Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This page provides an overview of reference available for PySpark, a Python API for Spark. For more information about PySpark, see PySpark on Azure Databricks.
Data types
For a complete list of PySpark data types, see PySpark data types.
Classes
| Reference | Description |
|---|---|
| Catalog | Interface for managing databases, tables, functions, and other catalog metadata. |
| Column | Operations for working with DataFrame columns, including transformations and expressions. |
| Data Types | Available data types in PySpark SQL, including primitive types, complex types, and user-defined types. |
| DataFrame | Distributed collection of data organized into named columns, similar to a table in a relational database. |
| DataFrameNaFunctions | Functionality for working with missing data in a DataFrame. |
| DataFrameReader | Interface used to load a DataFrame from external storage systems. |
| DataFrameStatFunctions | Functionality for statistical functions with a DataFrame. |
| DataFrameWriter | Interface used to write a DataFrame to external storage systems. |
| DataFrameWriterV2 | Interface used to write a DataFrame to external storage (version 2). |
| DataSource | APIs for implementing custom data sources to read from external systems. For information about custom data sources, see PySpark custom data sources. |
| DataSourceArrowWriter | A base class for data source writers that process data using PyArrow's RecordBatch. |
| DataSourceRegistration | A wrapper for data source registration. |
| DataSourceReader | A base class for data source readers. |
| DataSourceStreamArrowWriter | A base class for data stream writers that process data using PyArrow's RecordBatch. |
| DataSourceStreamReader | A base class for streaming data source readers. |
| DataSourceStreamWriter | A base class for data stream writers. |
| DataSourceWriter | A base class for data source writers responsible for saving data to a custom data source in batch mode. |
| DataStreamReader | Interface used to load a streaming DataFrame from external storage systems. |
| DataStreamWriter | Interface used to write a streaming DataFrame to external storage systems. |
| Geography | A class to represent a Geography value in Python. |
| Geometry | A class to represent a Geometry value in Python. |
| GroupedData | Methods for grouping data and performing aggregation operations on grouped DataFrames. |
| InputPartition | A base class representing an input partition returned by the partitions() method of DataSourceReader. |
| Observation | Collects metrics and observes DataFrames during query execution for monitoring and debugging. |
| PlotAccessor | Accessor for DataFrame plotting functionality in PySpark. |
| ProtoBuf | Support for serializing and deserializing data using Protocol Buffers format. |
| Row | Represents a row of data in a DataFrame, providing access to individual field values. |
| RuntimeConfig | Runtime configuration options for Spark SQL, including execution and optimizer settings. For information on configuration that is only available on Databricks, see Set Spark configuration properties on Azure Databricks. |
| SimpleDataSourceStreamReader | A base class for simplified streaming data source readers that reads data and plans the latest offset simultaneously. |
| SparkSession | The entry point for reading data and executing SQL queries in PySpark applications. |
| Stateful Processor | Manages state across streaming batches for complex stateful operations in structured streaming. |
| StreamingQuery | A handle to a query that is executing continuously in the background as new data arrives. |
| StreamingQueryListener | Abstract class for listening to streaming query lifecycle events. |
| StreamingQueryManager | Manages all active StreamingQuery instances associated with a SparkSession. |
| UserDefinedFunction (UDF) | User-defined functions for applying custom Python logic to DataFrame columns. |
| UDFRegistration | Wrapper for user-defined function registration. This instance can be accessed by spark.udf. |
| UserDefinedTableFunction (UDTF) | User-defined table functions that return multiple rows for each input row. |
| UDTFRegistration | Wrapper for user-defined table function registration. This instance can be accessed by spark.udtf. |
| VariantVal | Represents semi-structured data with flexible schema, which supports dynamic types and nested structures. |
| Window | Window functions for performing calculations across a set of table rows related to the current row. |
| WindowSpec | Window functions for performing calculations across a set of table rows related to the current row. |
| WriterCommitMessage | A commit message returned by DataSourceWriter.write and sent back to the driver as an input parameter of DataSourceWriter.commit or DataSourceWriter.abort. |
Functions
For a complete list of available built-in functions, see PySpark functions.