Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Maps an iterator of batches in the current DataFrame using a Python native function that is performed on pyarrow.RecordBatchs both as input and output, and returns the result as a DataFrame.
Syntax
mapInArrow(func: "ArrowMapIterFunction", schema: Union[StructType, str], barrier: bool = False, profile: Optional[ResourceProfile] = None)
Parameters
| Parameter | Type | Description |
|---|---|---|
func |
function | a Python native function that takes an iterator of pyarrow.RecordBatchs, and outputs an iterator of pyarrow.RecordBatchs. |
schema |
DataType or str | the return type of the func in PySpark. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string. |
barrier |
bool, optional, default False | Use barrier mode execution, ensuring that all Python workers in the stage will be launched concurrently. |
profile |
ResourceProfile, optional | The optional ResourceProfile to be used for mapInArrow. |
Returns
DataFrame
Examples
import pyarrow as pa
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
def filter_func(iterator):
for batch in iterator:
pdf = batch.to_pandas()
yield pa.RecordBatch.from_pandas(pdf[pdf.id == 1])
df.mapInArrow(filter_func, df.schema).show()
# +---+---+
# | id|age|
# +---+---+
# | 1| 21|
# +---+---+
df.mapInArrow(filter_func, df.schema, barrier=True).collect()
# [Row(id=1, age=21)]