Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Partitions the output by the given columns on the file system. The output is laid out similar to Hive's partitioning scheme.
Syntax
partitionBy(*cols)
Parameters
| Parameter | Type | Description |
|---|---|---|
*cols |
str or list | Names of the columns to partition by. |
Returns
DataStreamWriter
Examples
df = spark.readStream.format("rate").load()
df.writeStream.partitionBy("value")
# <...streaming.readwriter.DataStreamWriter object ...>
Partition a Rate source stream by timestamp and write to Parquet:
import tempfile
import time
with tempfile.TemporaryDirectory(prefix="partitionBy1") as d:
with tempfile.TemporaryDirectory(prefix="partitionBy2") as cp:
df = spark.readStream.format("rate").option("rowsPerSecond", 10).load()
q = df.writeStream.partitionBy(
"timestamp").format("parquet").option("checkpointLocation", cp).start(d)
time.sleep(5)
q.stop()
spark.read.schema(df.schema).parquet(d).show()