clusterBy (DataStreamWriter)

Clusters the output by the given columns. Records with similar values on the clustering columns are grouped together in the same file. Clustering improves query efficiency by allowing queries with predicates on the clustering columns to skip unnecessary data. Unlike partitioning, clustering can be used on high-cardinality columns.

Syntax

clusterBy(*cols)

Parameters

Parameter Type Description
*cols str or list Names of the columns to cluster by.

Returns

DataStreamWriter

Examples

df = spark.readStream.format("rate").load()
df.writeStream.clusterBy("value")
# <...streaming.readwriter.DataStreamWriter object ...>

Cluster a Rate source stream by timestamp and write to Parquet:

import tempfile
import time
with tempfile.TemporaryDirectory(prefix="clusterBy1") as d:
    with tempfile.TemporaryDirectory(prefix="clusterBy2") as cp:
        df = spark.readStream.format("rate").option("rowsPerSecond", 10).load()
        q = df.writeStream.clusterBy(
            "timestamp").format("parquet").option("checkpointLocation", cp).start(d)
        time.sleep(5)
        q.stop()
        spark.read.schema(df.schema).parquet(d).show()