Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Clusters the data by the given columns to optimize query performance.
Syntax
clusterBy(*cols)
Parameters
| Parameter | Type | Description |
|---|---|---|
*cols |
str or list | Names of the columns to cluster by. |
Returns
DataFrameWriter
Examples
Write a DataFrame into a Parquet file with clustering.
import tempfile
with tempfile.TemporaryDirectory(prefix="clusterBy") as d:
spark.createDataFrame(
[{"age": 100, "name": "Alice"}, {"age": 120, "name": "Ruifeng Zheng"}]
).write.clusterBy("name").mode("overwrite").format("parquet").save(d)