Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Groups the DataFrame by the specified columns so that aggregation can be performed on them. See GroupedData for all the available aggregate functions.
Syntax
groupBy(*cols: "ColumnOrNameOrOrdinal")
Parameters
| Parameter | Type | Description |
|---|---|---|
cols |
list, str, int or Column | The columns to group by. Each element can be a column name (string) or an expression (Column) or a column ordinal (int, 1-based) or list of them. |
Returns
GroupedData: A GroupedData object representing the grouped data by the specified columns.
Notes
A column ordinal starts from 1, which is different from the 0-based __getitem__.
Examples
df = spark.createDataFrame([
("Alice", 2), ("Bob", 2), ("Bob", 2), ("Bob", 5)], schema=["name", "age"])
df.groupBy().avg().show()
# +--------+
# |avg(age)|
# +--------+
# | 2.75|
# +--------+
df.groupBy("name").agg({"age": "sum"}).sort("name").show()
# +-----+--------+
# | name|sum(age)|
# +-----+--------+
# |Alice| 2|
# | Bob| 9|
# +-----+--------+
df.groupBy(df.name).max().sort("name").show()
# +-----+--------+
# | name|max(age)|
# +-----+--------+
# |Alice| 2|
# | Bob| 5|
# +-----+--------+
df.groupBy(["name", df.age]).count().sort("name", "age").show()
# +-----+---+-----+
# | name|age|count|
# +-----+---+-----+
# |Alice| 2| 1|
# | Bob| 2| 2|
# | Bob| 5| 1|
# +-----+---+-----+