Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Computes aggregates and returns the result as a DataFrame.
The available aggregate functions can be:
- Built-in aggregation functions, such as
avg,max,min,sum,count. - Group aggregate pandas UDFs, created with
pyspark.sql.functions.pandas_udf.
Syntax
agg(*exprs)
Parameters
| Parameter | Type | Description |
|---|---|---|
exprs |
dict or Column | A dict mapping from column name (string) to aggregate functions (string), or a list of aggregate Column expressions. |
Returns
DataFrame
Notes
Built-in aggregation functions and group aggregate pandas UDFs cannot be mixed in a single call to this function.
When exprs is a single dict, the key is the column to perform aggregation on and the value is the aggregate function. When exprs is a list of Column expressions, each expression specifies an aggregation to compute.
Examples
import pandas as pd
from pyspark.sql import functions as sf
df = spark.createDataFrame(
[(2, "Alice"), (3, "Alice"), (5, "Bob"), (10, "Bob")], ["age", "name"])
# Group-by name, and count each group.
df.groupBy(df.name).agg({"*": "count"}).sort("name").show()
# +-----+--------+
# | name|count(1)|
# +-----+--------+
# |Alice| 2|
# | Bob| 2|
# +-----+--------+
# Group-by name, and calculate the minimum age.
df.groupBy(df.name).agg(sf.min(df.age)).sort("name").show()
# +-----+--------+
# | name|min(age)|
# +-----+--------+
# |Alice| 2|
# | Bob| 5|
# +-----+--------+
# Same as above but uses a pandas UDF.
from pyspark.sql.functions import pandas_udf
@pandas_udf('int')
def min_udf(v: pd.Series) -> int:
return v.min()
df.groupBy(df.name).agg(min_udf(df.age)).sort("name").show()
# +-----+------------+
# | name|min_udf(age)|
# +-----+------------+
# |Alice| 2|
# | Bob| 5|
# +-----+------------+