Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Calculates the approximate quantiles of numerical columns of a DataFrame.
The result of this algorithm has the following deterministic bound: if the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the exact rank of x is close to (p _ N). More precisely, floor((p - err) _ N) <= rank(x) <= ceil((p + err) \* N).
This method implements a variation of the Greenwald-Khanna algorithm with some speed optimizations.
Syntax
approxQuantile(col, probabilities, relativeError)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
str, list, or tuple | A single column name, or a list of names for multiple columns. |
probabilities |
list or tuple of float | A list of quantile probabilities. Each number must be a float in the range [0, 1]. For example, 0.0 is the minimum, 0.5 is the median, and 1.0 is the maximum. |
relativeError |
float | The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Values greater than 1 give the same result as 1. |
Returns
list
If col is a string, returns a list of floats. If col is a list or tuple of strings, returns a list of lists of floats.
Notes
Null values are ignored in numerical columns before calculation. For columns containing only null values, an empty list is returned.
Examples
Calculate quantiles for a single column.
data = [(1,), (2,), (3,), (4,), (5,)]
df = spark.createDataFrame(data, ["values"])
df.stat.approxQuantile("values", [0.0, 0.5, 1.0], 0.05)
# [1.0, 3.0, 5.0]
Calculate quantiles for multiple columns.
data = [(1, 10), (2, 20), (3, 30), (4, 40), (5, 50)]
df = spark.createDataFrame(data, ["col1", "col2"])
df.stat.approxQuantile(["col1", "col2"], [0.0, 0.5, 1.0], 0.05)
# [[1.0, 3.0, 5.0], [10.0, 30.0, 50.0]]
Handle null values.
data = [(1,), (None,), (3,), (4,), (None,)]
df = spark.createDataFrame(data, ["values"])
df.stat.approxQuantile("values", [0.0, 0.5, 1.0], 0.05)
# [1.0, 3.0, 4.0]