pyspark.pandas.groupby.GroupBy.transform#
- GroupBy.transform(func, *args, **kwargs)[source]#
Apply function column-by-column to the GroupBy object.
The function passed to transform must take a Series as its first argument and return a Series. The given function is executed for each series in each grouped data.
While transform is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform. pandas-on-Spark offers a wide range of method that will be much faster than using transform for their specific purposes, so try to use them before reaching for transform.
Note
- this API executes the function once to infer the type which is
potentially expensive, for instance, when the dataset is created after aggregations or sorting.
To avoid this, specify return type in
func
, for instance, as below:>>> def convert_to_string(x) -> ps.Series[str]: ... return x.apply("a string {}".format)
When the given function has the return type annotated, the original index of the GroupBy object will be lost, and a default index will be attached to the result. Please be careful about configuring the default index. See also Default Index Type.
Note
the series within
func
is actually a pandas series. Therefore, any pandas API within this function is allowed.- Parameters
- funccallable
A callable that takes a Series as its first argument, and returns a Series.
- *args
Positional arguments to pass to func.
- **kwargs
Keyword arguments to pass to func.
- Returns
- appliedDataFrame
See also
aggregate
Apply aggregate function to the GroupBy object.
Series.apply
Apply a function to a Series.
Examples
>>> df = ps.DataFrame({'A': [0, 0, 1], ... 'B': [1, 2, 3], ... 'C': [4, 6, 5]}, columns=['A', 'B', 'C'])
>>> g = df.groupby('A')
Notice that
g
has two groups,0
and1
. Calling transform in various ways, we can get different grouping results: Below the functions passed to transform takes a Series as its argument and returns a Series. transform applies the function on each series in each grouped data, and combine them into a new DataFrame:>>> def convert_to_string(x) -> ps.Series[str]: ... return x.apply("a string {}".format) >>> g.transform(convert_to_string) B C 0 a string 1 a string 4 1 a string 2 a string 6 2 a string 3 a string 5
>>> def plus_max(x) -> ps.Series[int]: ... return x + x.max() >>> g.transform(plus_max) B C 0 3 10 1 4 12 2 6 10
You can omit the type hint and let pandas-on-Spark infer its type.
>>> def plus_min(x): ... return x + x.min() >>> g.transform(plus_min) B C 0 2 8 1 3 10 2 6 10
In case of Series, it works as below.
>>> df.B.groupby(df.A).transform(plus_max) 0 3 1 4 2 6 Name: B, dtype: int64
>>> (df * -1).B.groupby(df.A).transform(abs) 0 1 1 2 2 3 Name: B, dtype: int64
You can also specify extra arguments to pass to the function.
>>> def calculation(x, y, z) -> ps.Series[int]: ... return x + x.min() + y + z >>> g.transform(calculation, 5, z=20) B C 0 27 33 1 28 35 2 31 35