pyspark.pandas.groupby.GroupBy.transform#

GroupBy.transform(func, *args, **kwargs)[source]#

Apply function column-by-column to the GroupBy object.

The function passed to transform must take a Series as its first argument and return a Series. The given function is executed for each series in each grouped data.

While transform is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform. pandas-on-Spark offers a wide range of method that will be much faster than using transform for their specific purposes, so try to use them before reaching for transform.

Note

this API executes the function once to infer the type which is

potentially expensive, for instance, when the dataset is created after aggregations or sorting.

To avoid this, specify return type in func, for instance, as below:

>>> def convert_to_string(x) -> ps.Series[str]:
...     return x.apply("a string {}".format)

When the given function has the return type annotated, the original index of the GroupBy object will be lost, and a default index will be attached to the result. Please be careful about configuring the default index. See also Default Index Type.

Note

the series within func is actually a pandas series. Therefore, any pandas API within this function is allowed.

Parameters
funccallable

A callable that takes a Series as its first argument, and returns a Series.

*args

Positional arguments to pass to func.

**kwargs

Keyword arguments to pass to func.

Returns
appliedDataFrame

See also

aggregate

Apply aggregate function to the GroupBy object.

Series.apply

Apply a function to a Series.

Examples

>>> df = ps.DataFrame({'A': [0, 0, 1],
...                    'B': [1, 2, 3],
...                    'C': [4, 6, 5]}, columns=['A', 'B', 'C'])
>>> g = df.groupby('A')

Notice that g has two groups, 0 and 1. Calling transform in various ways, we can get different grouping results: Below the functions passed to transform takes a Series as its argument and returns a Series. transform applies the function on each series in each grouped data, and combine them into a new DataFrame:

>>> def convert_to_string(x) -> ps.Series[str]:
...     return x.apply("a string {}".format)
>>> g.transform(convert_to_string)  
            B           C
0  a string 1  a string 4
1  a string 2  a string 6
2  a string 3  a string 5
>>> def plus_max(x) -> ps.Series[int]:
...     return x + x.max()
>>> g.transform(plus_max)  
   B   C
0  3  10
1  4  12
2  6  10

You can omit the type hint and let pandas-on-Spark infer its type.

>>> def plus_min(x):
...     return x + x.min()
>>> g.transform(plus_min)  
   B   C
0  2   8
1  3  10
2  6  10

In case of Series, it works as below.

>>> df.B.groupby(df.A).transform(plus_max)
0    3
1    4
2    6
Name: B, dtype: int64
>>> (df * -1).B.groupby(df.A).transform(abs)
0    1
1    2
2    3
Name: B, dtype: int64

You can also specify extra arguments to pass to the function.

>>> def calculation(x, y, z) -> ps.Series[int]:
...     return x + x.min() + y + z
>>> g.transform(calculation, 5, z=20)  
    B   C
0  27  33
1  28  35
2  31  35