Applying Arbitrary Functions for Grouping Data with Pandas – Data Analytics in Python

by | Nov 17, 2022 | Uncategorized | 0 comments

In this tutorial, we will explore how we can apply Arbitrary functions to our groupings in Pandas.Its a handy technique while analyzing and performing data analytics with Python and Pandas. We can use Pandas GroupBy using Higher Order Function and apply Custom Aggregations.

>>> import pandas as pd
>>> df = pd.DataFrame({'Student':['Beth', 'Alex', 'Diana', 'Adrian'],
                  'Age': [18, 19, 18, 19],
                  'Math': [75, 82, 89, 85],
                  'Science': [65, 75, 86, 90],
                  'Teacher': ['William', 'William', 'Robert', 'Robert']})

Just to get an idea how our data looks, we can print the records as a Table:

>>> df.head()
   Age  Math  Science Student  Teacher
0   18    75       65    Beth  William
1   19    82       75    Alex  William
2   18    89       86   Diana   Robert
3   19    85       90  Adrian   Robert

Consider the following max function applied on GroupBy Teacher:

>>> df.groupby('Teacher').max()
          Age  Math  Science Student
Robert    19    89       90   Diana
William   19    82       75    Beth

The pre-defined max function can also be used in the following way:

>>> df.groupby('Teacher').apply(max)
         Age  Math  Science Student  Teacher
Robert    19    89       90   Diana   Robert
William   19    82       75    Beth  William

In the code above, we passed function as an argument to ‘apply’ function. Notice that in this way we can also pass custom defined functions and get our desired results. Lets define a function which finds best teacher in our case:

def best_teacher(group_dframe):
    return pd.DataFrame({'Math': [group_dframe.loc[group_dframe.Math.idxmax()].Teacher],
                        'Science': [group_dframe.loc[group_dframe.Science.idxmax()].Teacher]})

The function above takes a Pandas Grouped DataFrame as an argument and in turn returns a DataFrame with Teacher’s name corresponding to the Subjects’ max scores.

Lets examine the function more closely. Consider the list which is being passed as a value for key ‘Math’ in the dictionary defined in the function above:


Lets disect the above list step by step for better understanding of whats going on.


The above line returns the index of the maximum value for Math.


Now by using .loc function, we will fetch the row by using the previously fetched index of maximum value for Math. For more on .loc, you can see my post How to use .loc, .iloc, .ix in Pandas .

Now finally:


The line above fetches the Teacher from the row extracted in the previous step. Since that row was for the maximum score for Math, the Teacher returned here is the one whose students get maximum marks in Maths.

Now lets define a groupby DataFrame and apply our function:

>>> group_dframe = df.groupby('Age')
>>> group_dframe.apply(best_teacher)
         Math Science
18  0  Robert  Robert
19  0  Robert  Robert