In this tutorial, we will explore how we can apply Arbitrary functions to our groupings in Pandas.Its a handy technique while analyzing and performing data analytics with Python and Pandas. We can use Pandas GroupBy using Higher Order Function and apply Custom Aggregations.
>>> import pandas as pd
>>> df = pd.DataFrame({'Student':['Beth', 'Alex', 'Diana', 'Adrian'],
'Age': [18, 19, 18, 19],
'Math': [75, 82, 89, 85],
'Science': [65, 75, 86, 90],
'Teacher': ['William', 'William', 'Robert', 'Robert']})
Just to get an idea how our data looks, we can print the records as a Table:
>>> df.head()
Age Math Science Student Teacher
0 18 75 65 Beth William
1 19 82 75 Alex William
2 18 89 86 Diana Robert
3 19 85 90 Adrian Robert
Consider the following max function applied on GroupBy Teacher:
>>> df.groupby('Teacher').max()
Age Math Science Student
Teacher
Robert 19 89 90 Diana
William 19 82 75 Beth
The pre-defined max function can also be used in the following way:
>>> df.groupby('Teacher').apply(max)
Age Math Science Student Teacher
Teacher
Robert 19 89 90 Diana Robert
William 19 82 75 Beth William
In the code above, we passed function as an argument to ‘apply’ function. Notice that in this way we can also pass custom defined functions and get our desired results. Lets define a function which finds best teacher in our case:
def best_teacher(group_dframe):
return pd.DataFrame({'Math': [group_dframe.loc[group_dframe.Math.idxmax()].Teacher],
'Science': [group_dframe.loc[group_dframe.Science.idxmax()].Teacher]})
The function above takes a Pandas Grouped DataFrame as an argument and in turn returns a DataFrame with Teacher’s name corresponding to the Subjects’ max scores.
Lets examine the function more closely. Consider the list which is being passed as a value for key ‘Math’ in the dictionary defined in the function above:
[group_dframe.loc[group_dframe.Math.idxmax()].Teacher]
Lets disect the above list step by step for better understanding of whats going on.
group_dframe.Math.idxmax()
The above line returns the index of the maximum value for Math.
group_dframe.loc[group_dframe.Math.idxmax()]
Now by using .loc function, we will fetch the row by using the previously fetched index of maximum value for Math. For more on .loc, you can see my post How to use .loc, .iloc, .ix in Pandas .
Now finally:
group_dframe.loc[group_dframe.Math.idxmax()].Teacher
The line above fetches the Teacher from the row extracted in the previous step. Since that row was for the maximum score for Math, the Teacher returned here is the one whose students get maximum marks in Maths.
Now lets define a groupby DataFrame and apply our function:
>>> group_dframe = df.groupby('Age')
>>> group_dframe.apply(best_teacher)
Math Science
Age
18 0 Robert Robert
19 0 Robert Robert

