Data Aggregation with Pandas GroupBy
Pandas is a powerful Python library for data analysis and manipulation. A core feature is the groupby()
method, which allows you to split a DataFrame into groups based on the values of one or more columns. This is often followed by applying an aggregation function to calculate summary statistics for each group. This tutorial will cover how to use groupby()
with aggregation functions, such as sum()
, to gain insights from your data.
Understanding the Basics
The groupby()
method takes one or more column names as arguments. It then creates groups of rows where the values in the specified columns are the same. After grouping, you can apply functions to each group to perform calculations, transform data, or filter results.
Example Dataset
Let’s consider a sample DataFrame representing fruit sales data:
import pandas as pd
data = {'Fruit': ['Apples', 'Apples', 'Apples', 'Apples', 'Oranges', 'Oranges', 'Oranges', 'Oranges', 'Grapes', 'Grapes', 'Grapes', 'Grapes', 'Grapes'],
'Date': ['10/6/2016', '10/6/2016', '10/6/2016', '10/7/2016', '10/7/2016', '10/6/2016', '10/6/2016', '10/7/2016', '10/7/2016', '10/7/2016', '10/7/2016', '10/7/2016', '10/7/2016'],
'Name': ['Bob', 'Bob', 'Mike', 'Steve', 'Bob', 'Tom', 'Mike', 'Bob', 'Bob', 'Tom', 'Bob', 'Bob', 'Tony'],
'Number': [7, 8, 9, 10, 2, 15, 57, 65, 1, 87, 22, 12, 15]}
df = pd.DataFrame(data)
print(df)
This DataFrame contains information about the type of fruit sold (Fruit
), the date of the sale (Date
), the name of the seller (Name
), and the number of fruits sold (Number
).
Grouping and Aggregating
Let’s say we want to calculate the total number of each fruit sold by each seller. We can achieve this using groupby()
and sum()
.
grouped_df = df.groupby(['Fruit', 'Name'])['Number'].sum()
print(grouped_df)
This code first groups the DataFrame by the ‘Fruit’ and ‘Name’ columns. Then, it selects the ‘Number’ column and applies the sum()
function to calculate the sum of the ‘Number’ column for each group. The result is a Series with a multi-level index (Fruit and Name).
You can also reset the index to convert the grouped result back into a DataFrame with ‘Fruit’ and ‘Name’ as regular columns:
reset_df = df.groupby(['Fruit', 'Name'])['Number'].sum().reset_index()
print(reset_df)
Using agg()
for Multiple Aggregations
The agg()
function provides more flexibility. It allows you to apply multiple aggregation functions at once.
grouped_agg = df.groupby(['Fruit', 'Name']).agg({'Number': ['sum', 'mean', 'max']})
print(grouped_agg)
This code groups the data by ‘Fruit’ and ‘Name’ and then applies the sum()
, mean()
, and max()
functions to the ‘Number’ column. The result is a DataFrame with a multi-level column index (Number -> sum, mean, max).
Alternative Aggregation Syntax
You can also provide a dictionary that maps column names to the aggregation functions. This is useful when you want to perform different aggregations on different columns.
grouped_agg = df.groupby(['Fruit', 'Name']).agg({'Number': 'sum'})
print(grouped_agg)
This will produce the same result as df.groupby(['Fruit', 'Name'])['Number'].sum()
.
Further Exploration
Pandas groupby()
is a versatile tool for data analysis. Explore other aggregation functions like count()
, min()
, std()
, and more. Experiment with different grouping columns and aggregation techniques to unlock valuable insights from your data.