Introduction
In data analysis, it’s common to group data and perform operations on these groups. A frequent task is counting unique values within each group. In this tutorial, we will explore how to use the pandas
library in Python to count unique identifiers (ID
) for each distinct domain using a DataFrame.
Scenario
Suppose you have a dataset containing records of user interactions with different domains. Each record consists of an ID
, representing a user or session, and a domain
, which is the website the user interacted with. The goal is to determine how many unique users (or sessions) there are for each domain.
Sample Dataset
Consider this sample dataset:
ID domain
0 123 vk.com
1 123 vk.com
2 123 twitter.com
3 456 vk.com
4 456 facebook.com
5 456 vk.com
6 456 google.com
7 789 twitter.com
8 789 vk.com
Here, each user or session (ID
) may appear multiple times across different domains.
Methodology
Grouping and Counting Unique Values
To achieve our goal of counting unique IDs
per domain, we’ll use the powerful grouping capabilities provided by pandas. The main functions we will focus on are:
groupby()
nunique()
Step-by-step Guide
-
Import Pandas Library:
Ensure you have the pandas library installed and import it in your script.
import pandas as pd
-
Create a DataFrame:
Construct a DataFrame from your data using either a dictionary, list of lists, or reading from an external file (CSV, Excel).
data = { 'ID': [123, 123, 123, 456, 456, 456, 456, 789, 789], 'domain': ['vk.com', 'vk.com', 'twitter.com', 'vk.com', 'facebook.com', 'vk.com', 'google.com', 'twitter.com', 'vk.com'] } df = pd.DataFrame(data)
-
Group by Domain and Count Unique IDs:
Use the
groupby()
method to group data based on the domain. Then apply thenunique()
function on theID
column to count unique values within each group.result = df.groupby('domain')['ID'].nunique() print(result)
This will output:
domain facebook.com 1 google.com 1 twitter.com 2 vk.com 3 Name: ID, dtype: int64
-
Display Results as a DataFrame (Optional):
If you prefer the results in a tabular format rather than a Series, convert it to a DataFrame using
reset_index()
:result_df = df.groupby('domain')['ID'].nunique().reset_index() print(result_df)
The output will be:
domain ID 0 facebook.com 1 1 google.com 1 2 twitter.com 2 3 vk.com 3
Additional Considerations
-
Stripping Whitespace or Special Characters:
If your domain names contain leading or trailing whitespace or special characters, you might want to clean them before grouping. Use the
str.strip()
method for this purpose:result = df.groupby(df.domain.str.strip("'"))['ID'].nunique()
-
Efficiency Considerations:
Depending on your dataset’s size and complexity, consider efficiency. The
groupby
andnunique
approach is efficient and concise compared to other methods like removing duplicates before counting.
Summary
In this tutorial, we learned how to count unique values per group using pandas’ groupby()
and nunique()
functions. This technique is widely applicable in data analysis tasks where identifying distinct elements within categories or groups is necessary.
By mastering these techniques, you’ll be able to efficiently analyze datasets, derive insights from grouped data, and streamline your data processing workflows.