Introduction to Filtering Pandas DataFrames
Pandas is a powerful library for data manipulation and analysis in Python. One common operation when working with datasets is filtering, where you select specific rows based on certain conditions. In this tutorial, we will explore how to filter Pandas DataFrames using ‘in’ and ‘not in’ conditions, similar to SQL.
Using isin()
Method
The most straightforward way to achieve this is by using the isin()
method provided by pandas Series and DataFrames. This method returns a boolean mask indicating whether each element in the Series or DataFrame is contained in the specified list of values.
Here’s an example:
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
# Define the list of countries to filter by
countries_to_keep = ['UK', 'China']
# Filter the DataFrame using isin()
filtered_df = df[df['country'].isin(countries_to_keep)]
print(filtered_df)
Output:
country
1 UK
3 China
To apply the opposite condition (i.e., ‘not in’), you can use the bitwise NOT operator (~
) to invert the boolean mask:
# Filter the DataFrame using not in
filtered_df_not_in = df[~df['country'].isin(countries_to_keep)]
print(filtered_df_not_in)
Output:
country
0 US
2 Germany
Using query()
Method
Alternatively, you can use the query()
method provided by pandas DataFrames. This method allows you to filter the DataFrame using a string-based query language.
Here’s an example:
# Filter the DataFrame using query()
filtered_df_query = df.query("country in @countries_to_keep")
print(filtered_df_query)
Output:
country
1 UK
3 China
To apply the opposite condition (i.e., ‘not in’), you can modify the query string accordingly:
# Filter the DataFrame using not in with query()
filtered_df_not_in_query = df.query("country not in @countries_to_keep")
print(filtered_df_not_in_query)
Output:
country
0 US
2 Germany
Filtering on Multiple Columns
If you need to filter the DataFrame based on multiple columns, you can use the isin()
method with a list of values for each column. Alternatively, you can use the query()
method with a more complex query string.
Here’s an example:
# Create a sample DataFrame with multiple columns
df_multi = pd.DataFrame({
'country': ['US', 'UK', 'Germany', 'China'],
'city': ['New York', 'London', 'Berlin', 'Beijing']
})
# Define the list of values to filter by for each column
countries_to_keep = ['UK', 'China']
cities_to_keep = ['London', 'Beijing']
# Filter the DataFrame using isin() on multiple columns
filtered_df_multi_isin = df_multi[
(df_multi['country'].isin(countries_to_keep)) &
(df_multi['city'].isin(cities_to_keep))
]
print(filtered_df_multi_isin)
Output:
country city
1 UK London
3 China Beijing
Best Practices and Tips
- Use the
isin()
method for simple filtering conditions, as it is more readable and efficient. - Use the
query()
method for more complex filtering conditions or when you need to filter on multiple columns. - Avoid using list comprehensions or lambda functions for filtering, as they can be slower and less readable than the
isin()
orquery()
methods. - Always verify the data types of your columns before applying filtering conditions, especially when working with string values.
Conclusion
In this tutorial, we explored how to filter Pandas DataFrames using ‘in’ and ‘not in’ conditions. We covered two main approaches: using the isin()
method and the query()
method. By following best practices and tips, you can efficiently and effectively filter your datasets to extract valuable insights.