Introduction
When working with data in Python, particularly using the popular library Pandas, you often encounter missing values represented as NaN
(Not a Number). In addition to these, users might fill certain fields with placeholder strings like ‘N/A’, ‘NA’, or their variations. Both of these types of entries can complicate data analysis and need to be handled appropriately.
In this tutorial, we’ll explore how to filter out rows containing either NaN
values or specific unwanted string placeholders from a Pandas DataFrame effectively. We will cover various techniques that allow us to achieve clean datasets suitable for further processing.
Basic Setup
Firstly, let’s set up our environment by importing the necessary libraries and creating an example dataset:
import pandas as pd
import numpy as np
# Sample DataFrame with mixed valid data, NaNs, and placeholder strings
df = pd.DataFrame({
'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],
'rating': [3., 4., 5., np.nan, np.nan, np.nan],
'name': ['John', np.nan, 'N/A', 'Graham', np.nan, np.nan]
})
Techniques for Filtering Data
Method 1: Using notnull()
to Remove NaN Values
The simplest approach to remove rows with NaN
values in specific columns is using the notnull()
method. This method returns a Boolean mask that can be used to filter DataFrame rows.
# Filter out rows where 'name' column has NaN values
filtered_df = df[df['name'].notnull()]
print(filtered_df)
Output:
movie name rating
0 thg John 3.0
1 thg NaN 4.0
3 mol Graham NaN
Method 2: Combining notnull()
and String Filtering
To exclude both NaN
values and specific unwanted strings like ‘N/A’, we can chain conditions:
# Exclude rows where 'name' is either NaN or equals to any variation of 'N/A'
filtered_df = df[df['name'].notnull() & ~(df['name'].str.lower().isin(['n/a']))]
print(filtered_df)
Output:
movie name rating
0 thg John 3.0
3 mol Graham NaN
Method 3: Using dropna()
with Subset Parameter
dropna()
can be tailored to drop rows based on conditions in specified columns:
# Drop rows where 'name' column has NaN or unwanted strings
filtered_df = df.dropna(subset=['name'])
filtered_df = filtered_df[~filtered_df['name'].str.lower().isin(['n/a'])]
print(filtered_df)
Output:
movie name rating
0 thg John 3.0
3 mol Graham NaN
Method 4: Using query()
for Complex Conditions
For more complex filtering, Pandas’ query()
method is very powerful:
# Use query to exclude rows with NaN or 'N/A' in the 'name' column
filtered_df = df.query("name.notna() & name.str.lower() != 'n/a'", engine='python')
print(filtered_df)
Output:
movie name rating
0 thg John 3.0
3 mol Graham NaN
Conclusion
In this tutorial, we covered multiple methods to filter out unwanted data in a Pandas DataFrame effectively. Each method has its strengths and can be chosen based on the specific requirements of your dataset or preference for code readability.
By using these techniques, you ensure that your datasets are clean and ready for further analysis or processing without manual intervention to remove NaN
values and placeholder strings.