Filtering rows in a Pandas DataFrame based on string patterns is a common task, especially when working with large datasets. In this tutorial, we will cover how to filter rows that contain a specific string pattern using the str.contains
method.
Introduction to Pandas DataFrames
Before diving into filtering rows, let’s first introduce Pandas DataFrames. A DataFrame is a two-dimensional table of data with columns of potentially different types. It’s similar to an Excel spreadsheet or a table in a relational database. You can create a DataFrame from a dictionary, where the keys become the column names and the values become the column data.
Creating a Sample DataFrame
Let’s create a sample DataFrame to demonstrate how to filter rows based on string patterns.
import pandas as pd
# Create a sample DataFrame
data = {'ids': ['aball', 'bball', 'cnut', 'fball'],
'vals': [1, 2, 3, 4]}
df = pd.DataFrame(data)
print(df)
Output:
ids vals
0 aball 1
1 bball 2
2 cnut 3
3 fball 4
Filtering Rows Using str.contains
To filter rows that contain a specific string pattern, you can use the str.contains
method. This method returns a boolean Series showing True for the rows where the string pattern is found.
Let’s filter the rows that contain the string "ball".
# Filter rows that contain the string "ball"
filtered_df = df[df['ids'].str.contains('ball')]
print(filtered_df)
Output:
ids vals
0 aball 1
1 bball 2
3 fball 4
As you can see, the filtered DataFrame only includes the rows where the ids
column contains the string "ball".
Handling Missing Values
When using str.contains
, you might encounter missing values (NaN) in your DataFrame. To handle these cases, you can use the na=False
parameter to exclude NaN values from the filtering process.
# Filter rows that contain the string "ball" and exclude NaN values
filtered_df = df[df['ids'].str.contains('ball', na=False)]
print(filtered_df)
This will ensure that only rows with non-NaN values in the ids
column are included in the filtered DataFrame.
Using Regular Expressions
You can also use regular expressions (regex) to filter rows based on more complex string patterns. For example, you can use the filter
method with a regex pattern to filter rows where the ids
column ends with "ball".
# Filter rows where the ids column ends with "ball"
filtered_df = df.set_index('ids').filter(regex='ball$', axis=0)
print(filtered_df)
Output:
vals
ids
aball 1
bball 2
fball 4
This will only include rows where the ids
column ends with "ball".
Conclusion
In this tutorial, we covered how to filter rows in a Pandas DataFrame based on string patterns using the str.contains
method. We also discussed how to handle missing values and use regular expressions to filter rows based on more complex string patterns. With these techniques, you can efficiently filter your DataFrames to extract the data you need.