Merging data from multiple sources is a common task in data analysis. When working with pandas DataFrames, you can use the merge
function to combine two or more DataFrames based on one or more columns. In this tutorial, we will explore how to merge pandas DataFrames on multiple columns.
Introduction to Merge Function
The merge
function is used to join two DataFrames based on a common column. The basic syntax of the merge
function is:
pd.merge(left, right, on=None, left_on=None, right_on=None, how='inner', sort=False)
Here:
left
andright
are the DataFrames to be merged.on
is a label or list of labels for the column(s) to merge on. Ifon
is specified, thenleft_on
andright_on
must not be specified.left_on
andright_on
are labels or lists of labels for the column(s) to merge on in the left and right DataFrames respectively.
Merging on Multiple Columns
To merge two DataFrames on multiple columns, you can pass a list of column names to the left_on
and right_on
parameters. The order of the columns in the list must match between the two DataFrames.
Here is an example:
import pandas as pd
# Create sample DataFrames
A_df = pd.DataFrame({
'A_c1': ['a', 'b', 'c'],
'c2': [1, 2, 3],
'value_A': [10, 20, 30]
})
B_df = pd.DataFrame({
'B_c1': ['a', 'b', 'd'],
'c2': [1, 2, 4],
'value_B': [100, 200, 400]
})
# Merge the DataFrames on multiple columns
new_df = pd.merge(A_df, B_df, how='left', left_on=['A_c1', 'c2'], right_on=['B_c1', 'c2'])
print(new_df)
This will output:
A_c1 c2 value_A B_c1 c2_y value_B
0 a 1 10 a NaN 100.0
1 b 2 20 b NaN 200.0
2 c 3 30 None NaN NaN
As you can see, the DataFrames were merged on both A_c1
and c2
columns.
Merging on Identical Column Names
If the column names to merge on are identical between the two DataFrames, you can use the on
parameter instead of left_on
and right_on
. This will simplify your code and avoid duplicate columns in the merged DataFrame.
Here is an example:
import pandas as pd
# Create sample DataFrames
df1 = pd.DataFrame({
'col1': ['a', 'b', 'c'],
'col2': [1, 2, 3],
'value_1': [10, 20, 30]
})
df2 = pd.DataFrame({
'col1': ['a', 'b', 'd'],
'col2': [1, 2, 4],
'value_2': [100, 200, 400]
})
# Merge the DataFrames on identical column names
merged_df = df1.merge(df2, on=['col1', 'col2'])
print(merged_df)
This will output:
col1 col2 value_1 value_2
0 a 1 10 100.0
1 b 2 20 200.0
As you can see, the DataFrames were merged on both col1
and col2
columns without duplicate columns in the result.
Merging on Index
You can also merge one side on column names and the other side on index. To do this, use the left_index=True
or right_index=True
parameter.
Here is an example:
import pandas as pd
# Create sample DataFrames
df1 = pd.DataFrame({
'col1': ['a', 'b', 'c'],
'col2': [1, 2, 3],
'value_1': [10, 20, 30]
})
df2 = pd.DataFrame({
'value_2': [100, 200, 400]
}, index=['a', 'b', 'd'])
# Merge the DataFrames on column names and index
merged_df = df1.merge(df2, left_on='col1', right_index=True)
print(merged_df)
This will output:
col1 col2 value_1 value_2
0 a 1 10 100.0
1 b 2 20 200.0
As you can see, the DataFrames were merged on col1
column and index.
Choosing the Merge Type
By using the how
parameter, you can choose the type of merge to perform:
inner
: Returns only the rows that have matches in both DataFrames.left
: Returns all the rows from the left DataFrame and matching rows from the right DataFrame. If no match is found, the result will contain NaN values.right
: Similar toleft
, but returns all the rows from the right DataFrame and matching rows from the left DataFrame.outer
: Returns all rows from both DataFrames.
Here is an example:
import pandas as pd
# Create sample DataFrames
df1 = pd.DataFrame({
'col1': ['a', 'b', 'c'],
'value_1': [10, 20, 30]
})
df2 = pd.DataFrame({
'col1': ['a', 'b', 'd'],
'value_2': [100, 200, 400]
})
# Perform different types of merge
inner_df = df1.merge(df2, on='col1')
left_df = df1.merge(df2, on='col1', how='left')
right_df = df1.merge(df2, on='col1', how='right')
outer_df = df1.merge(df2, on='col1', how='outer')
print("Inner Merge:")
print(inner_df)
print("\nLeft Merge:")
print(left_df)
print("\nRight Merge:")
print(right_df)
print("\nOuter Merge:")
print(outer_df)
This will output the results of each type of merge.
In conclusion, merging pandas DataFrames on multiple columns is a powerful operation that allows you to combine data from different sources. By choosing the correct merge type and using the on
, left_on
, and right_on
parameters, you can achieve the desired result for your data analysis tasks.