Importing multiple CSV files into a single DataFrame is a common task when working with data. In this tutorial, we will explore how to use the pandas library to import and concatenate multiple CSV files.
Introduction to Pandas
Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures such as Series (1-dimensional labeled array) and DataFrames (2-dimensional labeled data structure with columns of potentially different types).
Importing Necessary Libraries
To start, we need to import the necessary libraries:
import pandas as pd
We will also use the glob
library to find all CSV files in a directory:
import glob
And optionally, we can use the pathlib
library for path manipulation:
from pathlib import Path
Finding CSV Files
To find all CSV files in a directory, we can use the glob.glob()
function:
path = 'data/' # replace with your directory path
files = glob.glob(path + '*.csv')
Alternatively, we can use the Path
class from pathlib
to achieve the same result:
path = Path('data/')
files = list(path.glob('*.csv'))
Importing CSV Files
Now that we have a list of CSV files, we can import them into DataFrames using a loop:
dfs = []
for file in files:
df = pd.read_csv(file)
dfs.append(df)
We can also use a list comprehension to achieve the same result:
dfs = [pd.read_csv(file) for file in files]
Concatenating DataFrames
To concatenate the DataFrames, we can use the pd.concat()
function:
df_concat = pd.concat(dfs, ignore_index=True)
The ignore_index=True
parameter resets the index of the resulting DataFrame.
Adding a Source Column
It’s often useful to add a source column to identify which file each row comes from. We can do this by adding a new column to each DataFrame before concatenating:
dfs = []
for i, file in enumerate(files):
df = pd.read_csv(file)
df['source'] = f'File {i}'
dfs.append(df)
Alternatively, we can use the assign()
method to add a new column:
df_concat = pd.concat((pd.read_csv(file).assign(source=f'S{i}') for i, file in enumerate(files)), ignore_index=True)
Example Use Case
Suppose we have two CSV files, data1.csv
and data2.csv
, with the following contents:
data1.csv
:
name,age
John,25
Jane,30
data2.csv
:
name,age
Bob,35
Alice,20
We can import and concatenate these files using the above code:
files = glob.glob('data/*.csv')
dfs = [pd.read_csv(file) for file in files]
df_concat = pd.concat(dfs, ignore_index=True)
print(df_concat)
Output:
name age
0 John 25
1 Jane 30
2 Bob 35
3 Alice 20
By adding a source column, we can identify which file each row comes from:
dfs = []
for i, file in enumerate(files):
df = pd.read_csv(file)
df['source'] = f'File {i}'
dfs.append(df)
df_concat = pd.concat(dfs, ignore_index=True)
print(df_concat)
Output:
name age source
0 John 25 File 0
1 Jane 30 File 0
2 Bob 35 File 1
3 Alice 20 File 1
Conclusion
In this tutorial, we learned how to import and concatenate multiple CSV files using the pandas library. We also explored how to add a source column to identify which file each row comes from. By following these steps, you can easily work with multiple CSV files in your data analysis projects.