Importing and Concatenating Multiple CSV Files with Pandas

Importing multiple CSV files into a single DataFrame is a common task when working with data. In this tutorial, we will explore how to use the pandas library to import and concatenate multiple CSV files.

Introduction to Pandas

Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures such as Series (1-dimensional labeled array) and DataFrames (2-dimensional labeled data structure with columns of potentially different types).

Importing Necessary Libraries

To start, we need to import the necessary libraries:

import pandas as pd

We will also use the glob library to find all CSV files in a directory:

import glob

And optionally, we can use the pathlib library for path manipulation:

from pathlib import Path

Finding CSV Files

To find all CSV files in a directory, we can use the glob.glob() function:

path = 'data/'  # replace with your directory path
files = glob.glob(path + '*.csv')

Alternatively, we can use the Path class from pathlib to achieve the same result:

path = Path('data/')
files = list(path.glob('*.csv'))

Importing CSV Files

Now that we have a list of CSV files, we can import them into DataFrames using a loop:

dfs = []
for file in files:
    df = pd.read_csv(file)
    dfs.append(df)

We can also use a list comprehension to achieve the same result:

dfs = [pd.read_csv(file) for file in files]

Concatenating DataFrames

To concatenate the DataFrames, we can use the pd.concat() function:

df_concat = pd.concat(dfs, ignore_index=True)

The ignore_index=True parameter resets the index of the resulting DataFrame.

Adding a Source Column

It’s often useful to add a source column to identify which file each row comes from. We can do this by adding a new column to each DataFrame before concatenating:

dfs = []
for i, file in enumerate(files):
    df = pd.read_csv(file)
    df['source'] = f'File {i}'
    dfs.append(df)

Alternatively, we can use the assign() method to add a new column:

df_concat = pd.concat((pd.read_csv(file).assign(source=f'S{i}') for i, file in enumerate(files)), ignore_index=True)

Example Use Case

Suppose we have two CSV files, data1.csv and data2.csv, with the following contents:

data1.csv:

name,age
John,25
Jane,30

data2.csv:

name,age
Bob,35
Alice,20

We can import and concatenate these files using the above code:

files = glob.glob('data/*.csv')
dfs = [pd.read_csv(file) for file in files]
df_concat = pd.concat(dfs, ignore_index=True)
print(df_concat)

Output:

     name  age
0    John   25
1    Jane   30
2     Bob   35
3   Alice   20

By adding a source column, we can identify which file each row comes from:

dfs = []
for i, file in enumerate(files):
    df = pd.read_csv(file)
    df['source'] = f'File {i}'
    dfs.append(df)
df_concat = pd.concat(dfs, ignore_index=True)
print(df_concat)