Visualizing Multiple Distributions with Overlapping Histograms using Matplotlib

Creating histograms is a fundamental aspect of data visualization, providing insights into the distribution and frequency of data. When comparing multiple datasets, it becomes essential to plot them together for comparative analysis. This tutorial will guide you through plotting two histograms on a single chart using Python’s Matplotlib library, with techniques to ensure both are visible regardless of their count values.

Introduction

Histograms display data distributions by grouping numbers into bins and showing the frequency of each bin as bars. When comparing datasets side-by-side, we often face challenges such as one histogram obscuring another due to overlapping bars. This tutorial will address these issues using Matplotlib’s plotting capabilities, offering solutions for both stacked (overlapping) and adjacent bar representations.

Setting Up Your Environment

Before you begin, ensure you have the necessary libraries installed:

pip install matplotlib numpy

These tools provide the core functionalities needed to generate histograms and manipulate their aesthetics.

Basic Histogram Plotting

Start by plotting two simple histograms with a basic overlap setting. Here’s an example using Gaussian distributions for demonstration:

import random
import numpy as np
from matplotlib import pyplot as plt

# Generate sample data
x = [random.gauss(3, 1) for _ in range(400)]
y = [random.gauss(4, 2) for _ in range(400)]

# Define bin edges
bins = np.linspace(-10, 10, 100)

# Plot histograms with transparency to overlap them
plt.hist(x, bins, alpha=0.5, label='x')
plt.hist(y, bins, alpha=0.5, label='y')

# Add legend and display plot
plt.legend(loc='upper right')
plt.show()

In this code:

  • alpha is used to set the transparency of bars allowing both histograms to be visible.
  • The label parameter helps in distinguishing datasets when using a legend.

Plotting Histograms Side-by-Side

When your objective is to compare distributions without overlap, plot histograms side-by-side. This approach visually distinguishes the frequency of each dataset:

import numpy as np
import matplotlib.pyplot as plt

# Generate random data
x = np.random.normal(1, 2, 5000)
y = np.random.normal(-1, 3, 2000)

# Define bins
bins = np.linspace(-10, 10, 30)

# Plot histograms side-by-side
plt.hist([x, y], bins, label=['x', 'y'], histtype='barstacked')

# Add legend and show plot
plt.legend(loc='upper right')
plt.show()

Here:

  • histtype='barstacked' stacks the bars for a clear distinction.
  • Arrays can be of different lengths without affecting the visualization.

Handling Differing Sample Sizes

When datasets have significantly different sample sizes, normalization ensures fair comparison. This is achieved by adjusting the heights of histogram bars to represent probability densities:

import numpy as np
import matplotlib.pyplot as plt

# Generate data with differing sizes
x = np.random.normal(1, 2, 5000)
y = np.random.normal(-1, 3, 2000)

# Calculate weights for normalization
x_weights = np.ones_like(x) / len(x)
y_weights = np.ones_like(y) / len(y)

# Plot normalized histograms
plt.hist([x, y], bins=30, weights=[x_weights, y_weights], label=['x', 'y'], alpha=0.5, density=True)

# Add legend and display plot
plt.legend(loc='upper right')
plt.show()

In this code:

  • weights is used to normalize each dataset.
  • density=True ensures the area under the histogram sums up to 1.

Advanced Customization

For complex visualizations where different y-axis scales are needed, dual axes can be employed:

import numpy as np
import matplotlib.pyplot as plt

# Generate sample data
y1 = np.random.normal(-2, 2, 1000)
y2 = np.random.normal(2, 2, 5000)

fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

# Plot histograms on separate axes
n1, bins1, _ = ax1.hist(y1, color='b', alpha=0.5)
n2, bins2, _ = ax2.hist(y2, color='g', alpha=0.5)

# Clear original axis to avoid duplicate plots
ax1.cla()
ax2.cla()

# Calculate width and shift bin edges for side-by-side bars
width = (bins1[1] - bins1[0]) * 0.4
shifted_bins2 = bins2 + width

# Re-plot with adjusted positions
ax1.bar(bins1[:-1], n1, width=width, align='edge', color='b')
ax2.bar(shifted_bins2[:-1], n2, width=width, align='edge', color='g')

# Set axis labels and display plot
ax1.set_ylabel("Count (y1)", color='b')
ax2.set_ylabel("Count (y2)", color='g')
plt.tight_layout()
plt.show()

Here:

  • Dual axes (twinx) allow separate y-scales for comparison.
  • Bar positions are adjusted using width to ensure clear visualization.

Conclusion

This tutorial covered various methods of plotting multiple histograms on a single chart using Matplotlib, including handling transparency, side-by-side bars, and normalized plots. Whether comparing distributions with equal or differing sample sizes, these techniques facilitate clear and effective data visualization.

Leave a Reply

Your email address will not be published. Required fields are marked *