Creating histograms is a fundamental aspect of data visualization, providing insights into the distribution and frequency of data. When comparing multiple datasets, it becomes essential to plot them together for comparative analysis. This tutorial will guide you through plotting two histograms on a single chart using Python’s Matplotlib library, with techniques to ensure both are visible regardless of their count values.
Introduction
Histograms display data distributions by grouping numbers into bins and showing the frequency of each bin as bars. When comparing datasets side-by-side, we often face challenges such as one histogram obscuring another due to overlapping bars. This tutorial will address these issues using Matplotlib’s plotting capabilities, offering solutions for both stacked (overlapping) and adjacent bar representations.
Setting Up Your Environment
Before you begin, ensure you have the necessary libraries installed:
pip install matplotlib numpy
These tools provide the core functionalities needed to generate histograms and manipulate their aesthetics.
Basic Histogram Plotting
Start by plotting two simple histograms with a basic overlap setting. Here’s an example using Gaussian distributions for demonstration:
import random
import numpy as np
from matplotlib import pyplot as plt
# Generate sample data
x = [random.gauss(3, 1) for _ in range(400)]
y = [random.gauss(4, 2) for _ in range(400)]
# Define bin edges
bins = np.linspace(-10, 10, 100)
# Plot histograms with transparency to overlap them
plt.hist(x, bins, alpha=0.5, label='x')
plt.hist(y, bins, alpha=0.5, label='y')
# Add legend and display plot
plt.legend(loc='upper right')
plt.show()
In this code:
alpha
is used to set the transparency of bars allowing both histograms to be visible.- The
label
parameter helps in distinguishing datasets when using a legend.
Plotting Histograms Side-by-Side
When your objective is to compare distributions without overlap, plot histograms side-by-side. This approach visually distinguishes the frequency of each dataset:
import numpy as np
import matplotlib.pyplot as plt
# Generate random data
x = np.random.normal(1, 2, 5000)
y = np.random.normal(-1, 3, 2000)
# Define bins
bins = np.linspace(-10, 10, 30)
# Plot histograms side-by-side
plt.hist([x, y], bins, label=['x', 'y'], histtype='barstacked')
# Add legend and show plot
plt.legend(loc='upper right')
plt.show()
Here:
histtype='barstacked'
stacks the bars for a clear distinction.- Arrays can be of different lengths without affecting the visualization.
Handling Differing Sample Sizes
When datasets have significantly different sample sizes, normalization ensures fair comparison. This is achieved by adjusting the heights of histogram bars to represent probability densities:
import numpy as np
import matplotlib.pyplot as plt
# Generate data with differing sizes
x = np.random.normal(1, 2, 5000)
y = np.random.normal(-1, 3, 2000)
# Calculate weights for normalization
x_weights = np.ones_like(x) / len(x)
y_weights = np.ones_like(y) / len(y)
# Plot normalized histograms
plt.hist([x, y], bins=30, weights=[x_weights, y_weights], label=['x', 'y'], alpha=0.5, density=True)
# Add legend and display plot
plt.legend(loc='upper right')
plt.show()
In this code:
weights
is used to normalize each dataset.density=True
ensures the area under the histogram sums up to 1.
Advanced Customization
For complex visualizations where different y-axis scales are needed, dual axes can be employed:
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data
y1 = np.random.normal(-2, 2, 1000)
y2 = np.random.normal(2, 2, 5000)
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
# Plot histograms on separate axes
n1, bins1, _ = ax1.hist(y1, color='b', alpha=0.5)
n2, bins2, _ = ax2.hist(y2, color='g', alpha=0.5)
# Clear original axis to avoid duplicate plots
ax1.cla()
ax2.cla()
# Calculate width and shift bin edges for side-by-side bars
width = (bins1[1] - bins1[0]) * 0.4
shifted_bins2 = bins2 + width
# Re-plot with adjusted positions
ax1.bar(bins1[:-1], n1, width=width, align='edge', color='b')
ax2.bar(shifted_bins2[:-1], n2, width=width, align='edge', color='g')
# Set axis labels and display plot
ax1.set_ylabel("Count (y1)", color='b')
ax2.set_ylabel("Count (y2)", color='g')
plt.tight_layout()
plt.show()
Here:
- Dual axes (
twinx
) allow separate y-scales for comparison. - Bar positions are adjusted using
width
to ensure clear visualization.
Conclusion
This tutorial covered various methods of plotting multiple histograms on a single chart using Matplotlib, including handling transparency, side-by-side bars, and normalized plots. Whether comparing distributions with equal or differing sample sizes, these techniques facilitate clear and effective data visualization.