Understanding Linear Relationships and Visualization
In data science and statistics, understanding the relationship between variables is crucial. A linear relationship implies that a change in one variable is associated with a proportional change in another. Visualizing such relationships using scatter plots and regression lines can provide valuable insights. The ggplot2
package in R provides a powerful and flexible system for creating elegant and informative graphics, including those displaying linear relationships. This tutorial will guide you through adding regression lines to your plots using ggplot2
.
Creating a Basic Scatter Plot
Before adding a regression line, let’s create a basic scatter plot. First, we’ll generate some sample data:
# Create sample data
set.seed(123) # for reproducibility
data <- data.frame(
x.plot = rep(seq(1, 5), 10),
y.plot = rnorm(50)
)
# Create a scatter plot
library(ggplot2)
ggplot(data, aes(x = x.plot, y = y.plot)) +
geom_point()
This code generates a scatter plot showing the relationship between x.plot
and y.plot
. The aes()
function maps the variables to the x and y axes. geom_point()
adds the individual data points to the plot.
Adding a Regression Line with geom_smooth()
The simplest way to add a regression line is using geom_smooth()
. By default, geom_smooth()
fits a locally weighted scatterplot smoothing (LOESS) line. To specifically fit and display a linear regression line, set the method
argument to "lm"
:
ggplot(data, aes(x = x.plot, y = y.plot)) +
geom_point() +
geom_smooth(method = "lm")
This will add a linear regression line to your scatter plot, representing the best-fit line through the data. The shaded area around the line represents the standard error of the regression. To remove the shaded area, set se = FALSE
within geom_smooth()
:
ggplot(data, aes(x = x.plot, y = y.plot)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Controlling the Regression Formula
While geom_smooth(method = "lm")
automatically calculates the best-fit linear model, you can explicitly specify the formula using the formula
argument. The general form of a linear formula is y ~ x
, where y
is the dependent variable and x
is the independent variable.
ggplot(data, aes(x = x.plot, y = y.plot)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE)
In most cases y ~ x
is sufficient, but you can also specify more complex formulas for multiple regression. However, for multiple regression, the predictions need to be computed and added to the dataframe as a new column.
Using geom_abline()
for Custom Lines
If you have already calculated the slope and intercept of a linear model (e.g., using lm()
), you can use geom_abline()
to add a line with those specific parameters. This is useful when you want to plot a line that isn’t necessarily based on the data in the plot itself.
# Fit a linear model
model <- lm(y.plot ~ x.plot, data = data)
# Extract slope and intercept
slope <- coef(model)[2]
intercept <- coef(model)[1]
ggplot(data, aes(x = x.plot, y = y.plot)) +
geom_point() +
geom_abline(intercept = intercept, slope = slope, color = "blue")
Adding Statistical Details
Beyond the regression line, you can add further statistical information to your plot, such as the correlation coefficient (r) and the equation of the regression line. The ggpubr
package provides convenient functions for this:
library(ggpubr)
ggplot(data, aes(x = x.plot, y = y.plot)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
stat_cor(label.x = 3, label.y = 1.5, size = 4) + # Add correlation coefficient
stat_regline_equation(label.x = 3, label.y = 1.2, size = 4) # Add regression equation
This code adds the correlation coefficient and the equation of the regression line to the plot, making it more informative. Adjust the label.x
and label.y
arguments to position the labels as desired.