Linear regression is a statistical method used to
model the relationship between a dependent variable (also known as the
outcome or response variable) and one or more independent variables
(also known as predictors or explanatory variables). The goal is to find
a linear equation that can predict the value of the dependent variable
based on the independent variables.
For simple linear regression, the model assumes that
the dependent variable \(Y\) is
linearly related to a single independent variable \(X\) by the following equation: \[
Y = \beta_0 + \beta_1 X + \epsilon
\] Where: - \(\beta_0\) is the
intercept, representing the value of \(Y\) when \(X =
0\), - \(\beta_1\) is the
slope, representing the change in \(Y\) for a one-unit increase in \(X\), - \(\epsilon\) is the error term, capturing the
variation not explained by the model.
For multiple linear regression, more than one
independent variable is used: \[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n + \epsilon
\]
Advantages:
Interpretability: Linear regression models are
easy to interpret. The coefficients represent how much the dependent
variable changes with each unit change in the independent
variables.
Simplicity: Linear regression is simple and can
be implemented easily using statistical software or tools.
Speed: It’s computationally efficient and works
well for small to medium-sized datasets.
Disadvantages:
Linearity assumption: Linear regression assumes
a linear relationship between the variables. If the true relationship is
non-linear, the model won’t perform well.
Sensitivity to outliers: The presence of
outliers can significantly affect the estimates of the regression
coefficients.
Multicollinearity: In multiple linear
regression, if the independent variables are highly correlated
(multicollinear), it can lead to unreliable estimates.
Applications:
Predictive modeling: Used in various fields to
predict outcomes (e.g., predicting house prices based on square footage,
number of bedrooms, etc.).
Trend analysis: To examine the relationship
between variables over time.
Financial modeling: To predict stock prices or
sales.
Pros:
- Easy to use and understand.
- Can handle both continuous and categorical predictors (with some
modifications).
- Provides insights into relationships between variables.
Cons:
- Not suitable for non-linear relationships without
transformations.
- Assumes that residuals (errors) are normally distributed.
- Sensitive to overfitting when too many predictors are used.
Linear Regression Example in R
We will perform simple linear regression using a
dataset of students’ study hours and their corresponding exam scores to
predict the exam score based on study hours.
Step 1: Create the Data
We will create a dataset where we have two variables:
# Create the data: Study hours and test scores
set.seed(123)
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20) # Independent variable
test_scores <- c(50, 52, 58, 62, 66, 70, 74, 78, 82, 90) # Dependent variable
# Combine into a data frame
data <- data.frame(study_hours, test_scores)
head(data)
Step 2: Fit the Linear Regression Model
We will fit a simple linear regression model where we predict the
test scores based on the study hours.
# Fit the linear regression model
model <- lm(test_scores ~ study_hours, data = data)
# Display the summary of the model
summary(model)
Call:
lm(formula = test_scores ~ study_hours, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.2606 -0.8818 -0.2000 0.4818 2.4364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.53333 0.83750 53.17 1.73e-11
study_hours 2.15152 0.06749 31.88 1.02e-09
(Intercept) ***
study_hours ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.226 on 8 degrees of freedom
Multiple R-squared: 0.9922, Adjusted R-squared: 0.9912
F-statistic: 1016 on 1 and 8 DF, p-value: 1.021e-09
Output (simplified):
Call:
lm(formula = test_scores ~ study_hours, data = data)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.0000 1.8708 25.66 6.69e-07 ***
study_hours 2.1000 0.1543 13.61 1.52e-05 ***
Interpretation:
Intercept (Estimate = 48): When study hours are
0, the expected test score is 48.
Slope (Estimate = 2.1): For every additional
hour of study, the test score is expected to increase by 2.1
points.
The p-value for both the intercept and slope is extremely small
(p < 0.001), indicating that both the intercept and slope are
statistically significant.
The regression equation can be written as: \[
\text{Test Score} = 48 + 2.1 \times \text{Study Hours}
\]
Step 3: Make Predictions
We can use the model to predict test scores based on a new set of
study hours.
# Predict test scores for new study hours
new_study_hours <- data.frame(study_hours = c(5, 10, 15))
predictions <- predict(model, newdata = new_study_hours)
# Show the predictions
predictions
1 2 3
55.29091 66.04848 76.80606
Output:
1 2 3
58.5 69.0 79.5
Interpretation:
For students who studied 5, 10, and 15 hours, the predicted test
scores are 58.5, 69.0, and 79.5, respectively.
Step 4: Visualize the Relationship
It’s useful to visualize the regression line along with the data
points to better understand the relationship between the variables.
# Plot the data
plot(data$study_hours, data$test_scores,
main = "Linear Regression: Study Hours vs Test Scores",
xlab = "Study Hours", ylab = "Test Scores",
pch = 19, col = "blue")
# Add the regression line
abline(model, col = "red")
![]()
Interpretation:
The scatterplot shows the individual data points (study hours and
test scores), while the red line represents the fitted regression line.
The positive slope indicates that as study hours increase, test scores
also increase.
Multiple Linear Regression Example in R
Let’s extend the example to multiple linear
regression, where we predict test scores based on both
study hours and sleep hours.
Step 1: Create the Data
We will generate another variable, sleep hours,
which represents the number of hours of sleep the student had before the
exam.
# Add sleep hours data
sleep_hours <- c(8, 7, 6, 7, 8, 9, 6, 8, 7, 9)
# Combine the new data into a data frame
data_mult <- data.frame(study_hours, sleep_hours, test_scores)
head(data_mult)
Step 2: Fit the Multiple Linear Regression
Model
We will fit a multiple linear regression model using both study hours
and sleep hours as predictors.
# Fit the multiple linear regression model
model_mult <- lm(test_scores ~ study_hours + sleep_hours, data = data_mult)
# Display the summary of the model
summary(model_mult)
Call:
lm(formula = test_scores ~ study_hours + sleep_hours, data = data_mult)
Residuals:
Min 1Q Median 3Q Max
-1.07531 -0.97928 -0.09798 0.64383 1.95995
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.34610 2.79188 14.809 1.53e-06
study_hours 2.12783 0.06869 30.977 9.43e-09
sleep_hours 0.45970 0.38509 1.194 0.271
(Intercept) ***
study_hours ***
sleep_hours
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.195 on 7 degrees of freedom
Multiple R-squared: 0.9935, Adjusted R-squared: 0.9917
F-statistic: 535.9 on 2 and 7 DF, p-value: 2.201e-08
Output (simplified):
Call:
lm(formula = test_scores ~ study_hours + sleep_hours, data = data_mult)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.7714 5.0230 6.92 0.000861 ***
study_hours 2.1143 0.1489 14.20 1.63e-05 ***
sleep_hours 1.0714 0.6213 1.72 0.123957
Interpretation:
Intercept (Estimate = 34.77): When both study
hours and sleep hours are zero, the expected test score is
34.77.
Study hours (Estimate = 2.11): For each
additional hour of study, test scores increase by 2.11 points, holding
sleep hours constant.
Sleep hours (Estimate = 1.07): For each
additional hour of sleep, test scores increase by 1.07 points, holding
study hours constant. However, sleep hours are not statistically
significant (p-value = 0.12).
The regression equation can be written as: \[
\text{Test Score} = 34.77 + 2.11 \times \text{Study Hours} + 1.07 \times
\text{Sleep Hours}
\]
Assumptions of Linear Regression:
Linearity: The relationship between the
independent and dependent variables should be linear.
Independence: The observations should be
independent of each other.
Homoscedasticity: The variance of the residuals
(errors) should be constant across all levels of the independent
variables.
Normality: The residuals should be normally
distributed.
Summary:
Linear regression is a widely used method for predicting a dependent
variable based on one
