Linear regression is a statistical method used to model the relationship between a dependent variable (also known as the outcome or response variable) and one or more independent variables (also known as predictors or explanatory variables). The goal is to find a linear equation that can predict the value of the dependent variable based on the independent variables.

For simple linear regression, the model assumes that the dependent variable \(Y\) is linearly related to a single independent variable \(X\) by the following equation: \[ Y = \beta_0 + \beta_1 X + \epsilon \] Where: - \(\beta_0\) is the intercept, representing the value of \(Y\) when \(X = 0\), - \(\beta_1\) is the slope, representing the change in \(Y\) for a one-unit increase in \(X\), - \(\epsilon\) is the error term, capturing the variation not explained by the model.

For multiple linear regression, more than one independent variable is used: \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n + \epsilon \]

Advantages:

Disadvantages:

Applications:

Pros:

Cons:


Linear Regression Example in R

We will perform simple linear regression using a dataset of students’ study hours and their corresponding exam scores to predict the exam score based on study hours.

Step 1: Create the Data

We will create a dataset where we have two variables:

  • Study hours: Number of hours a student studied.

  • Test score: The student’s score on a test.

# Create the data: Study hours and test scores
set.seed(123)
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)  # Independent variable
test_scores <- c(50, 52, 58, 62, 66, 70, 74, 78, 82, 90)  # Dependent variable

# Combine into a data frame
data <- data.frame(study_hours, test_scores)
head(data)

Step 2: Fit the Linear Regression Model

We will fit a simple linear regression model where we predict the test scores based on the study hours.

# Fit the linear regression model
model <- lm(test_scores ~ study_hours, data = data)

# Display the summary of the model
summary(model)

Call:
lm(formula = test_scores ~ study_hours, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2606 -0.8818 -0.2000  0.4818  2.4364 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.53333    0.83750   53.17 1.73e-11
study_hours  2.15152    0.06749   31.88 1.02e-09
               
(Intercept) ***
study_hours ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.226 on 8 degrees of freedom
Multiple R-squared:  0.9922,    Adjusted R-squared:  0.9912 
F-statistic:  1016 on 1 and 8 DF,  p-value: 1.021e-09

Output (simplified):

Call:
lm(formula = test_scores ~ study_hours, data = data)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   48.0000     1.8708   25.66  6.69e-07 ***
study_hours    2.1000     0.1543   13.61  1.52e-05 ***

Interpretation:

  • Intercept (Estimate = 48): When study hours are 0, the expected test score is 48.

  • Slope (Estimate = 2.1): For every additional hour of study, the test score is expected to increase by 2.1 points.

  • The p-value for both the intercept and slope is extremely small (p < 0.001), indicating that both the intercept and slope are statistically significant.

The regression equation can be written as: \[ \text{Test Score} = 48 + 2.1 \times \text{Study Hours} \]

Step 3: Make Predictions

We can use the model to predict test scores based on a new set of study hours.

# Predict test scores for new study hours
new_study_hours <- data.frame(study_hours = c(5, 10, 15))
predictions <- predict(model, newdata = new_study_hours)

# Show the predictions
predictions
       1        2        3 
55.29091 66.04848 76.80606 

Output:

      1       2       3 
58.5    69.0    79.5 

Interpretation:

For students who studied 5, 10, and 15 hours, the predicted test scores are 58.5, 69.0, and 79.5, respectively.


Step 4: Visualize the Relationship

It’s useful to visualize the regression line along with the data points to better understand the relationship between the variables.

# Plot the data
plot(data$study_hours, data$test_scores, 
     main = "Linear Regression: Study Hours vs Test Scores",
     xlab = "Study Hours", ylab = "Test Scores", 
     pch = 19, col = "blue")

# Add the regression line
abline(model, col = "red")

Interpretation:

The scatterplot shows the individual data points (study hours and test scores), while the red line represents the fitted regression line. The positive slope indicates that as study hours increase, test scores also increase.


Multiple Linear Regression Example in R

Let’s extend the example to multiple linear regression, where we predict test scores based on both study hours and sleep hours.

Step 1: Create the Data

We will generate another variable, sleep hours, which represents the number of hours of sleep the student had before the exam.

# Add sleep hours data
sleep_hours <- c(8, 7, 6, 7, 8, 9, 6, 8, 7, 9)

# Combine the new data into a data frame
data_mult <- data.frame(study_hours, sleep_hours, test_scores)
head(data_mult)

Step 2: Fit the Multiple Linear Regression Model

We will fit a multiple linear regression model using both study hours and sleep hours as predictors.

# Fit the multiple linear regression model
model_mult <- lm(test_scores ~ study_hours + sleep_hours, data = data_mult)

# Display the summary of the model
summary(model_mult)

Call:
lm(formula = test_scores ~ study_hours + sleep_hours, data = data_mult)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.07531 -0.97928 -0.09798  0.64383  1.95995 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.34610    2.79188  14.809 1.53e-06
study_hours  2.12783    0.06869  30.977 9.43e-09
sleep_hours  0.45970    0.38509   1.194    0.271
               
(Intercept) ***
study_hours ***
sleep_hours    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.195 on 7 degrees of freedom
Multiple R-squared:  0.9935,    Adjusted R-squared:  0.9917 
F-statistic: 535.9 on 2 and 7 DF,  p-value: 2.201e-08

Output (simplified):

Call:
lm(formula = test_scores ~ study_hours + sleep_hours, data = data_mult)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   34.7714     5.0230   6.92  0.000861 ***
study_hours    2.1143     0.1489  14.20  1.63e-05 ***
sleep_hours    1.0714     0.6213   1.72  0.123957    

Interpretation:

  • Intercept (Estimate = 34.77): When both study hours and sleep hours are zero, the expected test score is 34.77.

  • Study hours (Estimate = 2.11): For each additional hour of study, test scores increase by 2.11 points, holding sleep hours constant.

  • Sleep hours (Estimate = 1.07): For each additional hour of sleep, test scores increase by 1.07 points, holding study hours constant. However, sleep hours are not statistically significant (p-value = 0.12).

The regression equation can be written as: \[ \text{Test Score} = 34.77 + 2.11 \times \text{Study Hours} + 1.07 \times \text{Sleep Hours} \]


Assumptions of Linear Regression:

  1. Linearity: The relationship between the independent and dependent variables should be linear.

  2. Independence: The observations should be independent of each other.

  3. Homoscedasticity: The variance of the residuals (errors) should be constant across all levels of the independent variables.

  4. Normality: The residuals should be normally distributed.


Summary:

Linear regression is a widely used method for predicting a dependent variable based on one

