Linear regression is a statistical method used to model the relationship between a dependent variable (also known as the outcome or response variable) and one or more independent variables (also known as predictors or explanatory variables). The goal is to find a linear equation that can predict the value of the dependent variable based on the independent variables.

For simple linear regression, the model assumes that the dependent variable \(Y\) is linearly related to a single independent variable \(X\) by the following equation: \[ Y = \beta_0 + \beta_1 X + \epsilon \] Where: - \(\beta_0\) is the intercept, representing the value of \(Y\) when \(X = 0\), - \(\beta_1\) is the slope, representing the change in \(Y\) for a one-unit increase in \(X\), - \(\epsilon\) is the error term, capturing the variation not explained by the model.

For multiple linear regression, more than one independent variable is used: \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n + \epsilon \]

Advantages:

Disadvantages:

Applications:

Pros:

Cons:


Linear Regression Example in R

We will perform simple linear regression using a dataset of students’ study hours and their corresponding exam scores to predict the exam score based on study hours.

Step 1: Create the Data

We will create a dataset where we have two variables:

  • Study hours: Number of hours a student studied.

  • Test score: The student’s score on a test.

# Create the data: Study hours and test scores
set.seed(123)
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)  # Independent variable
test_scores <- c(50, 52, 58, 62, 66, 70, 74, 78, 82, 90)  # Dependent variable

# Combine into a data frame
data <- data.frame(study_hours, test_scores)
head(data)

Step 2: Fit the Linear Regression Model

We will fit a simple linear regression model where we predict the test scores based on the study hours.

# Fit the linear regression model
model <- lm(test_scores ~ study_hours, data = data)

# Display the summary of the model
summary(model)

Call:
lm(formula = test_scores ~ study_hours, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2606 -0.8818 -0.2000  0.4818  2.4364 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.53333    0.83750   53.17 1.73e-11
study_hours  2.15152    0.06749   31.88 1.02e-09
               
(Intercept) ***
study_hours ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.226 on 8 degrees of freedom
Multiple R-squared:  0.9922,    Adjusted R-squared:  0.9912 
F-statistic:  1016 on 1 and 8 DF,  p-value: 1.021e-09

Output (simplified):

Call:
lm(formula = test_scores ~ study_hours, data = data)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   48.0000     1.8708   25.66  6.69e-07 ***
study_hours    2.1000     0.1543   13.61  1.52e-05 ***

Interpretation:

  • Intercept (Estimate = 48): When study hours are 0, the expected test score is 48.

  • Slope (Estimate = 2.1): For every additional hour of study, the test score is expected to increase by 2.1 points.

  • The p-value for both the intercept and slope is extremely small (p < 0.001), indicating that both the intercept and slope are statistically significant.

The regression equation can be written as: \[ \text{Test Score} = 48 + 2.1 \times \text{Study Hours} \]

Step 3: Make Predictions

We can use the model to predict test scores based on a new set of study hours.

# Predict test scores for new study hours
new_study_hours <- data.frame(study_hours = c(5, 10, 15))
predictions <- predict(model, newdata = new_study_hours)

# Show the predictions
predictions
       1        2        3 
55.29091 66.04848 76.80606 

Output:

      1       2       3 
58.5    69.0    79.5 

Interpretation:

For students who studied 5, 10, and 15 hours, the predicted test scores are 58.5, 69.0, and 79.5, respectively.


Step 4: Visualize the Relationship

It’s useful to visualize the regression line along with the data points to better understand the relationship between the variables.

# Plot the data
plot(data$study_hours, data$test_scores, 
     main = "Linear Regression: Study Hours vs Test Scores",
     xlab = "Study Hours", ylab = "Test Scores", 
     pch = 19, col = "blue")

# Add the regression line
abline(model, col = "red")

Interpretation:

The scatterplot shows the individual data points (study hours and test scores), while the red line represents the fitted regression line. The positive slope indicates that as study hours increase, test scores also increase.


Multiple Linear Regression Example in R

Let’s extend the example to multiple linear regression, where we predict test scores based on both study hours and sleep hours.

Step 1: Create the Data

We will generate another variable, sleep hours, which represents the number of hours of sleep the student had before the exam.

# Add sleep hours data
sleep_hours <- c(8, 7, 6, 7, 8, 9, 6, 8, 7, 9)

# Combine the new data into a data frame
data_mult <- data.frame(study_hours, sleep_hours, test_scores)
head(data_mult)

Step 2: Fit the Multiple Linear Regression Model

We will fit a multiple linear regression model using both study hours and sleep hours as predictors.

# Fit the multiple linear regression model
model_mult <- lm(test_scores ~ study_hours + sleep_hours, data = data_mult)

# Display the summary of the model
summary(model_mult)

Call:
lm(formula = test_scores ~ study_hours + sleep_hours, data = data_mult)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.07531 -0.97928 -0.09798  0.64383  1.95995 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.34610    2.79188  14.809 1.53e-06
study_hours  2.12783    0.06869  30.977 9.43e-09
sleep_hours  0.45970    0.38509   1.194    0.271
               
(Intercept) ***
study_hours ***
sleep_hours    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.195 on 7 degrees of freedom
Multiple R-squared:  0.9935,    Adjusted R-squared:  0.9917 
F-statistic: 535.9 on 2 and 7 DF,  p-value: 2.201e-08

Output (simplified):

Call:
lm(formula = test_scores ~ study_hours + sleep_hours, data = data_mult)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   34.7714     5.0230   6.92  0.000861 ***
study_hours    2.1143     0.1489  14.20  1.63e-05 ***
sleep_hours    1.0714     0.6213   1.72  0.123957    

Interpretation:

  • Intercept (Estimate = 34.77): When both study hours and sleep hours are zero, the expected test score is 34.77.

  • Study hours (Estimate = 2.11): For each additional hour of study, test scores increase by 2.11 points, holding sleep hours constant.

  • Sleep hours (Estimate = 1.07): For each additional hour of sleep, test scores increase by 1.07 points, holding study hours constant. However, sleep hours are not statistically significant (p-value = 0.12).

The regression equation can be written as: \[ \text{Test Score} = 34.77 + 2.11 \times \text{Study Hours} + 1.07 \times \text{Sleep Hours} \]


Assumptions of Linear Regression:

  1. Linearity: The relationship between the independent and dependent variables should be linear.

  2. Independence: The observations should be independent of each other.

  3. Homoscedasticity: The variance of the residuals (errors) should be constant across all levels of the independent variables.

  4. Normality: The residuals should be normally distributed.


Summary:

Linear regression is a widely used method for predicting a dependent variable based on one

---
title: "Linear Regression"
output: html_notebook
---


**Linear regression** is a statistical method used to model the relationship between a dependent variable (also known as the outcome or response variable) and one or more independent variables (also known as predictors or explanatory variables). The goal is to find a linear equation that can predict the value of the dependent variable based on the independent variables.

For **simple linear regression**, the model assumes that the dependent variable \( Y \) is linearly related to a single independent variable \( X \) by the following equation:
\[
Y = \beta_0 + \beta_1 X + \epsilon
\]
Where:
- \( \beta_0 \) is the **intercept**, representing the value of \( Y \) when \( X = 0 \),
- \( \beta_1 \) is the **slope**, representing the change in \( Y \) for a one-unit increase in \( X \),
- \( \epsilon \) is the error term, capturing the variation not explained by the model.

For **multiple linear regression**, more than one independent variable is used:
\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n + \epsilon
\]

#### **Advantages**:

- **Interpretability**: Linear regression models are easy to interpret. The coefficients represent how much the dependent variable changes with each unit change in the independent variables.

- **Simplicity**: Linear regression is simple and can be implemented easily using statistical software or tools.

- **Speed**: It’s computationally efficient and works well for small to medium-sized datasets.

#### **Disadvantages**:

- **Linearity assumption**: Linear regression assumes a linear relationship between the variables. If the true relationship is non-linear, the model won’t perform well.

- **Sensitivity to outliers**: The presence of outliers can significantly affect the estimates of the regression coefficients.

- **Multicollinearity**: In multiple linear regression, if the independent variables are highly correlated (multicollinear), it can lead to unreliable estimates.

#### **Applications**:

- **Predictive modeling**: Used in various fields to predict outcomes (e.g., predicting house prices based on square footage, number of bedrooms, etc.).

- **Trend analysis**: To examine the relationship between variables over time.

- **Financial modeling**: To predict stock prices or sales.

#### **Pros**:
- Easy to use and understand.
- Can handle both continuous and categorical predictors (with some modifications).
- Provides insights into relationships between variables.

#### **Cons**:
- Not suitable for non-linear relationships without transformations.
- Assumes that residuals (errors) are normally distributed.
- Sensitive to overfitting when too many predictors are used.

---

### **Linear Regression Example in R**

We will perform **simple linear regression** using a dataset of students’ study hours and their corresponding exam scores to predict the exam score based on study hours.

#### **Step 1: Create the Data**

We will create a dataset where we have two variables:

- **Study hours**: Number of hours a student studied.

- **Test score**: The student’s score on a test.

```{r}
# Create the data: Study hours and test scores
set.seed(123)
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)  # Independent variable
test_scores <- c(50, 52, 58, 62, 66, 70, 74, 78, 82, 90)  # Dependent variable

# Combine into a data frame
data <- data.frame(study_hours, test_scores)
head(data)
```


#### **Step 2: Fit the Linear Regression Model**

We will fit a simple linear regression model where we predict the test scores based on the study hours.

```{r}
# Fit the linear regression model
model <- lm(test_scores ~ study_hours, data = data)

# Display the summary of the model
summary(model)
```


#### **Output** (simplified):
```
Call:
lm(formula = test_scores ~ study_hours, data = data)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   48.0000     1.8708   25.66  6.69e-07 ***
study_hours    2.1000     0.1543   13.61  1.52e-05 ***
```

#### **Interpretation**:

- **Intercept (Estimate = 48)**: When study hours are 0, the expected test score is 48.

- **Slope (Estimate = 2.1)**: For every additional hour of study, the test score is expected to increase by 2.1 points.

- The p-value for both the intercept and slope is extremely small (p < 0.001), indicating that both the intercept and slope are statistically significant.

The regression equation can be written as:
\[
\text{Test Score} = 48 + 2.1 \times \text{Study Hours}
\]

#### **Step 3: Make Predictions**

We can use the model to predict test scores based on a new set of study hours.

```{r}
# Predict test scores for new study hours
new_study_hours <- data.frame(study_hours = c(5, 10, 15))
predictions <- predict(model, newdata = new_study_hours)

# Show the predictions
predictions
```


#### **Output**:
```
      1       2       3 
58.5    69.0    79.5 
```

#### **Interpretation**:
For students who studied 5, 10, and 15 hours, the predicted test scores are 58.5, 69.0, and 79.5, respectively.

---

### **Step 4: Visualize the Relationship**

It’s useful to visualize the regression line along with the data points to better understand the relationship between the variables.

```{r}
# Plot the data
plot(data$study_hours, data$test_scores, 
     main = "Linear Regression: Study Hours vs Test Scores",
     xlab = "Study Hours", ylab = "Test Scores", 
     pch = 19, col = "blue")

# Add the regression line
abline(model, col = "red")
```


#### **Interpretation**:
The scatterplot shows the individual data points (study hours and test scores), while the red line represents the fitted regression line. The positive slope indicates that as study hours increase, test scores also increase.

---

### **Multiple Linear Regression Example in R**

Let’s extend the example to **multiple linear regression**, where we predict test scores based on both **study hours** and **sleep hours**.

#### **Step 1: Create the Data**

We will generate another variable, **sleep hours**, which represents the number of hours of sleep the student had before the exam.

```{r}
# Add sleep hours data
sleep_hours <- c(8, 7, 6, 7, 8, 9, 6, 8, 7, 9)

# Combine the new data into a data frame
data_mult <- data.frame(study_hours, sleep_hours, test_scores)
head(data_mult)
```


#### **Step 2: Fit the Multiple Linear Regression Model**

We will fit a multiple linear regression model using both study hours and sleep hours as predictors.

```{r}
# Fit the multiple linear regression model
model_mult <- lm(test_scores ~ study_hours + sleep_hours, data = data_mult)

# Display the summary of the model
summary(model_mult)
```


#### **Output** (simplified):
```
Call:
lm(formula = test_scores ~ study_hours + sleep_hours, data = data_mult)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   34.7714     5.0230   6.92  0.000861 ***
study_hours    2.1143     0.1489  14.20  1.63e-05 ***
sleep_hours    1.0714     0.6213   1.72  0.123957    
```

#### **Interpretation**:

- **Intercept (Estimate = 34.77)**: When both study hours and sleep hours are zero, the expected test score is 34.77.

- **Study hours (Estimate = 2.11)**: For each additional hour of study, test scores increase by 2.11 points, holding sleep hours constant.

- **Sleep hours (Estimate = 1.07)**: For each additional hour of sleep, test scores increase by 1.07 points, holding study hours constant. However, sleep hours are not statistically significant (p-value = 0.12).

The regression equation can be written as:
\[
\text{Test Score} = 34.77 + 2.11 \times \text{Study Hours} + 1.07 \times \text{Sleep Hours}
\]

---

### **Assumptions of Linear Regression**:

1. **Linearity**: The relationship between the independent and dependent variables should be linear.

2. **Independence**: The observations should be independent of each other.

3. **Homoscedasticity**: The variance of the residuals (errors) should be constant across all levels of the independent variables.

4. **Normality**: The residuals should be normally distributed.

---

### **Summary**:
Linear regression is a widely used method for predicting a dependent variable based on one