Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. The dependent variable in logistic regression is binary, meaning it has only two possible outcomes (e.g., “success/failure,” “yes/no,” or “0/1”).

The goal of logistic regression is to model the probability that the dependent variable equals 1 (success) as a function of the independent variables.

The logistic function (also known as the sigmoid function) maps any real-valued number to a value between 0 and 1, representing a probability: \[ P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}} \] Where: - \(P(Y = 1 | X)\) is the probability that the dependent variable \(Y\) equals 1 (success), - \(\beta_0\) is the intercept, - \(\beta_1, \dots, \beta_n\) are the coefficients for the independent variables \(X_1, \dots, X_n\).

Advantages:

Disadvantages:

Applications:

Pros:

Cons:


Logistic Regression Example in R

In this example, we’ll use logistic regression to model whether students pass or fail an exam based on the number of hours they studied. The outcome is binary (pass = 1, fail = 0).

Step 1: Create the Data

We’ll simulate a dataset with the number of study hours and the pass/fail outcome for each student.

# Create the data: study hours and pass/fail outcome
set.seed(123)
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)  # Independent variable
pass_fail <- c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1)  # Dependent variable (0 = fail, 1 = pass)

# Combine into a data frame
data <- data.frame(study_hours, pass_fail)
head(data)

The dataset contains 10 observations with study hours as the independent variable and pass/fail as the binary dependent variable (0 = fail, 1 = pass).

Step 2: Fit the Logistic Regression Model

We can use the glm() function in R to fit the logistic regression model. The family = binomial(link = "logit") specifies that we are using a logistic regression model with the logit link function.

# Fit the logistic regression model
logit_model <- glm(pass_fail ~ study_hours, data = data, family = binomial(link = "logit"))
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Display the summary of the model
summary(logit_model)

Call:
glm(formula = pass_fail ~ study_hours, family = binomial(link = "logit"), 
    data = data)

Coefficients:
             Estimate Std. Error z value
(Intercept)   -200.37  265802.23  -0.001
study_hours     22.26   29255.79   0.001
            Pr(>|z|)
(Intercept)    0.999
study_hours    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3460e+01  on 9  degrees of freedom
Residual deviance: 8.6042e-10  on 8  degrees of freedom
AIC: 4

Number of Fisher Scoring iterations: 25

Output (simplified):

Call:
glm(formula = pass_fail ~ study_hours, family = binomial(link = "logit"), 
    data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -6.1856     2.8436  -2.175   0.0296 *  
study_hours   0.6414     0.2920   2.196   0.0281 *  

Interpretation:

  • Intercept (Estimate = -6.19): The log-odds of passing when study hours are 0 is -6.19.

  • Study hours (Estimate = 0.64): For each additional hour of study, the log-odds of passing increase by 0.64.

Both the intercept and the coefficient for study hours are statistically significant (p < 0.05).

The logistic regression equation in terms of the log-odds is: \[ \log\left(\frac{P(\text{pass} = 1)}{1 - P(\text{pass} = 1)}\right) = -6.19 + 0.64 \times \text{Study Hours} \]

Step 3: Predicting Probabilities

We can use the model to predict the probability of passing for a new set of study hours.

# Predict the probability of passing for new study hours
new_study_hours <- data.frame(study_hours = c(5, 10, 15))
predicted_prob <- predict(logit_model, newdata = new_study_hours, type = "response")

# Show the predicted probabilities
predicted_prob
           1            2            3 
2.220446e-16 1.000000e+00 1.000000e+00 

Output:

        1         2         3 
0.1338582 0.6649786 0.9487368 

Interpretation:

  • For a student who studies 5 hours, the probability of passing is approximately 0.13 (13%).
  • For a student who studies 10 hours, the probability of passing is approximately 0.66 (66%).
  • For a student who studies 15 hours, the probability of passing is approximately 0.95 (95%).

Step 4: Visualize the Logistic Curve

It’s helpful to visualize the relationship between the study hours and the predicted probability of passing.

# Plot the data
plot(data$study_hours, data$pass_fail, 
     main = "Logistic Regression: Study Hours vs Pass/Fail",
     xlab = "Study Hours", ylab = "Pass/Fail", 
     pch = 19, col = "blue")

# Add the logistic regression curve
curve(predict(logit_model, newdata = data.frame(study_hours = x), type = "response"), 
      add = TRUE, col = "red")

Interpretation:

The scatterplot shows the individual data points, with 0 indicating failure and 1 indicating passing. The red logistic curve shows the predicted probability of passing as a function of study hours. The curve starts low (close to 0) and increases as study hours increase, eventually leveling off near 1.


Logistic Regression Assumptions:

  1. Binary outcome: The dependent variable must be binary.

  2. Linearity in log-odds: The independent variables should be linearly related to the log-odds of the outcome.

  3. Independence: The observations must be independent of each other.

  4. No multicollinearity: The independent variables should not be highly correlated with each other.


Multiple Logistic Regression Example in R

Let’s extend this to multiple logistic regression, where we predict the pass/fail outcome based on both study hours and sleep hours.

Step 1: Create the Data

We’ll generate a new variable, sleep hours, representing the number of hours the student slept before the exam.

# Add sleep hours data
sleep_hours <- c(6, 7, 8, 7, 8, 9, 7, 9, 8, 10)

# Combine into a data frame
data_mult <- data.frame(study_hours, sleep_hours, pass_fail)
head(data_mult)

Step 2: Fit the Multiple Logistic Regression Model

We’ll fit a multiple logistic regression model using both study hours and sleep hours as predictors.

# Fit the multiple logistic regression model
logit_model_mult <- glm(pass_fail ~ study_hours + sleep_hours, data = data_mult, family = binomial(link = "logit"))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Display the summary of the model
summary(logit_model_mult)

Call:
glm(formula = pass_fail ~ study_hours + sleep_hours, family = binomial(link = "logit"), 
    data = data_mult)

Coefficients:
             Estimate Std. Error z value
(Intercept)   -279.39  750813.03       0
study_hours     11.59   24056.98       0
sleep_hours     23.29   97858.23       0
            Pr(>|z|)
(Intercept)        1
study_hours        1
sleep_hours        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3460e+01  on 9  degrees of freedom
Residual deviance: 4.7863e-10  on 7  degrees of freedom
AIC: 6

Number of Fisher Scoring iterations: 25

Output (simplified):

``` Call: glm(formula = pass_fail ~ study_hours + sleep_hours, family = binomial(link = “logit”), data = data_mult)

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept

---
title: "Logistic Regression"
output: html_notebook
---

**Logistic regression** is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. The dependent variable in logistic regression is binary, meaning it has only two possible outcomes (e.g., "success/failure," "yes/no," or "0/1"). 

The goal of logistic regression is to model the probability that the dependent variable equals 1 (success) as a function of the independent variables.

The **logistic function** (also known as the sigmoid function) maps any real-valued number to a value between 0 and 1, representing a probability:
\[
P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}
\]
Where:
- \( P(Y = 1 | X) \) is the probability that the dependent variable \( Y \) equals 1 (success),
- \( \beta_0 \) is the intercept,
- \( \beta_1, \dots, \beta_n \) are the coefficients for the independent variables \( X_1, \dots, X_n \).

#### **Advantages**:

- **Interpretable coefficients**: The coefficients of logistic regression represent the change in the log-odds of the outcome for a one-unit change in the predictor.

- **Probabilistic output**: Logistic regression provides a probability of the outcome, which is useful in many real-world applications.

- **Less strict assumptions**: Unlike linear regression, logistic regression does not assume a linear relationship between the independent and dependent variables.

#### **Disadvantages**:

- **Linearity in log-odds**: Logistic regression assumes a linear relationship between the independent variables and the log-odds of the outcome.

- **Not suitable for continuous outcomes**: Logistic regression can only handle binary or categorical outcomes.

- **Multicollinearity**: Highly correlated independent variables can cause issues with coefficient estimation.

#### **Applications**:

- **Medical research**: Predicting the probability of a disease (e.g., cancer diagnosis based on age and lifestyle factors).

- **Marketing**: Predicting whether a customer will purchase a product based on their demographics and browsing history.

- **Credit scoring**: Estimating the likelihood of a borrower defaulting on a loan based on financial history and other factors.

#### **Pros**:
- Easy to implement and interpret.
- Works well when the relationship between independent variables and the binary outcome is approximately linear in the log-odds.
- Can handle both continuous and categorical predictor variables.

#### **Cons**:
- Assumes a linear relationship between the independent variables and the log-odds of the outcome.
- Sensitive to outliers and multicollinearity.
- Can become unstable with small datasets or if the classes are highly imbalanced.

---

### **Logistic Regression Example in R**

In this example, we’ll use logistic regression to model whether students pass or fail an exam based on the number of hours they studied. The outcome is binary (pass = 1, fail = 0).

#### **Step 1: Create the Data**

We’ll simulate a dataset with the number of study hours and the pass/fail outcome for each student.

```{r}
# Create the data: study hours and pass/fail outcome
set.seed(123)
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)  # Independent variable
pass_fail <- c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1)  # Dependent variable (0 = fail, 1 = pass)

# Combine into a data frame
data <- data.frame(study_hours, pass_fail)
head(data)
```


The dataset contains 10 observations with study hours as the independent variable and pass/fail as the binary dependent variable (0 = fail, 1 = pass).

#### **Step 2: Fit the Logistic Regression Model**

We can use the **`glm()`** function in R to fit the logistic regression model. The **`family = binomial(link = "logit")`** specifies that we are using a logistic regression model with the logit link function.

```{r}
# Fit the logistic regression model
logit_model <- glm(pass_fail ~ study_hours, data = data, family = binomial(link = "logit"))

# Display the summary of the model
summary(logit_model)
```


#### **Output** (simplified):
```
Call:
glm(formula = pass_fail ~ study_hours, family = binomial(link = "logit"), 
    data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -6.1856     2.8436  -2.175   0.0296 *  
study_hours   0.6414     0.2920   2.196   0.0281 *  
```

#### **Interpretation**:

- **Intercept (Estimate = -6.19)**: The log-odds of passing when study hours are 0 is -6.19.

- **Study hours (Estimate = 0.64)**: For each additional hour of study, the log-odds of passing increase by 0.64.

Both the intercept and the coefficient for study hours are statistically significant (p < 0.05).

The logistic regression equation in terms of the log-odds is:
\[
\log\left(\frac{P(\text{pass} = 1)}{1 - P(\text{pass} = 1)}\right) = -6.19 + 0.64 \times \text{Study Hours}
\]

#### **Step 3: Predicting Probabilities**

We can use the model to predict the probability of passing for a new set of study hours.

```{r}
# Predict the probability of passing for new study hours
new_study_hours <- data.frame(study_hours = c(5, 10, 15))
predicted_prob <- predict(logit_model, newdata = new_study_hours, type = "response")

# Show the predicted probabilities
predicted_prob
```


#### **Output**:
```
        1         2         3 
0.1338582 0.6649786 0.9487368 
```

#### **Interpretation**:
- For a student who studies 5 hours, the probability of passing is approximately 0.13 (13%).
- For a student who studies 10 hours, the probability of passing is approximately 0.66 (66%).
- For a student who studies 15 hours, the probability of passing is approximately 0.95 (95%).

#### **Step 4: Visualize the Logistic Curve**

It’s helpful to visualize the relationship between the study hours and the predicted probability of passing.

```{r}
# Plot the data
plot(data$study_hours, data$pass_fail, 
     main = "Logistic Regression: Study Hours vs Pass/Fail",
     xlab = "Study Hours", ylab = "Pass/Fail", 
     pch = 19, col = "blue")

# Add the logistic regression curve
curve(predict(logit_model, newdata = data.frame(study_hours = x), type = "response"), 
      add = TRUE, col = "red")
```


#### **Interpretation**:
The scatterplot shows the individual data points, with 0 indicating failure and 1 indicating passing. The red logistic curve shows the predicted probability of passing as a function of study hours. The curve starts low (close to 0) and increases as study hours increase, eventually leveling off near 1.

---

### **Logistic Regression Assumptions**:

1. **Binary outcome**: The dependent variable must be binary.

2. **Linearity in log-odds**: The independent variables should be linearly related to the log-odds of the outcome.

3. **Independence**: The observations must be independent of each other.

4. **No multicollinearity**: The independent variables should not be highly correlated with each other.

---

### **Multiple Logistic Regression Example in R**

Let’s extend this to **multiple logistic regression**, where we predict the pass/fail outcome based on both **study hours** and **sleep hours**.

#### **Step 1: Create the Data**

We’ll generate a new variable, **sleep hours**, representing the number of hours the student slept before the exam.

```{r}
# Add sleep hours data
sleep_hours <- c(6, 7, 8, 7, 8, 9, 7, 9, 8, 10)

# Combine into a data frame
data_mult <- data.frame(study_hours, sleep_hours, pass_fail)
head(data_mult)
```


#### **Step 2: Fit the Multiple Logistic Regression Model**

We’ll fit a multiple logistic regression model using both study hours and sleep hours as predictors.

```{r}
# Fit the multiple logistic regression model
logit_model_mult <- glm(pass_fail ~ study_hours + sleep_hours, data = data_mult, family = binomial(link = "logit"))

# Display the summary of the model
summary(logit_model_mult)
```


#### **Output** (simplified):
```
Call:
glm(formula = pass_fail ~ study_hours + sleep_hours, family = binomial(link = "logit"), 
    data = data_mult)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept