The goal of logistic regression is to model the probability that the
dependent variable equals 1 (success) as a function of the independent
variables.
Logistic Regression Example in R
In this example, we’ll use logistic regression to model whether
students pass or fail an exam based on the number of hours they studied.
The outcome is binary (pass = 1, fail = 0).
Step 1: Create the Data
We’ll simulate a dataset with the number of study hours and the
pass/fail outcome for each student.
# Create the data: study hours and pass/fail outcome
set.seed(123)
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20) # Independent variable
pass_fail <- c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1) # Dependent variable (0 = fail, 1 = pass)
# Combine into a data frame
data <- data.frame(study_hours, pass_fail)
head(data)
The dataset contains 10 observations with study hours as the
independent variable and pass/fail as the binary dependent variable (0 =
fail, 1 = pass).
Step 2: Fit the Logistic Regression Model
We can use the glm()
function in R to
fit the logistic regression model. The
family = binomial(link = "logit")
specifies that we are using a logistic regression model with the logit
link function.
# Fit the logistic regression model
logit_model <- glm(pass_fail ~ study_hours, data = data, family = binomial(link = "logit"))
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Display the summary of the model
summary(logit_model)
Call:
glm(formula = pass_fail ~ study_hours, family = binomial(link = "logit"),
data = data)
Coefficients:
Estimate Std. Error z value
(Intercept) -200.37 265802.23 -0.001
study_hours 22.26 29255.79 0.001
Pr(>|z|)
(Intercept) 0.999
study_hours 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3460e+01 on 9 degrees of freedom
Residual deviance: 8.6042e-10 on 8 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
Output (simplified):
Call:
glm(formula = pass_fail ~ study_hours, family = binomial(link = "logit"),
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.1856 2.8436 -2.175 0.0296 *
study_hours 0.6414 0.2920 2.196 0.0281 *
Interpretation:
Intercept (Estimate = -6.19): The log-odds of
passing when study hours are 0 is -6.19.
Study hours (Estimate = 0.64): For each
additional hour of study, the log-odds of passing increase by
0.64.
Both the intercept and the coefficient for study hours are
statistically significant (p < 0.05).
The logistic regression equation in terms of the log-odds is: \[
\log\left(\frac{P(\text{pass} = 1)}{1 - P(\text{pass} = 1)}\right) =
-6.19 + 0.64 \times \text{Study Hours}
\]
Step 3: Predicting Probabilities
We can use the model to predict the probability of passing for a new
set of study hours.
# Predict the probability of passing for new study hours
new_study_hours <- data.frame(study_hours = c(5, 10, 15))
predicted_prob <- predict(logit_model, newdata = new_study_hours, type = "response")
# Show the predicted probabilities
predicted_prob
1 2 3
2.220446e-16 1.000000e+00 1.000000e+00
Output:
1 2 3
0.1338582 0.6649786 0.9487368
Interpretation:
- For a student who studies 5 hours, the probability of passing is
approximately 0.13 (13%).
- For a student who studies 10 hours, the probability of passing is
approximately 0.66 (66%).
- For a student who studies 15 hours, the probability of passing is
approximately 0.95 (95%).
Step 4: Visualize the Logistic Curve
It’s helpful to visualize the relationship between the study hours
and the predicted probability of passing.
# Plot the data
plot(data$study_hours, data$pass_fail,
main = "Logistic Regression: Study Hours vs Pass/Fail",
xlab = "Study Hours", ylab = "Pass/Fail",
pch = 19, col = "blue")
# Add the logistic regression curve
curve(predict(logit_model, newdata = data.frame(study_hours = x), type = "response"),
add = TRUE, col = "red")
Interpretation:
The scatterplot shows the individual data points, with 0 indicating
failure and 1 indicating passing. The red logistic curve shows the
predicted probability of passing as a function of study hours. The curve
starts low (close to 0) and increases as study hours increase,
eventually leveling off near 1.
Multiple Logistic Regression Example in R
Let’s extend this to multiple logistic regression,
where we predict the pass/fail outcome based on both study
hours and sleep hours.
Step 1: Create the Data
We’ll generate a new variable, sleep hours,
representing the number of hours the student slept before the exam.
# Add sleep hours data
sleep_hours <- c(6, 7, 8, 7, 8, 9, 7, 9, 8, 10)
# Combine into a data frame
data_mult <- data.frame(study_hours, sleep_hours, pass_fail)
head(data_mult)
Step 2: Fit the Multiple Logistic Regression
Model
We’ll fit a multiple logistic regression model using both study hours
and sleep hours as predictors.
# Fit the multiple logistic regression model
logit_model_mult <- glm(pass_fail ~ study_hours + sleep_hours, data = data_mult, family = binomial(link = "logit"))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Display the summary of the model
summary(logit_model_mult)
Call:
glm(formula = pass_fail ~ study_hours + sleep_hours, family = binomial(link = "logit"),
data = data_mult)
Coefficients:
Estimate Std. Error z value
(Intercept) -279.39 750813.03 0
study_hours 11.59 24056.98 0
sleep_hours 23.29 97858.23 0
Pr(>|z|)
(Intercept) 1
study_hours 1
sleep_hours 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3460e+01 on 9 degrees of freedom
Residual deviance: 4.7863e-10 on 7 degrees of freedom
AIC: 6
Number of Fisher Scoring iterations: 25
Output (simplified):
``` Call: glm(formula = pass_fail ~ study_hours + sleep_hours, family
= binomial(link = “logit”), data = data_mult)
Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept
---
title: "Logistic Regression"
output: html_notebook
---

**Logistic regression** is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. The dependent variable in logistic regression is binary, meaning it has only two possible outcomes (e.g., "success/failure," "yes/no," or "0/1"). 

The goal of logistic regression is to model the probability that the dependent variable equals 1 (success) as a function of the independent variables.

The **logistic function** (also known as the sigmoid function) maps any real-valued number to a value between 0 and 1, representing a probability:
\[
P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}
\]
Where:
- \( P(Y = 1 | X) \) is the probability that the dependent variable \( Y \) equals 1 (success),
- \( \beta_0 \) is the intercept,
- \( \beta_1, \dots, \beta_n \) are the coefficients for the independent variables \( X_1, \dots, X_n \).

#### **Advantages**:

- **Interpretable coefficients**: The coefficients of logistic regression represent the change in the log-odds of the outcome for a one-unit change in the predictor.

- **Probabilistic output**: Logistic regression provides a probability of the outcome, which is useful in many real-world applications.

- **Less strict assumptions**: Unlike linear regression, logistic regression does not assume a linear relationship between the independent and dependent variables.

#### **Disadvantages**:

- **Linearity in log-odds**: Logistic regression assumes a linear relationship between the independent variables and the log-odds of the outcome.

- **Not suitable for continuous outcomes**: Logistic regression can only handle binary or categorical outcomes.

- **Multicollinearity**: Highly correlated independent variables can cause issues with coefficient estimation.

#### **Applications**:

- **Medical research**: Predicting the probability of a disease (e.g., cancer diagnosis based on age and lifestyle factors).

- **Marketing**: Predicting whether a customer will purchase a product based on their demographics and browsing history.

- **Credit scoring**: Estimating the likelihood of a borrower defaulting on a loan based on financial history and other factors.

#### **Pros**:
- Easy to implement and interpret.
- Works well when the relationship between independent variables and the binary outcome is approximately linear in the log-odds.
- Can handle both continuous and categorical predictor variables.

#### **Cons**:
- Assumes a linear relationship between the independent variables and the log-odds of the outcome.
- Sensitive to outliers and multicollinearity.
- Can become unstable with small datasets or if the classes are highly imbalanced.

---

### **Logistic Regression Example in R**

In this example, we’ll use logistic regression to model whether students pass or fail an exam based on the number of hours they studied. The outcome is binary (pass = 1, fail = 0).

#### **Step 1: Create the Data**

We’ll simulate a dataset with the number of study hours and the pass/fail outcome for each student.

```{r}
# Create the data: study hours and pass/fail outcome
set.seed(123)
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)  # Independent variable
pass_fail <- c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1)  # Dependent variable (0 = fail, 1 = pass)

# Combine into a data frame
data <- data.frame(study_hours, pass_fail)
head(data)
```


The dataset contains 10 observations with study hours as the independent variable and pass/fail as the binary dependent variable (0 = fail, 1 = pass).

#### **Step 2: Fit the Logistic Regression Model**

We can use the **`glm()`** function in R to fit the logistic regression model. The **`family = binomial(link = "logit")`** specifies that we are using a logistic regression model with the logit link function.

```{r}
# Fit the logistic regression model
logit_model <- glm(pass_fail ~ study_hours, data = data, family = binomial(link = "logit"))

# Display the summary of the model
summary(logit_model)
```


#### **Output** (simplified):
```
Call:
glm(formula = pass_fail ~ study_hours, family = binomial(link = "logit"), 
    data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -6.1856     2.8436  -2.175   0.0296 *  
study_hours   0.6414     0.2920   2.196   0.0281 *  
```

#### **Interpretation**:

- **Intercept (Estimate = -6.19)**: The log-odds of passing when study hours are 0 is -6.19.

- **Study hours (Estimate = 0.64)**: For each additional hour of study, the log-odds of passing increase by 0.64.

Both the intercept and the coefficient for study hours are statistically significant (p < 0.05).

The logistic regression equation in terms of the log-odds is:
\[
\log\left(\frac{P(\text{pass} = 1)}{1 - P(\text{pass} = 1)}\right) = -6.19 + 0.64 \times \text{Study Hours}
\]

#### **Step 3: Predicting Probabilities**

We can use the model to predict the probability of passing for a new set of study hours.

```{r}
# Predict the probability of passing for new study hours
new_study_hours <- data.frame(study_hours = c(5, 10, 15))
predicted_prob <- predict(logit_model, newdata = new_study_hours, type = "response")

# Show the predicted probabilities
predicted_prob
```


#### **Output**:
```
        1         2         3 
0.1338582 0.6649786 0.9487368 
```

#### **Interpretation**:
- For a student who studies 5 hours, the probability of passing is approximately 0.13 (13%).
- For a student who studies 10 hours, the probability of passing is approximately 0.66 (66%).
- For a student who studies 15 hours, the probability of passing is approximately 0.95 (95%).

#### **Step 4: Visualize the Logistic Curve**

It’s helpful to visualize the relationship between the study hours and the predicted probability of passing.

```{r}
# Plot the data
plot(data$study_hours, data$pass_fail, 
     main = "Logistic Regression: Study Hours vs Pass/Fail",
     xlab = "Study Hours", ylab = "Pass/Fail", 
     pch = 19, col = "blue")

# Add the logistic regression curve
curve(predict(logit_model, newdata = data.frame(study_hours = x), type = "response"), 
      add = TRUE, col = "red")
```


#### **Interpretation**:
The scatterplot shows the individual data points, with 0 indicating failure and 1 indicating passing. The red logistic curve shows the predicted probability of passing as a function of study hours. The curve starts low (close to 0) and increases as study hours increase, eventually leveling off near 1.

---

### **Logistic Regression Assumptions**:

1. **Binary outcome**: The dependent variable must be binary.

2. **Linearity in log-odds**: The independent variables should be linearly related to the log-odds of the outcome.

3. **Independence**: The observations must be independent of each other.

4. **No multicollinearity**: The independent variables should not be highly correlated with each other.

---

### **Multiple Logistic Regression Example in R**

Let’s extend this to **multiple logistic regression**, where we predict the pass/fail outcome based on both **study hours** and **sleep hours**.

#### **Step 1: Create the Data**

We’ll generate a new variable, **sleep hours**, representing the number of hours the student slept before the exam.

```{r}
# Add sleep hours data
sleep_hours <- c(6, 7, 8, 7, 8, 9, 7, 9, 8, 10)

# Combine into a data frame
data_mult <- data.frame(study_hours, sleep_hours, pass_fail)
head(data_mult)
```


#### **Step 2: Fit the Multiple Logistic Regression Model**

We’ll fit a multiple logistic regression model using both study hours and sleep hours as predictors.

```{r}
# Fit the multiple logistic regression model
logit_model_mult <- glm(pass_fail ~ study_hours + sleep_hours, data = data_mult, family = binomial(link = "logit"))

# Display the summary of the model
summary(logit_model_mult)
```


#### **Output** (simplified):
```
Call:
glm(formula = pass_fail ~ study_hours + sleep_hours, family = binomial(link = "logit"), 
    data = data_mult)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept