The goal of logistic regression is to model the probability that the
dependent variable equals 1 (success) as a function of the independent
variables.
Logistic Regression Example in R
In this example, we’ll use logistic regression to model whether
students pass or fail an exam based on the number of hours they studied.
The outcome is binary (pass = 1, fail = 0).
Step 1: Create the Data
We’ll simulate a dataset with the number of study hours and the
pass/fail outcome for each student.
# Create the data: study hours and pass/fail outcome
set.seed(123)
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20) # Independent variable
pass_fail <- c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1) # Dependent variable (0 = fail, 1 = pass)
# Combine into a data frame
data <- data.frame(study_hours, pass_fail)
head(data)
The dataset contains 10 observations with study hours as the
independent variable and pass/fail as the binary dependent variable (0 =
fail, 1 = pass).
Step 2: Fit the Logistic Regression Model
We can use the glm()
function in R to
fit the logistic regression model. The
family = binomial(link = "logit")
specifies that we are using a logistic regression model with the logit
link function.
# Fit the logistic regression model
logit_model <- glm(pass_fail ~ study_hours, data = data, family = binomial(link = "logit"))
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Display the summary of the model
summary(logit_model)
Call:
glm(formula = pass_fail ~ study_hours, family = binomial(link = "logit"),
data = data)
Coefficients:
Estimate Std. Error z value
(Intercept) -200.37 265802.23 -0.001
study_hours 22.26 29255.79 0.001
Pr(>|z|)
(Intercept) 0.999
study_hours 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3460e+01 on 9 degrees of freedom
Residual deviance: 8.6042e-10 on 8 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
Output (simplified):
Call:
glm(formula = pass_fail ~ study_hours, family = binomial(link = "logit"),
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.1856 2.8436 -2.175 0.0296 *
study_hours 0.6414 0.2920 2.196 0.0281 *
Interpretation:
Intercept (Estimate = -6.19): The log-odds of
passing when study hours are 0 is -6.19.
Study hours (Estimate = 0.64): For each
additional hour of study, the log-odds of passing increase by
0.64.
Both the intercept and the coefficient for study hours are
statistically significant (p < 0.05).
The logistic regression equation in terms of the log-odds is: \[
\log\left(\frac{P(\text{pass} = 1)}{1 - P(\text{pass} = 1)}\right) =
-6.19 + 0.64 \times \text{Study Hours}
\]
Step 3: Predicting Probabilities
We can use the model to predict the probability of passing for a new
set of study hours.
# Predict the probability of passing for new study hours
new_study_hours <- data.frame(study_hours = c(5, 10, 15))
predicted_prob <- predict(logit_model, newdata = new_study_hours, type = "response")
# Show the predicted probabilities
predicted_prob
1 2 3
2.220446e-16 1.000000e+00 1.000000e+00
Output:
1 2 3
0.1338582 0.6649786 0.9487368
Interpretation:
- For a student who studies 5 hours, the probability of passing is
approximately 0.13 (13%).
- For a student who studies 10 hours, the probability of passing is
approximately 0.66 (66%).
- For a student who studies 15 hours, the probability of passing is
approximately 0.95 (95%).
Step 4: Visualize the Logistic Curve
It’s helpful to visualize the relationship between the study hours
and the predicted probability of passing.
# Plot the data
plot(data$study_hours, data$pass_fail,
main = "Logistic Regression: Study Hours vs Pass/Fail",
xlab = "Study Hours", ylab = "Pass/Fail",
pch = 19, col = "blue")
# Add the logistic regression curve
curve(predict(logit_model, newdata = data.frame(study_hours = x), type = "response"),
add = TRUE, col = "red")
Interpretation:
The scatterplot shows the individual data points, with 0 indicating
failure and 1 indicating passing. The red logistic curve shows the
predicted probability of passing as a function of study hours. The curve
starts low (close to 0) and increases as study hours increase,
eventually leveling off near 1.
Multiple Logistic Regression Example in R
Let’s extend this to multiple logistic regression,
where we predict the pass/fail outcome based on both study
hours and sleep hours.
Step 1: Create the Data
We’ll generate a new variable, sleep hours,
representing the number of hours the student slept before the exam.
# Add sleep hours data
sleep_hours <- c(6, 7, 8, 7, 8, 9, 7, 9, 8, 10)
# Combine into a data frame
data_mult <- data.frame(study_hours, sleep_hours, pass_fail)
head(data_mult)
Step 2: Fit the Multiple Logistic Regression
Model
We’ll fit a multiple logistic regression model using both study hours
and sleep hours as predictors.
# Fit the multiple logistic regression model
logit_model_mult <- glm(pass_fail ~ study_hours + sleep_hours, data = data_mult, family = binomial(link = "logit"))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Display the summary of the model
summary(logit_model_mult)
Call:
glm(formula = pass_fail ~ study_hours + sleep_hours, family = binomial(link = "logit"),
data = data_mult)
Coefficients:
Estimate Std. Error z value
(Intercept) -279.39 750813.03 0
study_hours 11.59 24056.98 0
sleep_hours 23.29 97858.23 0
Pr(>|z|)
(Intercept) 1
study_hours 1
sleep_hours 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3460e+01 on 9 degrees of freedom
Residual deviance: 4.7863e-10 on 7 degrees of freedom
AIC: 6
Number of Fisher Scoring iterations: 25
Output (simplified):
``` Call: glm(formula = pass_fail ~ study_hours + sleep_hours, family
= binomial(link = “logit”), data = data_mult)
Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept
