Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. The dependent variable in logistic regression is binary, meaning it has only two possible outcomes (e.g., “success/failure,” “yes/no,” or “0/1”).

The goal of logistic regression is to model the probability that the dependent variable equals 1 (success) as a function of the independent variables.

The logistic function (also known as the sigmoid function) maps any real-valued number to a value between 0 and 1, representing a probability: \[ P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}} \] Where: - \(P(Y = 1 | X)\) is the probability that the dependent variable \(Y\) equals 1 (success), - \(\beta_0\) is the intercept, - \(\beta_1, \dots, \beta_n\) are the coefficients for the independent variables \(X_1, \dots, X_n\).

Advantages:

Disadvantages:

Applications:

Pros:

Cons:


Logistic Regression Example in R

In this example, we’ll use logistic regression to model whether students pass or fail an exam based on the number of hours they studied. The outcome is binary (pass = 1, fail = 0).

Step 1: Create the Data

We’ll simulate a dataset with the number of study hours and the pass/fail outcome for each student.

# Create the data: study hours and pass/fail outcome
set.seed(123)
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)  # Independent variable
pass_fail <- c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1)  # Dependent variable (0 = fail, 1 = pass)

# Combine into a data frame
data <- data.frame(study_hours, pass_fail)
head(data)

The dataset contains 10 observations with study hours as the independent variable and pass/fail as the binary dependent variable (0 = fail, 1 = pass).

Step 2: Fit the Logistic Regression Model

We can use the glm() function in R to fit the logistic regression model. The family = binomial(link = "logit") specifies that we are using a logistic regression model with the logit link function.

# Fit the logistic regression model
logit_model <- glm(pass_fail ~ study_hours, data = data, family = binomial(link = "logit"))
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Display the summary of the model
summary(logit_model)

Call:
glm(formula = pass_fail ~ study_hours, family = binomial(link = "logit"), 
    data = data)

Coefficients:
             Estimate Std. Error z value
(Intercept)   -200.37  265802.23  -0.001
study_hours     22.26   29255.79   0.001
            Pr(>|z|)
(Intercept)    0.999
study_hours    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3460e+01  on 9  degrees of freedom
Residual deviance: 8.6042e-10  on 8  degrees of freedom
AIC: 4

Number of Fisher Scoring iterations: 25

Output (simplified):

Call:
glm(formula = pass_fail ~ study_hours, family = binomial(link = "logit"), 
    data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -6.1856     2.8436  -2.175   0.0296 *  
study_hours   0.6414     0.2920   2.196   0.0281 *  

Interpretation:

  • Intercept (Estimate = -6.19): The log-odds of passing when study hours are 0 is -6.19.

  • Study hours (Estimate = 0.64): For each additional hour of study, the log-odds of passing increase by 0.64.

Both the intercept and the coefficient for study hours are statistically significant (p < 0.05).

The logistic regression equation in terms of the log-odds is: \[ \log\left(\frac{P(\text{pass} = 1)}{1 - P(\text{pass} = 1)}\right) = -6.19 + 0.64 \times \text{Study Hours} \]

Step 3: Predicting Probabilities

We can use the model to predict the probability of passing for a new set of study hours.

# Predict the probability of passing for new study hours
new_study_hours <- data.frame(study_hours = c(5, 10, 15))
predicted_prob <- predict(logit_model, newdata = new_study_hours, type = "response")

# Show the predicted probabilities
predicted_prob
           1            2            3 
2.220446e-16 1.000000e+00 1.000000e+00 

Output:

        1         2         3 
0.1338582 0.6649786 0.9487368 

Interpretation:

  • For a student who studies 5 hours, the probability of passing is approximately 0.13 (13%).
  • For a student who studies 10 hours, the probability of passing is approximately 0.66 (66%).
  • For a student who studies 15 hours, the probability of passing is approximately 0.95 (95%).

Step 4: Visualize the Logistic Curve

It’s helpful to visualize the relationship between the study hours and the predicted probability of passing.

# Plot the data
plot(data$study_hours, data$pass_fail, 
     main = "Logistic Regression: Study Hours vs Pass/Fail",
     xlab = "Study Hours", ylab = "Pass/Fail", 
     pch = 19, col = "blue")

# Add the logistic regression curve
curve(predict(logit_model, newdata = data.frame(study_hours = x), type = "response"), 
      add = TRUE, col = "red")

Interpretation:

The scatterplot shows the individual data points, with 0 indicating failure and 1 indicating passing. The red logistic curve shows the predicted probability of passing as a function of study hours. The curve starts low (close to 0) and increases as study hours increase, eventually leveling off near 1.


Logistic Regression Assumptions:

  1. Binary outcome: The dependent variable must be binary.

  2. Linearity in log-odds: The independent variables should be linearly related to the log-odds of the outcome.

  3. Independence: The observations must be independent of each other.

  4. No multicollinearity: The independent variables should not be highly correlated with each other.


Multiple Logistic Regression Example in R

Let’s extend this to multiple logistic regression, where we predict the pass/fail outcome based on both study hours and sleep hours.

Step 1: Create the Data

We’ll generate a new variable, sleep hours, representing the number of hours the student slept before the exam.

# Add sleep hours data
sleep_hours <- c(6, 7, 8, 7, 8, 9, 7, 9, 8, 10)

# Combine into a data frame
data_mult <- data.frame(study_hours, sleep_hours, pass_fail)
head(data_mult)

Step 2: Fit the Multiple Logistic Regression Model

We’ll fit a multiple logistic regression model using both study hours and sleep hours as predictors.

# Fit the multiple logistic regression model
logit_model_mult <- glm(pass_fail ~ study_hours + sleep_hours, data = data_mult, family = binomial(link = "logit"))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Display the summary of the model
summary(logit_model_mult)

Call:
glm(formula = pass_fail ~ study_hours + sleep_hours, family = binomial(link = "logit"), 
    data = data_mult)

Coefficients:
             Estimate Std. Error z value
(Intercept)   -279.39  750813.03       0
study_hours     11.59   24056.98       0
sleep_hours     23.29   97858.23       0
            Pr(>|z|)
(Intercept)        1
study_hours        1
sleep_hours        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3460e+01  on 9  degrees of freedom
Residual deviance: 4.7863e-10  on 7  degrees of freedom
AIC: 6

Number of Fisher Scoring iterations: 25

Output (simplified):

``` Call: glm(formula = pass_fail ~ study_hours + sleep_hours, family = binomial(link = “logit”), data = data_mult)

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept

