20  Classification

Author
Affiliation

Dr. Devan Becker

Jam: Sigma Oasis by Phish

Published

2024-07-24

20.1 Logistic Regression

Goal: Predict a 1

  • Response: 0 or 1
    • Predictions: probability of a 1?

Why Not Linear Regression?

The plot on the right shows Default versus Credit Card Balance (with jittering).

  • “Default” is 0 for No, 1 for Yes
    • Dummy encoding for the response
  • Linear model predicts negative values!
    • Also, values above 1.
  • What could the slope possibly represent???

The mean of y at each value of x

The Logistic Function - A Sigmoid

If \(t\in\mathbb{R}\), then \[ \sigma(t) = \dfrac{\exp(t)}{1 + \exp(t)}\in[0,1] \] where \(\sigma(\cdot)\) is the logistic function.

Logistic Function - Now with Parameters

\[ \sigma(\beta_0 + \beta_1 t) \]

#| standalone: true
#| viewerHeight: 600
#| viewerWidth: 800

xseq <- seq(-10, 10, 0.05)
sigma <- function(t) exp(t)/(1 + exp(t))

input <- list(b0 = 0, b1 = 1)

ui <- fluidPage(
    sidebarPanel(
        sliderInput("b0", "Intercept B_0", -5, 5, 0, 0.1, animate = list(interval = 100)),
        sliderInput("b1", "Slope B_1", -5, 5, 1, 0.1, animate = list(interval = 100))
    ),
    mainPanel(plotOutput("plot"), )
)

server <- function(input, output) {
    output$plot <- renderPlot({
        yseq <- sigma(input$b0 + input$b1 * xseq)
        plot(yseq ~ xseq,
            xlim = c(-10, 10),
            ylim = c(0, 1),
            ylab = bquote(sigma (beta[0] *" + " * beta[1] * t)),
            xlab = "t",
            type = "l")
    })
}

shinyApp(ui = ui, server = server)

Logistic Function - Now with Parameters Estimated from DATA

\[\begin{align*} \eta(x_i) &= X\underline{\beta}\\ p_i &= \sigma(\eta(x_i))\\ \implies \log\left(\frac{p_i}{1-p_i}\right) &= X\underline{\beta} \end{align*}\]

\(\eta(x_i) = -10.65 + 0.0054\cdot\text{balance}_i\)

Logistic Regression

  • The response is 0 or 1 (no or yes, don’t default or default, etc.)
  • The probability of a 1 increases according to the sigmoid function.
    • The linear predictor is \(\eta(x_i) = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots\)
    • The probability of class 1 is \(P(\text{class }1 | \text{predictors}) = \sigma(\eta(x_i))\)
  • Instead of normality assumptions, we use a binomial distribution.

It’s just one step away from a linear model!

Interpreting Slope Parameters

  • General structure: “For each one unit increase in \(x_i\), some function of \(y_i\) changes by some function of \(\beta\)”.
  • For logistic regression:
    • One unit increase in \(x_i\), \(\log\left(\frac{p(x_i)}{1-p(x_i)}\right)\) increases by \(\beta\).
  • The odds are \(\frac{p(x_i)}{1-p(x_i)}\).
    • “1 in 5 people with odds of 1/4 will default.”
  • \(\beta\) is the change in log odds for a one unit increase.
    • log odds ratio”.

Odds Ratios

Consider the one-predictor example.

\[ \eta(x_i) = \beta_0 + \beta_1x_{i1}\text{ and }\eta(x_i + 1) = \beta_0 + \beta_1x_i + \beta_1\text{, }\therefore \eta(x_i + 1) - \eta(x_i) = \beta \]

The logg odds ratio is defined as \[ \log(OR) = \log\left( \frac{p(x_i + 1)}{1 - p(x_i + 1)} \middle/ \frac{p(x_i)}{1 - p(x_i)}\right) \] Using \(\eta(x_i) = \log\left(\frac{p(x_i)}{1-p(x_i)}\right)\), show that \(\log(OR) = \beta_1\)

The Loss Function

For all observations:

  • If \(y_i = 0\), we want \(p(x_i)\) to be as low as possible.
    • Maximize \(1 - P(Y_{i} = 1|\beta_0,\beta_1,X)\)
  • If \(y_i = 1\), we want \(p(x_i)\) to be as high as possible.
    • Maximize \(P(Y_{i} = 1|\beta_0,\beta_1,X)\)

These can be combined as: \[ \prod_{i:y_{i} = 0}\left(1 - P(Y_{i} = 1|\beta_0,\beta_1,X)\right)\prod_{i:y_i=1}P(Y_i = 1|\beta_0,\beta_1,X) \] which is equivalent to: \[ \prod_{i=1}^n(1 - P(Y_{i} = 1|\beta_0,\beta_1,X))^{1 - Y_i}P(Y_i = 1|\beta_0,\beta_1,X)^{Y_i} \]

Which is NOT just the sum of squared errors!

Unlike linear regression, there’s no closed form for \(\hat{\beta}_0\) and \(\hat\beta_1\); we need numerical methods.

Examples: Two different predictors in the Default data

\[ \eta(x_i) = -3.5 + 0.5\cdot\text{student} \]

Odds of a student defaulting are \(\exp(0.5)\approx1.65\) times as high as a non-student.

\[ \eta(x_i) = -10.65 + 0.005\cdot\text{balance} \]

Each extra dollar of CC balance increases odds of defaulting by a factor of 1.005.

The scale of the predictors matters.

Odds versus Probabilities

“The odds of a student defaulting are \(\exp(0.5)\approx1.65\) times as high as a non-student.”

\[ \frac{P(\text{default} | \text{student} = 1)}{1 - P(\text{default} | \text{student} = 1)} \biggm/ \frac{P(\text{default} | \text{student} = 0)}{1 - P(\text{default} | \text{student} = 0)} = 1.65 \]

This cannot be solved for \(P(\text{default} | \text{student} = 1)\)!

\[ P(\text{default} | \text{student} = 1) = \dfrac{\exp(\eta(x_i))}{1 + \exp(\eta(x_i))} \approx 0.047 \]

Multiple Linear Logistic Regression

  • Predictors can be multicollinear, confounded, and have interactions.
    • Logistic is just Linear on a transformed scale!
  • We do not look for transformations of the response.
    • It’s already a transformation of the response \(p_i(x_i)\)!
  • We do look for transformations of the predictors!
    • Sigmoid + Polynomial is where the real fun is.

Errors in Logistic Regression: Deviance

  • All “errors” are either \(p(x_i)\) or \(1 - p(x_i)\).
    • i.e., distances are measured form either 0 or 1.

Instead, we use the deviance.

  • If \(p(x_i)\) were the true probability in a binomial distribution, what’s the probability of the observed value (0 or 1)?
    • In other words, \(P(Y_i = 1| \beta_0, \beta_1, X)\) and \(1 - P(Y_i = 1| \beta_0, \beta_1, X)\) are the residuals!
      • The residual we use depends on what the response is. If \(y_i = 0\), the residual is \(P(Y_i = 1 | \beta_0, \beta_1, X)\)
    • This is used more broadly in Generalized Linear Models (GLMs). Logistic Regression is one of many GLMs.

Logistic Decision Boundaries

\[ P(\text{defaulting} | \eta(x_i)) > p \implies a + bx_1 + cx_2 + dx_3 > e \]

for some (linear) hyperplane \(a + bx_1 + cx_2 + dx_3\) and some value \(e\).

  • Choosing \(p=0.5\) is standard, but other thresholds can be chosen.
    • Cancer example: want to be more admissive of false positives
      • Would rather operate and be wrong than falsely tell the patient that they’re healthy!

Predictions - Just Plug it In!

Intercept Student Balance Income
\(\beta\) -10.09 -0.65 0.0057 0.000003

We can make a prediction for a student with $2,000 balance and $20,000 income: \[\begin{align*} \eta(x) &= \beta_0 + \beta_1\cdot 1 + \beta_2\cdot 2000 + \beta_3\cdot 20000 \approx 0.0178\\ &\\ P(\text{defaulting} | x) &= \dfrac{\exp(\eta(x))}{1 + \exp(\eta(x))} \approx \dfrac{\exp(0.0178)}{1 + \exp(0.0178)} \approx 0.504\\ &\\ &P(\text{defaulting} | x) > 0.5 \implies \text{Predict Default} \end{align*}\]

Standard Errors: It’s complicated

We don’t have an estimate like \(\hat\beta = (X^TX)^{-1}X^TY\). We had to resort to numerical methods.

There is a closed form for \(V(\hat\beta)\) and a way to test significance, but they’re beyond the scope of this course.

  • Relies on likelihoods, which we avoided in this course.
  • You can assume significance tests in R output are correct.

20.2 Classification Basics

Goal: Predict a Category

  • Binary: Yes/no, success/failure, etc.
  • Categorical: 2 or more categories.
    • A.k.a. qualitative, but that’s a social science word.

In both: predict whether an observation is in category \(j\) given its predictors. \[ P(Y_i = j| x = x_i) \stackrel{def}{=} p_j(x_i) \]

Classification Confusion

Confusion Matrix: A tabular summary of classification errors.

True Pay (\(\cdot 0\)) True Def (\(\cdot 1\))
Pred Pay (\(0 \cdot\)) Good (00) Bad (01)
Pred Def (\(1 \cdot\)) Bad (10) Good (11)
  • Two ways to be wrong
  • Two ways to be right
  • Different applications have different needs

Accuracy: \(\dfrac{\text{Correct Predictions}}{\text{Number of Predictions}} =\frac{00 + 11}{00 + 01 + 10 + 11}\)

Is “Accuracy” Good?

Task: Predict whether a person has cancer

(In this made up example, 0.02% of people have cancer).

True Healthy True Cancer
Pred. Healthy Save a Life Lose a Life
Pred. Cancer Expensive/Invasive All good
  • Easy: 99.8% accuracy.
    • Always guess “Not Cancer”
  • Very Hard: 99.82% accuracy.

The Confusion Matrix for Default Data

True Payment True Default
Pred Payment 9627 228
Pred Default 40 105
  • This model: 97.32% accuracy.
    • Naive model: always predict “Pay” - 96.67% accuracy!

Other important measures (not on exam):

  • Sensitivity: \(\dfrac{\text{True Positives}}{\text{All Positives in Data}} = \dfrac{9627}{9627 + 40} = 99.58%\) (Naive: 100%)
  • Specificity: \(\dfrac{\text{True Negatives}}{\text{All Negatives in Data}} = \dfrac{105}{105 + 228} = 31.53\) (Naive: 0%)

Logistic Regression in R

The residual plots are the same as before:

The predictions can either be on the logit scale (type = "link", the default) or on the response scale (probabilities).

Regularization is often used with logistic regression (in python’s scikit-learn package, Ridge regularization is used by default without warning the user).