Logistic Regression and the Intercept
In this example, we consider the logistic regression model \[\Pr(y=1)=\frac{1}{1+\text{exp}(-(b_0+b_1 x))}.\] We generate \(y\) based on a sample of 5000 \(x\) values and consider the IMV associated with predicting new \(y\) outcomes (based on new \(x\) observations) for different values of \(b_0\) and \(b_1\).
The baseline model uses only the overall proportion of 1s — i.e., \(\text{mean}(y)\) — as a flat prediction. The enhanced model uses the fitted logistic regression predictions \(\hat{p}\). The IMV measures how much better the regression predictions are compared to simply using the prevalence.
The key question this example addresses: does the intercept \(b_0\) matter for the IMV, even when the slope \(b_1\) is held fixed? Adjust \(b_1\) with the slider and run the simulation to find out.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 900
library(shiny)
imv.binary<-function(y, p1, p2, sigma=1e-4) {
p1<-ifelse(p1<sigma,sigma,p1)
p2<-ifelse(p2<sigma,sigma,p2)
p1<-ifelse(p1>1-sigma,1-sigma,p1)
p2<-ifelse(p2>1-sigma,1-sigma,p2)
ll<-function(x,p) {
z<-log(p)*x+log(1-p)*(1-x)
z<-sum(z)/length(x)
exp(z)
}
loglik1<-ll(y,p1)
loglik2<-ll(y,p2)
getcoins<-function(a) {
f<-function(p,a) abs(p*log(p)+(1-p)*log(1-p)-log(a))
nlminb(.5,f,lower=0.001,upper=.999,a=a)$par
}
c1<-getcoins(loglik1)
c2<-getcoins(loglik2)
ew<-function(p1,p0) (p1-p0)/p0
imv<-ew(c2,c1)
imv
}
ui <- fluidPage(
titlePanel("Logistic Regression: IMV vs b0"),
sidebarLayout(
sidebarPanel(
sliderInput("b1",
"Slope (b1):",
min = -2,
max = 2,
value = 1,
step = 0.1),
actionButton("run", "Run Simulation", class = "btn-primary"),
hr(),
fluidRow(
column(6, numericInput("ymin", "Y-axis min:", value = -0.2, step = 0.05)),
column(6, numericInput("ymax", "Y-axis max:", value = 0.5, step = 0.05))
),
helpText("Adjust b1 and click 'Run Simulation' to see how IMV varies with b0.")
),
mainPanel(
plotOutput("omPlot", height = "600px")
)
)
)
server <- function(input, output, session) {
sim_results <- eventReactive(input$run, {
withProgress(message = 'Running simulation...', value = 0, {
b1 <- input$b1
b0vals <- seq(-2, 2, length.out = 100)
om <- numeric()
for (i in seq_along(b0vals)) {
b0 <- b0vals[i]
pr <- function(x, b0, b1) 1/(1 + exp(-(b0 + b1*x)))
x <- rnorm(5000)
p <- pr(x, b0 = b0, b1 = b1)
y <- rbinom(length(x), 1, p)
df <- data.frame(x = x, y = y)
m <- glm(y ~ x, df, family = 'binomial')
##oos data
x2 <- rnorm(1000)
p2 <- pr(x2, b0 = b0, b1 = b1)
y2 <- rbinom(length(x2), 1, p2)
p.oos <- predict(m, data.frame(x = x2), type = 'response')
om[as.character(b0)] <- imv.binary(y2, mean(y), p.oos)
incProgress(1/length(b0vals))
}
list(b0vals = b0vals, om = om)
})
})
output$omPlot <- renderPlot({
req(sim_results())
results <- sim_results()
# Fit loess
loess_fit <- loess(results$om ~ results$b0vals, span = 0.5)
loess_pred <- predict(loess_fit)
# Plot points
plot(results$b0vals, results$om,
xlab = expression(b[0]),
ylab = "IMV(prevalence, fitted model)",
ylim=c(input$ymin, input$ymax),
main = paste0("IMV vs b0 (b1 = ", input$b1, ")"),
pch = 19,
col = "steelblue")
# Add loess smooth
lines(results$b0vals, loess_pred, col = "darkred", lwd = 2)
# Add reference line and grid
abline(h = 0, lty = 2, col = "red")
grid()
})
}
shinyApp(ui = ui, server = server)
What you should see:
- The IMV peaks near \(b_0=0\) (balanced classes) and falls off as \(|b_0|\) grows — when the intercept pushes the outcome prevalence toward 0 or 1, the baseline (mean \(y\)) already captures most of the outcome, leaving little for the regression to improve on.
- The curve is symmetric around \(b_0=0\): a large positive intercept and a large negative intercept produce the same IMV because both create an extreme prevalence with the same predictability.
- As \(b_1\) increases (stronger slope), the peak IMV at \(b_0=0\) rises and the curve stays elevated over a wider range of \(b_0\) values — a stronger predictor can partially overcome the headwind of extreme prevalence.