Logistic Regression and the Oracle
The Oracle IMV is a special diagnostic that is only available in simulation studies, where the true data-generating probabilities \(p\) are known. Normally, we only have estimated probabilities \(\hat{p}\) from a fitted model. The Oracle IMV compares \(\hat{p}\) directly against the true \(p\) — measuring not how well the model predicts new outcomes, but how close the estimated probabilities are to the truth. As sample size grows and estimation improves, \(\hat{p} \to p\) and the Oracle IMV should shrink toward zero.
We consider the logistic regression model \[p=\Pr(y=1)=\frac{1}{1+\text{exp}(-(b_0+b_1(x_1+x_2+x_3+x_4+x_5)))},\] where five independent standard-normal covariates each carry the same slope \(b_1\). We generate \(y\) based on samples across a wide range of sample sizes. For each sample size we fit the correctly specified model (yielding estimates \(\hat{p}\)) and compute IMV(\(\hat{p}\), \(p\)) — the Oracle IMV — which measures how far the estimates are from the true probabilities. With six parameters to estimate, small samples produce noisy \(\hat{p}\); a large Oracle IMV indicates substantial estimation noise; an Oracle IMV near zero means \(\hat{p} \approx p\).
Adjust \(b_0\) and \(b_1\) and run the simulation to see how estimation quality changes with sample size.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 900
library(shiny)
imv.binary<-function(y, p1, p2, sigma=1e-4) {
p1<-ifelse(p1<sigma,sigma,p1)
p2<-ifelse(p2<sigma,sigma,p2)
p1<-ifelse(p1>1-sigma,1-sigma,p1)
p2<-ifelse(p2>1-sigma,1-sigma,p2)
ll<-function(x,p) {
z<-log(p)*x+log(1-p)*(1-x)
z<-sum(z)/length(x)
exp(z)
}
loglik1<-ll(y,p1)
loglik2<-ll(y,p2)
getcoins<-function(a) {
f<-function(p,a) abs(p*log(p)+(1-p)*log(1-p)-log(a))
nlminb(.5,f,lower=0.001,upper=.999,a=a)$par
}
c1<-getcoins(loglik1)
c2<-getcoins(loglik2)
ew<-function(p1,p0) (p1-p0)/p0
imv<-ew(c2,c1)
imv
}
ui <- fluidPage(
titlePanel("Logistic Regression: Sample Size vs Out-of-Sample Performance"),
sidebarLayout(
sidebarPanel(
sliderInput("b0",
"Intercept (b0):",
min = -2,
max = 2,
value = 0,
step = 0.1),
sliderInput("b1",
"Slope (b1):",
min = 0,
max = .5,
value = 0.25,
step = 0.01),
actionButton("run", "Run Simulation", class = "btn-primary"),
hr(),
fluidRow(
column(6, numericInput("ymin", "Y-axis min:", value = -0.2, step = 0.05)),
column(6, numericInput("ymax", "Y-axis max:", value = 0.5, step = 0.05))
),
p("This app computes the Oracle IMV — the gap between estimated probabilities and the true generating probabilities — as a function of sample size.")
),
mainPanel(
plotOutput("omPlot", height = "500px"),
hr(),
verbatimTextOutput("summary")
)
)
)
server <- function(input, output, session) {
simData <- eventReactive(input$run, {
b0 <- input$b0
b1 <- input$b1
N <- runif(100, 2, 4)
Nvals <- round(10^N)
om <- numeric()
withProgress(message = 'Running simulation...', value = 0, {
for (i in seq_along(Nvals)) {
N <- Nvals[i]
lp <- function(x1,x2,x3,x4,x5,b0,b1) b0 + b1*(x1+x2+x3+x4+x5)
pr <- function(x1,x2,x3,x4,x5,b0,b1) 1/(1+exp(-lp(x1,x2,x3,x4,x5,b0,b1)))
x1<-rnorm(N); x2<-rnorm(N); x3<-rnorm(N); x4<-rnorm(N); x5<-rnorm(N)
p <- pr(x1,x2,x3,x4,x5,b0,b1)
y <- rbinom(N, 1, p)
df <- data.frame(x1=x1,x2=x2,x3=x3,x4=x4,x5=x5)
m <- glm(y ~ x1+x2+x3+x4+x5, data=df, family='binomial')
## oos data
x1<-rnorm(10000); x2<-rnorm(10000); x3<-rnorm(10000); x4<-rnorm(10000); x5<-rnorm(10000)
p2 <- pr(x1,x2,x3,x4,x5,b0,b1)
y2 <- rbinom(10000, 1, p2)
df2 <- data.frame(x1=x1,x2=x2,x3=x3,x4=x4,x5=x5)
p.oos <- predict(m, df2, type='response')
om[as.character(N)] <- imv.binary(y2, p.oos, p2)
incProgress(1/length(Nvals))
}
})
list(n = as.numeric(names(om)), om = om)
})
output$omPlot <- renderPlot({
data <- simData()
plot(data$n, data$om,
xlab = "Sample Size (N)",
ylab = "Oracle IMV: IMV(estimated p, true p)",
main = paste0("Oracle IMV vs Sample Size\n(b0 = ",
input$b0, ", b1 = ", input$b1, ", 5 covariates)"),
pch = 16,
col = rgb(0, 0, 1, 0.5),
cex = 1.2,
ylim=c(input$ymin, input$ymax),
log = "x")
# Add a smoothed trend line
lines(lowess(data$n, data$om), col = "red", lwd = 2)
grid()
})
output$summary <- renderPrint({
data <- simData()
cat("Simulation Summary\n")
cat("==================\n\n")
cat("Parameters:\n")
cat(" b0 (Intercept):", input$b0, "\n")
cat(" b1 (Slope):", input$b1, "\n\n")
cat("Number of simulations: 100\n")
cat("Sample Size Range:", min(data$n), "to", max(data$n), "\n")
cat("IMV Range:", round(min(data$om), 4), "to", round(max(data$om), 4), "\n")
cat("Mean IMV:", round(mean(data$om), 4), "\n")
})
}
shinyApp(ui = ui, server = server)
What you should see:
- As sample size increases, the Oracle IMV decreases toward zero — this is the statistical property of consistency: with enough data, \(\hat{p} \to p\) and the estimation gap vanishes.
- With six parameters to estimate, the decay is noticeably slower than in a two-parameter model: the Oracle IMV stays elevated well past \(N=100\) and the convergence trend is clearly visible across the full range.
- When \(b_1\) is small, the outcome probabilities are close to the prevalence and the signal is weak — estimation noise persists to larger \(N\).
- When \(b_1\) is larger, each covariate contributes more discriminating information and the estimates converge faster.
- The Oracle IMV is always non-negative by construction: estimated probabilities can never be systematically better than the truth they are approximating.