Logistic Regression and the Oracle

The Oracle IMV is a special diagnostic that is only available in simulation studies, where the true data-generating probabilities \(p\) are known. Normally, we only have estimated probabilities \(\hat{p}\) from a fitted model. The Oracle IMV compares \(\hat{p}\) directly against the true \(p\) — measuring not how well the model predicts new outcomes, but how close the estimated probabilities are to the truth. As sample size grows and estimation improves, \(\hat{p} \to p\) and the Oracle IMV should shrink toward zero.

We consider the logistic regression model \[p=\Pr(y=1)=\frac{1}{1+\text{exp}(-(b_0+b_1 x))}.\] We generate \(y\) based on samples of \(x\) values across a wide range of sample sizes. For each sample size we fit the correctly specified model (yielding estimates \(\hat{p}\)) and compute IMV(\(\hat{p}\), \(p\)) — the Oracle IMV — which measures how far the estimates are from the true probabilities. A large Oracle IMV indicates substantial estimation noise; an Oracle IMV near zero means \(\hat{p} \approx p\).

Adjust \(b_0\) and \(b_1\) and run the simulation to see how estimation quality changes with sample size.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 900

library(shiny)

imv.binary<-function(y, p1, p2, sigma=1e-4) {
    p1<-ifelse(p1<sigma,sigma,p1)
    p2<-ifelse(p2<sigma,sigma,p2)
    p1<-ifelse(p1>1-sigma,1-sigma,p1)
    p2<-ifelse(p2>1-sigma,1-sigma,p2)
    
    ll<-function(x,p) {
        z<-log(p)*x+log(1-p)*(1-x)
        z<-sum(z)/length(x)
        exp(z)
    }    
    loglik1<-ll(y,p1)
    loglik2<-ll(y,p2)
    getcoins<-function(a) {
        f<-function(p,a) abs(p*log(p)+(1-p)*log(1-p)-log(a))
        nlminb(.5,f,lower=0.001,upper=.999,a=a)$par
    }
    c1<-getcoins(loglik1)
    c2<-getcoins(loglik2)
    ew<-function(p1,p0) (p1-p0)/p0
    imv<-ew(c2,c1)
    imv
}

ui <- fluidPage(
  titlePanel("Logistic Regression: Sample Size vs Out-of-Sample Performance"),
  
  sidebarLayout(
    sidebarPanel(
      sliderInput("b0", 
                  "Intercept (b0):", 
                  min = -2, 
                  max = 2, 
                  value = 0, 
                  step = 0.1),
      
      sliderInput("b1", 
                  "Slope (b1):", 
                  min = 0, 
                  max = .5, 
                  value = 0.25, 
                  step = 0.01),
      
      actionButton("run", "Run Simulation", class = "btn-primary"),
      
      hr(),
      
      p("This app computes the Oracle IMV — the gap between estimated probabilities and the true generating probabilities — as a function of sample size.")
    ),
    
    mainPanel(
      plotOutput("omPlot", height = "500px"),
      
      hr(),
      
      verbatimTextOutput("summary")
    )
  )
)

server <- function(input, output, session) {
  
  simData <- eventReactive(input$run, {
    b0 <- input$b0
    b1 <- input$b1
    
    N <- runif(100, 2.4, 4)
    Nvals <- round(10^N)
    
    om <- numeric()
    
    withProgress(message = 'Running simulation...', value = 0, {
      for (i in seq_along(Nvals)) {
        N <- Nvals[i]
        
        pr <- function(x, b0, b1) 1/(1 + exp(-(b0 + b1*x)))
        
        x <- rnorm(N)
        p <- pr(x, b0 = b0, b1 = b1)
        y <- rbinom(length(x), 1, p)
        df <- data.frame(x = x, y = y)
        m <- glm(y ~ x, df, family = 'binomial')
        
        ## oos data
        x2 <- rnorm(10000)
        p2 <- pr(x2, b0 = b0, b1 = b1)
        y2 <- rbinom(length(x2), 1, p2)
        p.oos <- predict(m, data.frame(x = x2), type = 'response')
        om[as.character(N)] <- imv.binary(y2, p.oos, p2)
        
        incProgress(1/length(Nvals))
      }
    })
    
    list(n = as.numeric(names(om)), om = om)
  })
  
  output$omPlot <- renderPlot({
    data <- simData()
    
    plot(data$n, data$om,
         xlab = "Sample Size (N)",
         ylab = "Oracle IMV: IMV(estimated p, true p)",
         main = paste0("Oracle IMV vs Sample Size\n(b0 = ", 
                      input$b0, ", b1 = ", input$b1, ")"),
         pch = 16,
         col = rgb(0, 0, 1, 0.5),
         cex = 1.2,
         ylim=c(0,.1),
         log = "x")
    
    # Add a smoothed trend line
    lines(lowess(data$n, data$om), col = "red", lwd = 2)
    grid()
  })
  
  output$summary <- renderPrint({
    data <- simData()
    cat("Simulation Summary\n")
    cat("==================\n\n")
    cat("Parameters:\n")
    cat("  b0 (Intercept):", input$b0, "\n")
    cat("  b1 (Slope):", input$b1, "\n\n")
    cat("Number of simulations: 100\n")
    cat("Sample Size Range:", min(data$n), "to", max(data$n), "\n")
    cat("IMV Range:", round(min(data$om), 4), "to", round(max(data$om), 4), "\n")
    cat("Mean IMV:", round(mean(data$om), 4), "\n")
  })
}

shinyApp(ui = ui, server = server)

What you should see:

As sample size increases, the Oracle IMV decreases toward zero — this is the statistical property of consistency: with enough data, \(\hat{p} \to p\) and the estimation gap vanishes.
When \(b_1\) is small, the outcome probabilities are all close to the prevalence and hard to distinguish. Even with large \(N\), there is little signal for the model to recover, so estimation noise remains relatively high.
When \(b_1\) is large, the model has strong signal to estimate and converges quickly — the Oracle IMV drops steeply as \(N\) grows.
The Oracle IMV is always non-negative by construction: estimated probabilities can never be systematically better than the truth they are approximating.

← Back to Home