The InterModel Vigorish (IMV)
  • Home
  • Simulation Examples
    • Logistic regression and the intercept
    • Logistic regression and the Oracle
    • Logistic regression and the Overfit
    • 2PL versus 3PL predictions
    • The collapse of the thresholded IMV
    • Implied probabilities and the IMV
    • IRT versus Cognitive Diagnostic Models
    • Multidimensional IRT and unidimensional scoring
  • Computational Examples
    • Logistic regression (glm)
    • Mixed-effects logistic regression (glmer)
    • Unidimensional IRT
    • Multidimensional IRT

Logistic Regression and the Oracle

The Oracle IMV is a special diagnostic that is only available in simulation studies, where the true data-generating probabilities \(p\) are known. Normally, we only have estimated probabilities \(\hat{p}\) from a fitted model. The Oracle IMV compares \(\hat{p}\) directly against the true \(p\) — measuring not how well the model predicts new outcomes, but how close the estimated probabilities are to the truth. As sample size grows and estimation improves, \(\hat{p} \to p\) and the Oracle IMV should shrink toward zero.

We consider the logistic regression model \[p=\Pr(y=1)=\frac{1}{1+\text{exp}(-(b_0+b_1(x_1+x_2+x_3+x_4+x_5)))},\] where five independent standard-normal covariates each carry the same slope \(b_1\). We generate \(y\) based on samples across a wide range of sample sizes. For each sample size we fit the correctly specified model (yielding estimates \(\hat{p}\)) and compute IMV(\(\hat{p}\), \(p\)) — the Oracle IMV — which measures how far the estimates are from the true probabilities. With six parameters to estimate, small samples produce noisy \(\hat{p}\); a large Oracle IMV indicates substantial estimation noise; an Oracle IMV near zero means \(\hat{p} \approx p\).

Adjust \(b_0\) and \(b_1\) and run the simulation to see how estimation quality changes with sample size.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 900

library(shiny)

imv.binary<-function(y, p1, p2, sigma=1e-4) {
    p1<-ifelse(p1<sigma,sigma,p1)
    p2<-ifelse(p2<sigma,sigma,p2)
    p1<-ifelse(p1>1-sigma,1-sigma,p1)
    p2<-ifelse(p2>1-sigma,1-sigma,p2)
    
    ll<-function(x,p) {
        z<-log(p)*x+log(1-p)*(1-x)
        z<-sum(z)/length(x)
        exp(z)
    }    
    loglik1<-ll(y,p1)
    loglik2<-ll(y,p2)
    getcoins<-function(a) {
        f<-function(p,a) abs(p*log(p)+(1-p)*log(1-p)-log(a))
        nlminb(.5,f,lower=0.001,upper=.999,a=a)$par
    }
    c1<-getcoins(loglik1)
    c2<-getcoins(loglik2)
    ew<-function(p1,p0) (p1-p0)/p0
    imv<-ew(c2,c1)
    imv
}

ui <- fluidPage(
  titlePanel("Logistic Regression: Sample Size vs Out-of-Sample Performance"),
  
  sidebarLayout(
    sidebarPanel(
      sliderInput("b0", 
                  "Intercept (b0):", 
                  min = -2, 
                  max = 2, 
                  value = 0, 
                  step = 0.1),
      
      sliderInput("b1", 
                  "Slope (b1):", 
                  min = 0, 
                  max = .5, 
                  value = 0.25, 
                  step = 0.01),
      
      actionButton("run", "Run Simulation", class = "btn-primary"),
      
      hr(),
      fluidRow(
        column(6, numericInput("ymin", "Y-axis min:", value = -0.2, step = 0.05)),
        column(6, numericInput("ymax", "Y-axis max:", value =  0.5, step = 0.05))
      ),
      p("This app computes the Oracle IMV — the gap between estimated probabilities and the true generating probabilities — as a function of sample size.")
    ),
    
    mainPanel(
      plotOutput("omPlot", height = "500px"),
      
      hr(),
      
      verbatimTextOutput("summary")
    )
  )
)

server <- function(input, output, session) {
  
  simData <- eventReactive(input$run, {
    b0 <- input$b0
    b1 <- input$b1
    
    N <- runif(100, 2, 4)
    Nvals <- round(10^N)

    om <- numeric()

    withProgress(message = 'Running simulation...', value = 0, {
      for (i in seq_along(Nvals)) {
        N <- Nvals[i]

        lp <- function(x1,x2,x3,x4,x5,b0,b1) b0 + b1*(x1+x2+x3+x4+x5)
        pr <- function(x1,x2,x3,x4,x5,b0,b1) 1/(1+exp(-lp(x1,x2,x3,x4,x5,b0,b1)))

        x1<-rnorm(N); x2<-rnorm(N); x3<-rnorm(N); x4<-rnorm(N); x5<-rnorm(N)
        p <- pr(x1,x2,x3,x4,x5,b0,b1)
        y <- rbinom(N, 1, p)
        df <- data.frame(x1=x1,x2=x2,x3=x3,x4=x4,x5=x5)
        m <- glm(y ~ x1+x2+x3+x4+x5, data=df, family='binomial')

        ## oos data
        x1<-rnorm(10000); x2<-rnorm(10000); x3<-rnorm(10000); x4<-rnorm(10000); x5<-rnorm(10000)
        p2 <- pr(x1,x2,x3,x4,x5,b0,b1)
        y2 <- rbinom(10000, 1, p2)
        df2 <- data.frame(x1=x1,x2=x2,x3=x3,x4=x4,x5=x5)
        p.oos <- predict(m, df2, type='response')
        om[as.character(N)] <- imv.binary(y2, p.oos, p2)

        incProgress(1/length(Nvals))
      }
    })
    
    list(n = as.numeric(names(om)), om = om)
  })
  
  output$omPlot <- renderPlot({
    data <- simData()
    
    plot(data$n, data$om,
         xlab = "Sample Size (N)",
         ylab = "Oracle IMV: IMV(estimated p, true p)",
         main = paste0("Oracle IMV vs Sample Size\n(b0 = ",
                      input$b0, ", b1 = ", input$b1, ", 5 covariates)"),
         pch = 16,
         col = rgb(0, 0, 1, 0.5),
         cex = 1.2,
         ylim=c(input$ymin, input$ymax),
         log = "x")
    
    # Add a smoothed trend line
    lines(lowess(data$n, data$om), col = "red", lwd = 2)
    grid()
  })
  
  output$summary <- renderPrint({
    data <- simData()
    cat("Simulation Summary\n")
    cat("==================\n\n")
    cat("Parameters:\n")
    cat("  b0 (Intercept):", input$b0, "\n")
    cat("  b1 (Slope):", input$b1, "\n\n")
    cat("Number of simulations: 100\n")
    cat("Sample Size Range:", min(data$n), "to", max(data$n), "\n")
    cat("IMV Range:", round(min(data$om), 4), "to", round(max(data$om), 4), "\n")
    cat("Mean IMV:", round(mean(data$om), 4), "\n")
  })
}

shinyApp(ui = ui, server = server)

What you should see:

  • As sample size increases, the Oracle IMV decreases toward zero — this is the statistical property of consistency: with enough data, \(\hat{p} \to p\) and the estimation gap vanishes.
  • With six parameters to estimate, the decay is noticeably slower than in a two-parameter model: the Oracle IMV stays elevated well past \(N=100\) and the convergence trend is clearly visible across the full range.
  • When \(b_1\) is small, the outcome probabilities are close to the prevalence and the signal is weak — estimation noise persists to larger \(N\).
  • When \(b_1\) is larger, each covariate contributes more discriminating information and the estimates converge faster.
  • The Oracle IMV is always non-negative by construction: estimated probabilities can never be systematically better than the truth they are approximating.

← Back to Home