The InterModel Vigorish (IMV)
  • Home
  • Simulation Examples
    • Logistic regression and the intercept
    • Logistic regression and the Oracle
    • Logistic regression and the Overfit
    • 2PL versus 3PL predictions
    • The collapse of the thresholded IMV
    • Implied probabilities and the IMV
    • IRT versus Cognitive Diagnostic Models
    • Multidimensional IRT and unidimensional scoring
  • Computational Examples
    • Logistic regression (glm)
    • Mixed-effects logistic regression (glmer)
    • Unidimensional IRT
    • Multidimensional IRT

The InterModel Vigorish (IMV)

What is the IMV?

When two models both predict the same binary outcome, how much better is one than the other — and does that difference actually matter?

Standard metrics like R², AUC, and the F₁ score can answer versions of this question, but they have a shared limitation: their values depend on the baseline difficulty of the prediction problem. An improvement of 0.03 in AUC means something very different when predicting a rare event (prevalence = 2%) versus a common one (prevalence = 50%). This makes it hard to compare model improvements across different datasets or outcomes.

The InterModel Vigorish (IMV) is designed to fix this. It is a metric for quantifying the change in predictive accuracy between two models — a baseline and an enhanced prediction — in a way that is portable (comparable across outcomes with different prevalences) and intuitive (grounded in a concrete physical analogy).


The Weighted Coin Analogy

The IMV is built on an analogy to weighted coins.

Any predictive system that assigns probabilities to binary outcomes can be mapped to an equivalent weighted coin — a physical object whose bias exactly matches the average uncertainty in those predictions. When you have two such systems, you can ask: by how much does the enhanced model’s coin outperform the baseline model’s coin in a single-blind bet?

More formally, the IMV is the expected proportional winnings from betting according to the enhanced model’s probabilities when the baseline model sets the odds. A positive IMV means the enhanced model provides genuine predictive value beyond the baseline; a negative IMV signals overfitting or model misspecification.

This framing has a key consequence: because the baseline model defines the bet, the IMV is always a statement about relative improvement, not absolute accuracy. The same absolute improvement in log-likelihood will yield a larger IMV when the baseline outcome is highly uncertain (prevalence near 0.5) than when it is already predictable — which is exactly the right behavior.

Try it yourself

Two weighted coins are flipped 20 times each. You see the outcomes but not the true weights. Enter your best guesses for each coin’s probability of heads — these define your enhanced model. The baseline model uses only the overall average (treating both coins identically). The IMV then measures how much your guesses improve on that baseline.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 680

library(shiny)

# ── IMV helpers ──────────────────────────────────────────────────────────────

get_coin_weight <- function(avg_ll, sigma = 1e-6) {
  target <- log(avg_ll)
  # The entropy H(w) = w*log(w) + (1-w)*log(1-w) has a minimum of log(0.5) at w=0.5.
  # If target is below this floor (with small tolerance for floating point),
  # there is no solution — predictions are worse than a fair coin.
  tol <- 1e-9
  if (target < log(0.5) - tol) stop("implausible")
  # Boundary case: target is at or very near the floor — coin weight is 0.5
  if (target <= log(0.5) + tol) return(0.5)
  f <- function(w) w * log(w) + (1 - w) * log(1 - w) - target
  uniroot(f, lower = 0.5 + sigma, upper = 1 - sigma)$root
}

avg_ll <- function(y, p, sigma = 1e-4) {
  p <- pmin(pmax(p, sigma), 1 - sigma)
  exp(mean(y * log(p) + (1 - y) * log(1 - p)))
}

imv_coins <- function(y, p_baseline, p_enhanced) {
  ll0 <- avg_ll(y, p_baseline)
  ll1 <- avg_ll(y, p_enhanced)
  w0  <- tryCatch(get_coin_weight(ll0), error = function(e) NA)
  w1  <- tryCatch(get_coin_weight(ll1), error = function(e) NA)
  list(w0 = w0, w1 = w1, imv = if (!is.na(w0) && !is.na(w1)) (w1 - w0) / w0 else NA)
}

# ── UI ───────────────────────────────────────────────────────────────────────

ui <- fluidPage(

  tags$head(tags$style(HTML("
    body { font-family: 'Georgia', serif; background: #fafaf8; }
    .well { background: #f0ede6; border: none; border-radius: 8px; }
    .flip-row { font-size: 1.3em; letter-spacing: 3px; margin: 4px 0; }
    .coin-label { font-weight: bold; font-size: 1.05em; margin-top: 10px; }
    .result-box { background: #fff; border: 1px solid #ddd; border-radius: 8px;
                  padding: 16px 20px; margin-top: 12px; }
    .result-box h4 { margin-top: 0; }
    .imv-positive { color: #2a7d2e; font-weight: bold; }
    .imv-negative { color: #b33a1e; font-weight: bold; }
    .imv-zero     { color: #666;    font-weight: bold; }
    .reveal-box   { background: #f5f0e8; border-left: 4px solid #8b6914;
                    padding: 12px 16px; border-radius: 4px; margin-top: 10px; }
    .error-box    { background: #fff5f5; border: 1px solid #f5c6cb; border-radius: 8px;
                    padding: 16px 20px; margin-top: 12px; }
    .error-box h4 { margin-top: 0; color: #b33a1e; }
    hr.thin { border-top: 1px solid #ddd; margin: 14px 0; }
    .explainer-inner { background: #fafafa; border: 1px solid #e8e8e8;
                       border-radius: 6px; padding: 14px 16px; font-size: 0.88em; }
    .stat-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 8px; margin: 10px 0; }
    .stat-card { background: #f0ede6; border-radius: 6px; padding: 8px 10px; }
    .stat-label { font-size: 0.78em; color: #666; margin-bottom: 2px; }
    .stat-val   { font-size: 1.1em; font-weight: bold; }
    .val-ok   { color: #2a7d2e; }
    .val-fail { color: #b33a1e; }
    .curve-wrap { position: relative; width: 100%; height: 200px; margin: 10px 0 4px; }
  "))),

  titlePanel(NULL),

  sidebarLayout(
    sidebarPanel(width = 4,
      actionButton("new_game", "🎲  New coins", class = "btn-primary btn-block",
                   style = "margin-bottom:14px;"),
      tags$div(class = "coin-label", "Coin 1 outcomes:"),
      uiOutput("flips1_ui"),
      tags$div(class = "coin-label", style = "margin-top:10px;", "Coin 2 outcomes:"),
      uiOutput("flips2_ui"),
      tags$hr(class = "thin"),
      tags$p("Enter your probability guesses:"),
      sliderInput("g1", "Your guess for Coin 1 (p₁):",
                  min = 0.01, max = 0.99, value = 0.5, step = 0.01),
      sliderInput("g2", "Your guess for Coin 2 (p₂):",
                  min = 0.01, max = 0.99, value = 0.5, step = 0.01),
      actionButton("submit", "⚖️  Compute IMV", class = "btn-success btn-block",
                   style = "margin-top:6px;")
    ),

    mainPanel(width = 8,
      uiOutput("results_ui")
    )
  )
)

# ── Server ───────────────────────────────────────────────────────────────────

server <- function(input, output, session) {

  # Reactive game state
  game <- reactiveValues(
    p1 = NULL, p2 = NULL,
    flips1 = NULL, flips2 = NULL,
    submitted = FALSE
  )

  # Generate new coins on button press (and on startup)
  observeEvent(input$new_game, {
    game$p1 <- runif(1, 0.3, 0.85)
    game$p2 <- runif(1, 0.3, 0.85)
    game$flips1 <- rbinom(20, 1, game$p1)
    game$flips2 <- rbinom(20, 1, game$p2)
    game$submitted <- FALSE
    updateSliderInput(session, "g1", value = 0.5)
    updateSliderInput(session, "g2", value = 0.5)
  }, ignoreNULL = FALSE)   # run on startup too

  # Show flip sequences
  fmt_flips <- function(flips) {
    paste(ifelse(flips == 1, "H", "T"), collapse = " ")
  }

  output$flips1_ui <- renderUI({
    req(game$flips1)
    heads <- sum(game$flips1)
    tags$div(
      tags$div(class = "flip-row", fmt_flips(game$flips1)),
      tags$small(style = "color:#555;",
        sprintf("(%d heads, %d tails out of 20)", heads, 20 - heads))
    )
  })

  output$flips2_ui <- renderUI({
    req(game$flips2)
    heads <- sum(game$flips2)
    tags$div(
      tags$div(class = "flip-row", fmt_flips(game$flips2)),
      tags$small(style = "color:#555;",
        sprintf("(%d heads, %d tails out of 20)", heads, 20 - heads))
    )
  })

  # On submit: compute and display results
  observeEvent(input$submit, {
    game$submitted <- TRUE
  })

  output$results_ui <- renderUI({

    if (!game$submitted) {
      return(tags$div(
        style = "color:#888; margin-top:30px; font-style:italic;",
        "Adjust your probability guesses using the sliders, then click",
        tags$strong("Compute IMV"), "to see your results."
      ))
    }

    req(game$flips1, game$flips2)

    # Build outcome and prediction vectors over both coins combined
    y_all  <- c(game$flips1, game$flips2)
    coin_id <- c(rep(1, 20), rep(2, 20))

    # Baseline: same probability for every flip = overall mean
    p_base <- mean(y_all)
    p_baseline_vec <- rep(p_base, 40)

    # Enhanced: user's guesses per coin
    p_enhanced_vec <- ifelse(coin_id == 1, input$g1, input$g2)

    res <- imv_coins(y_all, p_baseline_vec, p_enhanced_vec)

    # Check which model(s) failed
    baseline_failed  <- is.na(res$w0)
    enhanced_failed  <- is.na(res$w1)

    if (baseline_failed || enhanced_failed) {

      # avg_ll() returns exp(mean log-likelihood), so log() recovers mean log-likelihood
      ll_enhanced <- log(avg_ll(y_all, p_enhanced_vec))
      ll_baseline <- log(avg_ll(y_all, p_baseline_vec))
      log2_floor  <- log(0.5)
      gap_enh     <- ll_enhanced - log2_floor
      gap_base    <- ll_baseline - log2_floor

      failed_model <- if (baseline_failed && enhanced_failed) {
        "both the baseline and your guesses are"
      } else if (baseline_failed) {
        "the baseline model is"
      } else {
        "your guesses are"
      }

      # Pass values into JS via data attributes on a hidden div
      return(tagList(
        tags$div(class = "error-box",
          tags$h4("\u26a0\ufe0f Predictions too far from the data"),
          tags$p(
            "The IMV cannot be computed because ", failed_model,
            " so inconsistent with the observed flips that they perform",
            " worse than a fair coin (p\u00a0=\u00a00.5). In the weighted-coin",
            " framework, no valid coin weight exists for predictions this poor."
          ),
          tags$p("Try adjusting your guesses to be closer to the observed",
                 " proportion of heads for each coin."),
          tags$div(class = "reveal-box",
            tags$strong("Observed proportions:"),
            tags$br(),
            sprintf("Coin 1: %.2f heads  (your guess: %.2f)",
                    mean(game$flips1), input$g1),
            tags$br(),
            sprintf("Coin 2: %.2f heads  (your guess: %.2f)",
                    mean(game$flips2), input$g2)
          ),

          tags$div(class = "explainer-inner", style = "margin-top: 14px;",
              tags$p(
                "The IMV maps every set of predictions to an equivalent weighted coin",
                " via the bijection: find w \u2208 (0.5, 1) such that its Bernoulli entropy",
                " H(w)\u00a0=\u00a0w\u00a0log\u00a0w\u00a0+\u00a0(1\u2212w)\u00a0log(1\u2212w)",
                " equals your average log-likelihood \u2113\u0304.",
                " But H(w) has a floor at H(0.5)\u00a0=\u00a0\u2212log\u00a02\u00a0\u2248\u00a0\u22120.693 —",
                " the entropy of a perfectly fair coin.",
                " If your \u2113\u0304 falls below this floor, the equation has no solution.",
                " Your predictions are so confidently wrong they carry",
                " less information than a fair coin flip."
              ),

              tags$div(class = "stat-grid",
                tags$div(class = "stat-card",
                  tags$div(class = "stat-label", "Your avg log-likelihood \u2113\u0304"),
                  tags$div(class = paste("stat-val", if (enhanced_failed) "val-fail" else "val-ok"),
                           sprintf("%.4f", ll_enhanced))
                ),
                tags$div(class = "stat-card",
                  tags$div(class = "stat-label", "Feasibility floor \u2212log\u00a02"),
                  tags$div(class = "stat-val", "\u22120.6931")
                ),
                tags$div(class = "stat-card",
                  tags$div(class = "stat-label", "Gap to floor"),
                  tags$div(class = paste("stat-val", if (enhanced_failed) "val-fail" else "val-ok"),
                           sprintf("%+.4f", gap_enh))
                ),
                tags$div(class = "stat-card",
                  tags$div(class = "stat-label", "Baseline \u2113\u0304"),
                  tags$div(class = paste("stat-val", if (baseline_failed) "val-fail" else "val-ok"),
                           sprintf("%.4f", ll_baseline))
                )
              ),

              tags$p(style = "font-size:0.85em; color:#555; margin: 8px 0 4px;",
                "The curve below shows H(w) — the Bernoulli entropy as a function of coin weight w.",
                " The solid black line is the feasibility floor \u2212log\u00a02.",
                " The red dashed line is your average log-likelihood \u2113\u0304.",
                " A coin weight exists only when the red line meets or crosses the curve."),

              tags$div(class = "curve-wrap",
                tags$script(src = "https://cdnjs.cloudflare.com/ajax/libs/Chart.js/4.4.1/chart.umd.js"),
                tags$canvas(id = "entropy-curve",
                  `aria-label` = "Entropy curve showing feasibility floor and current log-likelihood")
              ),

              # Inline script — polls until Chart.js ready then draws
              tags$script(HTML(sprintf("
(function() {
  var LOG2 = Math.log(0.5);
  var ll_enh  = %f;
  var ll_base = %f;
  var failed_enh  = %s;
  var failed_base = %s;

  function entropy(w) {
    return w * Math.log(w) + (1 - w) * Math.log(1 - w);
  }
  function coinWeight(ll) {
    if (ll < LOG2) return null;
    var f = function(w) { return entropy(w) - ll; };
    var lo = 0.5, hi = 1 - 1e-9;
    for (var i = 0; i < 60; i++) {
      var mid = (lo + hi) / 2;
      if (f(mid) < 0) hi = mid; else lo = mid;
    }
    return (lo + hi) / 2;
  }

  function drawChart() {
    var canvas = document.getElementById('entropy-curve');
    if (!canvas || typeof Chart === 'undefined') {
      setTimeout(drawChart, 50);
      return;
    }

    var wPts = [], hPts = [];
    for (var i = 0; i <= 200; i++) {
      var w = 0.5 + (i / 200) * 0.499;
      wPts.push(w.toFixed(3));
      hPts.push(entropy(w));
    }

    var yMin = Math.min(-2.6, ll_enh - 0.3, ll_base - 0.1);

    var datasets = [
      { label: 'Floor: \u2212log 2',
        data: wPts.map(function() { return LOG2; }),
        borderColor: '#111', borderWidth: 2, borderDash: [],
        pointRadius: 0, fill: false },
      { label: 'H(w) \u2014 Bernoulli entropy curve', data: hPts,
        borderColor: '#185fa5', borderWidth: 2,
        pointRadius: 0, tension: 0.3, fill: false },
      { label: 'Your \u2113\u0304 (avg log-likelihood)',
        data: wPts.map(function() { return ll_enh; }),
        borderColor: '#a32d2d',
        borderWidth: 2.5, borderDash: [6,4], pointRadius: 0, fill: false }
    ];

    var w_enh = coinWeight(ll_enh);
    if (w_enh !== null) {
      datasets.push({
        label: '\u03c9', data: [{ x: w_enh.toFixed(3), y: entropy(w_enh) }],
        borderColor: 'transparent', backgroundColor: '#2a7d2e',
        pointRadius: 7, showLine: false
      });
    }

    new Chart(canvas.getContext('2d'), {
      type: 'line',
      data: { labels: wPts, datasets: datasets },
      options: {
        responsive: true, maintainAspectRatio: false, animation: false,
        plugins: { legend: { display: false }, tooltip: { enabled: false } },
        scales: {
          x: { title: { display: true, text: 'Coin weight w', font: { size: 11 } },
               ticks: { maxTicksLimit: 6,
                 callback: function(v,i) { return i%%40===0 ? parseFloat(wPts[i]).toFixed(2) : ''; } },
               grid: { color: 'rgba(0,0,0,0.06)' } },
          y: { title: { display: true, text: 'H(w)', font: { size: 11 } },
               min: yMin, max: 0.05,
               ticks: { callback: function(v) { return v.toFixed(1); } },
               grid: { color: 'rgba(0,0,0,0.06)' } }
        }
      }
    });
  }

  drawChart();
})();
              ", ll_enhanced, ll_baseline,
                 tolower(as.character(enhanced_failed)),
                 tolower(as.character(baseline_failed))
              )))
            ) # explainer-inner
        )    # error-box
      ))     # tagList
    }        # if failed

    imv_val <- res$imv
    imv_class <- if (imv_val > 0.005) "imv-positive" else
                 if (imv_val < -0.005) "imv-negative" else "imv-zero"
    imv_interp <- if (imv_val > 0.005)
      "Your guesses improve on the baseline — you captured something real about the two coins."
    else if (imv_val < -0.005)
      "Your guesses perform worse than just using the overall average. The baseline beats you here."
    else
      "Your guesses and the baseline perform about the same — not much signal picked up."

    tagList(
      tags$div(class = "result-box",
        tags$h4("📊 Results"),

        tags$p(tags$strong("Baseline model:"),
          sprintf("treats both coins the same (p = %.3f, implied coin weight = %.4f)",
                  p_base, res$w0)),

        tags$p(tags$strong("Your model:"),
          sprintf("Coin 1 = %.2f, Coin 2 = %.2f  →  implied coin weight = %.4f",
                  input$g1, input$g2, res$w1)),

        tags$hr(class = "thin"),

        tags$p(
          tags$strong("IMV = "),
          tags$span(class = imv_class,
                    sprintf("%.4f", imv_val))
        ),
        tags$p(style = "color:#444; font-style:italic;", imv_interp),

        tags$hr(class = "thin"),

        tags$div(class = "reveal-box",
          tags$strong("🔍 True coin weights:"),
          tags$br(),
          sprintf("Coin 1: p₁ = %.3f  (you guessed %.2f, sample proportion = %.2f)",
                  game$p1, input$g1, mean(game$flips1)),
          tags$br(),
          sprintf("Coin 2: p₂ = %.3f  (you guessed %.2f, sample proportion = %.2f)",
                  game$p2, input$g2, mean(game$flips2))
        )
      )
    )
  })
}

shinyApp(ui, server)

The IMV Structure

Every IMV calculation involves exactly two components: a baseline model and an enhanced model. The baseline is the simpler prediction — often just the outcome prevalence, or a model without a key predictor. The enhanced model is the one being evaluated. Both are translated into equivalent weighted coins, and the IMV is the proportional gain from betting with the enhanced coin when the baseline coin sets the odds.

Baseline Prediction Enhanced Prediction w₀ (baseline coin) w₁ (enhanced coin) Single-Blind Bet IMV (w₁−w₀)/w₀ Observed data Analogous physical systems Scale-invariant

The IMV is then \((w_1 - w_0) / w_0\) — the proportional gain. A positive value means the enhanced model’s coin is heavier (more informative); a negative value means the enhanced model is actually worse than the baseline, typically a sign of overfitting.

Why not R² or AUC?

The portability of the IMV comes from the fact that it is always measuring the same thing: reduction in outcome uncertainty. When the outcome prevalence is near 50%, there is substantial uncertainty in the outcome and a good model can yield a large IMV. When prevalence is extreme — near 0% or 100% — there is almost no uncertainty left to reduce, so even a perfect model yields a small IMV. This is the right behavior: the IMV is small when there is little to predict.

Standard metrics like R² and AUC do not behave this way, and their deviations from this pattern can be misleading. To see the difference, consider the logistic regression model:

\[\Pr(y=1) = \frac{1}{1 + \exp(-(b_0 + b_1 x))}\]

Here \(x\) is a predictor, \(b_1\) controls the strength of the relationship (the signal), and \(b_0\) is the intercept. Changing \(b_0\) shifts the outcome prevalence without changing how informative \(x\) is — the signal is the same, only the baseline uncertainty changes.

The simulation below fixes \(b_1\) and sweeps \(b_0\) from −3 to 3, computing IMV, McFadden’s R², and AUC at each point. The baseline model is the prevalence alone; the enhanced model is the fitted logistic regression. Each metric is averaged over 20 replications (N = 2,000 each) per \(b_0\) value. The x-axis shows both \(b_0\) and the implied prevalence.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 480

library(shiny)

sigmoid <- function(x) 1 / (1 + exp(-x))

get_coin_weight <- function(avg_ll_geom, sigma = 1e-6) {
  target <- log(avg_ll_geom)
  tol <- 1e-9
  if (target < log(0.5) - tol) return(NA)
  if (target <= log(0.5) + tol) return(0.5)
  f <- function(w) w * log(w) + (1 - w) * log(1 - w) - target
  tryCatch(uniroot(f, lower = 0.5 + sigma, upper = 1 - sigma)$root,
           error = function(e) NA)
}

avg_ll_geom <- function(y, p, sigma = 1e-4) {
  p <- pmin(pmax(p, sigma), 1 - sigma)
  exp(mean(y * log(p) + (1 - y) * log(1 - p)))
}

compute_one <- function(b0, b1, N = 2000, n_rep = 20) {
  imv_v <- r2_v <- auc_v <- numeric(n_rep)
  for (i in seq_len(n_rep)) {
    x   <- rnorm(N);  p_true <- sigmoid(b0 + b1 * x)
    y   <- rbinom(N, 1, p_true)
    x2  <- rnorm(N);  p2_true <- sigmoid(b0 + b1 * x2)
    y2  <- rbinom(N, 1, p2_true)
    fit <- glm(y ~ x, family = binomial)
    p_hat <- sigmoid(coef(fit)[1] + coef(fit)[2] * x2)
    prev  <- mean(y)
    # IMV
    w0 <- get_coin_weight(avg_ll_geom(y2, rep(prev, N)))
    w1 <- get_coin_weight(avg_ll_geom(y2, p_hat))
    imv_v[i] <- if (!is.na(w0) && !is.na(w1) && w0 > 0) (w1 - w0) / w0 else NA
    # McFadden R2
    ll_null <- sum(dbinom(y2, 1, prev, log = TRUE))
    ll_full <- sum(dbinom(y2, 1, p_hat, log = TRUE))
    r2_v[i] <- 1 - ll_full / ll_null
    # AUC
    pos <- p_hat[y2 == 1]; neg <- p_hat[y2 == 0]
    auc_v[i] <- mean(outer(pos, neg, ">")) + 0.5 * mean(outer(pos, neg, "=="))
  }
  c(imv = mean(imv_v, na.rm = TRUE),
    r2  = mean(r2_v,  na.rm = TRUE),
    auc = mean(auc_v, na.rm = TRUE),
    prev = sigmoid(b0))   # prevalence at x=0 (mean of x dist)
}

ui <- fluidPage(
  tags$head(tags$style(HTML("
    body { font-family: Georgia, serif; background: #fafaf8; }
    .well { background: #f0ede6; border: none; border-radius: 8px; }
  "))),
  sidebarLayout(
    sidebarPanel(width = 3,
      sliderInput("b1", HTML("Slope b<sub>1</sub><br><small>(signal strength — held fixed)</small>"),
                  min = 0.3, max = 2.0, value = 1.0, step = 0.1),
      tags$hr(),
      actionButton("run", "▶  Run simulation", class = "btn-primary btn-block"),
      tags$p(style = "font-size:0.8em; color:#888; margin-top:8px;",
        "Sweeps b₀ from −3 to 3 in steps of 0.25.",
        " Each point averages 20 replications (N = 2,000).",
        " Takes ~20 seconds.")
    ),
    mainPanel(width = 9,
      plotOutput("sweep_plot", height = "340px")
    )
  )
)

server <- function(input, output, session) {

  sweep <- reactiveVal(NULL)

  observeEvent(input$run, {
    b1 <- input$b1
    b0_seq <- seq(-3, 3, by = 0.25)
    withProgress(message = "Sweeping b₀...", value = 0, {
      res <- t(sapply(seq_along(b0_seq), function(i) {
        incProgress(1/length(b0_seq),
                    detail = sprintf("b₀ = %.2f (prevalence ≈ %.0f%%)",
                                     b0_seq[i], sigmoid(b0_seq[i]) * 100))
        compute_one(b0_seq[i], b1)
      }))
      sweep(data.frame(b0 = b0_seq, res))
    })
  })

  output$sweep_plot <- renderPlot({
    req(sweep())
    d <- sweep()

    # x-axis labels: b0 on top, prevalence on bottom
    b0_ticks  <- seq(-3, 3, by = 1)
    prev_ticks <- round(sigmoid(b0_ticks) * 100)

    par(mfrow = c(1, 3), mar = c(5, 4, 6, 1), family = "serif")

    plot_panel <- function(y, ylab, col, main) {
      plot(d$b0, y, type = "l", lwd = 2.5, col = col,
           xlab = "", ylab = ylab, main = main,
           xaxt = "n", cex.main = 1.05, cex.lab = 0.95,
           panel.first = abline(h = mean(y, na.rm=TRUE),
                                lty = 3, col = "gray70"))
      abline(h = 0, col = "gray80")
      # bottom axis: b0 values
      axis(1, at = b0_ticks, labels = b0_ticks, cex.axis = 0.85)
      mtext(expression(b[0]), side = 1, line = 2.2, cex = 0.85)
      # top axis: prevalence
      axis(3, at = b0_ticks,
           labels = paste0(prev_ticks, "%"), cex.axis = 0.8, col.axis = "gray40")
      mtext("Prevalence", side = 3, line = 3.5, cex = 0.8, col = "gray40")
    }

    plot_panel(d$imv, "IMV",          "#2a7d2e", paste0("IMV  (b₁ = ", input$b1, ")"))
    plot_panel(d$r2,  "McFadden R²",  "#b33a1e", paste0("McFadden R²  (b₁ = ", input$b1, ")"))
    plot_panel(d$auc, "AUC",          "#185fa5", paste0("AUC  (b₁ = ", input$b1, ")"))

  }, res = 96)
}

shinyApp(ui, server)

All three metrics vary with prevalence — but for different reasons and with different implications:

  • IMV peaks near 50% prevalence and tapers toward zero at the extremes. This reflects the structure of outcome uncertainty: when the outcome is nearly certain (prevalence near 0% or 100%), there is very little uncertainty to reduce and even an excellent model earns a small IMV. This is the desired behavior — the IMV is measuring something real.
  • McFadden R² also peaks near 50% but collapses much more sharply at extreme prevalences, understating model quality in a way that is hard to interpret. At very low or high prevalence it can suggest a model is nearly worthless even when it is predicting as well as the data allow.
  • AUC is more stable across prevalences but can remain high at extreme prevalences for a misleading reason: when nearly all outcomes are the same value, ranking predictions correctly is easy even for an uninformative model.

The dotted horizontal line in each panel marks the metric’s mean across the sweep. The key comparison is not the shape of any single curve but what happens when you try to compare two datasets with different prevalences: an IMV of 0.05 means the same thing regardless of the dataset’s prevalence; an R² of 0.05 does not.


The interactive examples below are organized into two groups. If you are new to the IMV, we recommend working through the binary prediction examples first.

Binary Prediction

These three examples all use logistic regression and are designed to build intuition progressively.

Note1. Logistic regression and the intercept

The simplest starting point. Shows how the IMV varies as a function of the slope parameter b₁ and the intercept b₀ in a standard logistic regression. Demonstrates the key point that the IMV is near zero when b₁ ≈ 0 (the predictor carries no information), and illustrates how the intercept — which controls baseline prevalence — modulates the IMV even when b₁ is held fixed.

Note2. Logistic regression and the Oracle

Introduces the Oracle IMV: a diagnostic available only in simulation, where the true generating probabilities p are known. The Oracle IMV measures how far estimated predictions p̂ are from the truth — and shows that this gap shrinks to zero as sample size increases (consistency). A useful tool for understanding estimation quality and the role of sample size.

Note3. Logistic regression and the Overfit

Introduces the Overfit IMV: what happens when you evaluate model fit on the same data used to estimate the model rather than on held-out data. A correctly specified model (y ~ x) is compared to an overspecified one (y ~ x + x²) under both in-sample and out-of-sample evaluation. The in-sample IMV is positive (the extra term appears to help); the out-of-sample IMV is negative (it actually hurts). The gap between the two narrows as sample size grows.

IRT Examples

These examples apply the IMV to item response theory models, where the quantity of interest is how well an IRT model predicts item responses relative to a simpler baseline.

Note4. 2PL versus 3PL predictions

Compares the predictive performance of a 2PL model (no guessing parameter) against a 3PL model (with guessing) when data are generated from the 3PL. Examines how the IMV varies as a function of both the guessing parameter c and sample size. Key finding: even when guessing is present, the 3PL’s predictive advantage over the 2PL is small — and can be negative in small samples due to the difficulty of estimating c.

Note5. The collapse of the thresholded IMV

Extends the IMV to polytomous (multi-category) items via the thresholded IMV (ω_t). Demonstrates the elegant theoretical property that, as the threshold parameters of a graded response model (GRM) converge, the thresholded IMV collapses to the binary IMV based on the corresponding dichotomization. This makes the IMV comparable across items with different numbers of response categories.


Computational Examples

These examples show how to use the imv R package to compute IMV values in real-life data analysis scenarios. The package is available on CRAN and can be installed with install.packages("imv"). Each example uses a publicly available dataset and walks through model fitting, IMV computation, and interpretation of results.

Note6. Logistic regression (glm)

Uses the imv package with glm() to compare a baseline and enhanced logistic regression model. Demonstrates the standard two-model workflow and shows how to read and interpret the cross-validated IMV output.

Note7. Mixed-effects logistic regression (glmer)

Applies the imv package to glmer() models from lme4, illustrating how IMV handles clustered data where observations are nested within groups.

Note8. Unidimensional IRT

Uses the imv package with mirt to compare IRT models fit to dichotomous item response data. Shows how IMV quantifies the predictive gain from adding item discrimination parameters (1PL vs. 2PL).

Note9. Multidimensional IRT

Extends the IRT example to multidimensional models, comparing a unidimensional and multidimensional mirt solution to assess whether a second dimension provides meaningful predictive improvement.


How to Interpret IMV Values

Because the IMV is defined relative to a baseline, its absolute magnitude depends on the comparison being made. Some reference points from published work:

  • IMV ≈ 0: The enhanced model offers no predictive improvement over the baseline. This can indicate either that the extra predictors are uninformative or, in small samples, that the model is overfitting.
  • IMV > 0, small (e.g., 0.001–0.01): Modest but potentially meaningful improvement. Typical range for comparing closely related IRT models (e.g., 2PL vs. 3PL, or 1PL vs. 2PL on well-behaved items).
  • IMV > 0, moderate (e.g., 0.01–0.05): Substantively meaningful improvement. Typical range when a new predictor explains a real portion of variance.
  • IMV < 0: The enhanced model performs worse than the baseline on new data. A diagnostic flag for overfitting or model misspecification.

These benchmarks are discussed in more depth, with simulation-based reference distributions, in the Psychometrika paper below.


References

The InterModel Vigorish (IMV) as a flexible and portable approach for quantifying predictive accuracy with binary outcomes
Domingue BW, Rahal C, Faul J, Freese J, Kanopka K, Rigos A, … & Tripathi AS · 2025
PLoS ONE, 20(3), e0316491 Open Access
Understanding the "fit" of models designed to predict binary outcomes has been a long-standing problem across the social sciences. We propose a flexible, portable, and intuitive metric for quantifying the change in accuracy between two predictive systems in the case of a binary outcome: the InterModel Vigorish (IMV). The IMV is based on an analogy to weighted coins, well-characterized physical systems with tractable probabilities. The IMV is always a statement about the change in fit relative to some baseline model—which can be as simple as the prevalence—whereas other metrics are stand-alone measures that need to be further manipulated to yield indices related to differences in fit across models. Moreover, the IMV is consistently interpretable independent of baseline prevalence. We contrast this metric with alternatives in numerous simulations. The IMV is more sensitive to estimation error than many alternatives and also shows distinctive sensitivity to prevalence. We consider its performance using examples spanning the social and natural sciences. The IMV allows for precise answers to questions about changes in model fit in a variety of settings in a manner that will be useful for furthering research and the understanding of social outcomes.
Journal →
The InterModel Vigorish as a lens for understanding (and quantifying) the value of item response models for dichotomously coded items
Domingue BW, Kanopka K, Kapoor R, Pohl S, Chalmers RP, Rahal C, & Rhemtulla M · 2024
Psychometrika, 89(3), 1034–1054
The deployment of statistical models, such as those used in item response theory (IRT), necessitates the use of indices that are informative about the degree to which a given model is appropriate for a specific data context. We introduce the InterModel Vigorish (IMV) as an index that can be used to quantify accuracy for models of dichotomous item responses based on the improvement across two sets of predictions. This index has a range of desirable features: it can be used for the comparison of non-nested models and its values are highly portable and generalizable. We use this fact to compare predictive performance across a variety of simulated data contexts and also demonstrate qualitative differences in behavior between the IMV and other common indices (e.g., the AIC and RMSEA). We also illustrate the utility of the IMV in empirical applications with data from 89 dichotomous item response datasets, helping illustrate how the IMV can be used in practice and substantiating claims regarding various aspects of model performance.
Journal →
Implied probabilities of polytomous response functions for model-based prediction and comparison
Domingue BW, Kanopka K, Ulitzsch E, & Zhang L · 2025
Behaviormetrika, 52(2), 683–705
Polytomous response models are typically motivated by assumptions about specific dichotomizations of the responses. We build on this concept by considering all possible dichotomizations. We first enumerate the full range of possible dichotomizations for an outcome in a given number of categories. We then show that many of these dichotomizations lead to "implied probabilities"—the probability associated with a specific dichotomization that can be computed directly from the category response function associated with a given model—that have dramatically different forms when computed for different models. The differences can be used to evaluate model fit.
Journal →
Evaluating model predictive performance in confirmatory factor analysis with binary outcomes using the InterModel Vigorish
Zhang L, Rahal C, Kanopka K, Ulitzsch E, Zhang Z, & Domingue BW · 2026
Multivariate Behavioral Research
Confirmatory Factor Analysis (CFA) has been widely used to assess the fit of theoretical measurement models to observed data. We introduce the InterModel Vigorish (IMV) to the field; a predictive fit index that offers novel perspectives for model comparison. The IMV complements traditional fit indices by offering additional information to support model evaluation, with a particular emphasis on a model's generalizability to the hold-out data. It also yields an interpretable and intuitive metric that facilitates meaningful comparisons. We extend it into the CFA framework with binary outcomes and conduct four simulation studies to evaluate its effectiveness. The simulation results suggest that IMV effectively gauges model misspecification, offering insights both at the scale and item levels. As designed, it is insensitive to changes in sample size. By focusing on predictive accuracy, the IMV discourages overfitting. It also enables item-level comparisons, offering richer diagnostic information. To facilitate the practical application of IMV, we offer an empirical example that demonstrates its efficacy in applied research. The paper is accompanied by an R package to further advance the use of the IMV in the CFA space.
Journal →

Contact & Computing Resources

Additional code and computing resources are available on GitHub.