The InterModel Vigorish (IMV)
What is the IMV?
When two models both predict the same binary outcome, how much better is one than the other — and does that difference actually matter?
Standard metrics like R², AUC, and the F₁ score can answer versions of this question, but they have a shared limitation: their values depend on the baseline difficulty of the prediction problem. An improvement of 0.03 in AUC means something very different when predicting a rare event (prevalence = 2%) versus a common one (prevalence = 50%). This makes it hard to compare model improvements across different datasets or outcomes.
The InterModel Vigorish (IMV) is designed to fix this. It is a metric for quantifying the change in predictive accuracy between two models — a baseline and an enhanced prediction — in a way that is portable (comparable across outcomes with different prevalences) and intuitive (grounded in a concrete physical analogy).
The Weighted Coin Analogy
The IMV is built on an analogy to weighted coins.
Any predictive system that assigns probabilities to binary outcomes can be mapped to an equivalent weighted coin — a physical object whose bias exactly matches the average uncertainty in those predictions. When you have two such systems, you can ask: by how much does the enhanced model’s coin outperform the baseline model’s coin in a single-blind bet?
More formally, the IMV is the expected proportional winnings from betting according to the enhanced model’s probabilities when the baseline model sets the odds. A positive IMV means the enhanced model provides genuine predictive value beyond the baseline; a negative IMV signals overfitting or model misspecification.
This framing has a key consequence: because the baseline model defines the bet, the IMV is always a statement about relative improvement, not absolute accuracy. The same absolute improvement in log-likelihood will yield a larger IMV when the baseline outcome is highly uncertain (prevalence near 0.5) than when it is already predictable — which is exactly the right behavior.
Try it yourself
Two weighted coins are flipped 20 times each. You see the outcomes but not the true weights. Enter your best guesses for each coin’s probability of heads — these define your enhanced model. The baseline model uses only the overall average (treating both coins identically). The IMV then measures how much your guesses improve on that baseline.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 680
library(shiny)
# ── IMV helpers ──────────────────────────────────────────────────────────────
get_coin_weight <- function(avg_ll, sigma = 1e-6) {
target <- log(avg_ll)
# The entropy H(w) = w*log(w) + (1-w)*log(1-w) has a minimum of log(0.5) at w=0.5.
# If target is below this floor (with small tolerance for floating point),
# there is no solution — predictions are worse than a fair coin.
tol <- 1e-9
if (target < log(0.5) - tol) stop("implausible")
# Boundary case: target is at or very near the floor — coin weight is 0.5
if (target <= log(0.5) + tol) return(0.5)
f <- function(w) w * log(w) + (1 - w) * log(1 - w) - target
uniroot(f, lower = 0.5 + sigma, upper = 1 - sigma)$root
}
avg_ll <- function(y, p, sigma = 1e-4) {
p <- pmin(pmax(p, sigma), 1 - sigma)
exp(mean(y * log(p) + (1 - y) * log(1 - p)))
}
imv_coins <- function(y, p_baseline, p_enhanced) {
ll0 <- avg_ll(y, p_baseline)
ll1 <- avg_ll(y, p_enhanced)
w0 <- tryCatch(get_coin_weight(ll0), error = function(e) NA)
w1 <- tryCatch(get_coin_weight(ll1), error = function(e) NA)
list(w0 = w0, w1 = w1, imv = if (!is.na(w0) && !is.na(w1)) (w1 - w0) / w0 else NA)
}
# ── UI ───────────────────────────────────────────────────────────────────────
ui <- fluidPage(
tags$head(tags$style(HTML("
body { font-family: 'Georgia', serif; background: #fafaf8; }
.well { background: #f0ede6; border: none; border-radius: 8px; }
.flip-row { font-size: 1.3em; letter-spacing: 3px; margin: 4px 0; }
.coin-label { font-weight: bold; font-size: 1.05em; margin-top: 10px; }
.result-box { background: #fff; border: 1px solid #ddd; border-radius: 8px;
padding: 16px 20px; margin-top: 12px; }
.result-box h4 { margin-top: 0; }
.imv-positive { color: #2a7d2e; font-weight: bold; }
.imv-negative { color: #b33a1e; font-weight: bold; }
.imv-zero { color: #666; font-weight: bold; }
.reveal-box { background: #f5f0e8; border-left: 4px solid #8b6914;
padding: 12px 16px; border-radius: 4px; margin-top: 10px; }
.error-box { background: #fff5f5; border: 1px solid #f5c6cb; border-radius: 8px;
padding: 16px 20px; margin-top: 12px; }
.error-box h4 { margin-top: 0; color: #b33a1e; }
hr.thin { border-top: 1px solid #ddd; margin: 14px 0; }
details { margin-top: 14px; }
summary { cursor: pointer; font-size: 0.9em; color: #666;
padding: 6px 0; user-select: none; }
summary:hover { color: #333; }
details[open] summary { margin-bottom: 10px; }
.explainer-inner { background: #fafafa; border: 1px solid #e8e8e8;
border-radius: 6px; padding: 14px 16px; font-size: 0.88em; }
.stat-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 8px; margin: 10px 0; }
.stat-card { background: #f0ede6; border-radius: 6px; padding: 8px 10px; }
.stat-label { font-size: 0.78em; color: #666; margin-bottom: 2px; }
.stat-val { font-size: 1.1em; font-weight: bold; }
.val-ok { color: #2a7d2e; }
.val-fail { color: #b33a1e; }
.curve-wrap { position: relative; width: 100%; height: 200px; margin: 10px 0 4px; }
"))),
titlePanel(NULL),
sidebarLayout(
sidebarPanel(width = 4,
actionButton("new_game", "🎲 New coins", class = "btn-primary btn-block",
style = "margin-bottom:14px;"),
tags$div(class = "coin-label", "Coin 1 outcomes:"),
uiOutput("flips1_ui"),
tags$div(class = "coin-label", style = "margin-top:10px;", "Coin 2 outcomes:"),
uiOutput("flips2_ui"),
tags$hr(class = "thin"),
tags$p("Enter your probability guesses:"),
sliderInput("g1", "Your guess for Coin 1 (p₁):",
min = 0.01, max = 0.99, value = 0.5, step = 0.01),
sliderInput("g2", "Your guess for Coin 2 (p₂):",
min = 0.01, max = 0.99, value = 0.5, step = 0.01),
actionButton("submit", "⚖️ Compute IMV", class = "btn-success btn-block",
style = "margin-top:6px;")
),
mainPanel(width = 8,
uiOutput("results_ui")
)
)
)
# ── Server ───────────────────────────────────────────────────────────────────
server <- function(input, output, session) {
# Reactive game state
game <- reactiveValues(
p1 = NULL, p2 = NULL,
flips1 = NULL, flips2 = NULL,
submitted = FALSE
)
# Generate new coins on button press (and on startup)
observeEvent(input$new_game, {
game$p1 <- runif(1, 0.3, 0.85)
game$p2 <- runif(1, 0.3, 0.85)
game$flips1 <- rbinom(20, 1, game$p1)
game$flips2 <- rbinom(20, 1, game$p2)
game$submitted <- FALSE
updateSliderInput(session, "g1", value = 0.5)
updateSliderInput(session, "g2", value = 0.5)
}, ignoreNULL = FALSE) # run on startup too
# Show flip sequences
fmt_flips <- function(flips) {
paste(ifelse(flips == 1, "H", "T"), collapse = " ")
}
output$flips1_ui <- renderUI({
req(game$flips1)
heads <- sum(game$flips1)
tags$div(
tags$div(class = "flip-row", fmt_flips(game$flips1)),
tags$small(style = "color:#555;",
sprintf("(%d heads, %d tails out of 20)", heads, 20 - heads))
)
})
output$flips2_ui <- renderUI({
req(game$flips2)
heads <- sum(game$flips2)
tags$div(
tags$div(class = "flip-row", fmt_flips(game$flips2)),
tags$small(style = "color:#555;",
sprintf("(%d heads, %d tails out of 20)", heads, 20 - heads))
)
})
# On submit: compute and display results
observeEvent(input$submit, {
game$submitted <- TRUE
})
output$results_ui <- renderUI({
if (!game$submitted) {
return(tags$div(
style = "color:#888; margin-top:30px; font-style:italic;",
"Adjust your probability guesses using the sliders, then click",
tags$strong("Compute IMV"), "to see your results."
))
}
req(game$flips1, game$flips2)
# Build outcome and prediction vectors over both coins combined
y_all <- c(game$flips1, game$flips2)
coin_id <- c(rep(1, 20), rep(2, 20))
# Baseline: same probability for every flip = overall mean
p_base <- mean(y_all)
p_baseline_vec <- rep(p_base, 40)
# Enhanced: user's guesses per coin
p_enhanced_vec <- ifelse(coin_id == 1, input$g1, input$g2)
res <- imv_coins(y_all, p_baseline_vec, p_enhanced_vec)
# Check which model(s) failed
baseline_failed <- is.na(res$w0)
enhanced_failed <- is.na(res$w1)
if (baseline_failed || enhanced_failed) {
# avg_ll() returns exp(mean log-likelihood), so log() recovers mean log-likelihood
ll_enhanced <- log(avg_ll(y_all, p_enhanced_vec))
ll_baseline <- log(avg_ll(y_all, p_baseline_vec))
log2_floor <- log(0.5)
gap_enh <- ll_enhanced - log2_floor
gap_base <- ll_baseline - log2_floor
failed_model <- if (baseline_failed && enhanced_failed) {
"both the baseline and your guesses are"
} else if (baseline_failed) {
"the baseline model is"
} else {
"your guesses are"
}
# Pass values into JS via data attributes on a hidden div
return(tagList(
tags$div(class = "error-box",
tags$h4("\u26a0\ufe0f Predictions too far from the data"),
tags$p(
"The IMV cannot be computed because ", failed_model,
" so inconsistent with the observed flips that they perform",
" worse than a fair coin (p\u00a0=\u00a00.5). In the weighted-coin",
" framework, no valid coin weight exists for predictions this poor."
),
tags$p("Try adjusting your guesses to be closer to the observed",
" proportion of heads for each coin."),
tags$div(class = "reveal-box",
tags$strong("Observed proportions:"),
tags$br(),
sprintf("Coin 1: %.2f heads (your guess: %.2f)",
mean(game$flips1), input$g1),
tags$br(),
sprintf("Coin 2: %.2f heads (your guess: %.2f)",
mean(game$flips2), input$g2)
),
tags$details(
tags$summary("Why does this happen? (click to expand)"),
tags$div(class = "explainer-inner",
tags$p(
"The IMV maps every set of predictions to an equivalent weighted coin",
" via the bijection: find w \u2208 (0.5, 1) such that its Bernoulli entropy",
" H(w)\u00a0=\u00a0w\u00a0log\u00a0w\u00a0+\u00a0(1\u2212w)\u00a0log(1\u2212w)",
" equals your average log-likelihood \u2113\u0304.",
" But H(w) has a floor at H(0.5)\u00a0=\u00a0\u2212log\u00a02\u00a0\u2248\u00a0\u22120.693 —",
" the entropy of a perfectly fair coin.",
" If your \u2113\u0304 falls below this floor, the equation has no solution.",
" Your predictions are so confidently wrong they carry",
" less information than a fair coin flip."
),
tags$div(class = "stat-grid",
tags$div(class = "stat-card",
tags$div(class = "stat-label", "Your avg log-likelihood \u2113\u0304"),
tags$div(class = paste("stat-val", if (enhanced_failed) "val-fail" else "val-ok"),
sprintf("%.4f", ll_enhanced))
),
tags$div(class = "stat-card",
tags$div(class = "stat-label", "Feasibility floor \u2212log\u00a02"),
tags$div(class = "stat-val", "\u22120.6931")
),
tags$div(class = "stat-card",
tags$div(class = "stat-label", "Gap to floor"),
tags$div(class = paste("stat-val", if (enhanced_failed) "val-fail" else "val-ok"),
sprintf("%+.4f", gap_enh))
),
tags$div(class = "stat-card",
tags$div(class = "stat-label", "Baseline \u2113\u0304"),
tags$div(class = paste("stat-val", if (baseline_failed) "val-fail" else "val-ok"),
sprintf("%.4f", ll_baseline))
)
),
tags$p(style = "font-size:0.85em; color:#555; margin: 8px 0 4px;",
"The curve below shows H(w) — the Bernoulli entropy as a function of coin weight w.",
" The solid black line is the feasibility floor \u2212log\u00a02.",
" The red dashed line is your average log-likelihood \u2113\u0304.",
" A coin weight exists only when the red line meets or crosses the curve."),
tags$div(class = "curve-wrap",
tags$script(src = "https://cdnjs.cloudflare.com/ajax/libs/Chart.js/4.4.1/chart.umd.js"),
tags$canvas(id = "entropy-curve",
`aria-label` = "Entropy curve showing feasibility floor and current log-likelihood")
),
# Inline script — polls until Chart.js ready then draws
tags$script(HTML(sprintf("
(function() {
var LOG2 = Math.log(0.5);
var ll_enh = %f;
var ll_base = %f;
var failed_enh = %s;
var failed_base = %s;
function entropy(w) {
return w * Math.log(w) + (1 - w) * Math.log(1 - w);
}
function coinWeight(ll) {
if (ll < LOG2) return null;
var f = function(w) { return entropy(w) - ll; };
var lo = 0.5, hi = 1 - 1e-9;
for (var i = 0; i < 60; i++) {
var mid = (lo + hi) / 2;
if (f(mid) < 0) hi = mid; else lo = mid;
}
return (lo + hi) / 2;
}
function drawChart() {
var canvas = document.getElementById('entropy-curve');
if (!canvas || typeof Chart === 'undefined') {
setTimeout(drawChart, 50);
return;
}
var wPts = [], hPts = [];
for (var i = 0; i <= 200; i++) {
var w = 0.5 + (i / 200) * 0.499;
wPts.push(w.toFixed(3));
hPts.push(entropy(w));
}
var yMin = Math.min(-2.6, ll_enh - 0.3, ll_base - 0.1);
var datasets = [
{ label: 'H(w) \u2014 Bernoulli entropy curve', data: hPts,
borderColor: '#185fa5', borderWidth: 2,
pointRadius: 0, tension: 0.3, fill: false },
{ label: 'Floor: \u2212log 2',
data: wPts.map(function() { return LOG2; }),
borderColor: '#111', borderWidth: 2, borderDash: [],
pointRadius: 0, fill: false },
{ label: 'Your \u2113\u0304 (avg log-likelihood)',
data: wPts.map(function() { return ll_enh; }),
borderColor: '#a32d2d',
borderWidth: 1.5, borderDash: [4,3], pointRadius: 0, fill: false }
];
var w_enh = coinWeight(ll_enh);
if (w_enh !== null) {
datasets.push({
label: '\u03c9', data: [{ x: w_enh.toFixed(3), y: entropy(w_enh) }],
borderColor: 'transparent', backgroundColor: '#2a7d2e',
pointRadius: 7, showLine: false
});
}
new Chart(canvas.getContext('2d'), {
type: 'line',
data: { labels: wPts, datasets: datasets },
options: {
responsive: true, maintainAspectRatio: false, animation: false,
plugins: { legend: { display: false }, tooltip: { enabled: false } },
scales: {
x: { title: { display: true, text: 'Coin weight w', font: { size: 11 } },
ticks: { maxTicksLimit: 6,
callback: function(v,i) { return i%%40===0 ? parseFloat(wPts[i]).toFixed(2) : ''; } },
grid: { color: 'rgba(0,0,0,0.06)' } },
y: { title: { display: true, text: 'H(w)', font: { size: 11 } },
min: yMin, max: 0.05,
ticks: { callback: function(v) { return v.toFixed(1); } },
grid: { color: 'rgba(0,0,0,0.06)' } }
}
}
});
}
drawChart();
})();
", ll_enhanced, ll_baseline,
tolower(as.character(enhanced_failed)),
tolower(as.character(baseline_failed))
)))
) # explainer-inner
) # details
) # error-box
)) # tagList
} # if failed
imv_val <- res$imv
imv_class <- if (imv_val > 0.005) "imv-positive" else
if (imv_val < -0.005) "imv-negative" else "imv-zero"
imv_interp <- if (imv_val > 0.005)
"Your guesses improve on the baseline — you captured something real about the two coins."
else if (imv_val < -0.005)
"Your guesses perform worse than just using the overall average. The baseline beats you here."
else
"Your guesses and the baseline perform about the same — not much signal picked up."
tagList(
tags$div(class = "result-box",
tags$h4("📊 Results"),
tags$p(tags$strong("Baseline model:"),
sprintf("treats both coins the same (p = %.3f, implied coin weight = %.4f)",
p_base, res$w0)),
tags$p(tags$strong("Your model:"),
sprintf("Coin 1 = %.2f, Coin 2 = %.2f → implied coin weight = %.4f",
input$g1, input$g2, res$w1)),
tags$hr(class = "thin"),
tags$p(
tags$strong("IMV = "),
tags$span(class = imv_class,
sprintf("%.4f", imv_val))
),
tags$p(style = "color:#444; font-style:italic;", imv_interp),
tags$hr(class = "thin"),
tags$div(class = "reveal-box",
tags$strong("🔍 True coin weights:"),
tags$br(),
sprintf("Coin 1: p₁ = %.3f (you guessed %.2f, sample proportion = %.2f)",
game$p1, input$g1, mean(game$flips1)),
tags$br(),
sprintf("Coin 2: p₂ = %.3f (you guessed %.2f, sample proportion = %.2f)",
game$p2, input$g2, mean(game$flips2))
)
)
)
})
}
shinyApp(ui, server)
A Guided Tour of the Examples
The interactive examples below are organized into two groups. If you are new to the IMV, we recommend working through the binary prediction examples first.
Binary Prediction
These three examples all use logistic regression and are designed to build intuition progressively.
IRT Examples
These examples apply the IMV to item response theory models, where the quantity of interest is how well an IRT model predicts item responses relative to a simpler baseline.
How to Interpret IMV Values
Because the IMV is defined relative to a baseline, its absolute magnitude depends on the comparison being made. Some reference points from published work:
- IMV ≈ 0: The enhanced model offers no predictive improvement over the baseline. This can indicate either that the extra predictors are uninformative or, in small samples, that the model is overfitting.
- IMV > 0, small (e.g., 0.001–0.01): Modest but potentially meaningful improvement. Typical range for comparing closely related IRT models (e.g., 2PL vs. 3PL, or 1PL vs. 2PL on well-behaved items).
- IMV > 0, moderate (e.g., 0.01–0.05): Substantively meaningful improvement. Typical range when a new predictor explains a real portion of variance.
- IMV < 0: The enhanced model performs worse than the baseline on new data. A diagnostic flag for overfitting or model misspecification.
These benchmarks are discussed in more depth, with simulation-based reference distributions, in the Psychometrika paper below.
References
Domingue, B. W., Rahal, C., Faul, J., Freese, J., Kanopka, K., Rigos, A., … & Tripathi, A. S. (2025). The InterModel Vigorish (IMV) as a flexible and portable approach for quantifying predictive accuracy with binary outcomes. PLoS ONE, 20(3), e0316491. → Paper
Domingue, B. W., Kanopka, K., Kapoor, R., Pohl, S., Chalmers, R. P., Rahal, C., & Rhemtulla, M. (2024). The InterModel Vigorish as a lens for understanding (and quantifying) the value of item response models for dichotomously coded items. Psychometrika, 89(3), 1034–1054. → Paper
Domingue, B. W., Kanopka, K., Ulitzsch, E., & Zhang, L. (2025). Implied probabilities of polytomous response functions for model-based prediction and comparison. Behaviormetrika, 52(2), 683–705. → Paper
Zhang, L., Rahal, C., Kanopka, K., Ulitzsch, E., Zhang, Z., & Domingue, B. W. (2026). Evaluating model predictive performance in confirmatory factor analysis with binary outcomes using the InterModel Vigorish. Multivariate Behavioral Research. DOI: 10.1080/00273171.2026.2645212. → Paper
Contact & Computing Resources
Additional code and computing resources are available on GitHub.