← Back to Machine Learning cs.LG
A benchmark that measures reasoning across multiple cognitive skills
Rohit Patel, Alexandre Rezende, Steven McClain
May 18, 2026
Current LLM benchmarks either inflate scores by testing specialized knowledge (GPQA, HLE) or rely purely on abstract reasoning divorced from practical context (ARC-AGI). GIM introduces 820 original problems where difficulty stems from integrating multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, audience calibration—grounded in broadly accessible knowledge. Each problem is expert-authored with rubric-decomposed scoring (median 6 criteria). The authors calibrate a 2-parameter logistic IRT model over 200k+ prompt-response pairs across 28 models, producing ability estimates robust to missing data and error distortion. A comprehensive leaderboard spans 22 models and 47 configurations; an extensive study of test-time compute shows that within-family choices (thinking budget, quantization) matter as much as model selection. The evaluation framework, IRT parameters, and 615 public problems are released.
Read the original paper →