← Back to Machine Learning cs.LG
How much of a reasoning chain can you trust before it breaks?
Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan
May 28, 2026
Language models often produce reasoning steps that are partly correct—valid intermediate work followed by critical errors. CROP applies conformal prediction to certify the longest contiguous prefix of a reasoning trace that stays below a risk threshold, automatically routing the unreliable remainder for human review or repair. Tested on six reasoning datasets, it outperforms standard uncertainty metrics by measuring what actually matters: how much valid reasoning can be preserved before hallucination.
Read the original paper →