How much of a reasoning chain can you trust before it breaks?

Language models often produce reasoning steps that are partly correct—valid intermediate work followed by critical errors. CROP applies conformal prediction to certify the longest contiguous prefix of a reasoning trace that stays below a risk threshold, automatically routing the unreliable remainder for human review or repair. Tested on six reasoning datasets, it outperforms standard uncertainty metrics by measuring what actually matters: how much valid reasoning can be preserved before hallucination.