Why AI judges change their minds mid-conversation

When LLMs evaluate items in sequence within one conversation, they drift toward the sentiment of prior judgments—a bias that grows stronger when they're already uncertain. Testing 75,898 calls across 11 models shows models consistently shift their ratings after exposure to positive or negative evaluation histories, with negative bias outweighing positive by 62%. Larger models reduce but don't eliminate the effect, and it concentrates on borderline cases rather than clear-cut ones. For anyone deploying LLMs as judges or graders, the fix is simple: evaluate each item in a fresh context rather than batching them together.