Why optimized prompts fail to transfer between tasks

Automated prompt optimization methods like DSpy and TextGrad improve LLM performance on individual benchmarks, but those improvements rarely transfer to new tasks or models. By analyzing thousands of optimized prompts across frameworks and LLMs, researchers found the problem: certain edits (like adding complexity or meta-instructions) systematically harm math and multi-hop reasoning, while step-by-step prompts help logical tasks. These aren't random failures—they're predictable mismatches between edit types and task characteristics.