Can AI learn what works by analyzing failed experiments?

Organizations run A/B tests but rarely use results to design better interventions. Researchers tested whether AI agents with access to analytical tools could learn from one round of healthcare messaging experiments (693K patient visits) to generate improved variants for the next round. The AI method produced messages with 69.8% click-through rate—6.5 points above baseline—whereas frontier LLMs guessing without experimental data failed. The finding: domain-specific experimental evidence matters far more than general reasoning ability.