Can language models automate quantitative trading strategy testing?

Quantitative backtesting—validating trading strategies against historical data—is technically complex and limits adoption. This work introduces BacktestBench, the first large-scale benchmark for automated backtesting, covering metrics calculation, ticker selection, strategy selection, and parameter tuning. The authors propose AutoBacktest, a multi-agent system that decomposes backtesting into semantic extraction, SQL retrieval, and Python code generation. Testing 23 mainstream LLMs reveals that grounded verification and standardized indicator representations are critical for end-to-end performance. The benchmark and baseline are designed for both researchers advancing LLM reasoning and quantitative finance practitioners seeking to lower barriers to strategy testing.