Teaching language models to write solvers once, reuse forever

Most LLM approaches to combinatorial optimization solve each instance independently at inference time. This work uses reinforcement learning to shift reasoning into model weights, training a code LLM to generate a single solver that works across an entire problem family. On Synergistic Dependency Selection (a constrained quadratic knapsack variant), Group Relative Policy Optimization with feasibility-gated rewards fine-tunes Qwen2.5-Coder-14B to reliably produce correct Simulated Annealing implementations. The resulting solver achieves 5% gap to optimal—versus 28.7% from base-model sampling—and costs 91× less to execute repeatedly. A single frozen solver generalizes across the test set; preliminary results on Job Shop Scheduling suggest partial transfer to new domains, though performance remains sensitive to reward design and scaffolding choices.