Smart bit allocation for compressed language models

Large language models need efficient quantization, but deciding which modules get more bits is expensive to recompute per target budget. GAMMA solves this by learning module sensitivity rankings in a single post-training pass, then using integer programming to assign exact bit allocations for any deployment target. The learned rankings are reusable across budgets, cutting per-budget adaptation from hours to minutes. On Llama and Qwen models (8B–32B), GAMMA achieves 2.5-bit average precision while matching 3-bit fixed-precision quality, and outperforms search-based mixed-precision methods by up to 7 points.