← Back to Artificial Intelligence cs.AI
Smart bit allocation for compressed language models
Zhangyang Yao, Haiyan Zhao, Haoyu Wang, Tianbo Huang, Lihua Zhang, Xu Han
May 18, 2026
Large language models need efficient quantization, but deciding which modules get more bits is expensive to recompute per target budget. GAMMA solves this by learning module sensitivity rankings in a single post-training pass, then using integer programming to assign exact bit allocations for any deployment target. The learned rankings are reusable across budgets, cutting per-budget adaptation from hours to minutes. On Llama and Qwen models (8B–32B), GAMMA achieves 2.5-bit average precision while matching 3-bit fixed-precision quality, and outperforms search-based mixed-precision methods by up to 7 points.
Read the original paper →