← Back to Artificial Intelligence
cs.AI

Smart bit allocation for compressed language models

Zhangyang Yao, Haiyan Zhao, Haoyu Wang, Tianbo Huang, Lihua Zhang, Xu Han

May 18, 2026

Large language models need efficient quantization, but deciding which modules get more bits is expensive to recompute per target budget. GAMMA solves this by learning module sensitivity rankings in a single post-training pass, then using integer programming to assign exact bit allocations for any deployment target. The learned rankings are reusable across budgets, cutting per-budget adaptation from hours to minutes. On Llama and Qwen models (8B–32B), GAMMA achieves 2.5-bit average precision while matching 3-bit fixed-precision quality, and outperforms search-based mixed-precision methods by up to 7 points.
Published as GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets arXiv:2605.18475
Read the original paper →