Protecting fine details when shrinking vision-language models

Distilling large vision-language models into efficient hybrid architectures (mixing Mamba-2 and attention) preserves performance on reasoning benchmarks but fails dramatically on OCR and document tasks. The problem: low-density background patches (sky, texture) dominate loss computation during training, while sparse high-density patches containing text and edges receive insufficient protection. HEED replaces uniform residual alignment with density-weighted alignment, using patch self-dissimilarity as a training-free importance signal. On a 10-benchmark suite, HEED gains 5.13 points over standard distillation; on OCRBench v2 alone, it recovers 8.7 points. The final student reaches teacher-level performance with 4.12× throughput and 68% memory savings at 128k context, with no added parameters or inference overhead.