How to avoid encoding the same video frame twice?

Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si, Chenglin Li, Shuai Dong, Kele Shao, Ruilin Li, Dianyi Wang, Nan Duan, Jiaqi Wang

Video multimodal models waste computational resources by encoding each frame independently, even though adjacent frames are nearly identical. AdaCodec instead sends a full reference frame only when the scene changes unpredictably, otherwise transmitting compact descriptions of motion and pixel differences. Across 11 benchmarks, this adaptive approach matches Qwen3-VL-8B's performance at 1/7 the token budget and accelerates response time from 9.3 seconds to 1.6 seconds.