← Back to Computer Vision cs.CV
Why can't vision-language models count beyond their training data?
Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan
May 28, 2026
Vision-language models like GPT-4V fail dramatically when asked to count objects beyond their training range, despite excelling at other visual tasks. Researchers dissected this failure into three stages and found the models actually perceive individual objects and understand relative quantity well—the real problem is the final step: translating visual magnitude into number tokens. The models learn separate statistical patterns for each modality rather than a unified number space, meaning they can't generalize unseen quantities. Data alone won't fix this; the models need architectural constraints that enforce shared representations across vision and language.
Read the original paper →