Why can't vision-language models count beyond their training data?

Vision-language models like GPT-4V fail dramatically when asked to count objects beyond their training range, despite excelling at other visual tasks. Researchers dissected this failure into three stages and found the models actually perceive individual objects and understand relative quantity well—the real problem is the final step: translating visual magnitude into number tokens. The models learn separate statistical patterns for each modality rather than a unified number space, meaning they can't generalize unseen quantities. Data alone won't fix this; the models need architectural constraints that enforce shared representations across vision and language.