Do vision models fail when robots face real lighting and clutter?

Leyi Wu, Yifan Zhao, Jinjie Zhang, Suzeyu Chen, Wosong Chen, Zhifei Chen, Tianshuo Xu, Qingchun He, Hongxin Hu, Haojian Huang, Yangkai Wei, Wenqian Li, Yinchuan Li, Ying-Cong Chen

Vision-language models power robot perception, but existing benchmarks test them on clean images, not the messy real world. RoboStressBench decomposes real-world visual chaos into four physical factors—materials, viewpoint, lighting, and geometry—then tests how each breaks state-of-the-art VLMs at recognition and planning. Different stressors wreck different capabilities, and the authors show a stress-aware agent that detects and edits problematic images before reasoning significantly improves performance in harsh conditions.