Making robot vision robust without retraining on new data

Yiyang Fu, Chubin Zhang, Shukai Gong, Yufan Deng, Kaiwei Sun, Qiyang Min, Qibin Hou, Yansong Tang, Jianan Wang, Daquan Zhou

Vision-language-action (VLA) models—used to teach robots to act from visual input—collapse when encountering real-world corruptions like blur, fog, or noise that weren't in their training set. The authors propose Information Bottleneck Adapter, a lightweight module that filters noise from visual inputs without requiring extra data or augmentation. On long-horizon robot tasks, it recovers 30% of lost performance and lets even tiny 0.5B-parameter models match 7B-scale robots under corrupted visuals.