← Back to Computer Vision cs.CV
Why multimodal AI struggles when tasks look similar but need different answers
Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou
June 1, 2026
Multimodal AI models trained on multiple tasks sequentially often fail when semantically similar tasks require different response structures—like confusing visual question-answering (short text) with grounding (coordinates). ProtoAda fixes this by routing tasks based on both semantic similarity and output format, then consolidating parameter updates geometrically to avoid gradient interference. Tests show substantial gains, particularly on tasks prone to format corruption during continual learning.
Read the original paper →