← Back to Computer Vision
cs.CV

Why multimodal AI struggles when tasks look similar but need different answers

Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou

June 1, 2026

Multimodal AI models trained on multiple tasks sequentially often fail when semantically similar tasks require different response structures—like confusing visual question-answering (short text) with grounding (coordinates). ProtoAda fixes this by routing tasks based on both semantic similarity and output format, then consolidating parameter updates geometrically to avoid gradient interference. Tests show substantial gains, particularly on tasks prone to format corruption during continual learning.
Published as ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning arXiv:2606.02576
Read the original paper →