← Back to Artificial Intelligence cs.AI
Separating language memory from vision for robot control
Jingjing Fan, Siyuan Li, Botao Ren, Zhidong Deng
May 18, 2026
Current robot control systems mix language understanding with visual reasoning in a single neural backbone, forcing them to compete for capacity. Key-Gram instead uses an external memory module that breaks instructions into task-specific components, retrieves linguistic knowledge via fast lookup tables, and injects it into the visual processing pipeline through gating mechanisms. This separation lets the backbone focus on visual reasoning while reusable instruction knowledge scales independently. Testing on RoboTwin2.0, LIBERO, and real dual-arm robots shows consistent gains: 29.5% improvement on simulation benchmarks and strong zero-shot transfer to new tasks.
Read the original paper →