← Back to Artificial Intelligence
cs.AI

Separating language memory from vision for robot control

Jingjing Fan, Siyuan Li, Botao Ren, Zhidong Deng

May 18, 2026

Current robot control systems mix language understanding with visual reasoning in a single neural backbone, forcing them to compete for capacity. Key-Gram instead uses an external memory module that breaks instructions into task-specific components, retrieves linguistic knowledge via fast lookup tables, and injects it into the visual processing pipeline through gating mechanisms. This separation lets the backbone focus on visual reasoning while reusable instruction knowledge scales independently. Testing on RoboTwin2.0, LIBERO, and real dual-arm robots shows consistent gains: 29.5% improvement on simulation benchmarks and strong zero-shot transfer to new tasks.
Published as Key-Gram: Extensible World Knowledge for Embodied Manipulation arXiv:2605.18556
Read the original paper →