← Back to Computer Vision
cs.CV

How to make transformers handle many modalities at once?

Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello

June 4, 2026

Current multimodal transformers either compute attention as pairwise interactions (scaling poorly with modality count) or concatenate everything (missing joint dependencies). GRAMformer introduces Volumetric Multimodal cross-Attention (VMA), which treats attention scores as functions of the geometric volume spanned by query and key vectors across multiple modalities simultaneously. This captures three-way or four-way interactions natively rather than approximating them through pairs, improving both speed and accuracy on multimodal tasks.
Published as GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention arXiv:2606.06249
Read the original paper →