How to make transformers handle many modalities at once?

Current multimodal transformers either compute attention as pairwise interactions (scaling poorly with modality count) or concatenate everything (missing joint dependencies). GRAMformer introduces Volumetric Multimodal cross-Attention (VMA), which treats attention scores as functions of the geometric volume spanned by query and key vectors across multiple modalities simultaneously. This captures three-way or four-way interactions natively rather than approximating them through pairs, improving both speed and accuracy on multimodal tasks.