← Back to Artificial Intelligence
cs.AI

Open multilingual text embeddings that top nine major benchmarks

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

May 14, 2026

Most high-quality text embedding models are expensive to run, trained on a narrow set of languages, and closed-source — limiting both accessibility and reproducibility. ML-Embed tackles all three issues with 3-Dimensional Matryoshka Learning, which combines flexible embedding size (MRL), variable inference depth (MLL), and a new parameter-efficient component (MEL). Models ranging from 140M to 8B parameters are trained on a massively multilingual dataset and evaluated across 430 tasks, with top performance on 9 of 17 MTEB benchmarks and notably strong results for low-resource languages. All models, training data, and code are publicly released.
Published as ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World arXiv:2605.15081
Read the original paper →