← Back to Artificial Intelligence cs.AI
Open multilingual text embeddings that top nine major benchmarks
Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
May 14, 2026
Most high-quality text embedding models are expensive to run, trained on a narrow set of languages, and closed-source — limiting both accessibility and reproducibility. ML-Embed tackles all three issues with 3-Dimensional Matryoshka Learning, which combines flexible embedding size (MRL), variable inference depth (MLL), and a new parameter-efficient component (MEL). Models ranging from 140M to 8B parameters are trained on a massively multilingual dataset and evaluated across 430 tasks, with top performance on 9 of 17 MTEB benchmarks and notably strong results for low-resource languages. All models, training data, and code are publicly released.
Read the original paper →