← Back to Computer Vision cs.CV
A massive, open dataset designed for training image-generation models
Benjamin Aubin, Gonzalo Iñaki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Clément Chadebec
May 20, 2026
Training text-to-image models typically requires expensive, proprietary datasets. MONET provides an open alternative: 104.9M image-caption pairs curated from 2.9B raw sources through automated filtering, deduplication, and re-captioning with multiple vision-language models. A 4B-parameter diffusion model trained only on MONET matched state-of-the-art scores on standard benchmarks, proving the dataset's quality and making reproducible image-generation research accessible to anyone.
Read the original paper →