A massive, open dataset designed for training image-generation models

Benjamin Aubin, Gonzalo Iñaki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Clément Chadebec

Training text-to-image models typically requires expensive, proprietary datasets. MONET provides an open alternative: 104.9M image-caption pairs curated from 2.9B raw sources through automated filtering, deduplication, and re-captioning with multiple vision-language models. A 4B-parameter diffusion model trained only on MONET matched state-of-the-art scores on standard benchmarks, proving the dataset's quality and making reproducible image-generation research accessible to anyone.