← Back to Computer Vision
cs.CV

A massive, open dataset designed for training image-generation models

Benjamin Aubin, Gonzalo Iñaki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Clément Chadebec

May 20, 2026

Training text-to-image models typically requires expensive, proprietary datasets. MONET provides an open alternative: 104.9M image-caption pairs curated from 2.9B raw sources through automated filtering, deduplication, and re-captioning with multiple vision-language models. A 4B-parameter diffusion model trained only on MONET matched state-of-the-art scores on standard benchmarks, proving the dataset's quality and making reproducible image-generation research accessible to anyone.
Published as MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset arXiv:2605.21272
Read the original paper →