A 28-trillion-pixel dataset for training image generators without legal risk

Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli, Juan Carlos Niebles, Justin Johnson, Jiajun Wu, Li Fei-Fei

Training visual generative models at scale hits a wall: finding huge, legally usable datasets is hard. GPIC solves this with 100M training images (28 trillion pixels total) captioned by a vision-language model, all permissively licensed for research and commercial use. The dataset is deduplicated, safety-filtered, and hosted on Hugging Face with benchmarking protocols and baseline flow-matching models, removing infrastructure barriers for researchers and practitioners building the next generation of image generators.