← Back to Machine Learning cs.LG
A tool that auto-generates ML dataset metadata without uploading your data
Rafi Al Attrach, Rajna Fani, Sebastian Lobentanzer, Joan Giner-Miguelez, Debanshu Das, Varuni H. K., Nobin Sarwar, Rajat Ghosh, Anwai Archit, Surbhi Motghare, Christina Conrad Parry, Luis Oala, Lara Grosso, Joaquin Vanschoren, Steffen Vogler, Sujata Goswami, Eric S. Rosenthal, Marzyeh Ghassemi, Matthew McDermott, Tom Pollard
May 14, 2026
Many high-value ML datasets live in governed, private repositories that can't simply be uploaded to a public platform to generate metadata. Croissant Baker is an open-source command-line tool that solves this by producing structured, JSON-LD Croissant metadata locally from a dataset directory, using a modular handler registry for different file formats. Validated against producer-authored ground truth across 140+ datasets spanning multiple domains, it achieves 97–100% agreement — and handles real-world scale, including a 374-file, 886-million-row Parquet dataset. The tool is aimed at dataset curators and ML platform operators who need to meet standards like NeurIPS's new Croissant submission requirement.
Read the original paper →