← Back to Machine Learning
cs.LG

A tool that auto-generates ML dataset metadata without uploading your data

Rafi Al Attrach, Rajna Fani, Sebastian Lobentanzer, Joan Giner-Miguelez, Debanshu Das, Varuni H. K., Nobin Sarwar, Rajat Ghosh, Anwai Archit, Surbhi Motghare, Christina Conrad Parry, Luis Oala, Lara Grosso, Joaquin Vanschoren, Steffen Vogler, Sujata Goswami, Eric S. Rosenthal, Marzyeh Ghassemi, Matthew McDermott, Tom Pollard

May 14, 2026

Many high-value ML datasets live in governed, private repositories that can't simply be uploaded to a public platform to generate metadata. Croissant Baker is an open-source command-line tool that solves this by producing structured, JSON-LD Croissant metadata locally from a dataset directory, using a modular handler registry for different file formats. Validated against producer-authored ground truth across 140+ datasets spanning multiple domains, it achieves 97–100% agreement — and handles real-world scale, including a 374-file, 886-million-row Parquet dataset. The tool is aimed at dataset curators and ML platform operators who need to meet standards like NeurIPS's new Croissant submission requirement.
Published as Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets arXiv:2605.15079
Read the original paper →