← Back to Computer Vision
cs.CV

Fixing manga's biggest dataset for modern AI systems

Jeonghun Baek, Atsuyuki Miyai, Shota Onohara, Hikaru Ikuta, Kiyoharu Aizawa

May 20, 2026

Manga109 is the main dataset for training AI on manga understanding and translation, but its dialogue annotations contained transcription errors, missing text, and inconsistent segmentation that sabotaged modern OCR systems. Researchers systematically identified five categories of annotation problems and fixed approximately 29,000 entries using automated detection plus manual review, creating Manga109-v2026. The cleaned dataset now aligns with how contemporary multimodal models process manga panels.
Published as Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding arXiv:2605.21182
Read the original paper →