← Back to Computer Vision cs.CV
Fixing manga's biggest dataset for modern AI systems
Jeonghun Baek, Atsuyuki Miyai, Shota Onohara, Hikaru Ikuta, Kiyoharu Aizawa
May 20, 2026
Manga109 is the main dataset for training AI on manga understanding and translation, but its dialogue annotations contained transcription errors, missing text, and inconsistent segmentation that sabotaged modern OCR systems. Researchers systematically identified five categories of annotation problems and fixed approximately 29,000 entries using automated detection plus manual review, creating Manga109-v2026. The cleaned dataset now aligns with how contemporary multimodal models process manga panels.
Read the original paper →