← Back to Computation and Language
cs.CL

Breaking documents into pieces to assign topics more precisely

Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-Tür, Stanley Jungkyu Choi

May 18, 2026

Traditional topic modeling treats entire documents as single units, which fails for real-world texts like product reviews that cover multiple unrelated subjects. This work shifts the unit of analysis from documents to segments—coherent spans expressing one theme each—and proposes segment-based topic allocation (SBTA). The authors introduce SemEval-STM, a new benchmark dataset where LLMs decompose documents into topical segments, refined by human annotators. They extend the word intrusion evaluation task to the segment level for measuring coherence. Across multiple models, SBTA produces cleaner topics and better supports analysis of heterogeneous documents. The dataset and framework are released on Hugging Face.
Published as From Documents to Segments: A Contextual Reformulation for Topic Assignment arXiv:2605.17714
Read the original paper →