← Back to Computation and Language
cs.CL

A smarter way to chop text into tokens

Craig W. Schmidt, Michael Krumdick, Adam Wiemerslage, Seth Ebner, Varshini Reddy, Yuval Pinter, Chris Tanner

May 21, 2026

Tokenization—breaking text into chunks for language models to process—is a hidden bottleneck that wastes context length. This work introduces ToaST, which builds binary trees from character patterns, then greedily selects vocabulary tokens to minimize the total count needed across all text. On English, it cuts token usage 11% compared to industry standards (BPE, WordPiece) while improving 1.5B-parameter model scores by 2.6–7.6% across benchmarks.
Published as Tokenization with Split Trees arXiv:2605.22705
Read the original paper →