← Back to Computation and Language cs.CL
A smarter way to chop text into tokens
Craig W. Schmidt, Michael Krumdick, Adam Wiemerslage, Seth Ebner, Varshini Reddy, Yuval Pinter, Chris Tanner
May 21, 2026
Tokenization—breaking text into chunks for language models to process—is a hidden bottleneck that wastes context length. This work introduces ToaST, which builds binary trees from character patterns, then greedily selects vocabulary tokens to minimize the total count needed across all text. On English, it cuts token usage 11% compared to industry standards (BPE, WordPiece) while improving 1.5B-parameter model scores by 2.6–7.6% across benchmarks.
Read the original paper →