A smarter way to chop text into tokens

Craig W. Schmidt, Michael Krumdick, Adam Wiemerslage, Seth Ebner, Varshini Reddy, Yuval Pinter, Chris Tanner

Tokenization—breaking text into chunks for language models to process—is a hidden bottleneck that wastes context length. This work introduces ToaST, which builds binary trees from character patterns, then greedily selects vocabulary tokens to minimize the total count needed across all text. On English, it cuts token usage 11% compared to industry standards (BPE, WordPiece) while improving 1.5B-parameter model scores by 2.6–7.6% across benchmarks.