← Back to Machine Learning cs.LG
Teaching molecules a chemical language with atomic semantics
Takayuki Kimura
May 16, 2026
Molecular AI currently relies on syntactic tokenizations like SMILES that don't capture chemical meaning. VQ-Atom uses graph neural network embeddings and vector quantization to assign atoms to discrete tokens representing their local chemical environments, creating a semantically grounded molecular language for Transformers. Tested on protein-ligand interaction prediction without 3D structures, VQ-Atom outperforms conventional tokenization approaches, demonstrating that token design fundamentally shapes how well language models capture molecular properties.
Read the original paper →