← Back to Machine Learning
cs.LG

Teaching molecules a chemical language with atomic semantics

Takayuki Kimura

May 16, 2026

Molecular AI currently relies on syntactic tokenizations like SMILES that don't capture chemical meaning. VQ-Atom uses graph neural network embeddings and vector quantization to assign atoms to discrete tokens representing their local chemical environments, creating a semantically grounded molecular language for Transformers. Tested on protein-ligand interaction prediction without 3D structures, VQ-Atom outperforms conventional tokenization approaches, demonstrating that token design fundamentally shapes how well language models capture molecular properties.
Published as Atoms as Language: VQ-Atom: Semantic Discretization for Molecular Representation Learning arXiv:2605.16823
Read the original paper →