← Back to Computation and Language
cs.CL

Speech-to-text translation that works in real time

Titouan Parcollet, Shucong Zhang, Xianrui Zheng, Rogier C. van Dalen

May 14, 2026

Current speech-to-text translation systems rely on separate speech recognition and translation modules, introducing cascading errors and latency. This work builds a single end-to-end SpeechLLM that operates in genuine streaming mode: the LLM decides when sufficient audio context exists to emit the next translation token, rather than waiting for complete utterances or outputting at fixed intervals. Training uses automatic alignments between speech and text. On multiple language pairs, the system achieves translation quality comparable to non-streaming baselines while maintaining 1–2 second latency, making it practical for real-time applications.
Published as Streaming Speech-to-Text Translation with a SpeechLLM arXiv:2605.14766
Read the original paper →