Speech-to-text translation that works in real time

Current speech-to-text translation systems rely on separate speech recognition and translation modules, introducing cascading errors and latency. This work builds a single end-to-end SpeechLLM that operates in genuine streaming mode: the LLM decides when sufficient audio context exists to emit the next translation token, rather than waiting for complete utterances or outputting at fixed intervals. Training uses automatic alignments between speech and text. On multiple language pairs, the system achieves translation quality comparable to non-streaming baselines while maintaining 1–2 second latency, making it practical for real-time applications.