Making robotic speech sound naturally conversational

Parshav Singla, Agnik Banerjee, Aaditya Arora, Shruti Aggarwal, Anil Kumar Verma, Vikram C M, Raj Prakash Gohil, Gopal Kumar Agarwal

Read speech—the kind produced by text-to-speech systems—lacks the intonation, stress, and rhythm variations that make conversation sound natural. This work applies deep neural networks to analyze and modify prosodic features, using HiFi-GAN for high-quality synthesis. Tested on multiple datasets and evaluated by listener preference (Mean Opinion Score), the method improves naturalness and accuracy over conventional approaches. Intended for virtual assistants, customer service bots, and language learning tools where computational efficiency matters.