论文标题
旨在为接下来的十亿用户构建文本到语音系统
Towards Building Text-To-Speech Systems for the Next Billion Users
论文作者
论文摘要
基于深度学习的文本到语音(TTS)系统,随着模型架构,培训方法和跨扬声器和语言的概括的进步,正在迅速发展。但是,这些进步尚未经过彻底研究印度语言言语综合。鉴于印度语言的数量和多样性,资源可用性相对较低,并且在未经测试的神经TT中,这种调查在计算上是昂贵的。在本文中,我们评估了德拉维语和印度雅利安语言的声学模型,声码器,补充损失功能,培训时间表以及说话者和语言多样性的选择。基于此,我们通过FastPitch和Hifi-GAN V1确定了单语模型,并对男性和女性扬声器进行了培训,以表现最好。通过此设置,我们培训和评估13种语言的TTS模型,并找到我们的模型,以通过平均意见分数衡量的所有语言中的现有模型显着改进。我们在Bhashini平台上开放所有型号。
Deep learning based text-to-speech (TTS) systems have been evolving rapidly with advances in model architectures, training methodologies, and generalization across speakers and languages. However, these advances have not been thoroughly investigated for Indian language speech synthesis. Such investigation is computationally expensive given the number and diversity of Indian languages, relatively lower resource availability, and the diverse set of advances in neural TTS that remain untested. In this paper, we evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages. Based on this, we identify monolingual models with FastPitch and HiFi-GAN V1, trained jointly on male and female speakers to perform the best. With this setup, we train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores. We open-source all models on the Bhashini platform.
