赵立铉, 杨鉴. 基于BERT预训练语言模型的印尼语语音合成[J]. 云南大学学报(自然科学版), 2021, 43(6): 1086-1095. doi: 10.7540/j.ynu.20210053
引用本文: 赵立铉, 杨鉴. 基于BERT预训练语言模型的印尼语语音合成[J]. 云南大学学报(自然科学版), 2021, 43(6): 1086-1095. doi: 10.7540/j.ynu.20210053
ZHAO Li-xuan, YANG Jian. Indonesian speech synthesis based on BERT pre-trained language model[J]. Journal of Yunnan University: Natural Sciences Edition, 2021, 43(6): 1086-1095. DOI: 10.7540/j.ynu.20210053
Citation: ZHAO Li-xuan, YANG Jian. Indonesian speech synthesis based on BERT pre-trained language model[J]. Journal of Yunnan University: Natural Sciences Edition, 2021, 43(6): 1086-1095. DOI: 10.7540/j.ynu.20210053

基于BERT预训练语言模型的印尼语语音合成

Indonesian speech synthesis based on BERT pre-trained language model

  • 摘要: 基于深度神经网络并利用大规模高质量“〈文本,音频〉”语料库训练的端到端语音合成系统已能够合成出高质量的语音,因受限于语料库规模,低资源非通用语言端到端语音合成系统性能仍有待提升. 近年来,自然语言处理领域实现了利用海量无标记文本数据以弱监督方式训练语言模型,BERT等预训练的语言模型被证明显著改进了许多自然语言处理任务. 论文基于预训练语言模型探索提升印尼语端到端语音合成系统性能的方法,首先利用上下文信息拼接和词向量拼接方法将BERT预训练词向量信息嵌入语音合成系统,然后在此基础上进一步研究编码器结构对语音合成性能的影响,最后从主观和客观两方面对文中阐述的各种方法所合成的语音进行测评. 实验结果表明,新方法与基线语音合成系统相比,客观评测结果显示系统性能提升近15%,有效降低了合成语音的梅尔倒谱失真和基频帧错误率,主观评价平均意见得分达到4.15,远高于基线系统的3.72.

     

    Abstract: The end-to-end speech synthesis system based on deep neural network and using large-scale high-quality corpus training has been able to synthesize high-quality speech. Due to the limited size of the corpus, the performance of the end-to-end speech synthesis system for low-resource non-generic languages still needs to be improved. In the field of Natural Language Processing (NLP), language models are trained in a weakly supervised way using massive unmarked text data. The pre-trained language models have been proved to significantly improve many NLP tasks. We explore a method to improve the performance of Indonesian end-to-end speech synthesis system based on the pre-trained language model. Firstly, we use contextual information Mosaic and word vector Mosaic methods to embed BERT pre-trained word vector information into the speech synthesis system. Then, we further investigate the influence of encoder structure on the performance of speech synthesis. In the end, the speech synthesized by different methods is evaluated from subjective and objective directions. Experimental results show that compared with the baseline speech synthesis system, the performance of the proposed method is improved by nearly 15% according to the objective evaluation results, and the Mel-cepstrum distortion and the base frequency frame error rate of the synthesized speech are effectively reduced. The mean opinion score is 4.15 VS 3.72 which is baseline synthesis system.

     

/

返回文章
返回