Abstract:
The end-to-end speech synthesis system based on deep neural network and using large-scale high-quality corpus training has been able to synthesize high-quality speech. Due to the limited size of the corpus, the performance of the end-to-end speech synthesis system for low-resource non-generic languages still needs to be improved. In the field of Natural Language Processing (NLP), language models are trained in a weakly supervised way using massive unmarked text data. The pre-trained language models have been proved to significantly improve many NLP tasks. We explore a method to improve the performance of Indonesian end-to-end speech synthesis system based on the pre-trained language model. Firstly, we use contextual information Mosaic and word vector Mosaic methods to embed BERT pre-trained word vector information into the speech synthesis system. Then, we further investigate the influence of encoder structure on the performance of speech synthesis. In the end, the speech synthesized by different methods is evaluated from subjective and objective directions. Experimental results show that compared with the baseline speech synthesis system, the performance of the proposed method is improved by nearly 15% according to the objective evaluation results, and the Mel-cepstrum distortion and the base frequency frame error rate of the synthesized speech are effectively reduced. The mean opinion score is 4.15 VS 3.72 which is baseline synthesis system.