王瑶, 龙华, 邵玉斌, 杜庆治. 可变时长的短时广播语音多语种识别[J]. 云南大学学报(自然科学版), 2022, 44(3): 490-496. doi: 10.7540/j.ynu.20210232
引用本文: 王瑶, 龙华, 邵玉斌, 杜庆治. 可变时长的短时广播语音多语种识别[J]. 云南大学学报(自然科学版), 2022, 44(3): 490-496. doi: 10.7540/j.ynu.20210232
WANG Yao, LONG Hua, SHAO Yu-bin, DU Qing-zhi. Multilingual recognition of short-time broadcast speech with variable duration[J]. Journal of Yunnan University: Natural Sciences Edition, 2022, 44(3): 490-496. DOI: 10.7540/j.ynu.20210232
Citation: WANG Yao, LONG Hua, SHAO Yu-bin, DU Qing-zhi. Multilingual recognition of short-time broadcast speech with variable duration[J]. Journal of Yunnan University: Natural Sciences Edition, 2022, 44(3): 490-496. DOI: 10.7540/j.ynu.20210232

可变时长的短时广播语音多语种识别

Multilingual recognition of short-time broadcast speech with variable duration

  • 摘要: 针对短时语音时长过短以及训练语音和测试语音时长不等,导致语种识别性能大幅度下降的问题,提出了一种可变时长的短时广播语音多语种识别模型(Variable Duration-Language Identification, VD-LID). 首先,对不同时长的语音进行时长规整;然后,对规整后的短时语音进行特征提取,提取其对数功率谱包络图作为语种特征;最后,将语种特征输入到残差神经网络中进行分类. 实验结果表明,相比于传统特征输入,对数功率谱包络图特征将短时语音的语种识别准确率提高到了82.4%;相比于没有引入时长规整层的语种识别模型,VD-LID在测试语音时长为5 s和10 s的实验中,语种识别准确率分别提升了27.9%和37.7%.

     

    Abstract: Aiming at the problem that the language recognition performance is greatly reduced due to the short duration of short speech and the difference between the duration of training speech and the duration of test speech, a multi-language recognition model of short broadcast speech with variable duration is proposed. Firstly, the duration of different speech length is structured. Then the features of the structured short speech were extracted and the logarithmic power spectrum envelope was extracted as language features. Finally, the language features are input into the residual neural network for classification. The experimental results show that compared with the traditional feature input, the logarithmic power spectrum envelope feature improves the language recognition accuracy of short-time speech to 82.4%. Compared with the language recognition model without the introduction of the time-regular layer, VD-LID improved the accuracy of language recognition by 27.9% and 37.7% respectively in the experiments of 5 s and 10 s speech duration..

     

/

返回文章
返回