Abstract:
In order to improve the accuracy of language recognition at low signal-to-noise ratios, a new feature extraction fusion method is introduced, which incorporates voiced segment detection at the front end of feature extraction. Based on the human ear auditory perception model, the Gammatone Frequency Cepstrum Coefficients (GFCC) are extracted as feature parameters, which are compressed and de-noised by principal component analysis and are fused with the Teager energy operator cepstrum parameters for each audible segment. Experiments with the Gaussian mixture general background model for language recognition show that the method based on the fused feature set improves the recognition accuracy of the five languages by 23.7% to 34.0% at signal-to-noise ratios ranging from −5 dB to 0 dB respectively,compared to the method based on the energy features of the log-Mel scale filter bank, and also significantly improves the recognition accuracy at other signal-to-noise levels.