A study on the generalization capability of acoustic models for robust speech recognition

IEEE Transactions on Audio, Speech and Language Processing |

In statistical learning theory, good generalization capability refers to small performance degradation when the model is evaluated on unseen testing data that are drawn from the same distribution as the training data, i.e. on matched trainingtesting case. Recently, soft-margin estimation (SME) method was proposed to improve acoustic model’s generalization capability for clean speech recognition and achieved success. In this paper, we study the generalization capability of acoustic model for robust speech recognition, where the training and testing data follow different distributions (i.e. mismatched training-testing case). From our analysis of noise effect on the log likelihood values of noisy speech features, although mismatch exists between testing and training data, it is still possible to achieve better robustness by improving the acoustic model’s generalization capability using SME. This is confirmed by our experimental study on Aurora-2 and Aurora-3 tasks, where SME improves recognition performance significantly for both matched and low/medium mismatched testing cases. However, the improvement in severely mismatched cases is relatively small. To alleviate the violation of SME assumption about the same distribution for training and testing data, we apply mean and variance normalization (MVN) to process speech features prior to model training. Experimental study shows that when training-testing mismatch is reduced, SME delivers better performance improvement. We expect SME to improve the robustness of speech recognition further when it is combined with other robustness methods. Although this study is on noisy speech recognition tasks, the method and discovery in this paper have no assumption on the type of distortion, and can be extended to deal with different types of distortions in other machine learning applications.