A Scalable Approach to Using DNN-Derived Features in GMM-HMM Based Acoustic Modeling for LVCSR
- Zhijie Yan ,
- Qiang Huo ,
- Jian Xu
14th Annual Conference of the International Speech Communication Association, InterSpeech 2013 |
We present a new scalable approach to using deep neural network (DNN) derived features in Gaussian mixture density hidden Markov model (GMM-HMM) based acoustic modeling for large vocabulary continuous speech recognition (LVCSR). The DNN-based feature extractor is trained from a subset of training data to mitigate the scalability issue of DNN training, while GMM-HMMs are trained by using state-of-the-art scalable training methods and tools to leverage the whole training set. In a benchmark evaluation, we used 309-hour Switchboard-I (SWB) training data to train a DNN first, which achieves a word error rate (WER) of 15.4% on NIST-2000 Hub5 evaluation set by a traditional DNN-HMM based approach. When the same DNN is used as a feature extractor and 2,000-hour «SWB+Fisher» training data is used to train the GMM-HMMs, our DNN-GMM-HMM approach achieves a WER of 13.8%. If per-conversation-side based unsupervised adaptation is performed, a WER of 13.1% can be achieved.