Exploring pre-training with alignments for RNN transducer based end-to-end speech recognition

ICASSP |

Recently, the recurrent neural network transducer (RNN-T) architecture
has become an emerging trend in end-to-end automatic speech
recognition research due to its advantages of being capable for online
streaming speech recognition. However, RNN-T training is
made difficult by the huge memory requirements, and complicated
neural structure. A common solution to ease the RNN-T training is
to employ connectionist temporal classification (CTC) model along
with RNN language model (RNNLM) to initialize the RNN-T parameters.
In this work, we conversely leverage external alignments
to seed the RNN-T model. Two different pre-training solutions are
explored, referred to as encoder pre-training, and whole-network
pre-training respectively. Evaluated on Microsoft 65,000 hours
anonymized production data with personally identifiable information
removed, our proposed methods can obtain significant improvement.
In particular, the encoder pre-training solution achieved a
10% and a 8% relative word error rate reduction when compared
with random initialization and the widely used CTC+RNNLM initialization
strategy, respectively. Our solutions also significantly
reduce the RNN-T model latency from the baseline.