Towards Contextual Spelling Correction for Customization of End-to-End Speech Recognition Systems

  • Xiaoqiang Wang ,
  • Yanqing Liu ,
  • ,
  • Veljko Miljanic ,
  • Sheng Zhao ,
  • Hosam Khalil

IEEE/ACM Transactions on Audio, Speech, and Language Processing | , Vol 30: pp. 3089-3097

Contextual biasing is an important and challenging
task for end-to-end automatic speech recognition (ASR) systems,
which aims to achieve better recognition performance by biasing
the ASR system to particular context phrases such as person
names, music list, proper nouns, etc. Existing methods mainly
include contextual LM biasing and adding bias encoder into
end-to-end ASR models. In this work, we introduce a novel
approach to do contextual biasing by adding a contextual spelling
correction model on top of the end-to-end ASR system. We
incorporate contextual information into a sequence-to-sequence
spelling correction model with a shared context encoder. The proposed
model includes two different mechanisms: autoregressive
(AR) and non-autoregressive (NAR). We also propose filtering
algorithms to handle large-size context lists, and performance
balancing mechanisms to control the biasing degree of the model.
The proposed model is a general biasing solution which is
domain-insensitive and can be adopted in different scenarios.
Experiments show that the proposed method achieves as much
as 51% relative word error rate (WER) reduction over ASR
system and outperforms traditional biasing methods. Compared
to the AR solution, the NAR model reduces model size by 43.2%
and speeds up inference by 2.1 times.