Online Meeting Recognition in Noisy Environments with Time-Frequency Mask Based MVDR Beamforming
- Shoko Araki ,
- Nobutaka Ito ,
- Marc Delcroix ,
- Atsunori Ogawa ,
- Keisuke Kinoshita ,
- Takuya Higuchi ,
- Takuya Yoshioka ,
- Dong Tran ,
- Shigeki Karita ,
- Tomohiro Nakatani
HSCMA 2017 |
Published by IEEE
This paper addresses our new online meeting recognition prototype, which works even in noisy environments. For speech enhancement, we employ a mask-based minimum variance distortionless response (MVDR) beamformer, which has recently shown to be a successful front-end for a state-of-the-art deep neural network (DNN)-based automatic speech recognition (ASR) system. To ensure more accurate and computationally efficient mask estimation for estimating steering vectors of a MVDR beamformer, which is challenging especially in an adverse environment, this work employs a probabilistic spatial dictionary. The dictionary consists of a pre-trained probability distribution of source location features for each of speaker location candidates. This dictionary significantly simplifies the problem to be solved, and therefore, realizes an accurate and efficient mask estimation. Our new prototype worked with stability with a latency of 3-5 seconds in a real exhibition situation. The quantitative evaluation results will be reported.