CNN with phonetic attention for text-independent speaker verification
- Tianyan Zhou ,
- yong zhao ,
- Jinyu Li ,
- Yifan Gong ,
- Jian Wu
Automatic Speech Recognition and Understanding Workshop |
Organized by IEEE
Text-independent speaker verification imposes no constraints
on the spoken content and usually needs long observations
to make reliable prediction. In this paper, we propose two
speaker embedding approaches by integrating the phonetic information
into the attention-based residual convolutional neural
network (CNN). Phonetic features are extracted from the
bottleneck layer of a pretrained acoustic model. In implicit
phonetic attention (IPA), the phonetic features are projected
by a transformation network into multi-channel feature maps,
and then concatenated with the raw acoustic features as the input
of the CNN network. In explicit phonetic attention (EPA),
the phonetic features are directly connected to the attentive
pooling layer through a separate 1-dim CNN to generate the
attention weights. With the incorporation of spoken content
and attention mechanism, the system can not only distill the
speaker-discriminant frames but also actively normalize the
phonetic variations. Multi-head attention and discriminative
objectives are further studied to improve the system. Experiments
on the VoxCeleb corpus show our proposed system
could outperform the state-of-the-art by around 43% relative.