VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
- Qiushi Zhu ,
- Long Zhou ,
- Ziqiang Zhang ,
- Shujie Liu ,
- Binxing Jiao ,
- Jie Zhang ,
- Lirong Dai ,
- Daxin Jiang ,
- Jinyu Li ,
- Furu Wei
IEEE Transactions on Multimedia |
Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text.
How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored.
In this paper, we propose a unified cross-modal representation learning framework VatLM (Visual-Audio-Text Language Model).
The proposed VatLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs.
In order to integrate these three modalities into one shared semantic space, VatLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer.
We evaluate the pre-trained VatLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks.
Results show that the proposed VatLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VatLM is capable of aligning different modalities into the same space.