Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation [paper]

Authors: Yoshiki Masuyama1*, Xuankai Chang2*, Wangyou Zhang3, Samuele Cornell4, Zhong-Qiu Wang2, Nobutaka Ono1, Yanmin Qian3, Shinji Watanabe2,

Affiliation: 1Tokyo Metropolitan University, Japan, 2Carnegie Mellon University, USA, 3Shanghai Jiao Tong University, China, 4Università Politecnica delle Marche, Italy,

Abstract: Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9%).

Accepted to: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Mohonk Mountain House, New Paltz, NY, USA, Oct 22-25, 2023

Audio examples: Spatialized WSJ0-2mix anechoic examples, Spatialized WSJ0-2mix reverberant examples

Spatialized WSJ0-2mix anechoic examples (440o0311_0.50963_423c0205_-0.50963.wav)

Input mixture (channel 0)


Oracle source spk1 (Channel 0)
Oracle transcript for spk1: in certain cases ,comma the cards are given free to subscribers .period.


Oracle source spk2 (Channel 0)
Oracle transcript for spk2: the other was mitsubishi motors corporation's u. s. sales operation.


TF-GridNet + WavLM ASR (w/o joint fine-tuning), spk1 (channel 0)
Predicted transcript for spk1: THE OTHER in certain cases ,comma the cards are given free to subscribers .period.


TF-GridNet + WavLM ASR (w/o joint fine-tuning), spk2 (channel 0)
Predicted transcript for spk2: the other was INSUPER motors corporation's u. s. sales operation TO SUBSCRIBERS .PERIOD .


TTF-GridNet + WavLM ASR (w joint fine-tuning), spk1 (channel 0)
Predicted transcript for spk1: in certain cases ,comma the cards are given free to subscribers .period.


TF-GridNet + WavLM ASR (w joint fine-tuning), spk2 (channel 0)
Predicted transcript for spk2: the other was mitsubishi motors corporation's u. s. sales operation.


Spatialized WSJ0-2mix reverberant examples (050a0511_2.4737_22ga010g_-2.4737.wav)

Input mixture (channel 0)


Oracle source spk1 (channel 0)
Oracle transcript for spk1: the statue of liberty and ellis island are within the new jersey waters of new york bay.


Oracle source spk2 (channel 0)
Oracle transcript for spk2: the population lives by herding goats and sheep or by trading.


TF-GridNet + WavLM ASR (w/o joint fine-tuning), spk1 (channel 0)
Predicted transcript for spk1: the statue of liberty and L. S. island are within the new jersey waters of new york bay.


TF-GridNet + WavLM ASR (w/o joint fine-tuning), spk2 (channel 0)
Predicted transcript for spk2: the population lives by HURTING goats and sheep or by trading.


TF-GridNet + WavLM ASR (w joint fine-tuning), spk1 (channel 0)
Predicted transcript for spk1: the statue of liberty and ellis island are within the new jersey waters of new york bay.


TF-GridNet + WavLM ASR (w joint fine-tuning), spk2 (channel 0)
Predicted transcript for spk2: the population lives by herding goats and sheep or by trading.


References:

[1] Z. -Q. Wang, J. Le Roux and J. R. Hershey, "Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation," Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 1-5. [paper]
[2] Z. -Q. Wang, S. Cornell, S. Choi, Y. Lee, B. Y. Kim, S. Watanabe, "TF-GridNet: Integrating Full-and Sub-Band Modeling for Speech Separation," arXiv preprint arXiv:2211.12433 2022. [paper]
[3] S. Chen et al., "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing," IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505-1518, 2022. [paper]
[4] S. Kim, T. Hori and S. Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4835-4839. [paper]