Cross-speaker encoding network for multi-talker speech recognition
Refereed conference paper presented and published in conference proceedings
CUHK Authors
Full Text
Digital Object Identifier (DOI) DOI for CUHK Users |
Altmetrics Information
.
Other information
AbstractEnd-to-end multi-talker speech recognition has garnered great interest as an effective approach to directly transcribe overlapped speech from multiple speakers. Current methods typically adopt either 1) single-input multiple-output (SIMO) models with a branched encoder, or 2) single-input single-output (SISO) models based on attention-based encoder-decoder architecture with serialized output training (SOT). In this work, we propose a Cross-Speaker Encoding (CSE) network to address the limitations of SIMO models by aggregating cross-speaker representations. Furthermore, the CSE model is integrated with SOT to leverage both the advantages of SIMO and SISO while mitigating their drawbacks. To the best of our knowledge, this work represents an early effort to integrate SIMO and SISO for multi-talker speech recognition. Experiments on the two-speaker LibrispeechMix dataset show that the CES model reduces word error rate (WER) by 8% over the SIMO baseline. The CSE-SOT model reduces WER by 10% overall and by 16% on high-overlap speech compared to the SOT model.
All Author(s) ListJiawen Kang, Lingwei Meng, Mingyu Cui, Haohan Guo, Xixin Wu, Xunying Liu, Helen Meng
Name of Conference2024 IEEE International Conference on Acoustics, Speech and Signal Processing
Start Date of Conference14/04/2024
End Date of Conference19/04/2024
Place of ConferenceSeoul
Country/Region of ConferenceSouth Korea
Proceedings TitleICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Year2024
Month4
PublisherIEEE
Pages11986 - 11990
ISBN979-8-3503-4486-8
eISBN979-8-3503-4485-1
ISSN1520-6149
eISSN2379-190X
LanguagesEnglish-United States