Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion
Refereed conference paper presented and published in conference proceedings

Times Cited
Altmetrics Information

Other information
AbstractFrom speech, speaker identity can be mostly characterized by the spectro-temporal structures of spectrum. Although recent researches have demonstrated the effectiveness of employing long short-term memory (LSTM) recurrent neural network (RNN) in voice conversion, traditional LSTM-RNN based approaches usually focus on temporal evolutions of speech features only. In this paper, we improve the conventional LSTM-RNN method for voice conversion by employing the two-dimensional time-frequency LSTM (TFLSTM) to model spectro-temporal warping along both time and frequency axes. A multi-task learned structured output layer (SOL) is afterward adopted to capture the dependencies between spectral and pitch parameters for further improvement, where spectral parameter targets are conditioned upon pitch parameters prediction. Experimental results show the proposed approach outperforms conventional systems in speech quality and speaker similarity.
All Author(s) ListRunnan LI, Zhiyong WU, Yishuang NING, Lifa SUN, Helen MENG, Lianhong CAI
Name of Conference18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017
Start Date of Conference20/08/2017
End Date of Conference24/08/2017
Place of ConferenceStockholm
Country/Region of ConferenceSweden
Proceedings TitleProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech Communication Association ( ISCA )
Pages3409 - 3413
LanguagesEnglish-United States
KeywordsVoice conversion, time-frequency long short term memory (TFLSTM), structured output layer (SOL)

Last updated on 2021-19-01 at 02:26