Attention-based Recurrent Generator with Gaussian Tolerance for Statistical Parametric Speech Synthesis
Refereed conference paper presented and published in conference proceedings

Full Text

Other information
AbstractConventional statistical parametric speech synthesis (SPSS) generates frame-level acoustic features in two separately optimized steps—namely, duration prediction and acoustic feature generation. It also incorporates a conditional independence assumption to generate independent output frames given textual inputs. Both factors constrain the quality of the generated speech output. This work proposes to apply the attention-based
recurrent generator (ARG) with Gaussian Tolerance (GT) for SPSS, where duration prediction and acoustic feature generation are jointly optimized with attention mechanism, and the dependency across output frames is modeled by acoustic feature generation conditioned on preceding frames. GT is introduced to train ARG to acquire robustness based on previous output frames with errors. Perceptual experiments comparing the naturalness between ARG and the conventional hidden Markov model show a gain in MOS score and the effectiveness of GT.
Index Terms: Statistical parametric speech synthesis, Attention mechanism, Sequence to sequence, Joint optimization.
All Author(s) ListXixin WU, Shiyin KANG, Lifa SUN, Yishuang NING, Zhiyong WU, Helen MENG
Name of ConferenceThe 3rd International Workshop on Affective Social Multimedia Computing (ASMMC) 2017
Start Date of Conference25/08/2017
End Date of Conference25/08/2017
Place of ConferenceStockholm
Country/Region of ConferenceSweden
LanguagesEnglish-United States

Last updated on 2018-11-05 at 12:33