Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang; RJ Skerry-Ryan; Daisy Stanton; Yonghui Wu; Ron J. Weiss; Navdeep Jaitly; Zongheng Yang; Ying Xiao; Zhifeng Chen; Samy Bengio; Quoc Khai Le; Yannis Agiomyrgiannakis; Rob Clark; Rif A. Saurous

doi:10.21437/interspeech.2017-1452

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang(Google (United States)), RJ Skerry-Ryan(Google (United States)), Daisy Stanton(Google (United States)), Yonghui Wu(Google (United States)), Ron J. Weiss(Google (United States)), Navdeep Jaitly(Google (United States)), Zongheng Yang(Google (United States)), Ying Xiao(Google (United States)), Zhifeng Chen(Google (United States)), Samy Bengio(Google (United States)), Quoc Khai Le(Google (United States)), Yannis Agiomyrgiannakis(Google (United States)), Rob Clark(Google (United States)), Rif A. Saurous(Google (United States))

Unknown

August 16, 2017

10.21437/interspeech.2017-1452

Cited by 1,720

Abstract

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.Building these components often requires extensive domain expertise and may contain brittle design choices.In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters.Given <text, audio> pairs, the model can be trained completely from scratch with random initialization.We present several key techniques to make the sequence-tosequence framework perform well for this challenging task.Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

Related Papers

No related papers found

Powered by citation graph analysis