Google's Neural Machine Translation System: Bridging the Gap between Human and Machine TranslationYonghui Wu, Mike Schuster, Zhifeng Chen et al.|arXiv (Cornell University)|2016 Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
Conformer: Convolution-augmented Transformer for Speech RecognitionRecently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively.In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer.Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3%without using a language model and 1.9%/3.9%with an external language model on test/testother.We also observe competitive performance of 2.7%/6.3%with a small model of only 10M parameters.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram PredictionsThis paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and F0 features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot TranslationMelvin Johnson, Mike Schuster, Quoc V. Le et al.|Transactions of the Association for Computational Linguistics|2017 We propose a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no changes to the model architecture from a standard NMT system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT systems using a single model. On the WMT’14 benchmarks, a single multilingual model achieves comparable performance for English→French and surpasses state-of-theart results for English→German. Similarly, a single multilingual model surpasses state-of-the-art results for French→English and German→English on WMT’14 and WMT’15 benchmarks, respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. Our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and also show some interesting examples when mixing languages.
Tacotron: Towards End-to-End Speech SynthesisA text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.Building these components often requires extensive domain expertise and may contain brittle design choices.In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters.Given <text, audio> pairs, the model can be trained completely from scratch with random initialization.We present several key techniques to make the sequence-tosequence framework perform well for this challenging task.Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.