Text to Speech without Text

Team Project @ CSIE5431 Applied Deep Learning
Team Project @ CSIE5431
Applied Deep Learning

Compare MBV and VQVAE for discrete representations of subword units in the ZeroSpeech 2019 Challenge.

Compare MBV and VQVAE
for discrete representations of subword units
in the ZeroSpeech 2019 Challenge.

result

Figure 1. MBV and VQVAE model architecture.

The underlying training methods on text to speech usually require the large quantity of labeled training data, including text labels or phoneme labels. However, it is quiet challenging and costly to collect high-quality parallel corpora for the low-resourced languages. In this project, we compare the Multilabel-Binary Vectors (MBV) autoencoder and the Vector Quantized Variational Autoencoder (VQVAE). These two methods share a similar autoencoder backbone while trying to discretize the continuous output of the encoder in a different manner, which are listed above in Figure 1.

Source Continuous
MBV MBV-ADV
VQVAEv1 VQVAEv2
Source
Continuous
MBV
MBV-ADV
VQVAEv1
VQVAEv2

Figure 2. Speech samples for voice conversion.

In order to improve the performances and explore the trade-offs, we introduced different techniques including adversarial learning and vector quantized. For the evaluation, we first compare same-speaker reconstruction in training loss, spectrogram, and voice sample. After that, we generate speech as voice conversion task and compare in terms of bitrate and quality which is shown above in Figure 2. For more information, please refer to our report.