Adapter-Based Extension of Multi-Speaker
Text-To-Speech Model For New Speakers

Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However, this approach has some challenges. Usually, fine-tuning requires several hours of high quality speech per speaker. Fine-tuning might negatively affect the quality of speech synthesis for previously learned speakers. In this paper, we propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules. In the proposed approach, a few adapter modules are added between the layers of the pretrained network. The pretrained model is frozen, and only the adapters are fine-tuned to the speech of a new speaker. Our approach will produce a new model with a high level of parameter sharing with the original model. Our experiments on LibriTTS, HiFi-TTS and VCTK datasets validate the effectiveness of adapter-based method through objective and subjective metrics.

I. Compare different dataset

LibriTTS

Reference Adapter (proposed) Full fine-tuning Ground truth
Male
Male
Male
Female
Female
Female

HiFi-TTS

Reference Adapter (proposed) Full fine-tuning Ground truth
Male
Male
Male
Female
Female
Female

VCTK

Reference Adapter (proposed) Full fine-tuning Ground truth
Male
Male
Male
Female
Female
Female

II. Compare different amount of training data

MALE

Reference Ground truth
1min 5min 15min 60min
Adapter
Full FT
Reference Ground truth
1min 5min 15min 60min
Adapter
Full FT

FEMALE

Reference Ground truth
1min 5min 15min 60min
Adapter
Full FT
Reference Ground truth
1min 5min 15min 60min
Adapter
Full FT