Adapter-Based Extension of Multi-Speaker Text-To-Speech Model For New Speakers
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However, this approach has some challenges. Usually, fine-tuning requires several hours of high quality speech per speaker. Fine-tuning might negatively affect the quality of speech synthesis for previously learned speakers. In this paper, we propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules. In the proposed approach, a few adapter modules are added between the layers of the pretrained network. The pretrained model is frozen, and only the adapters are fine-tuned to the speech of a new speaker. Our approach will produce a new model with a high level of parameter sharing with the original model. Our experiments on LibriTTS, HiFi-TTS and VCTK datasets validate the effectiveness of adapter-based method through objective and subjective metrics.