StarGAN-ZSVC: Towards zero-shot voice conversion in low-resource contexts
Matthew Baas and Herman Kamper.
Arxiv link: https://arxiv.org/abs/2106.00043
Voice conversion samples
Here we provide samples of converted speech for comparison of zero-shot voice conversion models, namely comparing AutoVC and the proposed StarGAN-ZSVC model. Below are conversions given for various seen/unseen source/target pairings, as evaluated on the test set. The samples were chosen by sampling 3 utterances from the test set in each case.
Note: our AutoVC model used in this comparison is trained on the small 9 minute VCC 2018 dataset as described in the paper, while the AutoVC model trained by the original authors is trained on the VCTK dataset, consisting of 4960 minutes of audio – a significantly larger dataset. Thus our AutoVC model performs substantially worse than the original work because it is trained in an extremely low-resource context, as per the experimental setup in Section 4 of the paper. We chose to do this to compare models exposed to the same amount of training data.
Seen-to-seen samples
Source-target speaker | Input audio | Ground truth target | AutoVC output | StarGAN-ZSVC output |
---|---|---|---|---|
VCC2SF2-VCC2SM1 (10001.wav) |
||||
VCC2SM2-VCC2SF1 (10020.wav) |
||||
VCC2SM2-VCC2SM1 (10062.wav) |
Seen-to-unseen samples
Source-target speaker | Input audio | Ground truth target | AutoVC output | StarGAN-ZSVC output |
---|---|---|---|---|
VCC2SF1-VCC2TF1 (10025.wav) |
||||
VCC2SM2-VCC2TM1 (10036.wav) |
||||
VCC2SM1-VCC2TF1 (10050.wav) |
Unseen-to-seen samples
Source-target speaker | Input audio | Ground truth target | AutoVC output | StarGAN-ZSVC output |
---|---|---|---|---|
VCC2TM1-VCC2SM1 (10025.wav) |
||||
VCC2TF2-VCC2SF1 (10036.wav) |
||||
VCC2TF1-VCC2SM1 (10050.wav) |
Unseen-to-unseen samples
Source-target speaker | Input audio | Ground truth target | AutoVC output | StarGAN-ZSVC output |
---|---|---|---|---|
VCC2TM2-VCC2TF1 (10001.wav) |
||||
VCC2TF2-VCC2TM1 (10020.wav) |
||||
VCC2TF2-VCC2TF1 (10062.wav) |
Pretrained models and reproduction
- Speaker embedding model (source code and pretrained model)
- Model, example evaluations, and pre-trained checkpoints this source code includes a notebook with an example of how to perform seen-to-seen and zero-shot voice conversion using the pretrained checkpoints.
If one has any queries about the model or its use, please reach out to the authors.
Citation
Here is the bibtex citation for the work:
@InProceedings{10.1007/978-3-030-66151-9_5,
author="Baas, Matthew and Kamper, Herman",
editor="Gerber, Aurona",
title="StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts",
booktitle="Artificial Intelligence Research",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="69--84",
}