StarGAN-ZSVC: Towards zero-shot voice conversion in low-resource contexts

Matthew Baas and Herman Kamper.

Arxiv link: https://arxiv.org/abs/2106.00043

Voice conversion samples

Here we provide samples of converted speech for comparison of zero-shot voice conversion models, namely comparing AutoVC and the proposed StarGAN-ZSVC model. Below are conversions given for various seen/unseen source/target pairings, as evaluated on the test set. The samples were chosen by sampling 3 utterances from the test set in each case.

Note: our AutoVC model used in this comparison is trained on the small 9 minute VCC 2018 dataset as described in the paper, while the AutoVC model trained by the original authors is trained on the VCTK dataset, consisting of 4960 minutes of audio – a significantly larger dataset. Thus our AutoVC model performs substantially worse than the original work because it is trained in an extremely low-resource context, as per the experimental setup in Section 4 of the paper. We chose to do this to compare models exposed to the same amount of training data.

Seen-to-seen samples

Source-target speaker	Input audio	Ground truth target	AutoVC output	StarGAN-ZSVC output
VCC2SF2-VCC2SM1 (10001.wav)
VCC2SM2-VCC2SF1 (10020.wav)
VCC2SM2-VCC2SM1 (10062.wav)

Seen-to-unseen samples

Source-target speaker	Input audio	Ground truth target	AutoVC output	StarGAN-ZSVC output
VCC2SF1-VCC2TF1 (10025.wav)
VCC2SM2-VCC2TM1 (10036.wav)
VCC2SM1-VCC2TF1 (10050.wav)

Unseen-to-seen samples

Source-target speaker	Input audio	Ground truth target	AutoVC output	StarGAN-ZSVC output
VCC2TM1-VCC2SM1 (10025.wav)
VCC2TF2-VCC2SF1 (10036.wav)
VCC2TF1-VCC2SM1 (10050.wav)

Unseen-to-unseen samples

Source-target speaker	Input audio	Ground truth target	AutoVC output	StarGAN-ZSVC output
VCC2TM2-VCC2TF1 (10001.wav)
VCC2TF2-VCC2TM1 (10020.wav)
VCC2TF2-VCC2TF1 (10062.wav)

Pretrained models and reproduction

Speaker embedding model (source code and pretrained model)
Model, example evaluations, and pre-trained checkpoints this source code includes a notebook with an example of how to perform seen-to-seen and zero-shot voice conversion using the pretrained checkpoints.

If one has any queries about the model or its use, please reach out to the authors.

Citation

Here is the bibtex citation for the work:

  @InProceedings{10.1007/978-3-030-66151-9_5,
  author="Baas, Matthew and Kamper, Herman",
  editor="Gerber, Aurona",
  title="StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts",
  booktitle="Artificial Intelligence Research",
  year="2020",
  publisher="Springer International Publishing",
  address="Cham",
  pages="69--84",
}