StarGAN-ZSVC: Towards zero-shot voice conversion in low-resource contexts

Matthew Baas and Herman Kamper.

Arxiv link: https://arxiv.org/abs/2106.00043

Voice conversion samples

Here we provide samples of converted speech for comparison of zero-shot voice conversion models, namely comparing AutoVC and the proposed StarGAN-ZSVC model. Below are conversions given for various seen/unseen source/target pairings, as evaluated on the test set. The samples were chosen by sampling 3 utterances from the test set in each case.

Note: our AutoVC model used in this comparison is trained on the small 9 minute VCC 2018 dataset as described in the paper, while the AutoVC model trained by the original authors is trained on the VCTK dataset, consisting of 4960 minutes of audio – a significantly larger dataset. Thus our AutoVC model performs substantially worse than the original work because it is trained in an extremely low-resource context, as per the experimental setup in Section 4 of the paper. We chose to do this to compare models exposed to the same amount of training data.

Seen-to-seen samples

Source-target speaker Input audio Ground truth target AutoVC output StarGAN-ZSVC output

VCC2SF2-VCC2SM1 (10001.wav)

VCC2SM2-VCC2SF1 (10020.wav)

VCC2SM2-VCC2SM1 (10062.wav)

Seen-to-unseen samples

Source-target speaker Input audio Ground truth target AutoVC output StarGAN-ZSVC output

VCC2SF1-VCC2TF1 (10025.wav)

VCC2SM2-VCC2TM1 (10036.wav)

VCC2SM1-VCC2TF1 (10050.wav)

Unseen-to-seen samples

Source-target speaker Input audio Ground truth target AutoVC output StarGAN-ZSVC output

VCC2TM1-VCC2SM1 (10025.wav)

VCC2TF2-VCC2SF1 (10036.wav)

VCC2TF1-VCC2SM1 (10050.wav)

Unseen-to-unseen samples

Source-target speaker Input audio Ground truth target AutoVC output StarGAN-ZSVC output

VCC2TM2-VCC2TF1 (10001.wav)

VCC2TF2-VCC2TM1 (10020.wav)

VCC2TF2-VCC2TF1 (10062.wav)

Pretrained models and reproduction

If one has any queries about the model or its use, please reach out to the authors.

Citation

Here is the bibtex citation for the work:

  @InProceedings{10.1007/978-3-030-66151-9_5,
  author="Baas, Matthew and Kamper, Herman",
  editor="Gerber, Aurona",
  title="StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts",
  booktitle="Artificial Intelligence Research",
  year="2020",
  publisher="Springer International Publishing",
  address="Cham",
  pages="69--84",
}