Voice Conversion Can Improve ASR in Very Low-Resource Settings

Matthew Baas and Herman Kamper.

Arxiv link: https://arxiv.org/abs/2111.02674

Voice conversion samples

Here we provide several samples of original and voice converted speech using the voice conversion model detailed in the paper. Samples are given both for English as the seen language, and for the unseen languages and speakers from Experiment 1 & 3 of the paper.

For each setting below, we present the input utterance, converted output utterance, as well as an utterance from the desired reference speaker. Sampling is made consistent by sorting all the filenames from the data being sampled and then sampling 3 utterances for each language with Python’s random library with a fixed seed of 123 across all languages. In all the cases below except English, the language as well as all source speakers are unseen during training. For English the language is seen, but the source speakers are from the LibriSpeech test set and thus are still unseen during training. All input and reference utterances are resampled to 16kHz for the datasets where they are not originally at 16kHz.

English

These are sampled from LibriSpeech’s augmented test-clean subset from Experiment 1. Speaker IDs are those as specified in the LibriSpeech dataset.

Model Input speaker & utterance Reference speaker & utterance Output utterance

Full model

4446

589

Full model

6829

337

Full model

1995

6037

Sans Q

4446

589

Sans Q

6829

337

Sans Q

1995

6037

Sans HGST

4446

589

Sans HGST

6829

337

Sans HGST

1995

6037

Afrikaans

These are sampled from the augmented data generated made for the + 100% augmentation factor for Afrikaans in Experiment 3. Source speaker IDs are those from the Afrikaans dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.

Model Input speaker & utterance Reference speaker & utterance Output utterance

Full model

0184

373

Full model

1919

8254

Full model

0184

5890

isiXhosa

These are sampled from the augmented data generated made for the + 100% augmentation factor for isiXhosa in Experiment 3. Source speaker IDs are those from the isiXhosa dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.

Model Input speaker & utterance Reference speaker & utterance Output utterance

Full model

0050

4455

Full model

1547

4104

Full model

0050

6127

Setswana

These are sampled from the augmented data generated made for the + 100% augmentation factor for Setswana in Experiment 3. Source speaker IDs are those from the Setswana dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.

Model Input speaker & utterance Reference speaker & utterance Output utterance

Full model

0045

3500

Full model

1932

2391

Full model

0045

5393

Sepedi

These are sampled from the augmented data generated made for the + 100% augmentation factor for Sepedi in Experiment 3. Source speaker IDs are those from the Sepedi dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.

Model Input speaker & utterance Reference speaker & utterance Output utterance

Full model

0045

3500

Full model

1932

2391

Full model

0045

5393

Citation

@inproceedings{baas2022lowresource,
  title= "Voice Conversion Can Improve ASR in Very Low-Resource Settings",
  author="Baas, Matthew and Kamper, Herman",
  booktitle="Interspeech",
  year=2022
}