Voice Conversion Can Improve ASR in Very Low-Resource Settings

Matthew Baas and Herman Kamper.

Arxiv link: https://arxiv.org/abs/2111.02674

Voice conversion samples

Here we provide several samples of original and voice converted speech using the voice conversion model detailed in the paper. Samples are given both for English as the seen language, and for the unseen languages and speakers from Experiment 1 & 3 of the paper.

For each setting below, we present the input utterance, converted output utterance, as well as an utterance from the desired reference speaker. Sampling is made consistent by sorting all the filenames from the data being sampled and then sampling 3 utterances for each language with Python’s random library with a fixed seed of 123 across all languages. In all the cases below except English, the language as well as all source speakers are unseen during training. For English the language is seen, but the source speakers are from the LibriSpeech test set and thus are still unseen during training. All input and reference utterances are resampled to 16kHz for the datasets where they are not originally at 16kHz.

English

These are sampled from LibriSpeech’s augmented test-clean subset from Experiment 1. Speaker IDs are those as specified in the LibriSpeech dataset.

Model	Input speaker & utterance	Reference speaker & utterance
Full model	4446	589
Full model	6829	337
Full model	1995	6037
Sans Q	4446	589
Sans Q	6829	337
Sans Q	1995	6037
Sans HGST	4446	589
Sans HGST	6829	337
Sans HGST	1995	6037

Afrikaans

These are sampled from the augmented data generated made for the + 100% augmentation factor for Afrikaans in Experiment 3. Source speaker IDs are those from the Afrikaans dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.

Model	Input speaker & utterance	Reference speaker & utterance
Full model	0184	373
Full model	1919	8254
Full model	0184	5890

isiXhosa

These are sampled from the augmented data generated made for the + 100% augmentation factor for isiXhosa in Experiment 3. Source speaker IDs are those from the isiXhosa dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.

Model	Input speaker & utterance	Reference speaker & utterance
Full model	0050	4455
Full model	1547	4104
Full model	0050	6127

Setswana

These are sampled from the augmented data generated made for the + 100% augmentation factor for Setswana in Experiment 3. Source speaker IDs are those from the Setswana dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.

Model	Input speaker & utterance	Reference speaker & utterance
Full model	0045	3500
Full model	1932	2391
Full model	0045	5393

Sepedi

These are sampled from the augmented data generated made for the + 100% augmentation factor for Sepedi in Experiment 3. Source speaker IDs are those from the Sepedi dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.

Model	Input speaker & utterance	Reference speaker & utterance
Full model	0045	3500
Full model	1932	2391
Full model	0045	5393

Citation

@inproceedings{baas2022lowresource,
  title= "Voice Conversion Can Improve ASR in Very Low-Resource Settings",
  author="Baas, Matthew and Kamper, Herman",
  booktitle="Interspeech",
  year=2022
}