Voice Conversion Can Improve ASR in Very Low-Resource Settings
Matthew Baas and Herman Kamper.
Arxiv link: https://arxiv.org/abs/2111.02674
Voice conversion samples
Here we provide several samples of original and voice converted speech using the voice conversion model detailed in the paper. Samples are given both for English as the seen language, and for the unseen languages and speakers from Experiment 1 & 3 of the paper.
For each setting below, we present the input utterance, converted output utterance, as well as an utterance from the desired reference speaker.
Sampling is made consistent by sorting all the filenames from the data being sampled and then sampling 3 utterances for each language with Python’s random
library with a fixed seed of 123 across all languages.
In all the cases below except English, the language as well as all source speakers are unseen during training.
For English the language is seen, but the source speakers are from the LibriSpeech test set and thus are still unseen during training.
All input and reference utterances are resampled to 16kHz for the datasets where they are not originally at 16kHz.
English
These are sampled from LibriSpeech’s augmented test-clean
subset from Experiment 1. Speaker IDs are those as specified in the LibriSpeech dataset.
Model | Input speaker & utterance | Reference speaker & utterance | Output utterance |
---|---|---|---|
Full model |
4446 |
589 |
|
Full model |
6829 |
337 |
|
Full model |
1995 |
6037 |
|
Sans Q |
4446 |
589 |
|
Sans Q |
6829 |
337 |
|
Sans Q |
1995 |
6037 |
|
Sans HGST |
4446 |
589 |
|
Sans HGST |
6829 |
337 |
|
Sans HGST |
1995 |
6037 |
Afrikaans
These are sampled from the augmented data generated made for the + 100%
augmentation factor for Afrikaans in Experiment 3.
Source speaker IDs are those from the Afrikaans dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.
Model | Input speaker & utterance | Reference speaker & utterance | Output utterance |
---|---|---|---|
Full model |
0184 |
373 |
|
Full model |
1919 |
8254 |
|
Full model |
0184 |
5890 |
isiXhosa
These are sampled from the augmented data generated made for the + 100%
augmentation factor for isiXhosa in Experiment 3.
Source speaker IDs are those from the isiXhosa dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.
Model | Input speaker & utterance | Reference speaker & utterance | Output utterance |
---|---|---|---|
Full model |
0050 |
4455 |
|
Full model |
1547 |
4104 |
|
Full model |
0050 |
6127 |
Setswana
These are sampled from the augmented data generated made for the + 100%
augmentation factor for Setswana in Experiment 3.
Source speaker IDs are those from the Setswana dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.
Model | Input speaker & utterance | Reference speaker & utterance | Output utterance |
---|---|---|---|
Full model |
0045 |
3500 |
|
Full model |
1932 |
2391 |
|
Full model |
0045 |
5393 |
Sepedi
These are sampled from the augmented data generated made for the + 100%
augmentation factor for Sepedi in Experiment 3.
Source speaker IDs are those from the Sepedi dataset as described in the dataset, while reference speaker IDs are those from LibriSpeech.
Model | Input speaker & utterance | Reference speaker & utterance | Output utterance |
---|---|---|---|
Full model |
0045 |
3500 |
|
Full model |
1932 |
2391 |
|
Full model |
0045 |
5393 |
Citation
@inproceedings{baas2022lowresource,
title= "Voice Conversion Can Improve ASR in Very Low-Resource Settings",
author="Baas, Matthew and Kamper, Herman",
booktitle="Interspeech",
year=2022
}