kNN-VC SACAIR 2023 demo

Demo website for the submitted paper to SACAIR 2023: Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

1. Introduction

This is the demonstration website for the paper submitted for SACAIR 2023: Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

2. Code & Pretrained models

The model used in our experiments is the kNN-VC voice conversion model. It’s code and checkpoints are available here.


3. Stuttering experiments

Here we convert an utterance recorded by a non-stuttering speaker into the voice of a stuttering speaker.

3.1 Practical samples

In the few samples below we convert non-stuttered speech into the voice of a stuttering speaker.

In the first sample below, we use a reference made from a short amount of audio from stuttering speech where whole words have been stitched together to form a short recording of minimally stuttering speech (but with clear word concatenation artifacts). We then obtain a clean recording of a piece of text with a non-stuttering speaker, and apply kNN-VC using $k=4$ to convert from the clean speaker to the stuttering speaker voice, thereby producing output speech in the correct voice but with stuttering and artifacts largely removed.

For the second, third, and fourth samples, the reference waveform is simply the raw stuttered recordings with heavy stuttering and silences. This makes the task slightly harder for the model since the reference speech is further out-of-domain. Finally, we also have applied voicefixer (a recent bandwidth extension model) to the outputs to upsample it from 16kHz to 44.1kHz.

Sample Source Reference kNN-VC output kNN-VC $\rightarrow$ voicefixer output
0
1
2
3

3.2 Stuttering Events in Podcasts

In a second, larger scale experiment, we apply the same conversion process to the stuttering events in podcasts dataset. Concretely we convert 200 LibriSpeech dev-clean wavs (same as original kNN-VC paper) to 10 random stuttered speakers from the stuttering events dataset. In the conversion process we use up to 30 short reference utterances from each stuttered speaker. We exclude speakers with less than of 15 utterances, and we also filter utterances to only be those with actual word/sound stuttering events (utterances with SoundRep > 0 or WordRep > 0).

In this experiment we are purely interested in comparing how the intelligibility and speaker similarity degrades when performing stuttered voice conversion instead of regular voice conversion, so to keep things comparable with the original paper we do not apply voicefixer to the outputs.

Sample Source Reference kNN-VC output FreeVC output
0
1
2

4. Cross-lingual voice conversion

Next we apply kNN-VC to cross-lingual voice conversion using the Multilingual LibriSpeech (MLS) dataset. Concretely, we sample 3 speakers from from each language in MLS, and convert 16 utterances from each speaker to every other sampled speaker, thereby covering all possible source/target language conversion pairs. Below are some sample conversions:

Sample Source Reference kNN-VC output FreeVC output
00
01
02
03
04
05
06
07
08
09

5. Instrument conversion

We use the Musical Instrument’s Sound Dataset and see how kNN-VC and FreeVC performs on truly out-of-domain source and reference audio. Concretely, we attempt to convert from a short musical audio using one instrument to sound as though it is played with another instrument (using a song from the other instrument as reference). Note that the models have only seen speech during training, so the output contains fairly heavy distortions.

Sample Source Reference kNN-VC output FreeVC output
1
2
3
4

6. Text-to-voice experiments

Here we convert an utterance to another voice, but the voice is not specified by a reference clip, but rather by a textual description of how the voice should sound. Concretely, we hand-label textual descriptions for 90 train/validation and 10 testing speakers from the LibriTTS dataset. Below we show a few conversion samples from the test-clean subset:

Sample Source Target speaker description kNN-VC output
0 “A man with a deep and lackluster voice, occasionally slurring his pronunciations, speaking in a subtly monotone manner, with moments of excitement occasionally.”
1 “Young woman with a high-pitched voice, employing a somewhat forced and artificially upbeat tone in her speech.”
2 “Young woman with a high-pitched voice, employing a somewhat forced and artificially upbeat tone in her speech.”
3 “An elderly woman with a velvety and resonant low voice, speaking in an animated and energetic, yet tender and caring manner.”
4 “Seasoned gentleman with a deep and low voice, speaking in a soft and tranquil cadence, emanating an aura of refinement and profound insight.”
5 “A young woman with a squeaky and animated voice, speaking in a rapid and highly expressive manner.”

Citation

Please find our bibtex citation:

@inproceedings{baas2023sacair,
  title={Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices},
  author={Baas, Matthew and Kamper, Herman},
  booktitle={SACAIR},
  year=2023
}

Thank you for checking out our work!