Audio Samples from "Jointly Trained Conversion Model and WaveNet Vocoder for Non-parallel Voice Conversion using Mel-spectrograms and Phonetic Posteriorgrams"

Authors: Songxiang Liu, Yuewen Cao, Xixin Wu, Lifa Sun, Xunying Liu and Helen Meng

Abstract: The N10 system in the Voice Conversion Challenge 2018 (VCC 2018) has achieved high voice conversion (VC) performance in terms of speech naturalness and speaker similarity. We believe that further improvements can be gained from joint optimization (instead of separate optimization) of the conversion model and WaveNet vocoder, as well as leveraging information from the acoustic representation of the speech waveform, e.g. from Mel-spectrograms. In this paper, we propose a VC architecture to jointly train a conversion model that maps phonetic posteriorgrams (PPGs) to Mel-spectrograms and a WaveNet vocoder. The conversion model has a bottle-neck layer, whose outputs are concatenated with PPGs before being fed into the WaveNet vocoder as local conditioning. A weighted sum of a Mel-spectrogram prediction loss and a WaveNet loss is used as the objective function to jointly optimize parameters of the conversion model and the WaveNet vocoder. Objective and subjective evaluation results show that the proposed approach is capable of achieving improved quality in voice conversion in terms of speech naturalness and speaker similarity of the converted speech for both cross-gender and intra-gender conversions.


System Description

Baseline 1: Separately trained PPG-to-Mel-spectrogram conversion model and the WaveNet vocoder.

Baseline 2: Jointly trained condition network and the WaveNet vocoder.

Proposed: Jointly trained PPG-to-Mel-spectrogram conversion model and the WaveNet vocoder.

Ablation 1: Remove the top FC layer of the conversion model from the proposed approach.

Ablation 2: Remove the PPG residual connection from the proposed approach and only feed the BN features to the WaveNet vocoder.


System Comparison

Source speakers: "slt" and "rms" from CMU Arctic dataset.

Target speakers: "clb" and "bdl" from CMU Arctic dataset.

Female-to-Female converison (slt -> clb)

1. Text content: "He was a merry monarch, especially so for an Asiatic."

Source Speech Target Speech
Baseline 1 Baseline 2 Ablation 1 Ablation 2 Proposed

2. Text content: "Beyond refusing to sell us food, they left us to ourselves."

Source Speech Target Speech
Baseline 1 Baseline 2 Ablation 1 Ablation 2 Proposed

Male-to-Female converison (rms -> clb)

1. Text content: "He was a merry monarch, especially so for an Asiatic."

Source Speech Target Speech
Baseline 1 Baseline 2 Ablation 1 Ablation 2 Proposed

2. Text content: "Beyond refusing to sell us food, they left us to ourselves."

Source Speech Target Speech
Baseline 1 Baseline 2 Ablation 1 Ablation 2 Proposed

Male-to-Male converison (rms -> bdl)

1. Text content: "He was a merry monarch, especially so for an Asiatic."

Source Speech Target Speech
Baseline 1 Baseline 2 Ablation 1 Ablation 2 Proposed

2. Text content: "Beyond refusing to sell us food, they left us to ourselves."

Source Speech Target Speech
Baseline 1 Baseline 2 Ablation 1 Ablation 2 Proposed

Female-to-Male converison (slt -> bdl)

1. Text content: "He was a merry monarch, especially so for an Asiatic."

Source Speech Target Speech
Baseline 1 Baseline 2 Ablation 1 Ablation 2 Proposed

2. Text content: "Beyond refusing to sell us food, they left us to ourselves."

Source Speech Target Speech
Baseline 1 Baseline 2 Ablation 1 Ablation 2 Proposed