Authors: Songxiang Liu, Yuewen Cao, Xixin Wu, Lifa Sun, Xunying Liu and Helen Meng
Abstract: The N10 system in the Voice Conversion Challenge 2018 (VCC 2018) has achieved high voice conversion (VC) performance in terms of speech naturalness and speaker similarity. We believe that further improvements can be gained from joint optimization (instead of separate optimization) of the conversion model and WaveNet vocoder, as well as leveraging information from the acoustic representation of the speech waveform, e.g. from Mel-spectrograms. In this paper, we propose a VC architecture to jointly train a conversion model that maps phonetic posteriorgrams (PPGs) to Mel-spectrograms and a WaveNet vocoder. The conversion model has a bottle-neck layer, whose outputs are concatenated with PPGs before being fed into the WaveNet vocoder as local conditioning. A weighted sum of a Mel-spectrogram prediction loss and a WaveNet loss is used as the objective function to jointly optimize parameters of the conversion model and the WaveNet vocoder. Objective and subjective evaluation results show that the proposed approach is capable of achieving improved quality in voice conversion in terms of speech naturalness and speaker similarity of the converted speech for both cross-gender and intra-gender conversions.
Baseline 1: Separately trained PPG-to-Mel-spectrogram conversion model and the WaveNet vocoder.
Baseline 2: Jointly trained condition network and the WaveNet vocoder.
Proposed: Jointly trained PPG-to-Mel-spectrogram conversion model and the WaveNet vocoder.
Ablation 1: Remove the top FC layer of the conversion model from the proposed approach.
Ablation 2: Remove the PPG residual connection from the proposed approach and only feed the BN features to the WaveNet vocoder.
Source speakers: "slt" and "rms" from CMU Arctic dataset.
Target speakers: "clb" and "bdl" from CMU Arctic dataset.
1. Text content: "He was a merry monarch, especially so for an Asiatic."
Source Speech | Target Speech |
---|---|
Baseline 1 | Baseline 2 | Ablation 1 | Ablation 2 | Proposed |
---|---|---|---|---|
2. Text content: "Beyond refusing to sell us food, they left us to ourselves."
Source Speech | Target Speech |
---|---|
Baseline 1 | Baseline 2 | Ablation 1 | Ablation 2 | Proposed |
---|---|---|---|---|
1. Text content: "He was a merry monarch, especially so for an Asiatic."
Source Speech | Target Speech |
---|---|
Baseline 1 | Baseline 2 | Ablation 1 | Ablation 2 | Proposed |
---|---|---|---|---|
2. Text content: "Beyond refusing to sell us food, they left us to ourselves."
Source Speech | Target Speech |
---|---|
Baseline 1 | Baseline 2 | Ablation 1 | Ablation 2 | Proposed |
---|---|---|---|---|
1. Text content: "He was a merry monarch, especially so for an Asiatic."
Source Speech | Target Speech |
---|---|
Baseline 1 | Baseline 2 | Ablation 1 | Ablation 2 | Proposed |
---|---|---|---|---|
2. Text content: "Beyond refusing to sell us food, they left us to ourselves."
Source Speech | Target Speech |
---|---|
Baseline 1 | Baseline 2 | Ablation 1 | Ablation 2 | Proposed |
---|---|---|---|---|
1. Text content: "He was a merry monarch, especially so for an Asiatic."
Source Speech | Target Speech |
---|---|
Baseline 1 | Baseline 2 | Ablation 1 | Ablation 2 | Proposed |
---|---|---|---|---|
2. Text content: "Beyond refusing to sell us food, they left us to ourselves."
Source Speech | Target Speech |
---|---|
Baseline 1 | Baseline 2 | Ablation 1 | Ablation 2 | Proposed |
---|---|---|---|---|