Performance of the DSFF-SVC framework

Evaluattion tasks

  1. Recording Studio Setting: Following the most existing works, we utilize the high-quality singing corpus that is recorded in studio as the experimental data. The vocals are clean with virtually no noise or environmental interference.
  2. In-the-Wild Setting: in the real-world SVC application, usually the singing voices are separated from the background music, which will remain some artifacts or reverb in the vocals. We consider this as a more challenging conversion task to examine the robustness of the SVC systems.

Base Models

  1. TransformerSVC: It adopts a vanilla encoder-only transformer model as the acoustic model. Its output is the mel-spectrogram.
  2. VitsSVC: It is a VITS-based model which is similar to the SoftVC-VITS. It is an end-to-end framework and can directly produce waveform.
  3. DiffWaveNetSVC: It adopts a diffusion-based acoustic model and could generate mel-spectrogram, which is proposed by. The internal encoder of the diffusion framework is based on Bidirectional Non-Causal Dilated CNN, which is similar to WaveNet.

For following tables, W represents using WeNet feature only, W + W represents using both WeNet and Whisper features, and W + W + C represents using WeNet, Whisper and ContentVec features.

Recording Studio Setting

Reference


Conversion Results

Source TransformerSVC
(W)
TransformerSVC
(W + W)
TransformerSVC
(W + W + C)
VitsSVC
(W)
VitsSVC
(W + W)
VitsSVC
(W + W + C)
DiffWaveNetSVC
(W)
DiffWaveNetSVC
(W + W)
DiffWaveNetSVC
(W + W + C)



In-the-Wild Setting

Reference (English Male)


Conversion Results

Source TransformerSVC
(W)
TransformerSVC
(W + W)
TransformerSVC
(W + W + C)
VitsSVC
(W)
VitsSVC
(W + W)
VitsSVC
(W + W + C)
DiffWaveNetSVC
(W)
DiffWaveNetSVC
(W + W)
DiffWaveNetSVC
(W + W + C)

Reference (English Female)


Conversion Results

Source TransformerSVC
(W)
TransformerSVC
(W + W)
TransformerSVC
(W + W + C)
VitsSVC
(W)
VitsSVC
(W + W)
VitsSVC
(W + W + C)
DiffWaveNetSVC
(W)
DiffWaveNetSVC
(W + W)
DiffWaveNetSVC
(W + W + C)

Reference (Chinese Male)


Conversion Results

Source TransformerSVC
(W)
TransformerSVC
(W + W)
TransformerSVC
(W + W + C)
VitsSVC
(W)
VitsSVC
(W + W)
VitsSVC
(W + W + C)
DiffWaveNetSVC
(W)
DiffWaveNetSVC
(W + W)
DiffWaveNetSVC
(W + W + C)

Reference (Chinese Female)


Conversion Results

Source TransformerSVC
(W)
TransformerSVC
(W + W)
TransformerSVC
(W + W + C)
VitsSVC
(W)
VitsSVC
(W + W)
VitsSVC
(W + W + C)
DiffWaveNetSVC
(W)
DiffWaveNetSVC
(W + W)
DiffWaveNetSVC
(W + W + C)