Table: Objective evaluation results of different semantic-based features and their integration.
In this page, we will show some representative cases to
We select DiffWaveNetSVC as the base model and conduct experiments under the Recording Studio Setting. We use Opencpop as the target singer, whose training corpus is 5.2 hours of studio recorded singing voices. Here are some reference samples:
Firstly, we aim to evaluate the reconstruction performance among different semantic-based features. The following samples are from Opencpop's test set:
Ground Truth | WeNet | WeNet + Whisper | WeNet + Whisper + ContentVec | |
---|---|---|---|---|
#1 | ||||
MCD | 0.00 | 12.56 | 8.00 | 5.02 |
Mel Spectrogram | ||||
#2 | ||||
MCD | 0.00 | 11.20 | 6.21 | 3.66 |
Mel Spectrogram | ||||
#3 | ||||
MCD | 0.00 | 13.40 | 8.82 | 6.51 |
Mel Spectrogram |
Observation: After integrating diverse semantic-features, the mel spectrograms are approaching the ground truth. The overall quality of the audios are also better.
Furthermore, we use the M4Singer as the source audios to conduct the singing voice conversion. Our desired target singer is Opencpop. Here we evaluate the melody modeling capability of the different semantic-features.
Source | WeNet | WeNet + Whisper | WeNet + Whisper + ContentVec | |
---|---|---|---|---|
#1 | ||||
F0CORR | 1.00 | -0.44 | 0.90 | 0.92 |
F0 | ||||
#2 | ||||
F0CORR | 1.00 | -0.22 | 0.61 | 0.68 |
F0 | ||||
#3 | ||||
F0CORR | 1.00 | -0.16 | 0.22 | 0.72 |
F0 |
Observation: After integrating diverse semantic-features, the trajectories of the melody between converted audios and ground truth are closer. However, we can also find that using only semantic-based features is hard to model melody adequately, appearing the “out of tune” for human hearing. Therefore, introducing explicit melody modeling (such as F0 features) for SVC remains necessary in the present technology context.
In this section, we evaluate the lyrics modeling capability of the different semantic-features. Like before, the source audios are from M4Singer and the target singer is Opencpop.
Source | WeNet | WeNet + Whisper | WeNet + Whisper + ContentVec | |
---|---|---|---|---|
#1 |
这只是刚刚入门接下来你还会会弹琴会写歌会双截棍 zhè zhǐ shì gāng gāng rù mén jiē xià lái nǐ hái huì huì tán qín huì xiě gē huì shuāng jié gùn |
|||
这只是刚刚入门接下来你还会弹琴会写歌会上街滚 (CER: 18.2%) | 这只是当童话吗地下安利啊狗货看清单写报上节归 (CER: 81.8%) | 这只是刚刚入门接下来你还会弹琴会写歌会双戒柜 (CER: 13.6%) | 就只是刚刚入门接下来你还会弹琴会写歌会双截棍 (CER: 9.1%) | |
zhè zhǐ shì gāng gāng rù mén jiē xià lái nǐ hái huì tán qín huì xiě gē huì shàng jiē gǔn | zhè zhǐ shì dāng tóng huà ma dì xià ān lì a gǒu huò kàn qīng dān xiě bào shàng jié guī | zhè zhǐ shì gāng gāng rù mén jiē xià lái nǐ hái huì tán qín huì xiě gē huì shuāng jiè guì | jiù zhǐ shì gāng gāng rù mén jiē xià lái nǐ hái huì tán qín huì xiě gē huì shuāng jié gùn | |
#2 |
我好想对你对你宠爱才短短几个礼拜心情坏因为你不在 wǒ hǎo xiǎng duì nǐ duì nǐ chǒng ài cái duǎn duǎn jǐ gè lǐ bài xīn qíng huài yīn wèi nǐ bù zài |
|||
我好想对你对你宠爱才短短几个礼拜心情会因为你不在 (CER: 4.2%) | 同好心愿意爱出爱大胆即刻必败应尽力为你不猜 (CER: 90.5%) | 我好心对你对你宠爱才等等几个礼拜心情欢意为你不在 (CER: 20.8%) | 我好想对你对你宠爱才短短几个礼拜心情欢意为你不在 (CER: 8.33%) | |
wǒ hǎo xiǎng duì nǐ duì nǐ chǒng ài cái duǎn duǎn jǐ gè lǐ bài xīn qíng huì yīn wèi nǐ bù zài | tóng hào xīn yuàn yì ài chū ài dà dǎn jí kè bì bài yīng jìn lì wèi nǐ bù cāi | wǒ hǎo xīn duì nǐ duì nǐ chǒng ài cái děng děng jǐ gè lǐ bài xīn qíng huān yì wèi nǐ bù zài | wǒ hǎo xiǎng duì nǐ duì nǐ chǒng ài cái duǎn duǎn jǐ gè lǐ bài xīn qíng huān yì wèi nǐ bù zài | |
#3 |
夕阳下我拉着你一起望着天天慢的静静的度过旧旧的时间 xī yáng xià wǒ lā zhe nǐ yì qǐ wàng zhe tiān tiān màn de jìng jìng de dù guò jiù jiù de shí jiān |
|||
夕阳下午拉着你一起望着天静慢地静静地度过久久的时间 (CER: 24.0%) | 心跳不欢声你一切望着天心慢的听见的多古典就失恋 (CER: 73.9%) | 深夜下午拉着你一起望着天听慢的静静的兜过去的时间 (CER: 29.2%) | 夕阳下我拉着你一起望着天给你慢的静静的度过久久的时间 (CER: 15.4%) | |
xī yáng xià wǔ lā zhe nǐ yì qǐ wàng zhe tiān jìng màn dì jìng jìng dì dù guò jiǔ jiǔ de shí jiān | xīn tiào bù huān shēng nǐ yī qiè wàng zhe tiān xīn màn de tīng jiàn de duō gǔ diǎn jiù shī liàn | shēn yè xià wǔ lā zhe nǐ yì qǐ wàng zhe tiān tīng màn de jìng jìng de dōu guò qù de shí jiān | xī yáng xià wǒ lā zhe nǐ yì qǐ wàng zhe tiān gěi nǐ màn de jìng jìng de dù guò jiǔ jiǔ de shí jiān | |
#4 |
顾不顾将相王侯管不管万世千秋求只求爱化解这万丈红尘纷乱永无休 gù bù gù jiàng xiàng wáng hóu guǎn bù guǎn wàn shì qiān qiū qiú zhǐ qiú ài huà jiě zhè wàn zhàng hóng chén fēn luàn yǒng wú xiū |
|||
孤不孤将相望后管不管万世千秋求知求爱化解这万丈红尘纷乱永无休 (CER: 16.7%) | 牵挂不顾登场问候问不完把时间救当初仇爱我将 (CER: 72.3%) | 顾不顾正向往后还不关为师千秋求知求爱化解这万丈红尘纷乱永无休 (CER: 30.0%) | 顾不顾长相往后管不管为师千秋求知求爱化解这万丈红尘纷乱永无休 (CER: 20.0%) | |
gū bù gū jiàng xiàng wàng hòu guǎn bù guǎn wàn shì qiān qiū qiú zhī qiú ài huà jiě zhè wàn zhàng hóng chén fēn luàn yǒng wú xiū | qiān guà bù gù dēng chǎng wèn hòu wèn bù wán bǎ shí jiān jiù dāng chū chóu ài wǒ jiāng | gù bù gù zhèng xiàng wǎng hòu hái bù guān wèi shī qiān qiū qiú zhī qiú ài huà jiě zhè wàn zhàng hóng chén fēn luàn yǒng wú xiū | gù bù gù zhǎng xiàng wǎng hòu guǎn bù guǎn wèi shī qiān qiū qiú zhī qiú ài huà jiě zhè wàn zhàng hóng chén fēn luàn yǒng wú xiū | |
#5 |
让我断了气铁了心爱的过火一回头就找到出路让我成为了无情的K歌之王 ràng wǒ duàn le qì tiě le xīn ài de guò huǒ yī huí tóu jiù zhǎo dào chū lù ràng wǒ chéng wéi liǎo wú qíng de K gē zhī wáng |
|||
让我断了气贴了心态的火火一回头就找到出路让我成为了无情的K哥之王 (CER: 12.5%) | 让我等了几天的心拍电话你回头都早老住让我成为了无形的黑洞失望 (CER: 63.3%) | 让我断了气贴了心海的火火一回头就找到出路让我成为了无形的配歌时光 (CER: 21.9%) | 让我断了七天的心海的过火一回头就找到出路让我成为了无情的配歌之王 (CER: 15.6%) | |
ràng wǒ duàn le qì tiē le xīn tài de huǒ huǒ yī huí tóu jiù zhǎo dào chū lù ràng wǒ chéng wéi liǎo wú qíng de K gē zhī wáng | ràng wǒ děng le jǐ tiān de xīn pāi diàn huà nǐ huí tóu dōu zǎo lǎo zhù ràng wǒ chéng wéi liǎo wú xíng de hēi dòng shī wàng | ràng wǒ duàn le qì tiē le xīn hǎi de huǒ huǒ yī huí tóu jiù zhǎo dào chū lù ràng wǒ chéng wéi liǎo wú xíng de pèi gē shí guāng | ràng wǒ duàn le qī tiān de xīn hǎi de guò huǒ yī huí tóu jiù zhǎo dào chū lù ràng wǒ chéng wéi liǎo wú qíng de pèi gē zhī wáng |
Observation: After integrating diverse semantic-features, the intelligibility of the converted audios has improved.
Finally, we evalute the speaker similarity for the converted audios.
Source | WeNet | Whisper | ContentVec |
---|---|---|---|
WeNet + Whisper | WeNet + ContentVec | Whisper + ContentVec | WeNet + Whisper + ContentVec |
---|---|---|---|
Observation: For different semantic-features, the speaker similarities of them are comparable and hard to rank from human perception.