Table: The comparison between the proposed ReTrans and cross attention for features fusion.
To verify the necessity and efficiency of the proposed ReTrans strategy for feature alignment of mismatched time resolutions, we select the classic cross attention as a baseline. Specifically, under DSFF-SVC framework, we utilize only WeNet to extract semantic-based features and try different strategies to align its resolution to that of F0 and energy features. We conduct the experiments on DiffWaveNetSVC and follow Naturalspeech2 to adopt the cross attention for features fusion within the diffusion model.
Source | Cross Attention | ReTrans |
---|---|---|