
VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023
Key Points
- 1This paper introduces the T02 team's VITS-based singing voice conversion (SVC) system for SVCC2023, featuring a feature extractor (HuBERT, F0), a voice converter, and a DSPGAN post-processor for enhanced audio quality.
- 2The system employs a two-stage training strategy to adapt to limited target speaker data, incorporating pre-training on speech and singing data, along with adaptation tricks like data augmentation and joint training with auxiliary singers.
- 3Official SVCC2023 results demonstrate the system's superior performance, ranking 1st in naturalness and 2nd in similarity for the challenging cross-domain task, with ablation studies confirming the effectiveness of its design choices.
The paper presents the T02 team's VITS-based singing voice conversion (SVC) system for the Singing Voice Conversion Challenge 2023 (SVCC2023), designed to convert source singing voices to target singers' voices while preserving lyrics and melody. The system focuses on decoupling speaker timbre, linguistic content, and melody, and addresses challenges of limited target speaker data and audio artifacts.
The core methodology of the system is structured into a feature extractor, a VITS-based voice converter, a key shifter, and a DSPGAN post-processor.
- Feature Extractor: This module is responsible for decomposing the input singing voice. It leverages a variant of the HuBERT model [19] to extract 256-dimensional speaker-independent linguistic content (SSL features). Fundamental frequency (F0) contours, crucial for melody preservation, are computed using the PYIN algorithm [20]. Speaker identity is represented by a look-up table (LuT) for speaker embeddings.
- Voice Converter (VITS-based): The central component is built upon the VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech) architecture [17]. It comprises a posterior encoder, a prior encoder, a decoder, and a discriminator.
- Training: The posterior encoder maps the source waveform to a hidden representation , modeling the posterior distribution . The decoder reconstructs to the original waveform, forming a self-reconstruction schema. A key modification from the standard HIFI-GAN decoder in VITS is the injection of a sine-based excitation signal [21], derived from F0, into the hidden features of the HIFI-GAN decoder [22] to enhance singing voice reconstruction quality. A multi-period discriminator (MPD) and a multi-scale discriminator (MSD) are employed for adversarial training, ensuring high-fidelity waveform generation. The prior encoder fuses speaker ID, F0, and SSL features to model the prior distribution. A convertible flow transforms the prior distribution to the posterior distribution, with a KL divergence loss ( in Fig. 2) applied between the prior and posterior.
- Inference: The concatenated prior encoder and decoder perform the conversion. The prior encoder takes the SSL features, shifted F0, and target speaker ID as input to generate the target singing voice waveform.
- Key Shifter: To account for differing pitch ranges between source and target singers, this module adjusts the F0 contour. It calculates the average F0 of the source () and target () singers. The pitch shift amount is then computed as the difference between the average source F0 and average target F0, i.e., . The source F0 sequence is then adjusted by to align its mean with the target, improving speaker similarity. For the cross-domain task, where only target speech F0 is available, the average pitch from the in-domain task's target singer is utilized as a reference.
- Post-processor (DSPGAN): Despite the end-to-end VITS model, converted audio may contain artifacts like metallic noise. A fine-tuned DSPGAN [18], a GAN-based universal vocoder, is used as a post-processor to re-synthesize the waveform. DSPGAN utilizes sine excitation for harmonic modeling and leverages a DSP module to extract mel-spectrograms from the VITS-generated waveform, using them as time-frequency domain supervision to eliminate artifacts and improve overall audio quality.
Training Strategy: Given limited target speaker data (around 10 minutes), a two-stage training strategy is adopted:
- Pre-training: The VITS-based voice converter is first pre-trained on the VCTK speech dataset [23], followed by further pre-training on a large mixed singing dataset (73.3 hours from NUS48e [27], Opencpop [28], M4singer [29], and Opensinger [30]). The DSPGAN post-processor is first trained on a mixed speech dataset (951.6 hours) and then fine-tuned on the mixed singing dataset.
- Adaptation: The pre-trained conversion model is adapted to the specific target singer data provided by SVCC2023. To mitigate overfitting and enhance generalization, two key tricks are employed:
- Data Augmentation: Speed perturbation [31] is applied to the target speaker's singing data. Audio clips are randomly varied in speed (factor 0.8 to 1.4) while preserving pitch, effectively doubling the dataset.
- Joint Training: During fine-tuning, the limited target speaker data is jointly trained with data from two auxiliary singers selected from the mixed singing dataset (those with the largest data subsets). This stabilizes the adaptation process.
Experimental Results: The system was evaluated in SVCC2023 for naturalness and speaker similarity. In the cross-domain task, where only speech data of the target speaker is available, the system (T02) demonstrated superior performance, ranking 1st in naturalness and 2nd in similarity among English listeners. In the in-domain task, the system achieved 5th place in both naturalness and similarity for English listeners. Ablation studies confirmed the effectiveness of key design choices: removing speech pre-training, adaptation tricks (data augmentation and auxiliary training), or the DSPGAN post-processor significantly degraded both naturalness and similarity, indicating their crucial roles in achieving high-quality conversion and robust adaptation.