This is an anonymous sample site for the StyleMoE paper submitted to the Audio Imagination Workshop at NeurIPS 2024. Below, you will find audio samples from the paper. The code will be released after acceptance.
Recent advances in style transfer text-to-speech (TTS) have improved the expressiveness of synthesized speech. Despite these advancements, encoding stylistic information from diverse and unseen reference speech remains challenging. This paper introduces StyleMoE, an approach that divides the embedding space into tractable subsets handled by style experts. The proposed method replaces the style encoder in a TTS system with a Mixture of Experts (MoE) layer. By utilizing a gating network to route reference speech samples to different style experts, which specialize in different aspects of the style space, StyleMoE improves the overall speaking style coverage in style transfer TTS. Our experiments objectively and subjectively demonstrate the improvement of style transfer for diverse and unseen styles. This approach enhances the performance of existing state-of-the-art style transfer TTS models, marking the first study of style MoE in TTS
Figure 1: The architecture of StyleMoE-TTS. Red modules represent modules from GenerSpeech. Green modules represent the Mixture of Experts layer. Purple modules represent style experts. The darker purple modules represent the style experts chosen by the gating network. Subfigures (a) and (b) illustrate the integration of StyleMoE into StyleMoE-TTS. Subfigure (c) depicts the StyleMoE layer, wherein each Style Expert block is a style reference encoder. Subfigure (d) illustrates the gating network.
Reference Audio | Baseline | StyleEnsemble experts=2 | StyleMoE experts=2, k=1 |
---|---|---|---|
Reference Audio | Baseline | StyleEnsemble experts=2 | StyleMoE experts=2, k=1 |
---|---|---|---|