
HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters
Key Points
- 1HunyuanVideo-Avatar proposes a novel multimodal diffusion transformer (MM-DiT) based model to overcome challenges in audio-driven human animation, including generating dynamic videos, maintaining character consistency, achieving precise emotion alignment, and enabling multi-character scenarios.
- 2The model introduces three key innovations: a character image injection module for robust consistency, an Audio Emotion Module (AEM) for transferring emotional cues from reference images, and a Face-Aware Audio Adapter (FAA) for isolated audio injection in multi-character animations.
- 3HunyuanVideo-Avatar demonstrates superior performance over state-of-the-art methods, generating high-fidelity, emotion-controllable, and multi-character dialogue videos in dynamic and immersive settings.
HunyuanVideo-Avatar is a novel multimodal diffusion transformer (MM-DiT)-based framework designed for high-fidelity audio-driven human animation, specifically addressing critical challenges in generating dynamic, emotion-controllable, and multi-character dialogue videos.
The paper identifies three primary limitations in existing audio-driven animation methods: (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. HunyuanVideo-Avatar directly tackles these issues through three core methodological innovations.
First, to ensure robust character consistency and dynamic motion, the framework introduces a Character Image Injection Module. This module departs from conventional addition-based character conditioning schemes, which can suffer from an inherent condition mismatch between training and inference phases. By replacing this, the Character Image Injection Module is designed to provide a more stable and accurate conditioning mechanism for character identity, thereby ensuring that the generated animations maintain strong character consistency while simultaneously allowing for dynamic and expressive movements.
Second, to achieve fine-grained and accurate emotion style control, an Audio Emotion Module (AEM) is incorporated. The AEM is designed to extract emotional cues not directly from the input audio, but from an "emotion reference image." This extracted emotional information is then transferred and applied to the target generated video. This novel approach allows for precise control over the emotional style of the animated character, enabling the model to produce expressions and body language that accurately reflect the desired emotional state derived from the visual reference, rather than solely relying on potentially ambiguous emotional signals from the audio itself.
Third, to facilitate multi-character audio-driven animation, the model proposes a Face-Aware Audio Adapter (FAA). This adapter operates by isolating the audio-driven character within the latent space using a "latent-level face mask." This masking mechanism allows for the independent injection of audio information specific to each character. By leveraging cross-attention mechanisms, the FAA enables distinct audio streams to drive the animation of individual characters within a multi-character scene without interference, ensuring that each character's movements and expressions are accurately synchronized with their respective audio inputs. This isolation and independent injection capability is crucial for generating coherent and interactive multi-character dialogue videos.
Collectively, these innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, demonstrating its capability to generate realistic avatars in dynamic, immersive scenarios with superior character consistency, emotional fidelity, and multi-character interaction.