NVIDIA PersonaPlex: Natural Conversational AI With Any Role and Voice
Paper

NVIDIA PersonaPlex: Natural Conversational AI With Any Role and Voice

2026.01.24
·Web·by 이호민
#Conversational AI#Full Duplex#Persona#NVIDIA#LLM

Key Points

  • 1PersonaPlex introduces a full-duplex conversational AI system that combines natural turn-taking, interruptions, and backchannels with deep customization of voice and persona through joint text and voice prompts.
  • 2Built on the 7-billion-parameter Moshi architecture, it processes user audio and prompts concurrently to stream responses, trained on a unique blend of real human conversations for naturalness and synthetic dialogues for task-specific adherence.
  • 3The system achieves state-of-the-art performance in conversational dynamics and task adherence, demonstrating strong generalization capabilities to out-of-domain scenarios and making both code and model weights available.

PersonaPlex is a full-duplex conversational AI model designed to overcome the traditional trade-off between voice/role customization and natural conversational dynamics. Prior systems, such as cascaded ASR→LLM→TTS, offered customization but lacked natural turn-taking, while full-duplex models like Moshi provided naturalness but were limited to a fixed voice and role. PersonaPlex addresses this by allowing users to select diverse voices and define any role via text prompts, delivering natural conversations with customization.

The core methodology of PersonaPlex is built upon the Moshi architecture, comprising 7 billion parameters, and operates as a single, unified model rather than a cascade of separate components. This architecture enables simultaneous listening and speaking, which is crucial for real-time interaction, interruption handling, and generating natural conversational behaviors like backchannels ("uh-huh," "oh").

PersonaPlex uses a hybrid prompting architecture with two primary input streams:

  1. Voice prompt: An audio embedding that encodes vocal characteristics, speaking style, and prosody.
  2. Text prompt: Natural language text that specifies the agent's role, background information, and conversational context.

These inputs are processed jointly by the model to create a coherent persona. The internal architecture consists of:

  • A Mimi speech encoder (a combination of a ConvNet and a Transformer) which converts user audio into tokens.
  • Temporal and Depth Transformers that process the ongoing conversation, integrating information from both the user's speech and the persona prompts.
  • A Mimi speech decoder (a Transformer coupled with a ConvNet) that generates the agent's output speech.
  • The underlying language model for semantic understanding and generalization is Helium.

Audio processing occurs at a 24kHz sample rate. The dual-stream configuration, where the model concurrently processes user input and generates its own response, is key to achieving low-latency interaction and natural conversational flow.

A significant challenge in training PersonaPlex was the scarcity of conversational speech data rich in non-verbal cues and with separated speaker audio. To overcome this, the training data strategy involved a blend of two main sources:

  1. Real Conversations: 7,303 conversations (1,217 hours) from the Fisher English corpus. These conversations were retrospectively annotated with contextual and personality descriptors (prompts) using GPT-OSS-120B. This approach transforms unscripted human conversations into persona-supervised data, capturing authentic interaction patterns, including natural backchanneling, expressions, and emotional responses.
  2. Synthetic Conversations: To expand coverage across diverse scenarios and topics, PersonaPlex was trained on:
    • 39,322 assistant role conversations (410 hours).
    • 105,410 customer service conversations (1,840 hours).
Conversation transcripts were generated using large language models (Qwen3-32B and GPT-OSS-120B), and the speech was synthesized using Chatterbox TTS. This synthetic data enabled training for specific task-following behaviors, with varied user/agent voices and detailed text prompts for specific roles (e.g., bank agent, medical receptionist).

The model trains in a single stage, blending these data sources. A key finding is that the final model exhibits the behavioral richness and natural speech patterns derived from the real Fisher conversations, combined with the task-adherence learned from the synthetic data. This is facilitated by using the same hybrid prompt format (voice and text) across both data sources, acting as a bridge between task knowledge and natural interaction patterns.

PersonaPlex demonstrates efficient specialization, requiring under 5,000 hours of directed data on top of Moshi’s pretrained weights to enable task-following while retaining broad conversational competence. It also shows emergent generalization beyond its training domains, handling complex and out-of-distribution scenarios (e.g., technical crisis management) with appropriate emotional tone and domain-specific reasoning, attributed to the broad corpus used for pretraining Moshi's language model, Helium.

Evaluation was performed using FullDuplexBench for conversational dynamics (turn-taking, user interruption, pause handling) and response quality (judged by GPT-4o), and ServiceDuplexBench (an extension for customer service task adherence). PersonaPlex significantly outperforms other commercial and open-source systems, including Moshi, Freeze Omni, Gemini Live, and Qwen 2.5 Omni, across metrics for conversational dynamics, latency, and task adherence in both general assistant and specialized customer service roles.

The model code and weights are released under MIT License and NVIDIA Open Model License, respectively, building upon the CC-BY-4.0 licensed Moshi model from Kyutai. The ServiceDuplexBench benchmark will also be made available.