VoxCPM: An Open-Source Model for Multilingual Speech Synthesis and Voice Cloning

One-line positioning

VoxCPM is an open-source speech synthesis project from OpenBMB. Its latest release, VoxCPM2, focuses on multilingual TTS, natural-language voice design, controllable voice cloning, and 48kHz high-quality audio generation.

Basic information

Item	Details
GitHub	OpenBMB/VoxCPM
Positioning	Open-source multilingual TTS and voice cloning system based on continuous speech representations
Main release	VoxCPM2
Model size	2B parameters
Training data	More than 2 million hours of multilingual speech data
Languages	30 global languages + 9 Chinese dialects
Core capabilities	Text-to-speech, voice design, controllable voice cloning, high-fidelity cloning, streaming synthesis
Audio output	48kHz high-quality audio
Technical approach	Tokenizer-Free, diffusion autoregressive architecture, built on MiniCPM-4
License	Apache-2.0
Main language	Python
Stars	Around 30.7k as of 2026-06-19
Online demo	Hugging Face Demo / China-friendly demo
Documentation	ReadTheDocs
Models	Hugging Face / ModelScope

What problem does it solve?

Many speech synthesis systems first compress audio into discrete tokens, predict those tokens, and then reconstruct speech. This route is practical and widely used, but it can also lose part of the fine-grained information in speech, such as tone, emotion, rhythm, and texture.

VoxCPM takes a different route. It emphasizes a Tokenizer-Free design: instead of relying on discrete audio tokenization, it directly generates continuous speech representations. The goal is to preserve more vocal detail and make synthesized speech sound more natural and expressive, especially in voice cloning scenarios where timbre consistency matters.

This makes VoxCPM useful for scenarios such as:

Multilingual dubbing and narration
AI podcasts, short-video voiceovers, and spoken content generation
Character voice exploration
Voice cloning with emotion and speaking-rate control
High-quality speech product prototypes

Core features

1. Multilingual text-to-speech

VoxCPM2 supports 30 global languages. Users can input raw text directly without adding explicit language tags. For Chinese, it also covers 9 dialects, including Sichuanese, Cantonese, Wu Chinese, Northeastern Mandarin, Henan dialect, Shaanxi dialect, Shandong dialect, Tianjin dialect, and Hokkien.

For content products, this is practical: one speech generation pipeline can cover Chinese, English, Japanese, Korean, French, Spanish, and more, instead of maintaining a separate model for every language.

2. Natural-language voice design

One of the more interesting capabilities of VoxCPM2 is Voice Design. You do not always need a reference audio clip. Instead, you can describe the target voice in natural language, such as “a young female voice, gentle and sweet” or “a middle-aged male voice, calm, magnetic, and slow-paced.”

This is useful for character voice exploration, podcast narration, virtual streamers, game NPC dubbing, and similar workflows. Compared with choosing from a few fixed TTS voices, it feels closer to “describe a voice and generate a new voice that matches the description.”

3. Controllable voice cloning

If you have a reference audio clip, VoxCPM2 can clone its timbre and optionally apply style instructions. For example, it can keep the original voice identity while making the output sound happier, faster, or more expressive.

This is more flexible than simple voice imitation. In real applications, users often want not only the same voice, but also the ability to control how that voice speaks across different content.

4. High-fidelity cloning

The project also provides a higher-fidelity cloning mode: provide both reference audio and its transcript, and the model continues from the reference audio. This mode focuses on preserving details such as timbre, rhythm, emotion, and speaking style.

If the goal is to stay as close as possible to a reference voice, rather than just producing a roughly similar voice, this mode is a better fit.

5. 48kHz output and real-time streaming

VoxCPM2 natively outputs 48kHz audio. With AudioVAE V2’s asymmetric encode/decode design, it can generate high-quality output from a 16kHz reference audio clip without requiring an external super-resolution module.

For inference performance, the project README reports an RTF as low as around 0.3 on an NVIDIA RTX 4090 with the standard PyTorch implementation, and around 0.13 when accelerated by Nano-vLLM or vLLM-Omni. That means VoxCPM is not only an offline experiment; it also has a path toward service-oriented deployment.

Who is it for?

VoxCPM is especially relevant for:

Developers building multilingual TTS services
Content teams that need Chinese, English, and other multilingual voice generation
Engineers researching voice cloning, speech generation, and continuous speech representations
Product teams working on AI podcasts, digital humans, virtual streamers, game dubbing, or short-video voiceovers
Teams that need a commercially usable open-source TTS model for local deployment or further development

Quick start

The simplest way to start is to install the Python package:

pip install voxcpm

The main environment requirements are Python 3.10 to 3.12, PyTorch 2.5.0 or later, and CUDA 12.0 or later.

A basic text-to-speech example:

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 is the recommended multilingual speech synthesis version.",
    cfg_value=2.0,
    inference_timesteps=10,
)

sf.write("demo.wav", wav, model.tts_model.sample_rate)

To try voice design, put the voice description at the beginning of the text:

wav = model.generate(
    text="(a young female voice, gentle and sweet) Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)

sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

The command-line interface is also available:

# Voice design
voxcpm design \
  --text "VoxCPM2 brings a new speech synthesis experience." \
  --output out.wav

# Voice cloning
voxcpm clone \
  --text "This is a voice cloning demo." \
  --reference-audio path/to/voice.wav \
  --output out.wav

If you only want to try the effect, start with the official demo. If you need higher-throughput deployment, check the Nano-vLLM-VoxCPM and vLLM-Omni deployment options.

Conclusion

VoxCPM2 is not just another TTS model. Its value is that multilingual synthesis, voice design, voice cloning, and high-quality output are brought together in one open-source system. For developers, it is approachable: it provides a Python API, CLI, Web Demo, and deployment-oriented acceleration paths.

If you are building content generation tools, speech products, digital humans, AI podcasts, or voice cloning applications, VoxCPM is worth a close look. It can be used as a ready-made TTS and cloning component, and it is also a useful reference implementation for studying continuous speech representations and open-source speech generation systems.