
One-line positioning
VoxCPM is an open-source speech synthesis project from OpenBMB. Its latest release, VoxCPM2, focuses on multilingual TTS, natural-language voice design, controllable voice cloning, and 48kHz high-quality audio generation.
Basic information
| Item | Details |
|---|---|
| GitHub | OpenBMB/VoxCPM |
| Positioning | Open-source multilingual TTS and voice cloning system based on continuous speech representations |
| Main release | VoxCPM2 |
| Model size | 2B parameters |
| Training data | More than 2 million hours of multilingual speech data |
| Languages | 30 global languages + 9 Chinese dialects |
| Core capabilities | Text-to-speech, voice design, controllable voice cloning, high-fidelity cloning, streaming synthesis |
| Audio output | 48kHz high-quality audio |
| Technical approach | Tokenizer-Free, diffusion autoregressive architecture, built on MiniCPM-4 |
| License | Apache-2.0 |
| Main language | Python |
| Stars | Around 30.7k as of 2026-06-19 |
| Online demo | Hugging Face Demo / China-friendly demo |
| Documentation | ReadTheDocs |
| Models | Hugging Face / ModelScope |
What problem does it solve?
Many speech synthesis systems first compress audio into discrete tokens, predict those tokens, and then reconstruct speech. This route is practical and widely used, but it can also lose part of the fine-grained information in speech, such as tone, emotion, rhythm, and texture.
VoxCPM takes a different route. It emphasizes a Tokenizer-Free design: instead of relying on discrete audio tokenization, it directly generates continuous speech representations. The goal is to preserve more vocal detail and make synthesized speech sound more natural and expressive, especially in voice cloning scenarios where timbre consistency matters.
This makes VoxCPM useful for scenarios such as:
- Multilingual dubbing and narration
- AI podcasts, short-video voiceovers, and spoken content generation
- Character voice exploration
- Voice cloning with emotion and speaking-rate control
- High-quality speech product prototypes
Core features
1. Multilingual text-to-speech
VoxCPM2 supports 30 global languages. Users can input raw text directly without adding explicit language tags. For Chinese, it also covers 9 dialects, including Sichuanese, Cantonese, Wu Chinese, Northeastern Mandarin, Henan dialect, Shaanxi dialect, Shandong dialect, Tianjin dialect, and Hokkien.
For content products, this is practical: one speech generation pipeline can cover Chinese, English, Japanese, Korean, French, Spanish, and more, instead of maintaining a separate model for every language.
2. Natural-language voice design
One of the more interesting capabilities of VoxCPM2 is Voice Design. You do not always need a reference audio clip. Instead, you can describe the target voice in natural language, such as “a young female voice, gentle and sweet” or “a middle-aged male voice, calm, magnetic, and slow-paced.”
This is useful for character voice exploration, podcast narration, virtual streamers, game NPC dubbing, and similar workflows. Compared with choosing from a few fixed TTS voices, it feels closer to “describe a voice and generate a new voice that matches the description.”
3. Controllable voice cloning
If you have a reference audio clip, VoxCPM2 can clone its timbre and optionally apply style instructions. For example, it can keep the original voice identity while making the output sound happier, faster, or more expressive.
This is more flexible than simple voice imitation. In real applications, users often want not only the same voice, but also the ability to control how that voice speaks across different content.
4. High-fidelity cloning
The project also provides a higher-fidelity cloning mode: provide both reference audio and its transcript, and the model continues from the reference audio. This mode focuses on preserving details such as timbre, rhythm, emotion, and speaking style.
If the goal is to stay as close as possible to a reference voice, rather than just producing a roughly similar voice, this mode is a better fit.
5. 48kHz output and real-time streaming
VoxCPM2 natively outputs 48kHz audio. With AudioVAE V2’s asymmetric encode/decode design, it can generate high-quality output from a 16kHz reference audio clip without requiring an external super-resolution module.
For inference performance, the project README reports an RTF as low as around 0.3 on an NVIDIA RTX 4090 with the standard PyTorch implementation, and around 0.13 when accelerated by Nano-vLLM or vLLM-Omni. That means VoxCPM is not only an offline experiment; it also has a path toward service-oriented deployment.
Who is it for?
VoxCPM is especially relevant for:
- Developers building multilingual TTS services
- Content teams that need Chinese, English, and other multilingual voice generation
- Engineers researching voice cloning, speech generation, and continuous speech representations
- Product teams working on AI podcasts, digital humans, virtual streamers, game dubbing, or short-video voiceovers
- Teams that need a commercially usable open-source TTS model for local deployment or further development
Quick start
The simplest way to start is to install the Python package:
pip install voxcpmThe main environment requirements are Python 3.10 to 3.12, PyTorch 2.5.0 or later, and CUDA 12.0 or later.
A basic text-to-speech example:
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained(
"openbmb/VoxCPM2",
load_denoiser=False,
)
wav = model.generate(
text="VoxCPM2 is the recommended multilingual speech synthesis version.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)To try voice design, put the voice description at the beginning of the text:
wav = model.generate(
text="(a young female voice, gentle and sweet) Hello, welcome to VoxCPM2!",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)The command-line interface is also available:
# Voice design
voxcpm design \
--text "VoxCPM2 brings a new speech synthesis experience." \
--output out.wav
# Voice cloning
voxcpm clone \
--text "This is a voice cloning demo." \
--reference-audio path/to/voice.wav \
--output out.wavIf you only want to try the effect, start with the official demo. If you need higher-throughput deployment, check the Nano-vLLM-VoxCPM and vLLM-Omni deployment options.
Conclusion
VoxCPM2 is not just another TTS model. Its value is that multilingual synthesis, voice design, voice cloning, and high-quality output are brought together in one open-source system. For developers, it is approachable: it provides a Python API, CLI, Web Demo, and deployment-oriented acceleration paths.
If you are building content generation tools, speech products, digital humans, AI podcasts, or voice cloning applications, VoxCPM is worth a close look. It can be used as a ready-made TTS and cloning component, and it is also a useful reference implementation for studying continuous speech representations and open-source speech generation systems.
