MOSS-Audio-Tokenizer-ONNX

This repository provides the ONNX exports of MOSS-Audio-Tokenizer (encoder & decoder), enabling torch-free audio encoding/decoding for the MOSS-TTS family.

Overview

MOSS-Audio-Tokenizer is the unified discrete audio interface for the entire MOSS-TTS Family, based on the Cat (Causal Audio Tokenizer with Transformer) architecture — a 1.6B-parameter, pure Causal Transformer audio tokenizer trained on 3M hours of diverse audio.

This ONNX repository is designed for lightweight, torch-free deployment scenarios. It serves as the audio tokenizer component in the MOSS-TTS llama.cpp inference backend, which combines llama.cpp (for the Qwen3 backbone) with ONNX Runtime or TensorRT (for the audio tokenizer) to achieve fully PyTorch-free TTS inference.

Supported Backends

Backend	Runtime	Use Case
ONNX Runtime (GPU)	`onnxruntime-gpu`	Recommended starting point
ONNX Runtime (CPU)	`onnxruntime`	CPU-only / no CUDA
TensorRT	Build from ONNX	Maximum throughput (user-built engines)

Note: We do not provide pre-built TensorRT engines, as they are tied to your specific GPU architecture and TensorRT version. To use TRT, build engines from the ONNX models yourself — see moss_audio_tokenizer/trt/build_engine.sh in the main repository.

Repository Contents

File	Description
`encoder.onnx`	ONNX model for audio encoding (waveform → discrete codes)
`decoder.onnx`	ONNX model for audio decoding (discrete codes → waveform)

Quick Start


# Download
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX \
    --local-dir weights/MOSS-Audio-Tokenizer-ONNX

This is typically used together with MOSS-TTS-GGUF for the llama.cpp inference pipeline. See the llama.cpp Backend documentation for the full end-to-end setup.

Main Repositories

Repository	Description
OpenMOSS/MOSS-TTS	MOSS-TTS Family main repository (includes llama.cpp backend, PyTorch inference, and all models)
OpenMOSS/MOSS-Audio-Tokenizer	MOSS-Audio-Tokenizer source code, PyTorch weights, ONNX/TRT export scripts, and evaluation
OpenMOSS-Team/MOSS-Audio-Tokenizer	PyTorch weights on Hugging Face (for `trust_remote_code=True` usage)
OpenMOSS-Team/MOSS-TTS-GGUF	Pre-quantized GGUF backbone weights (companion to this ONNX repo)

About MOSS-Audio-Tokenizer

MOSS-Audio-Tokenizer compresses 24kHz raw audio into a 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting high-fidelity reconstruction from 0.125kbps to 4kbps. It is trained from scratch on 3 million hours of speech, sound effects, and music, achieving state-of-the-art reconstruction quality among open-source audio tokenizers.

For the full model description, architecture details, and evaluation metrics, please refer to:

Evaluation Metrics

The table below compares the reconstruction quality of open-source audio tokenizers with MossAudioTokenizer on speech and audio/music data.

Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
STFT-Dist. denotes the STFT distance.
Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
Nq denotes the number of quantizers.

Model	bps	Frame rate	Nq	Speech: SIM ↑ (EN/ZH)	Speech: STOI ↑ (EN/ZH)	Speech: PESQ-NB ↑ (EN/ZH)	Speech: PESQ-WB ↑ (EN/ZH)	Audio/Music: Mel-Loss ↓	Audio/Music: STFT-Dist. ↓
XCodec2.0	800	50	1	0.82 / 0.74	0.92 / 0.86	3.04 / 2.46	2.43 / 1.96	-- / --	-- / --
MiMo Audio Tokenizer	850	25	4	0.80 / 0.74	0.91 / 0.87	2.94 / 2.62	2.39 / 2.14	0.82 / 0.81	2.33 / 2.23
Higgs Audio Tokenizer	1000	25	4	0.77 / 0.68	0.83 / 0.82	3.03 / 2.61	2.48 / 2.14	0.83 / 0.80	2.20 / 2.05
SpeechTokenizer	1000	50	2	0.36 / 0.25	0.77 / 0.68	1.59 / 1.38	1.25 / 1.17	-- / --	-- / --
XY-Tokenizer	1000	12.5	8	0.85 / 0.79	0.92 / 0.87	3.10 / 2.63	2.50 / 2.12	-- / --	-- / --
BigCodec	1040	80	1	0.84 / 0.69	0.93 / 0.88	3.27 / 2.55	2.68 / 2.06	-- / --	-- / --
Mimi	1100	12.5	8	0.74 / 0.59	0.91 / 0.85	2.80 / 2.24	2.25 / 1.78	1.24 / 1.19	2.62 / 2.49
MOSS Audio Tokenizer (Ours)	750	12.5	6	0.82 / 0.75	0.93 / 0.89	3.14 / 2.73	2.60 / 2.22	0.86 / 0.85	2.21 / 2.10
MOSS Audio Tokenizer (Ours)	1000	12.5	8	0.88 / 0.81	0.94 / 0.91	3.38 / 2.96	2.87 / 2.43	0.82 / 0.80	2.16 / 2.04
—	—	—	—	—	—	—	—	—	—
DAC	1500	75	2	0.48 / 0.41	0.83 / 0.79	1.87 / 1.67	1.48 / 1.37	-- / --	-- / --
Encodec	1500	75	2	0.60 / 0.45	0.85 / 0.81	1.94 / 1.80	1.56 / 1.48	1.12 / 1.04	2.60 / 2.42
Higgs Audio Tokenizer	2000	25	8	0.90 / 0.83	0.85 / 0.85	3.59 / 3.22	3.11 / 2.73	0.74 / 0.70	2.07 / 1.92
SpeechTokenizer	2000	50	4	0.66 / 0.50	0.88 / 0.80	2.38 / 1.79	1.92 / 1.49	-- / --	-- / --
Qwen3 TTS Tokenizer	2200	12.5	16	0.95 / 0.88	0.96 / 0.93	3.66 / 3.10	3.19 / 2.62	-- / --	-- / --
MiMo Audio Tokenizer	2250	25	12	0.89 / 0.83	0.95 / 0.92	3.57 / 3.25	3.05 / 2.71	0.70 / 0.68	2.21 / 2.10
Mimi	2475	12.5	18	0.89 / 0.76	0.94 / 0.91	3.49 / 2.90	2.97 / 2.35	1.10 / 1.06	2.45 / 2.32
MOSS Audio Tokenizer (Ours)	1500	12.5	12	0.92 / 0.86	0.95 / 0.93	3.64 / 3.27	3.20 / 2.74	0.77 / 0.74	2.08 / 1.96
MOSS Audio Tokenizer (Ours)	2000	12.5	16	0.95 / 0.89	0.96 / 0.94	3.78 / 3.46	3.41 / 2.96	0.73 / 0.70	2.03 / 1.90
—	—	—	—	—	—	—	—	—	—
DAC	3000	75	4	0.74 / 0.67	0.90 / 0.88	2.76 / 2.47	2.31 / 2.07	0.86 / 0.83	2.23 / 2.10
MiMo Audio Tokenizer	3650	25	20	0.91 / 0.85	0.95 / 0.93	3.73 / 3.44	3.25 / 2.89	0.66 / 0.65	2.17 / 2.06
SpeechTokenizer	4000	50	8	0.85 / 0.69	0.92 / 0.85	3.05 / 2.20	2.60 / 1.87	-- / --	-- / --
Mimi	4400	12.5	32	0.94 / 0.83	0.96 / 0.94	3.80 / 3.31	3.43 / 2.78	1.02 / 0.98	2.34 / 2.21
Encodec	4500	75	6	0.86 / 0.75	0.92 / 0.91	2.91 / 2.63	2.46 / 2.15	0.91 / 0.84	2.33 / 2.17
DAC	6000	75	8	0.89 / 0.84	0.95 / 0.94	3.75 / 3.57	3.41 / 3.20	0.65 / 0.63	1.97 / 1.87
MOSS Audio Tokenizer (Ours)	3000	12.5	24	0.96 / 0.92	0.97 / 0.96	3.90 / 3.64	3.61 / 3.20	0.69 / 0.66	1.98 / 1.84
MOSS Audio Tokenizer (Ours)	4000	12.5	32	0.97 / 0.93	0.97 / 0.96	3.95 / 3.71	3.69 / 3.30	0.68 / 0.64	1.96 / 1.82

LibriSpeech Speech Metrics (MOSS Audio Tokenizer vs. Open-source Tokenizers)

The plots below compare our MOSS Audio Tokenizer model with other open-source speech tokenizers on the LibriSpeech dataset, evaluated with SIM, STOI, PESQ-NB, and PESQ-WB (higher is better). We control the bps of the same model by adjusting the number of RVQ codebooks used during inference.

SIM	STOI
PESQ-NB	PESQ-WB

Citation

If you use this code or result in your paper, please cite our work as:

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111