logo
0
1
WeChat Login
cms42<c@cms42.top>
Update README

MOSS-Audio-Tokenizer-ONNX

This repository provides the ONNX exports of MOSS-Audio-Tokenizer (encoder & decoder), enabling torch-free audio encoding/decoding for the MOSS-TTS family.

Overview

MOSS-Audio-Tokenizer is the unified discrete audio interface for the entire MOSS-TTS Family, based on the Cat (Causal Audio Tokenizer with Transformer) architecture — a 1.6B-parameter, pure Causal Transformer audio tokenizer trained on 3M hours of diverse audio.

This ONNX repository is designed for lightweight, torch-free deployment scenarios. It serves as the audio tokenizer component in the MOSS-TTS llama.cpp inference backend, which combines llama.cpp (for the Qwen3 backbone) with ONNX Runtime or TensorRT (for the audio tokenizer) to achieve fully PyTorch-free TTS inference.

Supported Backends

BackendRuntimeUse Case
ONNX Runtime (GPU)onnxruntime-gpuRecommended starting point
ONNX Runtime (CPU)onnxruntimeCPU-only / no CUDA
TensorRTBuild from ONNXMaximum throughput (user-built engines)

Note: We do not provide pre-built TensorRT engines, as they are tied to your specific GPU architecture and TensorRT version. To use TRT, build engines from the ONNX models yourself — see moss_audio_tokenizer/trt/build_engine.sh in the main repository.

Repository Contents

FileDescription
encoder.onnxONNX model for audio encoding (waveform → discrete codes)
decoder.onnxONNX model for audio decoding (discrete codes → waveform)

Quick Start

# Download huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX \ --local-dir weights/MOSS-Audio-Tokenizer-ONNX

This is typically used together with MOSS-TTS-GGUF for the llama.cpp inference pipeline. See the llama.cpp Backend documentation for the full end-to-end setup.

Main Repositories

RepositoryDescription
OpenMOSS/MOSS-TTSMOSS-TTS Family main repository (includes llama.cpp backend, PyTorch inference, and all models)
OpenMOSS/MOSS-Audio-TokenizerMOSS-Audio-Tokenizer source code, PyTorch weights, ONNX/TRT export scripts, and evaluation
OpenMOSS-Team/MOSS-Audio-TokenizerPyTorch weights on Hugging Face (for trust_remote_code=True usage)
OpenMOSS-Team/MOSS-TTS-GGUFPre-quantized GGUF backbone weights (companion to this ONNX repo)

About MOSS-Audio-Tokenizer

MOSS-Audio-Tokenizer compresses 24kHz raw audio into a 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting high-fidelity reconstruction from 0.125kbps to 4kbps. It is trained from scratch on 3 million hours of speech, sound effects, and music, achieving state-of-the-art reconstruction quality among open-source audio tokenizers.

For the full model description, architecture details, and evaluation metrics, please refer to:

Evaluation Metrics

The table below compares the reconstruction quality of open-source audio tokenizers with MossAudioTokenizer on speech and audio/music data.

  • Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
  • Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
  • STFT-Dist. denotes the STFT distance.
  • Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
  • Nq denotes the number of quantizers.
ModelbpsFrame rateNqSpeech: SIM ↑ (EN/ZH)Speech: STOI ↑ (EN/ZH)Speech: PESQ-NB ↑ (EN/ZH)Speech: PESQ-WB ↑ (EN/ZH)Audio/Music: Mel-Loss ↓Audio/Music: STFT-Dist. ↓
XCodec2.08005010.82 / 0.740.92 / 0.863.04 / 2.462.43 / 1.96-- / ---- / --
MiMo Audio Tokenizer8502540.80 / 0.740.91 / 0.872.94 / 2.622.39 / 2.140.82 / 0.812.33 / 2.23
Higgs Audio Tokenizer10002540.77 / 0.680.83 / 0.823.03 / 2.612.48 / 2.140.83 / 0.802.20 / 2.05
SpeechTokenizer10005020.36 / 0.250.77 / 0.681.59 / 1.381.25 / 1.17-- / ---- / --
XY-Tokenizer100012.580.85 / 0.790.92 / 0.873.10 / 2.632.50 / 2.12-- / ---- / --
BigCodec10408010.84 / 0.690.93 / 0.883.27 / 2.552.68 / 2.06-- / ---- / --
Mimi110012.580.74 / 0.590.91 / 0.852.80 / 2.242.25 / 1.781.24 / 1.192.62 / 2.49
MOSS Audio Tokenizer (Ours)75012.560.82 / 0.750.93 / 0.893.14 / 2.732.60 / 2.220.86 / 0.852.21 / 2.10
MOSS Audio Tokenizer (Ours)100012.580.88 / 0.810.94 / 0.913.38 / 2.962.87 / 2.430.82 / 0.802.16 / 2.04
DAC15007520.48 / 0.410.83 / 0.791.87 / 1.671.48 / 1.37-- / ---- / --
Encodec15007520.60 / 0.450.85 / 0.811.94 / 1.801.56 / 1.481.12 / 1.042.60 / 2.42
Higgs Audio Tokenizer20002580.90 / 0.830.85 / 0.853.59 / 3.223.11 / 2.730.74 / 0.702.07 / 1.92
SpeechTokenizer20005040.66 / 0.500.88 / 0.802.38 / 1.791.92 / 1.49-- / ---- / --
Qwen3 TTS Tokenizer220012.5160.95 / 0.880.96 / 0.933.66 / 3.103.19 / 2.62-- / ---- / --
MiMo Audio Tokenizer225025120.89 / 0.830.95 / 0.923.57 / 3.253.05 / 2.710.70 / 0.682.21 / 2.10
Mimi247512.5180.89 / 0.760.94 / 0.913.49 / 2.902.97 / 2.351.10 / 1.062.45 / 2.32
MOSS Audio Tokenizer (Ours)150012.5120.92 / 0.860.95 / 0.933.64 / 3.273.20 / 2.740.77 / 0.742.08 / 1.96
MOSS Audio Tokenizer (Ours)200012.5160.95 / 0.890.96 / 0.943.78 / 3.463.41 / 2.960.73 / 0.702.03 / 1.90
DAC30007540.74 / 0.670.90 / 0.882.76 / 2.472.31 / 2.070.86 / 0.832.23 / 2.10
MiMo Audio Tokenizer365025200.91 / 0.850.95 / 0.933.73 / 3.443.25 / 2.890.66 / 0.652.17 / 2.06
SpeechTokenizer40005080.85 / 0.690.92 / 0.853.05 / 2.202.60 / 1.87-- / ---- / --
Mimi440012.5320.94 / 0.830.96 / 0.943.80 / 3.313.43 / 2.781.02 / 0.982.34 / 2.21
Encodec45007560.86 / 0.750.92 / 0.912.91 / 2.632.46 / 2.150.91 / 0.842.33 / 2.17
DAC60007580.89 / 0.840.95 / 0.943.75 / 3.573.41 / 3.200.65 / 0.631.97 / 1.87
MOSS Audio Tokenizer (Ours)300012.5240.96 / 0.920.97 / 0.963.90 / 3.643.61 / 3.200.69 / 0.661.98 / 1.84
MOSS Audio Tokenizer (Ours)400012.5320.97 / 0.930.97 / 0.963.95 / 3.713.69 / 3.300.68 / 0.641.96 / 1.82

LibriSpeech Speech Metrics (MOSS Audio Tokenizer vs. Open-source Tokenizers)

The plots below compare our MOSS Audio Tokenizer model with other open-source speech tokenizers on the LibriSpeech dataset, evaluated with SIM, STOI, PESQ-NB, and PESQ-WB (higher is better). We control the bps of the same model by adjusting the number of RVQ codebooks used during inference.

SIM
STOI
PESQ-NB
PESQ-WB

Citation

If you use this code or result in your paper, please cite our work as: