This review traces the evolution of end-to-end speech-to-speech models from SpeechGPT (May 2023) through Qwen3-Omni (September 2025), covering 13 models that progressively reduced latency, added full-duplex capabilities, and scaled to hundreds of billions of parameters.
Traditional voice assistants follow a cascaded pipeline: ASR (Automatic Speech Recognition) transcribes speech to text, an LLM generates a text response, and TTS (Text-to-Speech) synthesizes the output. This approach introduces latency at each stage and loses paralinguistic information—emotion, prosody, speaker identity—that doesn't survive the text bottleneck.
Figure 1: Cascaded vs end-to-end speech processing. The text bottleneck in cascaded systems loses paralinguistic information.
Core Innovation: Uses HuBERT to discretize speech into discrete units, then treats these units as a new "language" that the LLM learns alongside text.
Three-stage training:
Modality-adaptation pre-training: LLaMA-7B on LibriLight speech units
Cross-modal instruction fine-tuning on SpeechInstruct dataset
Chain-of-modality instruction fine-tuning with LoRA
This work established the foundational paradigm: convert speech to discrete tokens, process with LLM backbone, generate speech tokens that decode back to audio.
Core Innovation: Text-Instruct Delay Parallel Decoding. In a single forward pass, generates 8 tokens: 1 text + 7 SNAC audio codec layers. Text and speech produced simultaneously.
Architecture:
SNAC codec: 7 complementary token layers with one-step delay
"Any Model Can Talk": Preserves base LLM reasoning while adding speech
Batch approach: Two parallel samples for text-to-audio capability transfer
Figure 2: Mini-Omni generates text and 7 SNAC audio layers in parallel at each timestep.
Figure 4: Moshi separates temporal dependencies (7B Transformer) from depth/codebook dependencies (small Depth Transformer). Inner Monologue predicts text before audio at each timestep.
Core Innovation: Thinker-Talker architecture. Talker directly reads Thinker's hidden states (not text output), preserving information that would be lost in text serialization.
Core Innovation: RL-based reasoning + RAG. PPO for sequence quality, GRPO for audio realism. RAG with web search and audio search for timbre switching.
Scale: Trained on 8 million hours of speech data.
Performance: 3.18% WER (LibriSpeech), 76.55% paralinguistic accuracy
Core Innovation: 32-codebook MTP + causal ConvNet. Multi-token prediction for all codebook layers. Lightweight ConvNet replaces DiT for streaming from first codec frame.
MoE Thinker-Talker: 30B total, 3B active
AuT encoder: Trained from scratch on 20M hours
Code2Wav: Causal ConvNet replacing block-wise DiT
Results: Open-source SOTA on 32/36 benchmarks. 234ms first-packet latency.
3. Key Architectures Explained
Audio Tokenization Approaches
Model
Codec
Frame Rate
Key Property
SpeechGPT
HuBERT discrete units
—
Semantic-focused
Mini-Omni/2
SNAC
7 layers
Hierarchical with delay
Moshi
Mimi
12.5Hz
Semantic in early acoustic layers
Baichuan-Omni-1.5
8-layer RVQ
12.5Hz
Whisper encoder + dual properties
Step-Audio
Dual-codebook
16.7Hz + 25Hz
Separate semantic/acoustic
Qwen3-Omni
32-codebook MTP
—
Multi-track prediction
Decoder Types and Latency
Model
Latency
Decoder Type
Technique
LLaMA-Omni
226ms
Non-autoregressive CTC
Parallel discrete unit prediction
Moshi
160-200ms
Depth + Temporal
Parallel stream modeling
MinMo STT
~100ms
AR streaming
Token interleaving (5:15)
LLaMA-Omni2
~600ms
AR + CosyVoice 2
Quality over latency
Qwen3-Omni
234ms
MTP + ConvNet
Streaming from first frame
4. Comparison Tables
Model Overview
Model
Date
Params
Training Data
Key Innovation
SpeechGPT
May 2023
7B
SpeechInstruct
HuBERT + chain-of-modality
Mini-Omni
Aug 2024
—
VoiceAssistant-400K
Parallel 8 tokens/step
LLaMA-Omni
Sep 2024
8B
200K pairs
NAR CTC decoder
Moshi
Sep 2024
7B
2.1T tokens
Inner Monologue + full-duplex
Baichuan-Omni
Oct 2024
7B
—
First omni-modal 7B
Mini-Omni2
Oct 2024
0.5B
Limited
Command-based interruption
MinMo
Jan 2025
8B
1.4M hours
Token interleaving (5:15)
Baichuan-Omni-1.5
Jan 2025
—
500B tokens
8-layer RVQ at 12.5Hz
Step-Audio
Feb 2025
130B
—
Dual-codebook (2:3 interleave)
Qwen2.5-Omni
Mar 2025
—
—
Thinker-Talker (hidden states)
LLaMA-Omni2
May 2025
0.5-32B
200K dialogues
AR decoder (quality focus)
Step-Audio 2
Jul 2025
—
8M hours
RL (PPO/GRPO) + RAG
Qwen3-Omni
Sep 2025
30B (3B active)
20M hours
32-codebook + ConvNet
The Arc of Innovation
Figure 6: Evolution of focus areas in speech-to-speech research.
5. References
Zhang et al. "SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities." arXiv:2305.11000, May 2023.
Xie & Wu. "Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming." arXiv:2408.16725, Aug 2024.
Fang et al. "LLaMA-Omni: Seamless Speech Interaction with Large Language Models." arXiv:2409.06666, Sep 2024.
Défossez et al. "Moshi: A Speech-Text Foundation Model for Real-Time Dialogue." arXiv:2410.00037, Oct 2024.
Li et al. "Baichuan-Omni Technical Report." arXiv:2410.08565, Oct 2024.
Xie & Wu. "Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities." arXiv:2410.11190, Oct 2024.
Chen et al. "MinMo: A Multimodal Large Language Model for Seamless Voice Interaction." arXiv:2501.06282, Jan 2025.
Li et al. "Baichuan-Omni-1.5 Technical Report." arXiv:2501.15368, Jan 2025.
Huang et al. "Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction." arXiv:2502.11946, Feb 2025.
Xu et al. "Qwen2.5-Omni Technical Report." arXiv:2503.20215, Mar 2025.
Fang et al. "LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis." arXiv:2505.02625, May 2025. ACL 2025.
Wu et al. "Step-Audio 2 Technical Report." arXiv:2507.16632, Jul 2025.
Xu et al. "Qwen3-Omni Technical Report." arXiv:2509.17765, Sep 2025.
Ji et al. "WavChat: A Survey of Spoken Dialogue Models." arXiv:2411.13577, Nov 2024. (Survey reference)