Speech-to-Speech Models: A Literature Review

This review traces the evolution of end-to-end speech-to-speech models from SpeechGPT (May 2023) through Qwen3-Omni (September 2025), covering 13 models that progressively reduced latency, added full-duplex capabilities, and scaled to hundreds of billions of parameters.

Traditional voice assistants follow a cascaded pipeline: ASR (Automatic Speech Recognition) transcribes speech to text, an LLM generates a text response, and TTS (Text-to-Speech) synthesizes the output. This approach introduces latency at each stage and loses paralinguistic information—emotion, prosody, speaker identity—that doesn't survive the text bottleneck.

2. Chronological Timeline

1. SpeechGPT (May 2023)

Zhang et al. · arXiv:2305.11000

Core Innovation: Uses HuBERT to discretize speech into discrete units, then treats these units as a new "language" that the LLM learns alongside text.

Three-stage training:

Modality-adaptation pre-training: LLaMA-7B on LibriLight speech units
Cross-modal instruction fine-tuning on SpeechInstruct dataset
Chain-of-modality instruction fine-tuning with LoRA

This work established the foundational paradigm: convert speech to discrete tokens, process with LLM backbone, generate speech tokens that decode back to audio.

2. Mini-Omni (August 2024)

Xie & Wu · arXiv:2408.16725

Core Innovation: Text-Instruct Delay Parallel Decoding. In a single forward pass, generates 8 tokens: 1 text + 7 SNAC audio codec layers. Text and speech produced simultaneously.

Architecture:

SNAC codec: 7 complementary token layers with one-step delay
"Any Model Can Talk": Preserves base LLM reasoning while adding speech
Batch approach: Two parallel samples for text-to-audio capability transfer

Figure 2: Mini-Omni generates text and 7 SNAC audio layers in parallel at each timestep.

3. LLaMA-Omni (September 2024)

Fang et al. · arXiv:2409.06666

Core Innovation: Non-autoregressive CTC speech decoder. Uses Connectionist Temporal Classification to predict the entire discrete unit sequence in parallel, eliminating sequential decoding latency.

Architecture:

Speech encoder + adaptor + Llama-3.1-8B + NAR Transformer decoder (425M params)
Decoder: 2 Transformer layers, upsample factor λ=25
CTC learns alignment without pre-aligned training data

Key metrics: 226ms latency, trained in <3 days on 4 GPUs

Figure 3: CTC enables parallel prediction of all output tokens. The blank token (ε) allows flexible alignment.

4. Moshi (September/October 2024)

Défossez et al. (Kyutai) · arXiv:2410.00037

Core Innovation: Inner Monologue + Depth-Temporal Factorization. Predicts time-aligned text tokens as prefix to audio tokens. Separates inter-codebook (Depth Transformer) from inter-timestep (Temporal Transformer) dependencies.

Architecture:

Helium: 7B text LLM backbone (2.1T tokens)
Mimi codec: 12.5Hz, 1.1kbps, 80ms latency
Three streams: User audio, Moshi audio, inner monologue text

Latency: 160ms theoretical, 200ms practical—first real-time full-duplex spoken LLM.

Figure 4: Moshi separates temporal dependencies (7B Transformer) from depth/codebook dependencies (small Depth Transformer). Inner Monologue predicts text before audio at each timestep.

5. Baichuan-Omni (October 2024)

Li et al. · arXiv:2410.08565

Core Innovation: First open-source 7B model to process four modalities simultaneously: image, video, audio, and text.

Two-stage training: multimodal alignment → multitask fine-tuning across all modalities.

6. Mini-Omni2 (October 2024)

Xie & Wu · arXiv:2410.11190

Core Innovation: Command-based interruption. Recognizes vocal commands like "Stop Omni" during generation. Uses batch-based dual-model approach that halves memory.

Visual encoder: CLIP ViT-B/32 → 50 feature tokens
Audio encoder: Whisper-small
LLM backbone: Qwen2-0.5B

7. MinMo (January 2025)

Chen et al. · arXiv:2501.06282

Core Innovation: Token interleaving with AR streaming decoder. Uses fixed ratio: 5 semantic tokens followed by 15 speech tokens.

Scale: ~8B parameters trained on 1.4 million hours of speech data.

Voice encoder: SenseVoice-large
LLM: Qwen2.5-7B-Instruct
Voice decoder: AR streaming Transformer + CosyVoice 2

Four-stage curriculum: STT → TTS → S2S → duplex alignment

Latency: ~100ms STT, ~600ms full-duplex

8. Baichuan-Omni-1.5 (January 2025)

Li et al. · arXiv:2501.15368

Core Innovation: 8-layer RVQ audio tokenizer at 12.5Hz. Uses Whisper Large encoder → 8-layer Residual Vector Quantizer preserving both semantic and acoustic properties.

Training data: 500B tokens (text, audio, vision)

9. Step-Audio (February 2025)

Huang et al. · arXiv:2502.11946

Core Innovation: Dual-codebook tokenization with 2:3 temporal interleaving. Semantic (16.7Hz, 1024 codebook) + acoustic (25Hz, 4096 codebook) tokenizers.

Scale: 130B parameters—first production-ready open-source solution at this scale.

Foundation: Step-1 130B with audio-contextualized pretraining
Speech decoder: Hybrid flow matching + neural vocoding
Capabilities: Dialect, emotion, singing, RAP, tool calling

10. Qwen2.5-Omni (March 2025)

Xu et al. · arXiv:2503.20215

Core Innovation: Thinker-Talker architecture. Talker directly reads Thinker's hidden states (not text output), preserving information that would be lost in text serialization.

Thinker: Transformer decoder + Whisper-large-v3 encoder
Talker: Dual-track autoregressive Transformer
TMRoPE: 40ms = 1 temporal unit for audio-video alignment
Sliding-window DiT: Reduces first-packet delay

Figure 5: Qwen's Thinker-Talker separation. The Talker receives high-dimensional hidden states rather than discrete text tokens.

11. LLaMA-Omni2 (May 2025)

Fang et al. · arXiv:2505.02625 · ACL 2025

Core Innovation: Autoregressive streaming decoder (vs LLaMA-Omni's CTC). Switches to AR + CosyVoice 2 flow matching, trading latency for better audio quality.

Key insight: 200K well-curated dialogues outperform millions of hours of noisy data.

Latency: ~600ms (higher than v1's 226ms, but better UTMOS quality scores)

12. Step-Audio 2 (July 2025)

Wu et al. · arXiv:2507.16632

Core Innovation: RL-based reasoning + RAG. PPO for sequence quality, GRPO for audio realism. RAG with web search and audio search for timbre switching.

Scale: Trained on 8 million hours of speech data.

Performance: 3.18% WER (LibriSpeech), 76.55% paralinguistic accuracy

13. Qwen3-Omni (September 2025)

Xu et al. · arXiv:2509.17765

Core Innovation: 32-codebook MTP + causal ConvNet. Multi-token prediction for all codebook layers. Lightweight ConvNet replaces DiT for streaming from first codec frame.

MoE Thinker-Talker: 30B total, 3B active
AuT encoder: Trained from scratch on 20M hours
Code2Wav: Causal ConvNet replacing block-wise DiT

Results: Open-source SOTA on 32/36 benchmarks. 234ms first-packet latency.

Speech-to-Speech Models

A Literature Review

1. Overview: Cascaded vs End-to-End

2. Chronological Timeline

1. SpeechGPT (May 2023)

2. Mini-Omni (August 2024)

3. LLaMA-Omni (September 2024)

4. Moshi (September/October 2024)

5. Baichuan-Omni (October 2024)

6. Mini-Omni2 (October 2024)

7. MinMo (January 2025)

8. Baichuan-Omni-1.5 (January 2025)

9. Step-Audio (February 2025)

10. Qwen2.5-Omni (March 2025)

11. LLaMA-Omni2 (May 2025)

12. Step-Audio 2 (July 2025)

13. Qwen3-Omni (September 2025)

3. Key Architectures Explained

Audio Tokenization Approaches

Decoder Types and Latency

4. Comparison Tables

Model Overview

The Arc of Innovation

5. References

Model	Codec	Frame Rate	Key Property
SpeechGPT	HuBERT discrete units	—	Semantic-focused
Mini-Omni/2	SNAC	7 layers	Hierarchical with delay
Moshi	Mimi	12.5Hz	Semantic in early acoustic layers
Baichuan-Omni-1.5	8-layer RVQ	12.5Hz	Whisper encoder + dual properties
Step-Audio	Dual-codebook	16.7Hz + 25Hz	Separate semantic/acoustic
Qwen3-Omni	32-codebook MTP	—	Multi-track prediction

Model	Latency	Decoder Type	Technique
LLaMA-Omni	226ms	Non-autoregressive CTC	Parallel discrete unit prediction
Moshi	160-200ms	Depth + Temporal	Parallel stream modeling
MinMo STT	~100ms	AR streaming	Token interleaving (5:15)
LLaMA-Omni2	~600ms	AR + CosyVoice 2	Quality over latency
Qwen3-Omni	234ms	MTP + ConvNet	Streaming from first frame

Model	Date	Params	Training Data	Key Innovation
SpeechGPT	May 2023	7B	SpeechInstruct	HuBERT + chain-of-modality
Mini-Omni	Aug 2024	—	VoiceAssistant-400K	Parallel 8 tokens/step
LLaMA-Omni	Sep 2024	8B	200K pairs	NAR CTC decoder
Moshi	Sep 2024	7B	2.1T tokens	Inner Monologue + full-duplex
Baichuan-Omni	Oct 2024	7B	—	First omni-modal 7B
Mini-Omni2	Oct 2024	0.5B	Limited	Command-based interruption
MinMo	Jan 2025	8B	1.4M hours	Token interleaving (5:15)
Baichuan-Omni-1.5	Jan 2025	—	500B tokens	8-layer RVQ at 12.5Hz
Step-Audio	Feb 2025	130B	—	Dual-codebook (2:3 interleave)
Qwen2.5-Omni	Mar 2025	—	—	Thinker-Talker (hidden states)
LLaMA-Omni2	May 2025	0.5-32B	200K dialogues	AR decoder (quality focus)
Step-Audio 2	Jul 2025	—	8M hours	RL (PPO/GRPO) + RAG
Qwen3-Omni	Sep 2025	30B (3B active)	20M hours	32-codebook + ConvNet