Speech-to-Speech Models

A Literature Review

Siddharth Choudhary
January 2026
This review traces the evolution of end-to-end speech-to-speech models from SpeechGPT (May 2023) through Qwen3-Omni (September 2025), covering 13 models that progressively reduced latency, added full-duplex capabilities, and scaled to hundreds of billions of parameters.
Contents

1. Overview: Cascaded vs End-to-End

Traditional voice assistants follow a cascaded pipeline: ASR (Automatic Speech Recognition) transcribes speech to text, an LLM generates a text response, and TTS (Text-to-Speech) synthesizes the output. This approach introduces latency at each stage and loses paralinguistic information—emotion, prosody, speaker identity—that doesn't survive the text bottleneck.

Cascaded Pipeline (Traditional) Speech In ASR text LLM text TTS Speech Out ❌ High latency ❌ Loses prosody End-to-End Pipeline (Modern) Speech In Speech LLM (unified) Speech Out ✓ Low latency ✓ Preserves prosody
Figure 1: Cascaded vs end-to-end speech processing. The text bottleneck in cascaded systems loses paralinguistic information.

2. Chronological Timeline

2023: Foundation

1. SpeechGPT (May 2023)

Zhang et al. · arXiv:2305.11000
Core Innovation: Uses HuBERT to discretize speech into discrete units, then treats these units as a new "language" that the LLM learns alongside text.

Three-stage training:

  1. Modality-adaptation pre-training: LLaMA-7B on LibriLight speech units
  2. Cross-modal instruction fine-tuning on SpeechInstruct dataset
  3. Chain-of-modality instruction fine-tuning with LoRA

This work established the foundational paradigm: convert speech to discrete tokens, process with LLM backbone, generate speech tokens that decode back to audio.

2024: The Cambrian Explosion

2. Mini-Omni (August 2024)

Xie & Wu · arXiv:2408.16725
Core Innovation: Text-Instruct Delay Parallel Decoding. In a single forward pass, generates 8 tokens: 1 text + 7 SNAC audio codec layers. Text and speech produced simultaneously.
Architecture:
  • SNAC codec: 7 complementary token layers with one-step delay
  • "Any Model Can Talk": Preserves base LLM reasoning while adding speech
  • Batch approach: Two parallel samples for text-to-audio capability transfer
Mini-Omni: Parallel Text-Speech Decoding LLM Hidden State Text Head Audio Head 1 Audio Head 2-7 ... "Hello" [SNAC layer 1] [SNAC layers 2-7] } 8 tokens/step
Figure 2: Mini-Omni generates text and 7 SNAC audio layers in parallel at each timestep.

3. LLaMA-Omni (September 2024)

Fang et al. · arXiv:2409.06666
Core Innovation: Non-autoregressive CTC speech decoder. Uses Connectionist Temporal Classification to predict the entire discrete unit sequence in parallel, eliminating sequential decoding latency.
Architecture:
  • Speech encoder + adaptor + Llama-3.1-8B + NAR Transformer decoder (425M params)
  • Decoder: 2 Transformer layers, upsample factor λ=25
  • CTC learns alignment without pre-aligned training data

Key metrics: 226ms latency, trained in <3 days on 4 GPUs

CTC vs Autoregressive Decoding Autoregressive (AR) u₁ u₂ u₃ ... Sequential: O(N) steps CTC (Non-autoregressive) u₁ u₂ u₃ u₄ ... uₙ Parallel: O(1) steps CTC Collapsing Rules Path: [C, C, ε, A, ε, T, T] 1. Remove duplicates: [C, ε, A, ε, T] 2. Remove blanks (ε): [C, A, T] Output: "CAT" Many paths → same output Loss sums over all valid paths
Figure 3: CTC enables parallel prediction of all output tokens. The blank token (ε) allows flexible alignment.

4. Moshi (September/October 2024)

Défossez et al. (Kyutai) · arXiv:2410.00037
Core Innovation: Inner Monologue + Depth-Temporal Factorization. Predicts time-aligned text tokens as prefix to audio tokens. Separates inter-codebook (Depth Transformer) from inter-timestep (Temporal Transformer) dependencies.
Architecture:
  • Helium: 7B text LLM backbone (2.1T tokens)
  • Mimi codec: 12.5Hz, 1.1kbps, 80ms latency
  • Three streams: User audio, Moshi audio, inner monologue text

Latency: 160ms theoretical, 200ms practical—first real-time full-duplex spoken LLM.

Moshi: Depth-Temporal Architecture with Inner Monologue Temporal Transformer (7B) Dependencies across timesteps h₁ h₂ h₃ Text: "Hel" Depth TF Text: "lo" Depth TF [c₀,c₁,...,cₖ] [c₀,c₁,...,cₖ] Time → Inner Monologue (text prefix) Audio codecs
Figure 4: Moshi separates temporal dependencies (7B Transformer) from depth/codebook dependencies (small Depth Transformer). Inner Monologue predicts text before audio at each timestep.

5. Baichuan-Omni (October 2024)

Li et al. · arXiv:2410.08565
Core Innovation: First open-source 7B model to process four modalities simultaneously: image, video, audio, and text.

Two-stage training: multimodal alignment → multitask fine-tuning across all modalities.

6. Mini-Omni2 (October 2024)

Xie & Wu · arXiv:2410.11190
Core Innovation: Command-based interruption. Recognizes vocal commands like "Stop Omni" during generation. Uses batch-based dual-model approach that halves memory.
  • Visual encoder: CLIP ViT-B/32 → 50 feature tokens
  • Audio encoder: Whisper-small
  • LLM backbone: Qwen2-0.5B
2025: Industrial Scale

7. MinMo (January 2025)

Chen et al. · arXiv:2501.06282
Core Innovation: Token interleaving with AR streaming decoder. Uses fixed ratio: 5 semantic tokens followed by 15 speech tokens.

Scale: ~8B parameters trained on 1.4 million hours of speech data.

  • Voice encoder: SenseVoice-large
  • LLM: Qwen2.5-7B-Instruct
  • Voice decoder: AR streaming Transformer + CosyVoice 2

Four-stage curriculum: STT → TTS → S2S → duplex alignment

Latency: ~100ms STT, ~600ms full-duplex

8. Baichuan-Omni-1.5 (January 2025)

Li et al. · arXiv:2501.15368
Core Innovation: 8-layer RVQ audio tokenizer at 12.5Hz. Uses Whisper Large encoder → 8-layer Residual Vector Quantizer preserving both semantic and acoustic properties.

Training data: 500B tokens (text, audio, vision)

9. Step-Audio (February 2025)

Huang et al. · arXiv:2502.11946
Core Innovation: Dual-codebook tokenization with 2:3 temporal interleaving. Semantic (16.7Hz, 1024 codebook) + acoustic (25Hz, 4096 codebook) tokenizers.

Scale: 130B parameters—first production-ready open-source solution at this scale.

  • Foundation: Step-1 130B with audio-contextualized pretraining
  • Speech decoder: Hybrid flow matching + neural vocoding
  • Capabilities: Dialect, emotion, singing, RAP, tool calling

10. Qwen2.5-Omni (March 2025)

Xu et al. · arXiv:2503.20215
Core Innovation: Thinker-Talker architecture. Talker directly reads Thinker's hidden states (not text output), preserving information that would be lost in text serialization.
  • Thinker: Transformer decoder + Whisper-large-v3 encoder
  • Talker: Dual-track autoregressive Transformer
  • TMRoPE: 40ms = 1 temporal unit for audio-video alignment
  • Sliding-window DiT: Reduces first-packet delay
Qwen2.5-Omni: Thinker-Talker Architecture Thinker (Standard LLM) Text Head "Hello, how..." Hidden h hidden states Talker (Dual-track AR Transformer) Audio tokens Key Insight Talker sees hidden states, not text output. → Richer information
Figure 5: Qwen's Thinker-Talker separation. The Talker receives high-dimensional hidden states rather than discrete text tokens.

11. LLaMA-Omni2 (May 2025)

Fang et al. · arXiv:2505.02625 · ACL 2025
Core Innovation: Autoregressive streaming decoder (vs LLaMA-Omni's CTC). Switches to AR + CosyVoice 2 flow matching, trading latency for better audio quality.

Key insight: 200K well-curated dialogues outperform millions of hours of noisy data.

Latency: ~600ms (higher than v1's 226ms, but better UTMOS quality scores)

12. Step-Audio 2 (July 2025)

Wu et al. · arXiv:2507.16632
Core Innovation: RL-based reasoning + RAG. PPO for sequence quality, GRPO for audio realism. RAG with web search and audio search for timbre switching.

Scale: Trained on 8 million hours of speech data.

Performance: 3.18% WER (LibriSpeech), 76.55% paralinguistic accuracy

13. Qwen3-Omni (September 2025)

Xu et al. · arXiv:2509.17765
Core Innovation: 32-codebook MTP + causal ConvNet. Multi-token prediction for all codebook layers. Lightweight ConvNet replaces DiT for streaming from first codec frame.
  • MoE Thinker-Talker: 30B total, 3B active
  • AuT encoder: Trained from scratch on 20M hours
  • Code2Wav: Causal ConvNet replacing block-wise DiT

Results: Open-source SOTA on 32/36 benchmarks. 234ms first-packet latency.

3. Key Architectures Explained

Audio Tokenization Approaches

Model Codec Frame Rate Key Property
SpeechGPT HuBERT discrete units Semantic-focused
Mini-Omni/2 SNAC 7 layers Hierarchical with delay
Moshi Mimi 12.5Hz Semantic in early acoustic layers
Baichuan-Omni-1.5 8-layer RVQ 12.5Hz Whisper encoder + dual properties
Step-Audio Dual-codebook 16.7Hz + 25Hz Separate semantic/acoustic
Qwen3-Omni 32-codebook MTP Multi-track prediction

Decoder Types and Latency

Model Latency Decoder Type Technique
LLaMA-Omni 226ms Non-autoregressive CTC Parallel discrete unit prediction
Moshi 160-200ms Depth + Temporal Parallel stream modeling
MinMo STT ~100ms AR streaming Token interleaving (5:15)
LLaMA-Omni2 ~600ms AR + CosyVoice 2 Quality over latency
Qwen3-Omni 234ms MTP + ConvNet Streaming from first frame

4. Comparison Tables

Model Overview

Model Date Params Training Data Key Innovation
SpeechGPT May 2023 7B SpeechInstruct HuBERT + chain-of-modality
Mini-Omni Aug 2024 VoiceAssistant-400K Parallel 8 tokens/step
LLaMA-Omni Sep 2024 8B 200K pairs NAR CTC decoder
Moshi Sep 2024 7B 2.1T tokens Inner Monologue + full-duplex
Baichuan-Omni Oct 2024 7B First omni-modal 7B
Mini-Omni2 Oct 2024 0.5B Limited Command-based interruption
MinMo Jan 2025 8B 1.4M hours Token interleaving (5:15)
Baichuan-Omni-1.5 Jan 2025 500B tokens 8-layer RVQ at 12.5Hz
Step-Audio Feb 2025 130B Dual-codebook (2:3 interleave)
Qwen2.5-Omni Mar 2025 Thinker-Talker (hidden states)
LLaMA-Omni2 May 2025 0.5-32B 200K dialogues AR decoder (quality focus)
Step-Audio 2 Jul 2025 8M hours RL (PPO/GRPO) + RAG
Qwen3-Omni Sep 2025 30B (3B active) 20M hours 32-codebook + ConvNet

The Arc of Innovation

2023 Feasibility "Can we do this?" Early 2024 Latency "How fast?" Late 2024 Full-Duplex "How naturally?" 2025 Scale + Quality "How well?"
Figure 6: Evolution of focus areas in speech-to-speech research.

5. References

  1. Zhang et al. "SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities." arXiv:2305.11000, May 2023.
  2. Xie & Wu. "Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming." arXiv:2408.16725, Aug 2024.
  3. Fang et al. "LLaMA-Omni: Seamless Speech Interaction with Large Language Models." arXiv:2409.06666, Sep 2024.
  4. Défossez et al. "Moshi: A Speech-Text Foundation Model for Real-Time Dialogue." arXiv:2410.00037, Oct 2024.
  5. Li et al. "Baichuan-Omni Technical Report." arXiv:2410.08565, Oct 2024.
  6. Xie & Wu. "Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities." arXiv:2410.11190, Oct 2024.
  7. Chen et al. "MinMo: A Multimodal Large Language Model for Seamless Voice Interaction." arXiv:2501.06282, Jan 2025.
  8. Li et al. "Baichuan-Omni-1.5 Technical Report." arXiv:2501.15368, Jan 2025.
  9. Huang et al. "Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction." arXiv:2502.11946, Feb 2025.
  10. Xu et al. "Qwen2.5-Omni Technical Report." arXiv:2503.20215, Mar 2025.
  11. Fang et al. "LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis." arXiv:2505.02625, May 2025. ACL 2025.
  12. Wu et al. "Step-Audio 2 Technical Report." arXiv:2507.16632, Jul 2025.
  13. Xu et al. "Qwen3-Omni Technical Report." arXiv:2509.17765, Sep 2025.
  14. Ji et al. "WavChat: A Survey of Spoken Dialogue Models." arXiv:2411.13577, Nov 2024. (Survey reference)

Last updated: January 2026 · Siddharth Choudhary