Speech Language Models
Treating speech as a sequence of discrete tokens generated by language-model-style architectures — neural codecs, speech tokenization, and TTS recast as next-token prediction over audio.
A speech language model treats speech as a sequence of discrete tokens and generates it autoregressively, the way a text language model generates words. The recipe inverts decades of speech-synthesis methodology: instead of regressing acoustic features (mel-spectrograms, vocoder parameters) from text with a specialised model, a neural audio codec compresses waveforms into a small vocabulary of tokens, and a transformer trained with next-token prediction generates those tokens conditioned on text, a voice prompt, or both. The output tokens are then decoded back to a waveform by the codec. The resulting systems share architecture, training objective, and scaling behaviour with text language models — and inherit their zero-shot, in-context, instruction-following capabilities.
Discrete speech tokens
The keystone of the paradigm is speech tokenization: turning a continuous waveform into a finite sequence of integers without losing the information needed to reconstruct intelligible, natural-sounding speech. Two families of tokens dominate practice. Acoustic tokens — produced by neural codecs like SoundStream and EnCodec — are residual-vector-quantised codes optimised for waveform reconstruction; they preserve speaker identity, prosody, and recording conditions, but their distribution is hard to model because consecutive codebooks are highly correlated. Semantic tokens — produced by self-supervised models like HuBERT and WavLM — discretise representations that have been pre-trained for content, discarding most paralinguistic information; they are easier for a language model to predict but lose the cues a vocoder needs to render natural audio. Guo et al. survey this design space, mapping the trade-offs between codebook size, frame rate, reconstruction fidelity, and downstream LM perplexity, and arguing that hybrid tokenizers — semantic tokens for the LM, acoustic tokens for resynthesis, with a learned bridge between them — have become the dominant pattern.
Neural codec language models
Once speech is tokenised, generating it becomes a sequence-modelling problem. Peng et al. introduce VoiceCraft, a transformer decoder that performs token infilling over EnCodec tokens, using a causal-masking-plus-delayed-stacking trick so that a single model handles both zero-shot text-to-speech and arbitrary-region speech editing within an existing utterance. The same architecture, trained on diverse audiobooks, podcasts, and YouTube, generalises to unseen voices and recording conditions and matches or exceeds VALL-E and commercial systems on naturalness and speaker similarity. Ji et al. push the controllability axis with TextrolSpeech, a corpus and codec-LM TTS pipeline that conditions generation on free-text style descriptions (“an excited woman speaking quickly”), showing that the same next-token recipe can absorb prompt-based style control of the kind familiar from text generation.
The view that codec-LM TTS is “just” a language-modelling problem invites importing techniques from the text side. Lu et al. show that an instruction-following speech language model can be built without any speech instruction-tuning data, by aligning a pre-trained speech encoder with a text LLM and relying on text-side instruction tuning to transfer through the shared representation. The result is a single model that follows spoken instructions in natural language, an early step toward general-purpose speech assistants that do not route through a separate ASR-then-LLM pipeline.
Hybrid: diffusion, flow, and variational generators conditioned on tokens
Pure autoregressive token prediction is not the only way to consume speech tokens. Li et al. propose StyleTTS, a style-based generative model that factors speech into a content stream and a learned style vector, sampling diverse, expressive utterances through a generator architecture inspired by StyleGAN; the style vector itself can be predicted from a reference utterance or a text prompt, giving a tractable handle on prosody and speaking style that pure codec-LM TTS struggles to expose. Miao et al.’s EfficientTTS 2 unifies text-to-waveform generation and voice conversion in a single one-stage variational end-to-end model, narrowing the long-standing gap between two-stage (mel + vocoder) and one-stage TTS pipelines and laying out a recipe for training stability when the encoder, posterior, and decoder are jointly optimised.
The same conditioning idea — discrete semantic codes feeding a continuous generator — is explored systematically by Qiang et al., who compare semantic-coding choices for minimally-supervised speech synthesis that pairs a conditional diffusion acoustic model with a language model over semantic tokens, and by Guan et al.’s MM-TTS, which fuses textual, visual, and audio prompts into a shared style embedding for expressive, multi-modal-conditioned TTS. Qi et al.’s PAVITS carries the prosody-aware end-to-end VITS architecture into emotional voice conversion, showing that conditioning a flow-based generator on explicit prosody descriptors closes a meaningful gap on naturalness and emotional similarity in EVC.
Data: the prerequisite for scale
Speech language models exhibit the same scaling sensitivities as their text counterparts, and the field has been bottlenecked less by architecture than by the scarcity of large, diverse, openly available speech corpora. He et al. address this with Emilia, an extensive multilingual dataset built from in-the-wild speech across six languages, with a deliberately heterogeneous mix of speaking styles, accents, and recording conditions. The dataset is offered as the substrate for the next generation of speech generation models — a role analogous to the one that Common Crawl plays for text LMs — and the paper documents the cleaning and tagging pipeline so the methodology can be reproduced for new languages.
Open problems
The frontier is moving quickly. Codebook design — flat vs. residual, single vs. multi-stream, fixed vs. learned frame rate — is still settling. The field has not converged on whether to model acoustic and semantic tokens jointly, hierarchically, or with a separate acoustic decoder. Conditioning interfaces (text prompts, reference clips, structured style descriptors, free-text instructions) proliferate without standardisation. Evaluation lags far behind: naturalness MOS, speaker similarity, and word-error-rate are too coarse to discriminate between current systems, and human evaluation is expensive and noisy. Finally, the safety surface of speech LMs — voice cloning, deepfakes, attribution — is wider than that of text LMs, and watermarking, provenance, and detection are open methodological problems rather than solved engineering ones.
Prerequisites
Sources
- paper · primary · 2024peng-2024
-
- paper · supporting · 2024he-2024
- paper · primary · 2025li-2025
- paper · supporting · 2024ji-2024
- paper · primary · 2024miao-2024
- paper · supporting · 2024qiang-2024
- paper · supporting · 2024guan-2024
- paper · primary · 2025lu-2025
- paper · supporting · 2024qi-2024
In context
Where this topic sits in the prerequisite graph. Click any node to jump.
Reviewed by
Review this topic
This page was drafted by an agent and is waiting on expert review. Spotted a wrong prerequisite, a missing concept, a misattributed source, or a factual slip? Tell us — your review opens a tracked issue maintainers act on.