Natural Language Processing
The computational analysis and generation of human language — parsing, semantics, machine translation, and large language models.
Natural language processing is the field concerned with enabling computers to understand, generate, and reason about human language. It sits at the intersection of computer science, linguistics, and machine learning, and its challenges are among the deepest in AI — human language is ambiguous, context-dependent, culturally situated, and endlessly creative. The field has undergone a dramatic transformation over the past decade, from hand-crafted rules and statistical models to deep neural architectures that can translate between languages, answer questions, summarize documents, and carry on extended conversations.
Linguistic Foundations and Text Processing
Before any learning algorithm can operate on text, the text must be transformed into a representation that a computer can manipulate. This pipeline begins with text processing: converting raw character streams into structured units suitable for analysis.
Tokenization is the first step — splitting a string of text into individual units called tokens. For languages like English, tokens roughly correspond to words and punctuation marks, but even here the task is non-trivial: contractions (“don’t”), hyphenated compounds (“state-of-the-art”), and URLs all require careful handling. For languages without whitespace delimiters (Chinese, Japanese, Thai), tokenization is itself a significant NLP problem. Modern systems increasingly use subword tokenization methods — Byte Pair Encoding (BPE), introduced to NLP by Rico Sennrich and colleagues in 2016, and SentencePiece — that learn a vocabulary of frequent character sequences from data, balancing the granularity of character-level models with the efficiency of word-level models.
The linguistic structure of language is traditionally described at multiple levels. Morphology studies the internal structure of words: how stems, prefixes, suffixes, and inflections combine to form words and convey grammatical information. Syntax describes how words combine into phrases and sentences according to grammatical rules. Semantics concerns the meaning of words, phrases, and sentences. Pragmatics addresses how meaning depends on context — what the speaker intends, what the listener already knows, and the social situation. Each level presents distinct challenges for computational systems.
Ambiguity pervades every level of language. Lexical ambiguity arises when a single word has multiple meanings (“bank” as a financial institution versus a river bank). Syntactic ambiguity arises when a sentence can be parsed in multiple ways (“I saw the man with the telescope” — who has the telescope?). Semantic ambiguity arises when a sentence has multiple interpretations even after parsing. Resolving these ambiguities requires context, world knowledge, and common-sense reasoning — making NLP one of the “AI-complete” problems, in the sense that fully solving it would seem to require solving AI in general.
The historical trajectory of the field mirrors the broader arc of AI. Early NLP systems in the 1950s and 1960s — including the Georgetown-IBM experiment in machine translation (1954) and Joseph Weizenbaum’s ELIZA (1966) — used hand-crafted rules and pattern matching. The statistical revolution of the 1990s, championed by researchers like Frederick Jelinek at IBM, replaced rules with probabilistic models trained on large corpora. And the neural revolution of the 2010s replaced hand-engineered features with learned representations, culminating in the transformer architecture and large language models.
Language Models and the N-Gram Tradition
A language model assigns probabilities to sequences of words (or tokens), capturing regularities in how language is used. Language models are the backbone of NLP: they underlie speech recognition, machine translation, text generation, spelling correction, and many other applications.
The earliest computational language models were n-gram models, which estimate the probability of a word given the preceding words using the Markov assumption: . A unigram model considers words independently; a bigram model conditions on the immediately preceding word; a trigram on the two preceding words. Probabilities are estimated by counting occurrences in a training corpus and normalizing.
The fundamental problem with n-gram models is data sparsity: most n-grams, especially for , never appear in any finite corpus, making naive probability estimates unreliable. Smoothing techniques redistribute probability mass from observed n-grams to unseen ones. Add-one (Laplace) smoothing adds a count of one to every n-gram — simple but crude. Good-Turing smoothing estimates the probability of unseen events using the frequency of events seen once. Kneser-Ney smoothing, one of the most effective methods, uses absolute discounting combined with a clever backoff distribution based on the diversity of contexts in which a word appears.
Perplexity is the standard evaluation metric for language models, defined as , where is a test corpus of words. Lower perplexity indicates a better model — one that assigns higher probability to the actual text. Perplexity can be interpreted as the effective number of equally likely next words the model is choosing among at each step. N-gram models typically achieve perplexities in the hundreds; modern neural language models reduce this to single digits on standard benchmarks.
Neural language models replace the discrete probability tables of n-gram models with continuous function approximators. The first influential neural language model was proposed by Yoshua Bengio and colleagues in 2003, which used a feedforward network over concatenated word embeddings. This approach addressed the sparsity problem by mapping words into a continuous space where similar words have similar representations, enabling the model to generalize from seen contexts to unseen but similar ones.
Word Embeddings and Distributional Semantics
The insight that words can be represented as dense vectors in a continuous space — word embeddings — transformed NLP in the 2010s. The foundational idea traces back to the distributional hypothesis, articulated by linguist John Rupert Firth in 1957: “You shall know a word by the company it keeps.” Words that appear in similar contexts tend to have similar meanings, and embedding methods exploit this regularity.
Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013, provided two efficient architectures for learning embeddings from large corpora. The Skip-gram model predicts context words from a target word; the Continuous Bag-of-Words (CBOW) model predicts a target word from its context. Training uses negative sampling — a simplified form of noise-contrastive estimation that avoids the computational expense of a full softmax over the vocabulary. The resulting vectors exhibit striking algebraic properties: the famous analogy demonstrated that semantic relationships are encoded as linear directions in the vector space.
GloVe (Global Vectors), developed by Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford in 2014, approaches the same goal differently. Rather than training on local context windows, GloVe factorizes the global word co-occurrence matrix, combining the strengths of count-based and prediction-based methods. The objective minimizes a weighted least-squares loss over log co-occurrence counts, and the resulting embeddings perform comparably to Word2Vec on most benchmarks.
FastText, developed at Facebook in 2017, extends Word2Vec by representing each word as a bag of character n-grams, allowing the model to generate embeddings for words not seen during training (out-of-vocabulary words) and to capture morphological regularities — the embeddings for “running,” “runner,” and “ran” share subword components that encode their common root.
A critical limitation of static word embeddings is that they assign a single vector to each word type regardless of context. The word “bank” receives the same embedding whether it appears in “river bank” or “bank account.” Contextual embeddings — where the representation of a word depends on its surrounding context — addressed this limitation. ELMo (Embeddings from Language Models), introduced by Matthew Peters and colleagues in 2018, generates context-dependent representations using a bidirectional LSTM language model. Each word’s embedding is a learned combination of the hidden states across all layers, capturing both syntactic and semantic information. ELMo was rapidly superseded by transformer-based contextual models, but it demonstrated the power of the underlying idea.
Recurrent Networks and Sequence Modeling
Before transformers, recurrent neural networks (RNNs) were the dominant architecture for processing sequential data in NLP. An RNN maintains a hidden state that is updated at each time step as a function of the current input and the previous hidden state: . This recurrence allows the network to, in principle, capture dependencies across arbitrary distances in a sequence.
In practice, vanilla RNNs suffer from the vanishing gradient problem: gradients backpropagated through many time steps shrink exponentially, making it nearly impossible to learn long-range dependencies. The Long Short-Term Memory (LSTM) architecture, introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997, addresses this with a cell state that runs through the sequence like a conveyor belt, regulated by three gates — the input gate, forget gate, and output gate — that control what information is written, retained, and read. The Gated Recurrent Unit (GRU), proposed by Kyunghyun Cho and colleagues in 2014, simplifies the LSTM by merging the cell state and hidden state and using only two gates, achieving comparable performance with fewer parameters.
Bidirectional RNNs process the sequence in both directions — left-to-right and right-to-left — and concatenate the hidden states, giving each position access to both past and future context. Stacked (multi-layer) architectures build deeper representations by feeding the hidden states of one RNN layer as input to the next.
The sequence-to-sequence (seq2seq) framework, introduced by Ilya Sutskever, Oriol Vinyals, and Quoc Le in 2014, applies RNNs to variable-length input-output problems like machine translation. An encoder RNN reads the input sequence and compresses it into a fixed-length context vector; a decoder RNN generates the output sequence conditioned on this vector. The bottleneck of compressing an entire input into a single vector was addressed by the attention mechanism, proposed by Dzmitry Bahdanau and colleagues in 2014, which allows the decoder to attend to different parts of the encoder’s output at each generation step. Attention proved transformative — it improved translation quality substantially and laid the groundwork for the transformer.
The Transformer and Pre-trained Language Models
The transformer architecture, introduced by Ashish Vaswani and colleagues in their 2017 paper Attention Is All You Need, replaced recurrence entirely with self-attention, allowing every position in a sequence to attend to every other position in a single operation. The core computation is scaled dot-product attention:
where (queries), (keys), and (values) are linear projections of the input. Multi-head attention runs this operation multiple times in parallel with different projection matrices and concatenates the results, enabling the model to capture different types of relationships simultaneously. Positional encodings — originally sinusoidal functions, later replaced by learned embeddings or rotary position embeddings (RoPE) — inject sequence-order information that the attention mechanism does not inherently encode.
The transformer’s parallelizability over sequence length made it far more efficient to train on modern GPUs than recurrent models, and it quickly became the dominant architecture for NLP and beyond. Two families of pre-trained models emerged from it.
BERT (Bidirectional Encoder Representations from Transformers), introduced by Jacob Devlin and colleagues at Google in 2018, is a bidirectional encoder trained with masked language modeling — randomly masking 15% of tokens and predicting them from surrounding context — and next sentence prediction. BERT’s contextual embeddings can be fine-tuned for a wide range of downstream tasks (classification, question answering, named entity recognition), and it set new state-of-the-art results on nearly every NLP benchmark at the time of its release. Variants followed rapidly: RoBERTa (Facebook, 2019) showed that training longer on more data with better hyperparameters improved performance substantially; ALBERT reduced parameter count through factorized embeddings; DistilBERT compressed the model through knowledge distillation.
The GPT series (OpenAI, 2018 onward) takes the opposite approach: a unidirectional (autoregressive) decoder trained to predict the next token. GPT-2 (2019) demonstrated that language models trained on large web corpora could generate remarkably coherent text. GPT-3 (2020), with 175 billion parameters, revealed in-context learning — the ability to perform new tasks given only a few examples in the prompt, without any gradient updates. Scaling laws, identified by Jared Kaplan and colleagues, showed smooth power-law relationships between model size, data, compute, and loss, providing a principled basis for building ever-larger models. The T5 model (Google, 2019) unified all NLP tasks into a text-to-text framework, treating classification, translation, summarization, and question answering as instances of the same sequence-to-sequence problem.
Machine Translation and Information Extraction
Machine translation — the automatic conversion of text from one natural language to another — has been a driving application of NLP since the field’s earliest days. The Georgetown-IBM experiment of 1954 used hand-crafted rules to translate Russian sentences into English; the ALPAC report of 1966 famously declared machine translation impractical and cut funding for a decade.
The statistical machine translation (SMT) paradigm, pioneered at IBM in the late 1980s by Peter Brown and colleagues, modeled translation probabilistically using Bayes’ rule: , where is the foreign sentence and is the English translation. Phrase-based SMT extended word-level models to phrases, capturing local reordering and idiomatic expressions. SMT dominated the field for two decades, with systems built on parallel corpora, alignment models, and language models.
Neural machine translation (NMT) replaced the entire SMT pipeline with a single end-to-end neural network. The seq2seq model with attention, described above, was the first successful NMT architecture. The transformer subsequently became the standard, and the field saw rapid gains: Google’s transformer-based NMT system reduced translation errors by more than 60% compared to the previous phrase-based system on several language pairs. Multilingual models like mBART and M2M-100 can translate between many language pairs — including pairs for which no parallel training data exists — through zero-shot translation, leveraging shared representations learned across languages.
Evaluation in machine translation relies on automatic metrics and human judgment. BLEU (Bilingual Evaluation Understudy), proposed by Kirti Papineni and colleagues in 2002, measures n-gram overlap between a candidate translation and one or more reference translations. BLEU is fast and widely used but has well-known limitations: it rewards surface similarity rather than meaning, penalizes valid paraphrases, and correlates imperfectly with human quality judgments. METEOR, TER, and learned metrics like COMET address some of these limitations.
Information extraction encompasses tasks that identify structured information in unstructured text. Named entity recognition (NER) identifies mentions of people, organizations, locations, dates, and other entity types. The standard approach casts NER as sequence labeling using the BIO tagging scheme (Beginning, Inside, Outside), with models ranging from conditional random fields to transformer-based architectures. Relation extraction identifies relationships between entities (“Einstein was born in Ulm”); event extraction identifies events and their participants; and entity linking maps mentions to entries in a knowledge base, resolving ambiguity (“Paris” the city versus “Paris” the person).
Sentiment Analysis and Question Answering
Sentiment analysis — determining the attitude, opinion, or emotion expressed in a text — is one of the most commercially important NLP applications. At its simplest, it is a binary classification problem: is a product review positive or negative? More nuanced formulations include multi-class sentiment (e.g., a five-star rating scale), aspect-based sentiment (the food was excellent but the service was slow), and emotion detection (joy, anger, sadness, fear).
Early approaches relied on sentiment lexicons — hand-curated lists of words annotated with polarity scores — and simple classifiers over bag-of-words features. The neural era brought models that capture context, negation, and compositional meaning far more effectively. Transformer-based models fine-tuned on sentiment datasets now achieve near-human accuracy on standard benchmarks. Challenges remain in detecting sarcasm and irony, handling implicit sentiment (where no explicitly evaluative language is used), and performing sentiment analysis across languages and domains.
Question answering (QA) systems take a natural-language question and return an answer, either by extracting a span from a given passage (extractive QA) or by generating a free-form response (abstractive QA). The SQuAD benchmark (Stanford Question Answering Dataset), introduced in 2016, catalyzed rapid progress in extractive QA: models read a passage and predict the start and end positions of the answer span. Within two years, machine performance matched human performance on SQuAD.
Open-domain QA requires the additional step of retrieving relevant documents from a large corpus before extracting or generating an answer. Dense passage retrieval (DPR), which uses learned embeddings to retrieve passages, has largely replaced sparse methods like TF-IDF and BM25 for this task. Retrieval-augmented generation (RAG) combines a retriever with a generative language model, allowing the system to ground its answers in retrieved evidence and reduce the hallucination problem that plagues purely generative models. Multi-hop reasoning — answering questions that require synthesizing information from multiple passages — remains an active and challenging research area.
Large Language Models and Modern Frontiers
The era of large language models (LLMs) has fundamentally reshaped NLP and, arguably, the entire landscape of AI. Models with tens or hundreds of billions of parameters, trained on trillions of tokens of internet text, exhibit capabilities that were not explicitly trained for: few-shot learning from a handful of prompt examples, chain-of-thought reasoning where the model articulates intermediate steps to solve multi-step problems, and instruction following that allows the model to carry out diverse tasks described in natural language.
Prompting techniques have become a new form of programming. Zero-shot prompting provides only a task description; few-shot prompting includes a small number of input-output examples. Chain-of-thought (CoT) prompting, introduced by Jason Wei and colleagues in 2022, dramatically improves performance on reasoning tasks by eliciting step-by-step explanations. Self-consistency aggregates multiple chain-of-thought samples and takes the majority answer, improving robustness.
Fine-tuning remains important for adapting LLMs to specific tasks and aligning them with human preferences. Instruction tuning trains models on collections of tasks phrased as instructions, improving their ability to follow novel instructions at test time. Reinforcement learning from human feedback (RLHF), used in training ChatGPT and similar systems, trains a reward model on human preference judgments and then fine-tunes the language model to maximize that reward. Parameter-efficient fine-tuning methods like LoRA (Low-Rank Adaptation) modify only a small number of parameters, making adaptation feasible even for very large models.
The frontiers of NLP research stretch across many dimensions. Multimodal language models integrate text with images, audio, and video, moving toward systems that can see, hear, and read simultaneously. Long-context learning extends the effective context window of transformers to handle entire books or codebases. Grounding connects language to perception and action, a prerequisite for embodied AI. Bias and fairness research documents and mitigates the ways in which language models absorb and amplify societal biases from their training data. And the questions of alignment and safety — how to ensure that increasingly capable language systems act in accordance with human values — have moved from academic workshops to front-page policy debates, marking NLP as not just a technical field but one with profound social and ethical implications.