Language Model Fairness

Large language models inherit the social patterns of their training data, and increasingly mediate consequential decisions in hiring, education, health care, and law. Language model fairness is the research area concerned with measuring those inherited patterns, locating them in model internals, and intervening — through data curation, architectural changes, fine-tuning objectives, or post-hoc decoding constraints — to reduce harms. The field is methodologically distinct from generic NLP evaluation: it borrows from sociolinguistics, audit studies in the social sciences, and algorithmic-fairness theory, alongside the standard tooling of probing and benchmarking.

Bias in language models manifests at multiple layers. Representational bias appears in the geometry of the embedding space — words associated with social groups cluster with stereotype-aligned attributes. Allocational bias appears when downstream predictions, such as résumé screening or recidivism estimation, distribute outcomes unevenly across groups. Generative bias appears when text produced by the model differs in tone, register, or content depending on the demographic context of the prompt. Each kind requires distinct measurement: probing tasks for representational bias, controlled disparate-impact studies for allocational bias, and controlled-generation evaluations for generative bias.

A particularly subtle finding has been the persistence of covert stereotyping even after overt-bias mitigation. Hofmann et al. (2024) showed that contemporary language models — including those whose explicit racial associations have been substantially reduced through alignment training — continue to produce systematically more negative judgments about speakers of African American English compared to standardised American English, in tasks ranging from employment decisions to criminal sentencing. The paper introduces the methodology of matched guise probing, isolating linguistic features as the trigger for differential treatment, and demonstrates that the bias strengthens with model scale even as overt bias diminishes. This work has reframed alignment as insufficient to remove sociolinguistic prejudice, and reopened the question of how training data and fine-tuning interact to produce these patterns.

Open methodological questions dominate the field: how to audit closed models without their internals, how to design benchmarks that resist over-fitting and stereotype reinforcement, how to disaggregate bias from genuine demographic differences in language use, and how to combine sociotechnical evaluation with engineering interventions. The work increasingly intersects with policy: bias audits are mandated in some jurisdictions for high-stakes deployment, and fairness research now informs both model release decisions and regulatory frameworks.

Language Model Fairness

Prerequisites

Sources

In context

Reviewed by

Review this topic