Data & Information
The storage, retrieval, and analysis of data — databases, information retrieval, and data science.
The transformation of raw data into structured knowledge is one of the defining achievements of computing. From the earliest tabulating machines that Herman Hollerith designed for the 1890 United States Census to the planet-scale search engines and analytics platforms of the twenty-first century, the problem has always been the same: how do we store vast quantities of information, find what we need within it, and extract meaning from what we find? The Data and Information branch of computer science addresses this question at every level of abstraction, from the mathematical foundations of relational algebra to the statistical machinery of modern data science.
The story begins with databases, the systems that give data its structure and permanence. Before Edgar F. Codd published his landmark 1970 paper introducing the relational model, data management was an ad hoc affair tied to the physical layout of files on disk. Codd’s insight was to separate the logical description of data from its physical storage, freeing programmers to think in terms of relations, tuples, and queries rather than pointers and byte offsets. This single idea gave rise to SQL, to the ACID transaction model, and to an entire industry of database management systems that underpin virtually every application in existence. The study of databases encompasses not only the relational tradition but also the diverse landscape of NoSQL systems, distributed architectures, and the theoretical machinery of query optimization and concurrency control that make it all work at scale.
Once data is stored, the next challenge is finding the right piece of it at the right time. Information retrieval is the science of search, originating in the library classification systems of the nineteenth century and maturing through the pioneering work of Gerard Salton and his SMART system at Cornell in the 1960s. The field gave the world the inverted index, the TF-IDF weighting scheme, and ultimately the web search engines that became the primary interface between humanity and its collective knowledge. Today information retrieval spans everything from classical Boolean and vector space models to neural retrieval systems built on transformer architectures, along with the evaluation methodologies, ranking algorithms, and recommender systems that power the modern information economy.
The third pillar of this branch is data science, the discipline concerned with extracting actionable knowledge from data through a combination of statistical analysis, algorithmic pattern discovery, and visual communication. Its intellectual roots reach back to John Tukey’s advocacy for exploratory data analysis in the 1960s and 1970s, and to the knowledge discovery in databases movement of the 1990s led by researchers such as Usama Fayyad and Rakesh Agrawal. Data science brings together clustering, classification, association rule mining, dimensionality reduction, anomaly detection, and time series analysis into a coherent pipeline that begins with messy, real-world data and ends with interpretable insight. The explosive growth of available data in the twenty-first century has made these techniques essential across every domain, from healthcare and finance to social science and engineering.
These three sub-topics form a natural progression. Databases provide the infrastructure for organizing and persisting data. Information retrieval builds on that infrastructure to solve the problem of finding relevant information within large collections. Data science completes the arc by providing the analytical tools to discover patterns, make predictions, and communicate findings. Together they represent the full lifecycle of data, from storage through search to understanding, and they connect outward to nearly every other area of computer science, from algorithms and data structures that provide their theoretical backbone to artificial intelligence and machine learning that increasingly drive their most powerful techniques.
Explore
- 01
Databases
The theory and systems for structured data storage and retrieval — relational models, query languages, transactions, and modern database architectures.
- 02
Information Retrieval
The science of searching and organizing large collections of information — indexing, ranking, text processing, and web search.
- 03
Data Science
The extraction of knowledge from data — statistical analysis, data mining, visualization, and the data science pipeline.