Data Structures for Storage and Retrieval
Explore how databases store and retrieve data under the hood. Learn about append-only logs, indexes, and the fundamental trade-offs between read and write performance that shape every storage engine.
The Two Things Every Database Must Do
At its core, every database has just two jobs. When you give it data, it needs to store that data somewhere safe. When you ask for that data later, it needs to find it and give it back to you. That's it. Everything else, query languages, replication, transactions, builds on top of these two fundamental operations.
In earlier chapters, we explored data models and query languages: how you, as a developer, format your data and ask for it back. Now we're going to flip perspectives and look at things from the database's point of view. How does it actually store your data? And more importantly, how does it find it again quickly?
You might wonder why this matters. After all, you're probably not going to build your own database engine. But understanding what's happening under the hood helps you choose the right database for your workload and tune it for better performance. The difference between a well-chosen storage engine and a poorly chosen one can be the difference between queries that take milliseconds and queries that take minutes.
The World's Simplest Database
Let's start with something almost absurdly simple: a database implemented in just a few lines of Bash. This will help us understand the fundamental trade-offs that all databases face.
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}That's a complete key-value database. The db_set function takes a key and a value and appends them to a file called database. The db_get function searches for the key and returns the most recent value associated with it.
Let's see it in action:
$ db_set 123456 '{"name":"London","attractions":["Big Ben","London Eye"]}'
$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'
$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}It works! You can store JSON documents (or any text) and retrieve them by key. But how does it actually work?
The Power of Append-Only Logs
Look at what happens when we update a key:
$ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}'
$ db_get 42
{"name":"San Francisco","attractions":["Exploratorium"]}
$ cat database
123456,{"name":"London","attractions":["Big Ben","London Eye"]}
42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
42,{"name":"San Francisco","attractions":["Exploratorium"]}Notice something interesting: the old value for key 42 is still in the file. We didn't overwrite it, we just appended a new entry. When we read, we scan for all occurrences of the key and take the last one (that's what tail -n 1 does).
This append-only approach is called a log. In this context, "log" doesn't mean application logs or error messages. It means an append-only sequence of records, possibly binary, possibly only readable by programs.
The db_set function has fantastic write performance. Appending to the end of a file is one of the fastest operations a computer can do. The disk head doesn't need to seek to a specific location; it just writes at the end. Many real databases use this same principle internally.
The Read Problem
But there's a catch. While db_set is blazing fast, db_get is painfully slow. Every time you look up a key, the function has to scan the entire file from beginning to end, checking every line.
If your database has 10 records, this is fine. If it has 10 million records, every lookup has to examine all 10 million lines. Double the data, double the lookup time. In computer science terms, we say the lookup is O(n), it scales linearly with the amount of data.
# Simulating our simple database in Python
class SimpleDatabase:
def __init__(self, filename):
self.filename = filename
def db_set(self, key, value):
"""Append a key-value pair to the file. O(1) - constant time!"""
with open(self.filename, 'a') as f:
f.write(f"{key},{value}\n")
def db_get(self, key):
"""Scan the entire file to find the latest value. O(n) - linear time!"""
result = None
with open(self.filename, 'r') as f:
for line in f:
k, v = line.strip().split(',', 1)
if k == key:
result = v # Keep the latest match
return result
# Performance demonstration
db = SimpleDatabase("test.db")
# Writing 1000 records is fast
import time
start = time.time()
for i in range(1000):
db.db_set(f"key{i}", f"value{i}")
print(f"1000 writes: {time.time() - start:.3f} seconds")
# But reading requires scanning everything
start = time.time()
for i in range(100):
db.db_get("key999") # Always have to scan to the end
print(f"100 reads: {time.time() - start:.3f} seconds")The Solution: Indexes
To find data quickly, we need a shortcut, a way to jump directly to the data we want instead of scanning everything. This is what an index provides.
Think of a book's index at the back. Instead of reading every page to find mentions of "database," you look up "database" in the index, and it tells you exactly which pages to turn to. A database index works the same way: it's additional metadata that acts as a signpost, pointing you directly to the data you need.
Let's add a simple index to our database:
class IndexedDatabase:
def __init__(self, filename):
self.filename = filename
self.index = {} # key -> byte offset in file
self._rebuild_index()
def _rebuild_index(self):
"""Build index by scanning the file once at startup."""
try:
with open(self.filename, 'r') as f:
offset = 0
for line in f:
key = line.split(',', 1)[0]
self.index[key] = offset
offset += len(line.encode('utf-8'))
except FileNotFoundError:
pass
def db_set(self, key, value):
"""Append to file AND update the index."""
with open(self.filename, 'a') as f:
offset = f.tell()
f.write(f"{key},{value}\n")
self.index[key] = offset # Update index
def db_get(self, key):
"""Use index to jump directly to the data. O(1)!"""
if key not in self.index:
return None
with open(self.filename, 'r') as f:
f.seek(self.index[key]) # Jump to the right position
line = f.readline()
return line.strip().split(',', 1)[1]
# Now both operations are fast!
db = IndexedDatabase("indexed.db")
# Writing is still fast (plus a tiny index update)
db.db_set("user:1", '{"name": "Alice", "age": 30}')
db.db_set("user:2", '{"name": "Bob", "age": 25}')
db.db_set("user:3", '{"name": "Carol", "age": 35}')
# Reading is now O(1) - constant time!
print(db.db_get("user:2")) # Jumps directly to the dataThe Index Trade-off
Indexes seem magical, they make reads fast! So why doesn't every database index everything by default?
Because indexes have a cost. Every time you write data, you also have to update the index. For our simple append-only database, writing was just appending to a file, the fastest possible operation. With an index, every write now involves:
- Append the data to the file
- Update the index with the new location
If you have multiple indexes (to support different types of queries), every write has to update all of them. This overhead adds up.
This is why databases don't automatically index everything. They require you, the developer or DBA, to choose which indexes to create. You know your application's query patterns: which fields you search by, which queries need to be fast, and how often you write versus read.
Real Databases: Two Approaches
Real databases face the same fundamental challenge: balancing write performance against read performance. Over decades of research and engineering, two main families of storage engines have emerged:
Log-structured storage engines embrace the append-only approach. They optimize for writes by always appending, then periodically reorganizing the data in the background. Examples include LSM-trees (used in LevelDB, RocksDB, Cassandra).
Page-oriented storage engines like B-trees take a different approach. They divide the database into fixed-size pages and update data in place. This makes reads very efficient but requires more care with writes. Most traditional relational databases use B-trees.
We'll explore both approaches in detail in upcoming sections. For now, the key insight is that there's no single "best" approach, it depends on your workload. Analytics workloads (lots of reads, few writes) have different needs than transactional workloads (many small reads and writes).
The Meaning of "Log"
Before we continue, let's clarify terminology. When developers hear "log," they often think of application logs, those text files full of error messages and debugging information.
In database internals, "log" means something broader: any append-only sequence of records. It might be human-readable text, or it might be compact binary data that only programs can interpret. The key property is that new data is always added at the end; existing data is never modified in place.
This append-only property is powerful. Sequential writes are fast on both spinning hard drives and SSDs. Append-only structures are easier to make crash-safe (if you crash mid-write, you just have a partial record at the end that you can discard). And they're easier to replicate, you just ship the new records to replicas.
Key Takeaways
Let's summarize the fundamental concepts we've covered:
The Two Operations: Every database must store data (writes) and retrieve data (reads). Everything else builds on these primitives.
The Simple Log: Appending to a file is the simplest and fastest write operation. But it makes reads expensive, you have to scan everything.
Indexes Solve Reads: An index is additional metadata that points to where data lives. It turns O(n) scans into O(1) lookups.
The Fundamental Trade-off: Indexes speed up reads but slow down writes. Every write must update all indexes. Choose indexes based on your query patterns.
Two Families: Log-structured engines (append-only, compact later) and page-oriented engines (update in place) represent different trade-offs.
Understanding these trade-offs will help you make better decisions about which database to use, how to configure it, and which indexes to create. In the next sections, we'll dive deeper into specific data structures, hash indexes, SSTables, LSM-trees, and B-trees, that power real-world databases.