B-Trees
Deep dive into B-Trees - the most successful data structure in database history, powering indexes in PostgreSQL, MySQL, Oracle, and virtually every relational database for over 50 years.
The Most Successful Data Structure in Database History
While LSM-Trees are gaining popularity, they're the newcomers. The real veteran of database indexing is the B-Tree, introduced in 1970 and described as "ubiquitous" by 1979. More than 50 years later, B-Trees remain the default index implementation in almost every relational database, PostgreSQL, MySQL, Oracle, SQL Server, and many NoSQL databases too.
Like SSTables, B-Trees keep data sorted by key, enabling both fast lookups and efficient range queries. But the underlying philosophy is completely different. Where LSM-Trees write sequentially to variable-size segments, B-Trees read and write fixed-size pages (typically 4 KB), updating them in place. This design maps directly to how disk hardware works, both hard drives and SSDs operate on fixed-size blocks.
Understanding Pages and Pointers
Think of a B-Tree as a tree structure, but instead of nodes in memory, we have pages on disk. Each page is identified by an address (like a disk location), and pages can contain references (pointers) to other pages. It's like pointers in a programming language, but pointing to disk locations instead of memory addresses.
The tree has one special page called the root. Every lookup starts at the root and follows page references down through the tree until it reaches the data.
How B-Tree Lookups Work
Let's walk through finding a specific key. Say we're looking for key 251 in the tree.
-
Start at the root page: The root contains keys that act as boundaries (like signposts). We see boundaries at 100, 200, 300. Since 251 is between 200 and 300, we follow the pointer for that range.
-
Navigate internal pages: The next page breaks down the 200-300 range further. We see boundaries at 220, 250, 280. Since 251 is between 250 and 280, we follow that pointer.
-
Reach a leaf page: Finally we arrive at a leaf page containing individual keys and their values. We scan this small page to find key 251.
The number of child pointers a page can hold is called the branching factor. In the diagrams above, it's small for readability. In real databases, a branching factor of 500 or more is common. Why does this matter? Because higher branching factors mean shallower trees.
The Magic of Logarithmic Depth
Here's why B-Trees are so efficient: a B-Tree with n keys has a depth of only O(log n). With a branching factor of 500, let's calculate:
- Level 1 (root): 1 page, up to 500 children
- Level 2: 500 pages, up to 250,000 children
- Level 3: 250,000 pages, up to 125 million children
- Level 4: 125 million pages
A four-level B-Tree with 4 KB pages and a branching factor of 500 can store up to 256 terabytes of data. And to find any key, you only need to read 4 pages, that's just 4 disk reads!
def btree_depth(num_keys: int, branching_factor: int) -> int:
"""Calculate the minimum depth needed to store num_keys."""
import math
if num_keys <= 0:
return 0
# At depth d, we can store up to branching_factor^d keys
return math.ceil(math.log(num_keys) / math.log(branching_factor))
# Examples
print(f"1 million keys, branching 500: {btree_depth(1_000_000, 500)} levels")
print(f"1 billion keys, branching 500: {btree_depth(1_000_000_000, 500)} levels")
print(f"1 trillion keys, branching 500: {btree_depth(1_000_000_000_000, 500)} levels")
# Output:
# 1 million keys, branching 500: 3 levels
# 1 billion keys, branching 500: 4 levels
# 1 trillion keys, branching 500: 5 levelsUpdates and Insertions
Updating an existing key is straightforward:
- Search for the leaf page containing the key
- Modify the value in that page
- Write the page back to disk
The page stays in the same location, so all pointers to it remain valid.
Inserting a new key requires finding the right leaf page and adding the key there. But what if the page is full?
When a page overflows, we split it into two half-full pages and update the parent page to point to both. If the parent also overflows, it splits too, and this can cascade up to the root. If the root splits, we create a new root, this is the only way a B-Tree grows taller.
B-Tree Implementation
Let's build a simple B-Tree in Python to understand the mechanics:
from typing import Optional, List, Tuple, Any
class BTreeNode:
"""A node (page) in the B-Tree."""
def __init__(self, leaf: bool = False):
self.leaf = leaf
self.keys: List[Any] = []
self.values: List[Any] = [] # Only used in leaf nodes
self.children: List['BTreeNode'] = [] # Only used in internal nodes
def __repr__(self):
return f"BTreeNode(keys={self.keys}, leaf={self.leaf})"
class BTree:
"""
A B-Tree implementation demonstrating the core concepts.
In a real database, nodes would be pages on disk, and we'd use
page IDs instead of object references.
"""
def __init__(self, order: int = 4):
"""
order: maximum number of children per node (branching factor).
A node can have at most (order - 1) keys.
"""
self.order = order
self.root = BTreeNode(leaf=True)
def search(self, key) -> Optional[Any]:
"""Search for a key and return its value, or None if not found."""
return self._search_node(self.root, key)
def _search_node(self, node: BTreeNode, key) -> Optional[Any]:
"""Recursively search for a key starting from node."""
# Find the position where key should be
i = 0
while i < len(node.keys) and key > node.keys[i]:
i += 1
# Check if we found the key
if i < len(node.keys) and key == node.keys[i]:
if node.leaf:
return node.values[i]
else:
# In internal nodes, the key is just a separator
# The actual value is in a leaf - go right
return self._search_node(node.children[i + 1], key)
# Key not found in this node
if node.leaf:
return None # Key doesn't exist
else:
# Follow the appropriate child pointer
return self._search_node(node.children[i], key)
def insert(self, key, value):
"""Insert a key-value pair into the B-Tree."""
root = self.root
# If root is full, split it first
if len(root.keys) == self.order - 1:
new_root = BTreeNode(leaf=False)
new_root.children.append(self.root)
self._split_child(new_root, 0)
self.root = new_root
self._insert_non_full(self.root, key, value)
def _insert_non_full(self, node: BTreeNode, key, value):
"""Insert into a node that is guaranteed to have room."""
i = len(node.keys) - 1
if node.leaf:
# Find position and insert
while i >= 0 and key < node.keys[i]:
i -= 1
i += 1
node.keys.insert(i, key)
node.values.insert(i, value)
else:
# Find the child to descend into
while i >= 0 and key < node.keys[i]:
i -= 1
i += 1
# If child is full, split it first
if len(node.children[i].keys) == self.order - 1:
self._split_child(node, i)
if key > node.keys[i]:
i += 1
self._insert_non_full(node.children[i], key, value)
def _split_child(self, parent: BTreeNode, child_index: int):
"""Split a full child node into two nodes."""
order = self.order
full_child = parent.children[child_index]
mid = (order - 1) // 2
# Create new node with right half of keys
new_node = BTreeNode(leaf=full_child.leaf)
# Move keys to new node
new_node.keys = full_child.keys[mid + 1:]
promoted_key = full_child.keys[mid]
full_child.keys = full_child.keys[:mid]
if full_child.leaf:
# Move values too
new_node.values = full_child.values[mid + 1:]
full_child.values = full_child.values[:mid + 1]
# Keep middle key in left node for leaves
new_node.keys = full_child.keys[mid:]
full_child.keys = full_child.keys[:mid]
new_node.values = full_child.values[mid:]
full_child.values = full_child.values[:mid]
else:
# Move children
new_node.children = full_child.children[mid + 1:]
full_child.children = full_child.children[:mid + 1]
# Insert promoted key and new child into parent
parent.keys.insert(child_index, promoted_key)
parent.children.insert(child_index + 1, new_node)
def range_query(self, start_key, end_key) -> List[Tuple[Any, Any]]:
"""Return all key-value pairs where start_key <= key <= end_key."""
results = []
self._range_query(self.root, start_key, end_key, results)
return results
def _range_query(self, node: BTreeNode, start_key, end_key, results: List):
"""Recursively collect keys in range."""
if node.leaf:
for i, key in enumerate(node.keys):
if start_key <= key <= end_key:
results.append((key, node.values[i]))
else:
for i, key in enumerate(node.keys):
if start_key <= key:
self._range_query(node.children[i], start_key, end_key, results)
if start_key <= key <= end_key:
# Key itself might be in the range (for internal nodes that store values)
pass
# Don't forget the rightmost child
if end_key >= node.keys[-1] if node.keys else True:
self._range_query(node.children[-1], start_key, end_key, results)
def visualize(self):
"""Print a simple visualization of the tree."""
self._visualize_node(self.root, 0)
def _visualize_node(self, node: BTreeNode, level: int):
indent = " " * level
node_type = "LEAF" if node.leaf else "INTERNAL"
print(f"{indent}[{node_type}] keys={node.keys}")
for child in node.children:
self._visualize_node(child, level + 1)
# Usage example
if __name__ == "__main__":
tree = BTree(order=4) # Small order for demonstration
# Insert some data
data = [(10, "ten"), (20, "twenty"), (5, "five"), (15, "fifteen"),
(25, "twenty-five"), (30, "thirty"), (35, "thirty-five")]
for key, value in data:
print(f"Inserting {key}...")
tree.insert(key, value)
tree.visualize()
print()
# Search
print(f"Search for 15: {tree.search(15)}")
print(f"Search for 99: {tree.search(99)}")
# Range query
print(f"Range 10-25: {tree.range_query(10, 25)}")The Danger of In-Place Updates
B-Trees modify pages in place. This is fundamentally different from LSM-Trees' append-only approach, and it introduces challenges.
When you overwrite a page on disk, the hardware has to:
- On HDD: Move the disk head to the right position, wait for the platter to spin to the right spot, then write
- On SSD: Erase a large block (SSDs can't overwrite individual bytes), then rewrite
More critically, some operations require updating multiple pages atomically. When you split a page, you need to:
- Write the original page (now half-full)
- Write the new page (other half)
- Update the parent page with the new pointer
What if the database crashes after step 1 but before step 3? You have an orphan page that no one points to, a corrupted index.
Write-Ahead Logging (WAL) for Crash Safety
The solution is the Write-Ahead Log (WAL), also called a redo log. Before making any changes to B-Tree pages, you first append the intended changes to a log file. Only after the log is safely on disk do you modify the actual pages.
If the database crashes mid-operation, it can replay the WAL on startup to complete any partial operations or undo them, restoring the B-Tree to a consistent state.
import json
import os
from typing import List, Dict, Any
class WAL:
"""Write-Ahead Log for crash recovery."""
def __init__(self, path: str):
self.path = path
self.log_file = open(path, 'a+')
def log_operation(self, operation: Dict[str, Any]):
"""Log an operation before performing it."""
entry = json.dumps(operation) + '\n'
self.log_file.write(entry)
self.log_file.flush()
os.fsync(self.log_file.fileno()) # Ensure it's on disk
def mark_complete(self, operation_id: str):
"""Mark an operation as successfully completed."""
self.log_operation({'type': 'COMPLETE', 'id': operation_id})
def get_incomplete_operations(self) -> List[Dict]:
"""Find operations that were logged but not marked complete."""
self.log_file.seek(0)
operations = {}
for line in self.log_file:
entry = json.loads(line.strip())
if entry['type'] == 'COMPLETE':
operations.pop(entry['id'], None)
else:
operations[entry['id']] = entry
return list(operations.values())
def clear(self):
"""Clear the log after successful checkpoint."""
self.log_file.close()
open(self.path, 'w').close()
self.log_file = open(self.path, 'a+')
# Usage example
class DurableBTreePage:
"""A B-Tree page that uses WAL for durability."""
def __init__(self, page_id: int, wal: WAL):
self.page_id = page_id
self.wal = wal
self.data = {}
def update(self, key, value):
import uuid
operation_id = str(uuid.uuid4())
# Step 1: Log the intended change
self.wal.log_operation({
'id': operation_id,
'type': 'UPDATE',
'page_id': self.page_id,
'key': key,
'old_value': self.data.get(key),
'new_value': value
})
# Step 2: Perform the change
self.data[key] = value
# Step 3: Mark complete
self.wal.mark_complete(operation_id)Concurrency Control with Latches
Another challenge with in-place updates: concurrent access. What if one thread is splitting a page while another is trying to read from it? The reader might see an inconsistent half-updated state.
B-Trees solve this with latches (lightweight locks). Before reading or modifying a page, a thread must acquire the appropriate latch. This ensures mutual exclusion but adds complexity.
LSM-Trees have an advantage here: since segments are immutable once written, readers can access them without locks. The only coordination needed is when swapping old segments for new ones, which can be done atomically.
B-Tree Optimizations
B-Trees have been around for 50+ years. Naturally, many optimizations have been developed:
Copy-on-Write (COW)
Instead of modifying pages in place (requiring WAL), write a new copy of any modified page. Then update parent pages to point to the new copy. The old pages remain untouched until all readers are done. LMDB uses this approach.
Key Abbreviation
Interior nodes only need keys to act as separators, they don't need the full key. By abbreviating keys, we can fit more of them in each page, increasing the branching factor and reducing tree depth.
def abbreviate_key(key: str, previous_key: str) -> str:
"""
Create shortest prefix that distinguishes key from previous_key.
Used in B-Tree internal nodes to save space.
"""
# Find first differing character
min_len = min(len(key), len(previous_key))
for i in range(min_len):
if key[i] != previous_key[i]:
return key[:i + 1]
return key[:min_len + 1]
# Example
print(abbreviate_key("application", "apple")) # "appl"
print(abbreviate_key("banana", "apple")) # "b"
print(abbreviate_key("application", "application")) # "application"Sequential Leaf Pages
For efficient range scans, it helps if leaf pages are physically close together on disk. Many B-Tree implementations try to allocate nearby leaf pages in adjacent disk locations. However, as the tree grows and pages split, maintaining this layout becomes difficult. LSM-Trees have an advantage here: they rewrite entire segments, naturally keeping sorted data together.
Sibling Pointers
Adding pointers between leaf pages allows efficient range scans without backtracking to parent nodes. To scan from key A to key Z, find the leaf containing A, then follow sibling pointers until you pass Z.
B-Tree vs LSM-Tree Summary
Both structures keep data sorted and support range queries. The choice depends on your workload:
| Aspect | B-Tree | LSM-Tree | | ---------------------- | ------------------------ | --------------------------------- | | Write Pattern | In-place updates | Append-only | | Write Performance | Slower (random I/O) | Faster (sequential I/O) | | Read Performance | Faster (single location) | Slower (check multiple levels) | | Concurrency | Requires latches | Simpler (immutable segments) | | Crash Recovery | WAL required | WAL + immutable segments | | Space Amplification| Lower | Higher (during compaction) | | Maturity | 50+ years | ~20 years |
Complete Disk-Based B-Tree Example
Here's a more realistic implementation that simulates disk pages:
import struct
from typing import Optional, Dict, Any
import os
class DiskPage:
"""Simulates a fixed-size disk page."""
PAGE_SIZE = 4096 # 4 KB
def __init__(self, page_id: int, is_leaf: bool = True):
self.page_id = page_id
self.is_leaf = is_leaf
self.keys: list = []
self.values: list = [] # For leaf pages
self.children: list = [] # For internal pages (page IDs)
self.next_leaf: Optional[int] = None # Sibling pointer
def is_full(self, max_keys: int) -> bool:
return len(self.keys) >= max_keys
class DiskBTree:
"""
B-Tree with disk-page simulation.
Demonstrates page-based storage and access patterns.
"""
def __init__(self, directory: str, order: int = 100):
self.directory = directory
self.order = order # Max children per page
self.max_keys = order - 1
os.makedirs(directory, exist_ok=True)
self.page_cache: Dict[int, DiskPage] = {}
self.next_page_id = 0
self.root_id: Optional[int] = None
# Statistics
self.disk_reads = 0
self.disk_writes = 0
# Create root
root = self._allocate_page(is_leaf=True)
self.root_id = root.page_id
def _allocate_page(self, is_leaf: bool = True) -> DiskPage:
"""Allocate a new page."""
page = DiskPage(self.next_page_id, is_leaf)
self.next_page_id += 1
self.page_cache[page.page_id] = page
return page
def _read_page(self, page_id: int) -> DiskPage:
"""Read a page from 'disk' (simulated with cache)."""
self.disk_reads += 1
if page_id not in self.page_cache:
raise ValueError(f"Page {page_id} not found")
return self.page_cache[page_id]
def _write_page(self, page: DiskPage):
"""Write a page to 'disk'."""
self.disk_writes += 1
self.page_cache[page.page_id] = page
def search(self, key) -> Optional[Any]:
"""Search for a key, counting disk reads."""
self.disk_reads = 0
result = self._search(self.root_id, key)
print(f"Search required {self.disk_reads} disk reads")
return result
def _search(self, page_id: int, key) -> Optional[Any]:
page = self._read_page(page_id)
# Binary search for the key position
i = self._binary_search(page.keys, key)
if i < len(page.keys) and page.keys[i] == key:
if page.is_leaf:
return page.values[i]
if page.is_leaf:
return None
# Follow child pointer
child_id = page.children[i]
return self._search(child_id, key)
def _binary_search(self, keys: list, target) -> int:
"""Find insertion point for target in sorted keys."""
left, right = 0, len(keys)
while left < right:
mid = (left + right) // 2
if keys[mid] < target:
left = mid + 1
else:
right = mid
return left
def insert(self, key, value):
"""Insert a key-value pair."""
root = self._read_page(self.root_id)
if root.is_full(self.max_keys):
# Root is full, need to split
new_root = self._allocate_page(is_leaf=False)
new_root.children.append(self.root_id)
self._split_child(new_root, 0)
self.root_id = new_root.page_id
self._write_page(new_root)
self._insert_non_full(self.root_id, key, value)
def _split_child(self, parent: DiskPage, child_index: int):
"""Split a full child page."""
child = self._read_page(parent.children[child_index])
mid = self.max_keys // 2
# Create new page for right half
new_page = self._allocate_page(is_leaf=child.is_leaf)
# Distribute keys
new_page.keys = child.keys[mid + 1:]
promoted_key = child.keys[mid]
child.keys = child.keys[:mid]
if child.is_leaf:
new_page.values = child.values[mid + 1:]
child.values = child.values[:mid + 1]
new_page.next_leaf = child.next_leaf
child.next_leaf = new_page.page_id
else:
new_page.children = child.children[mid + 1:]
child.children = child.children[:mid + 1]
# Update parent
parent.keys.insert(child_index, promoted_key)
parent.children.insert(child_index + 1, new_page.page_id)
self._write_page(child)
self._write_page(new_page)
self._write_page(parent)
def _insert_non_full(self, page_id: int, key, value):
"""Insert into a page that has room."""
page = self._read_page(page_id)
i = self._binary_search(page.keys, key)
if page.is_leaf:
page.keys.insert(i, key)
page.values.insert(i, value)
self._write_page(page)
else:
child_id = page.children[i]
child = self._read_page(child_id)
if child.is_full(self.max_keys):
self._split_child(page, i)
if key > page.keys[i]:
child_id = page.children[i + 1]
self._insert_non_full(child_id, key, value)
def stats(self):
"""Print statistics about the tree."""
def count_levels(page_id: int) -> int:
page = self._read_page(page_id)
if page.is_leaf:
return 1
return 1 + count_levels(page.children[0])
def count_pages(page_id: int) -> int:
page = self._read_page(page_id)
if page.is_leaf:
return 1
return 1 + sum(count_pages(c) for c in page.children)
levels = count_levels(self.root_id)
pages = count_pages(self.root_id)
print(f"Tree depth: {levels} levels")
print(f"Total pages: {pages}")
print(f"Branching factor: up to {self.order}")
# Demonstration
if __name__ == "__main__":
tree = DiskBTree("/tmp/btree_demo", order=50)
# Insert a lot of data
print("Inserting 10,000 keys...")
for i in range(10000):
tree.insert(i, f"value_{i}")
tree.stats()
# Search
print("\nSearching for key 5000:")
print(f"Found: {tree.search(5000)}")
print("\nSearching for key 9999:")
print(f"Found: {tree.search(9999)}")Key Takeaways
| Concept | Description | | -------------------- | -------------------------------------------------------------------- | | Page | Fixed-size block (typically 4 KB) that forms the unit of disk I/O | | Branching Factor | Number of child pointers per page; higher = shallower tree | | O(log n) Depth | A tree with 500-way branching stores billions of keys in 4 levels | | In-Place Updates | Pages are overwritten, unlike LSM's append-only approach | | Page Splitting | When a page overflows, split it and update the parent | | WAL | Write-ahead log for crash recovery | | Latches | Lightweight locks for concurrent access |
B-Trees have earned their place as the dominant index structure for good reason. Their predictable performance, efficient reads, and mature implementations make them the safe default choice for most database workloads. Understanding both B-Trees and LSM-Trees gives you the knowledge to choose the right tool for your specific needs.