Transaction Processing or Analytics
A comprehensive guide to understanding the fundamental differences between OLTP and OLAP workloads, data warehousing, ETL pipelines, and schema design patterns like star and snowflake schemas.
Two Very Different Ways to Use a Database
Databases originally existed for one purpose: recording business transactions. A sale was made, an order was placed, an employee got paid. Each of these events was a "transaction" that needed to be recorded. Over time, the term "transaction" expanded beyond money-changing-hands to mean any logical group of reads and writes to a database.
But here's the interesting part: as companies grew, they started using databases for two fundamentally different purposes. On one hand, they needed to process transactions, customers buying products, users updating their profiles, orders being placed. On the other hand, they wanted to analyze their data, understanding sales trends, customer behavior, and business performance.
These two use cases have such different characteristics that they've evolved into separate disciplines with their own names, tools, and optimizations.
OLTP: Online Transaction Processing
When a customer adds an item to their shopping cart, that's a transaction. When they check out, that's another transaction. When the warehouse marks the order as shipped, that's yet another. This pattern, small, fast operations on individual records, is called Online Transaction Processing (OLTP).
The key characteristics of OLTP workloads are:
- Small, targeted queries: Look up one customer, one order, one product
- Low latency is critical: Users are waiting for responses
- Random access patterns: Queries jump around based on user actions
- Current state matters: You need to know what's in stock right now
class OLTPDatabase:
"""
Simulates typical OLTP operations.
Notice: Each operation touches few records, needs to be fast.
"""
def __init__(self):
self.users = {}
self.products = {}
self.orders = {}
# OLTP Pattern: Fetch single record by key
def get_user(self, user_id: int) -> dict:
"""Lookup single user - must be fast!"""
return self.users.get(user_id)
# OLTP Pattern: Update single record
def update_cart(self, user_id: int, product_id: int, quantity: int):
"""Add item to cart - immediate response needed."""
user = self.users.get(user_id)
if user:
if 'cart' not in user:
user['cart'] = {}
user['cart'][product_id] = quantity
# OLTP Pattern: Insert new record
def place_order(self, user_id: int, items: list) -> int:
"""Create order - user is waiting at checkout."""
order_id = len(self.orders) + 1
self.orders[order_id] = {
'user_id': user_id,
'items': items,
'status': 'pending',
'timestamp': '2024-01-15 10:30:00'
}
return order_id
# OLTP Pattern: Check current state
def check_inventory(self, product_id: int) -> int:
"""How many in stock RIGHT NOW? User needs to know."""
product = self.products.get(product_id, {})
return product.get('stock', 0)
# Example OLTP operations
db = OLTPDatabase()
db.users[1] = {'name': 'Alice', 'email': 'alice@example.com'}
db.products[100] = {'name': 'Laptop', 'stock': 50, 'price': 999}
# These are all fast, targeted operations
user = db.get_user(1) # Single lookup
db.update_cart(1, 100, 2) # Single update
stock = db.check_inventory(100) # Single check
order_id = db.place_order(1, [(100, 2)]) # Single insertOLAP: Online Analytical Processing
Now imagine you're the CEO, and you want to know: "What was our total revenue last quarter, broken down by region?" This question requires scanning millions of sales records, aggregating them, and computing statistics. This is Online Analytical Processing (OLAP).
OLAP queries have completely different characteristics:
- Massive scans: Read millions or billions of rows
- Aggregate operations: SUM, COUNT, AVG, GROUP BY
- Historical data: Analyze trends over months or years
- Complex queries: Join multiple tables, apply filters
class OLAPQueries:
"""
Simulates typical OLAP operations.
Notice: Each query scans huge amounts of data, computes aggregates.
"""
def __init__(self, sales_data: list):
self.sales = sales_data # Millions of rows in real system
def total_revenue_by_store(self, month: str) -> dict:
"""
OLAP Pattern: Aggregate over large dataset.
In real system, this scans terabytes of data.
"""
revenue = {}
for sale in self.sales:
if sale['date'].startswith(month):
store = sale['store_id']
revenue[store] = revenue.get(store, 0) + sale['amount']
return revenue
def top_products_by_quantity(self, n: int = 10) -> list:
"""
OLAP Pattern: Aggregate, sort, limit.
"""
product_sales = {}
for sale in self.sales:
pid = sale['product_id']
product_sales[pid] = product_sales.get(pid, 0) + sale['quantity']
sorted_products = sorted(
product_sales.items(),
key=lambda x: x[1],
reverse=True
)
return sorted_products[:n]
def promotional_effectiveness(self, promo_dates: list) -> dict:
"""
OLAP Pattern: Complex analysis across time periods.
"How many more bananas did we sell during promotions?"
"""
promo_sales = 0
normal_sales = 0
promo_days = 0
normal_days = 0
for sale in self.sales:
if sale['date'] in promo_dates:
promo_sales += sale['quantity']
promo_days += 1
else:
normal_sales += sale['quantity']
normal_days += 1
return {
'promo_avg': promo_sales / max(promo_days, 1),
'normal_avg': normal_sales / max(normal_days, 1),
'lift': (promo_sales / max(promo_days, 1)) /
(normal_sales / max(normal_days, 1)) - 1
}
# Example OLAP queries
sales_data = [
{'date': '2024-01-15', 'store_id': 1, 'product_id': 100, 'quantity': 5, 'amount': 4995},
{'date': '2024-01-15', 'store_id': 2, 'product_id': 100, 'quantity': 3, 'amount': 2997},
{'date': '2024-01-16', 'store_id': 1, 'product_id': 101, 'quantity': 10, 'amount': 500},
# ... imagine millions more rows
]
analytics = OLAPQueries(sales_data)
print(analytics.total_revenue_by_store('2024-01'))
print(analytics.top_products_by_quantity(5))OLTP vs OLAP: Side by Side
The differences between these two workloads are so stark that trying to optimize for both at once is nearly impossible.
| Property | OLTP | OLAP | |----------|------|------| | Read pattern | Few records by key | Aggregate over millions | | Write pattern | Random, low-latency | Bulk imports (ETL) | | Primary users | Customers, end users | Business analysts | | Data represents | Current state | Historical events | | Query complexity | Simple lookups/updates | Complex joins & aggregates | | Dataset size | Gigabytes to Terabytes | Terabytes to Petabytes | | Latency requirement | Milliseconds | Seconds to minutes |
The Problem: Why Can't We Just Use One Database?
Early on, companies tried to run both OLTP and OLAP on the same database. This seemed sensible, SQL works for both, right? But it led to serious problems:
When an analyst runs a query that scans 100 million rows, it consumes:
- Disk I/O: Reading huge amounts of data
- Memory: Caching and sorting results
- CPU: Aggregating and computing statistics
- Locks: Potentially blocking concurrent transactions
Meanwhile, customers trying to check out experience slow responses. The database administrator gets angry calls from both sides.
The Solution: Data Warehousing
The solution is to separate these workloads entirely. Keep your fast, lean OLTP databases for transactions, and create a separate Data Warehouse for analytics.
The data warehouse is a read-only copy of data from all your OLTP systems, transformed into a format that's optimized for analysis. Analysts can run any query they want without impacting production systems.
ETL: Extract, Transform, Load
Getting data into the warehouse is a process called ETL:
- Extract: Pull data from source OLTP systems
- Transform: Clean, validate, denormalize, and restructure
- Load: Write to the data warehouse
class ETLPipeline:
"""
Simplified ETL pipeline from OLTP to Data Warehouse.
"""
def __init__(self):
self.staging = []
self.warehouse = []
def extract(self, oltp_systems: dict) -> list:
"""
Extract: Pull data from multiple OLTP sources.
In real systems: database connections, APIs, log files.
"""
extracted = []
for system_name, data in oltp_systems.items():
for record in data:
extracted.append({
'source': system_name,
'raw_data': record,
'extracted_at': '2024-01-15 02:00:00'
})
print(f"Extracted {len(extracted)} records from {len(oltp_systems)} systems")
return extracted
def transform(self, raw_records: list) -> list:
"""
Transform: Clean and restructure data for analytics.
- Standardize formats
- Handle missing values
- Denormalize for query performance
- Add computed fields
"""
transformed = []
for record in raw_records:
raw = record['raw_data']
# Standardize date format
date = raw.get('date', raw.get('timestamp', raw.get('created_at')))
# Denormalize: Include store name, not just ID
store_name = self._lookup_store_name(raw.get('store_id'))
# Compute derived fields
profit = raw.get('revenue', 0) - raw.get('cost', 0)
transformed.append({
'date': date,
'store_id': raw.get('store_id'),
'store_name': store_name,
'product_id': raw.get('product_id'),
'quantity': raw.get('quantity', 0),
'revenue': raw.get('revenue', 0),
'cost': raw.get('cost', 0),
'profit': profit,
'source_system': record['source']
})
print(f"Transformed {len(transformed)} records")
return transformed
def load(self, transformed_records: list):
"""
Load: Insert into data warehouse.
Often uses bulk loading for efficiency.
"""
self.warehouse.extend(transformed_records)
print(f"Loaded {len(transformed_records)} records to warehouse")
print(f"Warehouse now contains {len(self.warehouse)} total records")
def _lookup_store_name(self, store_id: int) -> str:
stores = {1: 'Downtown', 2: 'Mall', 3: 'Airport'}
return stores.get(store_id, 'Unknown')
def run(self, oltp_systems: dict):
"""Run complete ETL pipeline."""
print("=" * 50)
print("Starting ETL Pipeline")
print("=" * 50)
raw = self.extract(oltp_systems)
transformed = self.transform(raw)
self.load(transformed)
print("ETL Complete!")
return self.warehouse
# Example ETL run
oltp_sources = {
'sales_system': [
{'date': '2024-01-15', 'store_id': 1, 'product_id': 100,
'quantity': 5, 'revenue': 4995, 'cost': 3000},
{'date': '2024-01-15', 'store_id': 2, 'product_id': 101,
'quantity': 10, 'revenue': 500, 'cost': 250},
],
'inventory_system': [
{'timestamp': '2024-01-15', 'store_id': 1, 'product_id': 100,
'quantity': -5}, # Shipped
]
}
etl = ETLPipeline()
warehouse_data = etl.run(oltp_sources)ETL Frequency
ETL can run:
- Nightly batch: Most common; data is up to 24 hours old
- Hourly/frequent batch: More timely, more complex
- Streaming/real-time: Near-instant updates via event streams
Why Data Warehouses Exist (Even in Small Companies' Absence)
You might notice that small companies don't have data warehouses. Why?
- Few OLTP systems: Maybe just one e-commerce platform
- Small data: Can be analyzed in a regular database or spreadsheet
- Simple queries: Don't need specialized optimization
But large enterprises have:
- Dozens of OLTP systems: Sales, inventory, HR, shipping, finance...
- Petabytes of data: Years of transaction history
- Complex analytics: Cross-system analysis, ML training
The overhead of maintaining a data warehouse only makes sense at scale.
The Star Schema: Organizing Data for Analytics
Data warehouses use a specific schema design called the Star Schema. It's optimized for the kinds of queries analysts actually run.
The star schema has two types of tables:
- Fact Table: The central table containing events/transactions (what happened)
- Dimension Tables: Surrounding tables with descriptive attributes (who, what, where, when, why)
The schema is called a "star" because when you draw it, the fact table sits in the center with dimension tables radiating outward like rays.
from dataclasses import dataclass
from typing import List, Optional
import datetime
@dataclass
class DimDate:
"""Date dimension: all attributes about a date."""
date_id: int
full_date: str
day: int
month: int
year: int
quarter: int
day_of_week: str
is_weekend: bool
is_holiday: bool
@dataclass
class DimProduct:
"""Product dimension: everything about a product."""
product_id: int
name: str
brand: str
category: str
subcategory: str
supplier: str
unit_cost: float
@dataclass
class DimStore:
"""Store dimension: everything about a location."""
store_id: int
name: str
address: str
city: str
region: str
country: str
square_feet: int
has_bakery: bool
opened_date: str
@dataclass
class DimCustomer:
"""Customer dimension: everything about a customer."""
customer_id: int
name: str
email: str
segment: str # 'retail', 'wholesale', 'premium'
signup_date: str
lifetime_value: float
@dataclass
class FactSales:
"""
Fact table: Each row is a sale EVENT.
Contains foreign keys to dimensions + measurable facts.
"""
sale_id: int
# Foreign keys to dimensions
date_id: int
product_id: int
store_id: int
customer_id: int
# Measurable facts (numbers you can aggregate)
quantity: int
unit_price: float
discount: float
total_amount: float
profit: float
class StarSchemaWarehouse:
"""
A data warehouse organized as a star schema.
"""
def __init__(self):
# Dimension tables (relatively small)
self.dim_dates: dict = {}
self.dim_products: dict = {}
self.dim_stores: dict = {}
self.dim_customers: dict = {}
# Fact table (very large - millions/billions of rows)
self.fact_sales: List[FactSales] = []
def query_revenue_by_region_and_quarter(self) -> dict:
"""
Typical OLAP query: aggregate facts, group by dimensions.
SQL equivalent:
SELECT d.region, dd.quarter, SUM(f.total_amount)
FROM fact_sales f
JOIN dim_store d ON f.store_id = d.store_id
JOIN dim_date dd ON f.date_id = dd.date_id
GROUP BY d.region, dd.quarter
"""
results = {}
for sale in self.fact_sales:
store = self.dim_stores.get(sale.store_id)
date = self.dim_dates.get(sale.date_id)
if store and date:
key = (store.region, date.quarter)
results[key] = results.get(key, 0) + sale.total_amount
return results
def query_top_products_by_profit(self, n: int = 10) -> list:
"""
Find most profitable products.
SQL equivalent:
SELECT p.name, SUM(f.profit) as total_profit
FROM fact_sales f
JOIN dim_product p ON f.product_id = p.product_id
GROUP BY p.product_id, p.name
ORDER BY total_profit DESC
LIMIT n
"""
profit_by_product = {}
for sale in self.fact_sales:
product = self.dim_products.get(sale.product_id)
if product:
profit_by_product[product.name] = \
profit_by_product.get(product.name, 0) + sale.profit
sorted_products = sorted(
profit_by_product.items(),
key=lambda x: x[1],
reverse=True
)
return sorted_products[:n]
def query_holiday_vs_regular_sales(self) -> dict:
"""
Compare sales on holidays vs regular days.
This is why we have is_holiday in dim_date!
"""
holiday_sales = 0
regular_sales = 0
for sale in self.fact_sales:
date = self.dim_dates.get(sale.date_id)
if date:
if date.is_holiday:
holiday_sales += sale.total_amount
else:
regular_sales += sale.total_amount
return {
'holiday_total': holiday_sales,
'regular_total': regular_sales,
'holiday_lift': holiday_sales / max(regular_sales, 1)
}
# Example usage
warehouse = StarSchemaWarehouse()
# Load dimensions
warehouse.dim_dates[1] = DimDate(1, '2024-01-15', 15, 1, 2024, 1,
'Monday', False, False)
warehouse.dim_dates[2] = DimDate(2, '2024-12-25', 25, 12, 2024, 4,
'Wednesday', False, True) # Holiday!
warehouse.dim_products[100] = DimProduct(100, 'Laptop', 'TechBrand',
'Electronics', 'Computers',
'SupplierA', 500.0)
warehouse.dim_stores[1] = DimStore(1, 'Downtown', '123 Main St',
'New York', 'Northeast', 'USA',
50000, True, '2010-01-01')
warehouse.dim_customers[1] = DimCustomer(1, 'Alice', 'alice@example.com',
'premium', '2020-01-01', 5000.0)
# Load facts
warehouse.fact_sales.append(
FactSales(1, date_id=1, product_id=100, store_id=1, customer_id=1,
quantity=2, unit_price=999, discount=0, total_amount=1998,
profit=998)
)
# Run analytics
print(warehouse.query_revenue_by_region_and_quarter())Facts vs Dimensions: What Goes Where?
Understanding what belongs in fact tables vs dimension tables is crucial:
Fact Table (The Center)
- Contains events that happened
- Each row = one transaction/event
- Contains measures (numbers you aggregate): quantity, amount, profit
- Contains foreign keys to dimension tables
- Huge: Can have billions of rows
- Narrow in number of columns (but wide in row count)
Dimension Tables (The Rays)
- Contains descriptive attributes
- Answers who, what, where, when, why, how
- Contains context for analysis
- Smaller: Thousands to millions of rows
- Wide: Can have hundreds of columns of attributes
The Snowflake Schema: Normalized Dimensions
The Snowflake Schema is a variation where dimension tables are further normalized into sub-dimensions.
In a snowflake schema, instead of storing "brand_name" directly in dim_product, you'd have a separate dim_brand table that dim_product references.
Star Schema Pros:
- Simpler queries (fewer joins)
- Faster query performance
- Easier for analysts to understand
Snowflake Schema Pros:
- Less data redundancy
- More normalized (database theory approved!)
- Smaller storage footprint
In practice, star schemas are preferred for data warehouses because query simplicity and performance matter more than strict normalization.
Wide Tables in Data Warehouses
A surprising characteristic of data warehouse tables: they're wide. Really wide.
A fact table might have 100+ columns:
- Foreign keys to 10-20 dimension tables
- 80+ measure columns (quantities, amounts, counts, flags)
A dimension table like dim_store might include:
- Basic info: name, address, phone
- Geographic: city, region, country, timezone
- Physical: square footage, floors, parking
- Services: bakery, deli, pharmacy, gas station
- History: opened, remodeled, expanded dates
- Management: district, region manager, territory
- And much more...
This width is intentional: analysts need all this context for their queries.
Real-World Data Warehouse Systems
The data warehouse market has many players:
| Category | Examples | |----------|----------| | Commercial | Teradata, Vertica, SAP HANA, Oracle Exadata | | Cloud-native | Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse | | Open Source | Apache Hive, Apache Spark SQL, ClickHouse, DuckDB |
Key Takeaways
| Concept | Description | |---------|-------------| | OLTP | Transaction processing: small, fast operations on current data | | OLAP | Analytical processing: large scans, aggregates, historical data | | Data Warehouse | Separate database optimized for analytics | | ETL | Extract-Transform-Load: process to populate warehouse | | Star Schema | Central fact table surrounded by dimension tables | | Fact Table | Events/transactions with measures (numbers) | | Dimension Table | Descriptive context (who, what, where, when) | | Snowflake Schema | Star schema with normalized dimensions |
The key insight is that OLTP and OLAP workloads are so different that they require fundamentally different system designs. The indexes we discussed earlier in this chapter (B-Trees, LSM-Trees) are optimized for OLTP. In the next section, we'll explore column-oriented storage, the secret weapon that makes data warehouses so fast at analytics.