Transaction Processing or Analytics

Two Very Different Ways to Use a Database

Databases originally existed for one purpose: recording business transactions. A sale was made, an order was placed, an employee got paid. Each of these events was a "transaction" that needed to be recorded. Over time, the term "transaction" expanded beyond money-changing-hands to mean any logical group of reads and writes to a database.

But here's the interesting part: as companies grew, they started using databases for two fundamentally different purposes. On one hand, they needed to process transactions, customers buying products, users updating their profiles, orders being placed. On the other hand, they wanted to analyze their data, understanding sales trends, customer behavior, and business performance.

These two use cases have such different characteristics that they've evolved into separate disciplines with their own names, tools, and optimizations.

OLTP: Online Transaction Processing

When a customer adds an item to their shopping cart, that's a transaction. When they check out, that's another transaction. When the warehouse marks the order as shipped, that's yet another. This pattern, small, fast operations on individual records, is called Online Transaction Processing (OLTP).

Loading diagram...

The key characteristics of OLTP workloads are:

Small, targeted queries: Look up one customer, one order, one product
Low latency is critical: Users are waiting for responses
Random access patterns: Queries jump around based on user actions
Current state matters: You need to know what's in stock right now

class OLTPDatabase:
    """
    Simulates typical OLTP operations.
    Notice: Each operation touches few records, needs to be fast.
    """
    
    def __init__(self):
        self.users = {}
        self.products = {}
        self.orders = {}
    
    # OLTP Pattern: Fetch single record by key
    def get_user(self, user_id: int) -> dict:
        """Lookup single user - must be fast!"""
        return self.users.get(user_id)
    
    # OLTP Pattern: Update single record
    def update_cart(self, user_id: int, product_id: int, quantity: int):
        """Add item to cart - immediate response needed."""
        user = self.users.get(user_id)
        if user:
            if 'cart' not in user:
                user['cart'] = {}
            user['cart'][product_id] = quantity
    
    # OLTP Pattern: Insert new record
    def place_order(self, user_id: int, items: list) -> int:
        """Create order - user is waiting at checkout."""
        order_id = len(self.orders) + 1
        self.orders[order_id] = {
            'user_id': user_id,
            'items': items,
            'status': 'pending',
            'timestamp': '2024-01-15 10:30:00'
        }
        return order_id
    
    # OLTP Pattern: Check current state
    def check_inventory(self, product_id: int) -> int:
        """How many in stock RIGHT NOW? User needs to know."""
        product = self.products.get(product_id, {})
        return product.get('stock', 0)
 
 
# Example OLTP operations
db = OLTPDatabase()
db.users[1] = {'name': 'Alice', 'email': 'alice@example.com'}
db.products[100] = {'name': 'Laptop', 'stock': 50, 'price': 999}
 
# These are all fast, targeted operations
user = db.get_user(1)                    # Single lookup
db.update_cart(1, 100, 2)                # Single update
stock = db.check_inventory(100)          # Single check
order_id = db.place_order(1, [(100, 2)]) # Single insert

OLAP: Online Analytical Processing

Now imagine you're the CEO, and you want to know: "What was our total revenue last quarter, broken down by region?" This question requires scanning millions of sales records, aggregating them, and computing statistics. This is Online Analytical Processing (OLAP).

Loading diagram...

OLAP queries have completely different characteristics:

Massive scans: Read millions or billions of rows
Aggregate operations: SUM, COUNT, AVG, GROUP BY
Historical data: Analyze trends over months or years
Complex queries: Join multiple tables, apply filters

class OLAPQueries:
    """
    Simulates typical OLAP operations.
    Notice: Each query scans huge amounts of data, computes aggregates.
    """
    
    def __init__(self, sales_data: list):
        self.sales = sales_data  # Millions of rows in real system
    
    def total_revenue_by_store(self, month: str) -> dict:
        """
        OLAP Pattern: Aggregate over large dataset.
        In real system, this scans terabytes of data.
        """
        revenue = {}
        for sale in self.sales:
            if sale['date'].startswith(month):
                store = sale['store_id']
                revenue[store] = revenue.get(store, 0) + sale['amount']
        return revenue
    
    def top_products_by_quantity(self, n: int = 10) -> list:
        """
        OLAP Pattern: Aggregate, sort, limit.
        """
        product_sales = {}
        for sale in self.sales:
            pid = sale['product_id']
            product_sales[pid] = product_sales.get(pid, 0) + sale['quantity']
        
        sorted_products = sorted(
            product_sales.items(), 
            key=lambda x: x[1], 
            reverse=True
        )
        return sorted_products[:n]
    
    def promotional_effectiveness(self, promo_dates: list) -> dict:
        """
        OLAP Pattern: Complex analysis across time periods.
        "How many more bananas did we sell during promotions?"
        """
        promo_sales = 0
        normal_sales = 0
        promo_days = 0
        normal_days = 0
        
        for sale in self.sales:
            if sale['date'] in promo_dates:
                promo_sales += sale['quantity']
                promo_days += 1
            else:
                normal_sales += sale['quantity']
                normal_days += 1
        
        return {
            'promo_avg': promo_sales / max(promo_days, 1),
            'normal_avg': normal_sales / max(normal_days, 1),
            'lift': (promo_sales / max(promo_days, 1)) / 
                    (normal_sales / max(normal_days, 1)) - 1
        }
 
 
# Example OLAP queries
sales_data = [
    {'date': '2024-01-15', 'store_id': 1, 'product_id': 100, 'quantity': 5, 'amount': 4995},
    {'date': '2024-01-15', 'store_id': 2, 'product_id': 100, 'quantity': 3, 'amount': 2997},
    {'date': '2024-01-16', 'store_id': 1, 'product_id': 101, 'quantity': 10, 'amount': 500},
    # ... imagine millions more rows
]
 
analytics = OLAPQueries(sales_data)
print(analytics.total_revenue_by_store('2024-01'))
print(analytics.top_products_by_quantity(5))

OLTP vs OLAP: Side by Side

The differences between these two workloads are so stark that trying to optimize for both at once is nearly impossible.

Loading diagram...

| Property | OLTP | OLAP | |----------|------|------| | Read pattern | Few records by key | Aggregate over millions | | Write pattern | Random, low-latency | Bulk imports (ETL) | | Primary users | Customers, end users | Business analysts | | Data represents | Current state | Historical events | | Query complexity | Simple lookups/updates | Complex joins & aggregates | | Dataset size | Gigabytes to Terabytes | Terabytes to Petabytes | | Latency requirement | Milliseconds | Seconds to minutes |

The Problem: Why Can't We Just Use One Database?

Early on, companies tried to run both OLTP and OLAP on the same database. This seemed sensible, SQL works for both, right? But it led to serious problems:

Loading diagram...

When an analyst runs a query that scans 100 million rows, it consumes:

Disk I/O: Reading huge amounts of data
Memory: Caching and sorting results
CPU: Aggregating and computing statistics
Locks: Potentially blocking concurrent transactions

Meanwhile, customers trying to check out experience slow responses. The database administrator gets angry calls from both sides.

The Solution: Data Warehousing

The solution is to separate these workloads entirely. Keep your fast, lean OLTP databases for transactions, and create a separate Data Warehouse for analytics.

Loading diagram...

The data warehouse is a read-only copy of data from all your OLTP systems, transformed into a format that's optimized for analysis. Analysts can run any query they want without impacting production systems.

ETL: Extract, Transform, Load

Getting data into the warehouse is a process called ETL:

Extract: Pull data from source OLTP systems
Transform: Clean, validate, denormalize, and restructure
Load: Write to the data warehouse

class ETLPipeline:
    """
    Simplified ETL pipeline from OLTP to Data Warehouse.
    """
    
    def __init__(self):
        self.staging = []
        self.warehouse = []
    
    def extract(self, oltp_systems: dict) -> list:
        """
        Extract: Pull data from multiple OLTP sources.
        In real systems: database connections, APIs, log files.
        """
        extracted = []
        
        for system_name, data in oltp_systems.items():
            for record in data:
                extracted.append({
                    'source': system_name,
                    'raw_data': record,
                    'extracted_at': '2024-01-15 02:00:00'
                })
        
        print(f"Extracted {len(extracted)} records from {len(oltp_systems)} systems")
        return extracted
    
    def transform(self, raw_records: list) -> list:
        """
        Transform: Clean and restructure data for analytics.
        - Standardize formats
        - Handle missing values
        - Denormalize for query performance
        - Add computed fields
        """
        transformed = []
        
        for record in raw_records:
            raw = record['raw_data']
            
            # Standardize date format
            date = raw.get('date', raw.get('timestamp', raw.get('created_at')))
            
            # Denormalize: Include store name, not just ID
            store_name = self._lookup_store_name(raw.get('store_id'))
            
            # Compute derived fields
            profit = raw.get('revenue', 0) - raw.get('cost', 0)
            
            transformed.append({
                'date': date,
                'store_id': raw.get('store_id'),
                'store_name': store_name,
                'product_id': raw.get('product_id'),
                'quantity': raw.get('quantity', 0),
                'revenue': raw.get('revenue', 0),
                'cost': raw.get('cost', 0),
                'profit': profit,
                'source_system': record['source']
            })
        
        print(f"Transformed {len(transformed)} records")
        return transformed
    
    def load(self, transformed_records: list):
        """
        Load: Insert into data warehouse.
        Often uses bulk loading for efficiency.
        """
        self.warehouse.extend(transformed_records)
        print(f"Loaded {len(transformed_records)} records to warehouse")
        print(f"Warehouse now contains {len(self.warehouse)} total records")
    
    def _lookup_store_name(self, store_id: int) -> str:
        stores = {1: 'Downtown', 2: 'Mall', 3: 'Airport'}
        return stores.get(store_id, 'Unknown')
    
    def run(self, oltp_systems: dict):
        """Run complete ETL pipeline."""
        print("=" * 50)
        print("Starting ETL Pipeline")
        print("=" * 50)
        
        raw = self.extract(oltp_systems)
        transformed = self.transform(raw)
        self.load(transformed)
        
        print("ETL Complete!")
        return self.warehouse
 
 
# Example ETL run
oltp_sources = {
    'sales_system': [
        {'date': '2024-01-15', 'store_id': 1, 'product_id': 100, 
         'quantity': 5, 'revenue': 4995, 'cost': 3000},
        {'date': '2024-01-15', 'store_id': 2, 'product_id': 101, 
         'quantity': 10, 'revenue': 500, 'cost': 250},
    ],
    'inventory_system': [
        {'timestamp': '2024-01-15', 'store_id': 1, 'product_id': 100, 
         'quantity': -5},  # Shipped
    ]
}
 
etl = ETLPipeline()
warehouse_data = etl.run(oltp_sources)

ETL Frequency

ETL can run:

Nightly batch: Most common; data is up to 24 hours old
Hourly/frequent batch: More timely, more complex
Streaming/real-time: Near-instant updates via event streams

Loading diagram...

Why Data Warehouses Exist (Even in Small Companies' Absence)

You might notice that small companies don't have data warehouses. Why?

Few OLTP systems: Maybe just one e-commerce platform
Small data: Can be analyzed in a regular database or spreadsheet
Simple queries: Don't need specialized optimization

But large enterprises have:

Dozens of OLTP systems: Sales, inventory, HR, shipping, finance...
Petabytes of data: Years of transaction history
Complex analytics: Cross-system analysis, ML training

The overhead of maintaining a data warehouse only makes sense at scale.

The Star Schema: Organizing Data for Analytics

Data warehouses use a specific schema design called the Star Schema. It's optimized for the kinds of queries analysts actually run.

The star schema has two types of tables:

Fact Table: The central table containing events/transactions (what happened)
Dimension Tables: Surrounding tables with descriptive attributes (who, what, where, when, why)

Loading diagram...

The schema is called a "star" because when you draw it, the fact table sits in the center with dimension tables radiating outward like rays.

from dataclasses import dataclass
from typing import List, Optional
import datetime
 
@dataclass
class DimDate:
    """Date dimension: all attributes about a date."""
    date_id: int
    full_date: str
    day: int
    month: int
    year: int
    quarter: int
    day_of_week: str
    is_weekend: bool
    is_holiday: bool
 
 
@dataclass
class DimProduct:
    """Product dimension: everything about a product."""
    product_id: int
    name: str
    brand: str
    category: str
    subcategory: str
    supplier: str
    unit_cost: float
 
 
@dataclass
class DimStore:
    """Store dimension: everything about a location."""
    store_id: int
    name: str
    address: str
    city: str
    region: str
    country: str
    square_feet: int
    has_bakery: bool
    opened_date: str
 
 
@dataclass
class DimCustomer:
    """Customer dimension: everything about a customer."""
    customer_id: int
    name: str
    email: str
    segment: str  # 'retail', 'wholesale', 'premium'
    signup_date: str
    lifetime_value: float
 
 
@dataclass
class FactSales:
    """
    Fact table: Each row is a sale EVENT.
    Contains foreign keys to dimensions + measurable facts.
    """
    sale_id: int
    
    # Foreign keys to dimensions
    date_id: int
    product_id: int
    store_id: int
    customer_id: int
    
    # Measurable facts (numbers you can aggregate)
    quantity: int
    unit_price: float
    discount: float
    total_amount: float
    profit: float
 
 
class StarSchemaWarehouse:
    """
    A data warehouse organized as a star schema.
    """
    
    def __init__(self):
        # Dimension tables (relatively small)
        self.dim_dates: dict = {}
        self.dim_products: dict = {}
        self.dim_stores: dict = {}
        self.dim_customers: dict = {}
        
        # Fact table (very large - millions/billions of rows)
        self.fact_sales: List[FactSales] = []
    
    def query_revenue_by_region_and_quarter(self) -> dict:
        """
        Typical OLAP query: aggregate facts, group by dimensions.
        
        SQL equivalent:
        SELECT d.region, dd.quarter, SUM(f.total_amount)
        FROM fact_sales f
        JOIN dim_store d ON f.store_id = d.store_id
        JOIN dim_date dd ON f.date_id = dd.date_id
        GROUP BY d.region, dd.quarter
        """
        results = {}
        
        for sale in self.fact_sales:
            store = self.dim_stores.get(sale.store_id)
            date = self.dim_dates.get(sale.date_id)
            
            if store and date:
                key = (store.region, date.quarter)
                results[key] = results.get(key, 0) + sale.total_amount
        
        return results
    
    def query_top_products_by_profit(self, n: int = 10) -> list:
        """
        Find most profitable products.
        
        SQL equivalent:
        SELECT p.name, SUM(f.profit) as total_profit
        FROM fact_sales f
        JOIN dim_product p ON f.product_id = p.product_id
        GROUP BY p.product_id, p.name
        ORDER BY total_profit DESC
        LIMIT n
        """
        profit_by_product = {}
        
        for sale in self.fact_sales:
            product = self.dim_products.get(sale.product_id)
            if product:
                profit_by_product[product.name] = \
                    profit_by_product.get(product.name, 0) + sale.profit
        
        sorted_products = sorted(
            profit_by_product.items(),
            key=lambda x: x[1],
            reverse=True
        )
        return sorted_products[:n]
    
    def query_holiday_vs_regular_sales(self) -> dict:
        """
        Compare sales on holidays vs regular days.
        
        This is why we have is_holiday in dim_date!
        """
        holiday_sales = 0
        regular_sales = 0
        
        for sale in self.fact_sales:
            date = self.dim_dates.get(sale.date_id)
            if date:
                if date.is_holiday:
                    holiday_sales += sale.total_amount
                else:
                    regular_sales += sale.total_amount
        
        return {
            'holiday_total': holiday_sales,
            'regular_total': regular_sales,
            'holiday_lift': holiday_sales / max(regular_sales, 1)
        }
 
 
# Example usage
warehouse = StarSchemaWarehouse()
 
# Load dimensions
warehouse.dim_dates[1] = DimDate(1, '2024-01-15', 15, 1, 2024, 1, 
                                  'Monday', False, False)
warehouse.dim_dates[2] = DimDate(2, '2024-12-25', 25, 12, 2024, 4, 
                                  'Wednesday', False, True)  # Holiday!
 
warehouse.dim_products[100] = DimProduct(100, 'Laptop', 'TechBrand', 
                                          'Electronics', 'Computers', 
                                          'SupplierA', 500.0)
 
warehouse.dim_stores[1] = DimStore(1, 'Downtown', '123 Main St', 
                                    'New York', 'Northeast', 'USA',
                                    50000, True, '2010-01-01')
 
warehouse.dim_customers[1] = DimCustomer(1, 'Alice', 'alice@example.com',
                                          'premium', '2020-01-01', 5000.0)
 
# Load facts
warehouse.fact_sales.append(
    FactSales(1, date_id=1, product_id=100, store_id=1, customer_id=1,
              quantity=2, unit_price=999, discount=0, total_amount=1998, 
              profit=998)
)
 
# Run analytics
print(warehouse.query_revenue_by_region_and_quarter())

Facts vs Dimensions: What Goes Where?

Understanding what belongs in fact tables vs dimension tables is crucial:

Fact Table (The Center)

Contains events that happened
Each row = one transaction/event
Contains measures (numbers you aggregate): quantity, amount, profit
Contains foreign keys to dimension tables
Huge: Can have billions of rows
Narrow in number of columns (but wide in row count)

Dimension Tables (The Rays)

Contains descriptive attributes
Answers who, what, where, when, why, how
Contains context for analysis
Smaller: Thousands to millions of rows
Wide: Can have hundreds of columns of attributes

Loading diagram...

The Snowflake Schema: Normalized Dimensions

The Snowflake Schema is a variation where dimension tables are further normalized into sub-dimensions.

Loading diagram...

In a snowflake schema, instead of storing "brand_name" directly in dim_product, you'd have a separate dim_brand table that dim_product references.

Star Schema Pros:

Simpler queries (fewer joins)
Faster query performance
Easier for analysts to understand

Snowflake Schema Pros:

Less data redundancy
More normalized (database theory approved!)
Smaller storage footprint

In practice, star schemas are preferred for data warehouses because query simplicity and performance matter more than strict normalization.

Wide Tables in Data Warehouses

A surprising characteristic of data warehouse tables: they're wide. Really wide.

Loading diagram...

A fact table might have 100+ columns:

Foreign keys to 10-20 dimension tables
80+ measure columns (quantities, amounts, counts, flags)

A dimension table like dim_store might include:

Basic info: name, address, phone
Geographic: city, region, country, timezone
Physical: square footage, floors, parking
Services: bakery, deli, pharmacy, gas station
History: opened, remodeled, expanded dates
Management: district, region manager, territory
And much more...

This width is intentional: analysts need all this context for their queries.

Real-World Data Warehouse Systems

The data warehouse market has many players:

| Category | Examples | |----------|----------| | Commercial | Teradata, Vertica, SAP HANA, Oracle Exadata | | Cloud-native | Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse | | Open Source | Apache Hive, Apache Spark SQL, ClickHouse, DuckDB |

Loading diagram...

Key Takeaways

| Concept | Description | |---------|-------------| | OLTP | Transaction processing: small, fast operations on current data | | OLAP | Analytical processing: large scans, aggregates, historical data | | Data Warehouse | Separate database optimized for analytics | | ETL | Extract-Transform-Load: process to populate warehouse | | Star Schema | Central fact table surrounded by dimension tables | | Fact Table | Events/transactions with measures (numbers) | | Dimension Table | Descriptive context (who, what, where, when) | | Snowflake Schema | Star schema with normalized dimensions |

Loading diagram...

The key insight is that OLTP and OLAP workloads are so different that they require fundamentally different system designs. The indexes we discussed earlier in this chapter (B-Trees, LSM-Trees) are optimized for OLTP. In the next section, we'll explore column-oriented storage, the secret weapon that makes data warehouses so fast at analytics.