Data Warehousing and Business Intelligence

Chapter 11 10% of exam

Overview

Data Warehousing and Business Intelligence (DW/BI) encompasses the planning, implementation, and control of processes to provide decision-support data and support knowledge workers with reporting and analysis capabilities. A data warehouse is a centralized repository of integrated data from multiple sources, organized and optimized for analytical querying rather than operational transaction processing. The discipline covers the full analytics pipeline: extracting data from operational sources, transforming and integrating it into a consistent format, loading it into analytical data stores, and delivering insights through reports, dashboards, OLAP cubes, and advanced analytics. Two foundational architects shaped the field: Bill Inmon (the 'father of data warehousing') who advocates a top-down approach with a normalized enterprise data warehouse, and Ralph Kimball who advocates a bottom-up approach building dimensional data marts that form the warehouse. Modern DW/BI has evolved significantly with cloud technologies, real-time analytics, self-service BI, data lakes, and the lakehouse architecture that combines the flexibility of data lakes with the performance and structure of data warehouses. Despite these advances, the fundamental principles — integrated data, historical perspective, subject orientation, and non-volatile storage — remain the cornerstones of effective analytical data management.

Key Concepts

Data Warehouse Characteristics (Inmon's Definition)

Bill Inmon defined a data warehouse as a collection of data that is: (1) SUBJECT-ORIENTED — organized around key business subjects (customers, products, sales) rather than operational applications; (2) INTEGRATED — data from multiple sources is cleansed, standardized, and combined into a consistent format (resolving naming conflicts, coding differences, format variations); (3) TIME-VARIANT — maintains historical data with timestamps, enabling trend analysis and comparisons over time (unlike operational systems that typically maintain only current state); (4) NON-VOLATILE — data is loaded and accessed but not modified or deleted (append-only or overwrite with history preservation). These four characteristics distinguish a warehouse from an operational database.

Inmon vs Kimball Approaches

Two foundational approaches to data warehouse architecture: INMON (TOP-DOWN) — build a normalized (3NF) Enterprise Data Warehouse (EDW) first as the single source of truth, then create dependent data marts for specific departments. The EDW stores data at the atomic level in normalized form. Advantages: enterprise-wide consistency, single version of truth, handles complex integration. Disadvantages: longer time to first delivery, higher upfront cost, more complex. KIMBALL (BOTTOM-UP) — build dimensional data marts first, organized around business processes (sales, inventory, finance). Data marts share conformed dimensions, forming an enterprise 'bus' architecture. No separate normalized EDW. Advantages: faster time to value, incremental, business-aligned. Disadvantages: risk of inconsistency if conformance isn't maintained. Many organizations use a HYBRID approach combining elements of both.

ETL vs ELT

Two approaches to moving data from source to warehouse: ETL (Extract, Transform, Load) — data is extracted from sources, transformed in a separate staging area (data cleansing, type conversion, business rules, aggregation), then loaded into the warehouse. Traditional approach for on-premises warehouses. Transformations happen OUTSIDE the warehouse using tools like Informatica, Talend, DataStage. ELT (Extract, Load, Transform) — data is extracted and loaded RAW into the target first, then transformed INSIDE the target using the target's compute engine. Common in cloud data warehouses (Snowflake, BigQuery, Redshift) which have massive parallel processing power. Tools like dbt, Matillion, and cloud-native SQL are used. ELT advantages: leverages cloud compute scalability, faster loading, preserves raw data for reprocessing. ETL advantages: handles complex transformations better, reduces load on warehouse, mature tooling.

OLAP vs OLTP

Two fundamentally different database usage patterns: OLTP (Online Transaction Processing) — operational systems handling high volumes of small, fast transactions. Characteristics: many users, frequent read/write, small amounts of data per transaction, current data, normalized schema, response time in milliseconds. Examples: order entry, banking transactions, point-of-sale. OLAP (Online Analytical Processing) — analytical systems supporting complex queries across large volumes of historical data. Characteristics: fewer users, mostly read-only, large data scans, historical data, denormalized schema, response time in seconds. Examples: sales analysis, trend reporting, forecasting. OLAP operations: SLICE (filter one dimension), DICE (filter multiple dimensions), DRILL-DOWN (more detail), DRILL-UP (less detail, summarize), PIVOT (rotate dimensions). MOLAP = multidimensional storage, ROLAP = relational storage, HOLAP = hybrid.

Data Lake vs Data Warehouse

DATA LAKE — a centralized repository storing raw data in its NATIVE format (structured, semi-structured, unstructured) using SCHEMA-ON-READ (structure applied when data is read, not when stored). Technologies: Hadoop HDFS, cloud object storage (S3, Azure ADLS, GCS). Advantages: stores all data types, cheap storage, flexible for diverse analytics. Risk: can become a 'data swamp' without governance. DATA WAREHOUSE — stores STRUCTURED, PROCESSED data using SCHEMA-ON-WRITE (data must conform to defined schema before loading). Technologies: Snowflake, Redshift, BigQuery, Teradata. Advantages: fast query performance, guaranteed quality, well-understood structure. DATA LAKEHOUSE — emerging architecture combining lake flexibility with warehouse performance. Uses open table formats (Delta Lake, Apache Iceberg, Apache Hudi) on data lake storage to provide ACID transactions, schema enforcement, and SQL analytics on lake data.

Data Warehouse Design Concepts

Key design concepts for dimensional data warehouses: GRAIN — the level of detail in a fact table; the MOST CRITICAL design decision. Example: 'one row per product per store per day.' Always declare the grain before adding dimensions and measures. CONFORMED DIMENSIONS — shared dimension tables used consistently across multiple fact tables. Enable enterprise-wide analysis (Date, Customer, Product dimensions shared by Sales, Marketing, Finance). BUS MATRIX — a planning tool showing which dimensions are used by which business processes/fact tables. Guides the order of data mart development. STAGING AREA — an intermediate storage area for ETL processing. Data lands here first for cleansing and transformation before loading to the warehouse.

Types of Fact Tables

Fact tables come in three main varieties: (1) TRANSACTION FACT TABLE — records individual business events at the atomic grain. One row per transaction (each sale, each click, each shipment). Most common type. Contains measures that are additive across all dimensions. (2) PERIODIC SNAPSHOT FACT TABLE — records the state of something at regular intervals. One row per entity per period (account balance per month, inventory level per week). Contains semi-additive measures (can be averaged but not meaningfully summed across time). (3) ACCUMULATING SNAPSHOT FACT TABLE — records the lifecycle of a process with key milestones. One row per entity instance, updated as milestones are reached. Example: order processing with columns for order_date, ship_date, delivery_date, return_date. Multiple date dimension foreign keys. (4) FACTLESS FACT TABLE — records events without numeric measures (promotions, attendance).

Slowly Changing Dimensions in DW

Handling changes to dimension attributes over time is crucial for accurate historical analysis. The most common types in DW context: TYPE 1 (Overwrite) — update the dimension record in place. History is lost. Use when: corrections (fixing a typo), attributes where history doesn't matter. TYPE 2 (Add Row) — insert a new dimension row with the new value, add effective_date, expiration_date, and is_current flag. Surrogate key changes; natural key stays the same. Use when: business needs to track attribute changes over time (customer moved from New York to California — both should appear in historical analyses). TYPE 3 (Add Column) — add 'previous_value' and 'current_value' columns. Tracks only one level of change. Use when: only the most recent change matters (previous region vs. current region). TYPE 2 is the most commonly used for important changes. The surrogate key is essential for Type 2 — it differentiates the multiple rows for the same entity.

Business Intelligence Delivery Methods

Multiple ways to deliver analytical insights: (1) STANDARD REPORTS — pre-formatted, scheduled delivery (daily sales report, monthly P&L). Known content, known audience. (2) AD-HOC REPORTS — user-created, on-demand analysis. Users build their own queries and reports. (3) DASHBOARDS — visual displays of key KPIs and metrics with interactive elements (drill-down, filtering). Focus on monitoring and alerting. (4) SCORECARDS — performance tracking against defined targets (Balanced Scorecard methodology). Red/yellow/green indicators. (5) SELF-SERVICE BI — tools that enable business users to explore data independently (Tableau, Power BI, Looker). Reduces IT bottleneck. (6) ADVANCED ANALYTICS — predictive models, machine learning, statistical analysis. Descriptive (what happened), Diagnostic (why), Predictive (what will happen), Prescriptive (what should we do). (7) DATA VISUALIZATION — charts, graphs, maps, infographics for communicating insights effectively.

Data Warehouse Architecture Layers

A typical data warehouse has multiple architectural layers: (1) SOURCE LAYER — operational systems, external data feeds, flat files that provide raw data; (2) STAGING LAYER — temporary storage for data during ETL processing (extraction, cleansing, transformation); (3) INTEGRATION LAYER — the core data warehouse where data is integrated, standardized, and stored in either normalized (Inmon) or dimensional (Kimball) form; (4) ACCESS/PRESENTATION LAYER — data marts, OLAP cubes, and semantic layers that present data to users in business-friendly formats; (5) DELIVERY LAYER — BI tools, reports, dashboards, and applications that end users interact with. Each layer has specific roles in data quality, transformation, and access control.

Data Mart Types

Data marts are subsets of the data warehouse focused on specific business areas: (1) DEPENDENT DATA MART — sourced FROM the enterprise data warehouse. Data is already integrated and consistent. Aligned with Inmon's approach. Advantages: guaranteed consistency with enterprise data. (2) INDEPENDENT DATA MART — sourced DIRECTLY from operational systems, bypassing any enterprise data warehouse. Aligned with early Kimball implementations. Risks: inconsistency between marts, duplicate ETL effort, incompatible definitions. (3) HYBRID DATA MART — sourced primarily from the warehouse but supplemented with data directly from other sources when needed. Most organizations aim for dependent data marts to ensure consistency. Independent data marts should generally be avoided because they create data silos and 'stovepipe' reporting.

Real-Time and Near-Real-Time DW/BI

Traditional DW uses batch processing (nightly loads). Modern requirements demand fresher data: NEAR-REAL-TIME — data updated every few minutes to hours using micro-batch processing, change data capture (CDC), or trickle-feed loading. Example: operational dashboards refreshed every 15 minutes. REAL-TIME (Streaming) — data arrives and is processed continuously using streaming platforms (Kafka, Kinesis, Flink). Example: fraud detection, live monitoring dashboards. LAMBDA ARCHITECTURE — combines batch and stream processing: batch layer for comprehensive processing, speed layer for real-time updates, serving layer that merges both views. KAPPA ARCHITECTURE — simplifies Lambda by using only stream processing for all data. The choice depends on business requirements for data freshness vs. complexity and cost.

Best Practices

✓ Define the grain of fact tables precisely BEFORE designing dimensions — grain is the single most important decision
✓ Use conformed dimensions to enable cross-process analysis and prevent inconsistent reporting
✓ Build incremental ETL/ELT processes for scalability — avoid reloading entire datasets when possible
✓ Implement appropriate SCD strategies based on business requirements (Type 2 for tracking history)
✓ Design for query performance, not just data storage — use star schemas, proper indexes, and materialized views
✓ Establish a data staging area for ETL processing to separate extraction from transformation and loading
✓ Create a semantic/metadata layer to help business users understand warehouse contents without technical knowledge
✓ Consider the Inmon/Kimball spectrum based on your organization's needs — hybrid approaches often work best
✓ Govern the data lake to prevent it from becoming a data swamp — require metadata, quality checks, and access controls
✓ Use Kimball's Bus Matrix to plan dimensional model development and identify conformed dimensions
✓ Implement data quality checks at each ETL stage — validate row counts, key metrics, and business rules
✓ Consider cloud data warehouse or lakehouse architecture for new implementations — they offer scalability and lower operational burden

💡 Exam Tips

★ DW/BI is 10% of the exam — expect 10 questions
★ Inmon (top-down, normalized EDW first) vs Kimball (bottom-up, dimensional data marts) is a CLASSIC exam topic — know both thoroughly
★ ETL vs ELT: ETL transforms before loading (traditional); ELT loads then transforms (cloud-native)
★ OLAP vs OLTP: understand ALL key differences (read vs write, historical vs current, denormalized vs normalized)
★ Grain is the MOST IMPORTANT decision in dimensional modelling — define it first and never compromise it
★ Know types of facts: Additive (can sum across ALL dimensions), Semi-additive (some dimensions), Non-additive (can't sum)
★ Data Lake (schema-on-read, raw data) vs Data Warehouse (schema-on-write, processed data) — know the trade-offs
★ Conformed dimensions enable enterprise-wide analysis — they are shared across fact tables
★ Three types of fact tables: Transaction, Periodic Snapshot, Accumulating Snapshot
★ Know SCD Types 1 (overwrite), 2 (add row), 3 (add column) — Type 2 is most common for history tracking
★ Dependent data marts (from warehouse) are preferred over independent data marts (from sources directly)