Data Modelling and Design

Chapter 5 11% of exam

Overview

Data Modelling and Design is the process of discovering, analyzing, representing, and communicating data requirements in a precise form called a data model. Data models serve as blueprints for designing and building databases and are the primary means of communicating data structure requirements between business stakeholders and technical teams. Effective data modelling bridges the gap between business requirements and technical implementation. The discipline encompasses several modelling approaches: Entity-Relationship (ER) modelling for relational databases, dimensional modelling for data warehouses, object-oriented modelling for applications, graph modelling for network data, and NoSQL modelling for non-relational data stores. The Three Schema Architecture — Conceptual, Logical, and Physical models — provides a framework for progressively refining data requirements from business concepts to technical implementation. Data modelling is both an art and a science. It requires deep understanding of business requirements, normalization theory, design patterns, and technology constraints. A well-designed data model reduces storage costs, improves query performance, ensures data integrity, and makes systems easier to maintain and extend. Poor data models lead to data quality issues, performance problems, and expensive rework. DMBOK2 emphasizes that models should be treated as enterprise assets, versioned, governed, and maintained throughout their lifecycle.

Key Concepts

Three Schema Architecture (Conceptual — Logical — Physical)

The three-level modelling approach progresses from abstract to concrete: CONCEPTUAL MODEL — the highest-level business view. Contains entities and relationships only — NO attributes, keys, or data types. Purpose: communicate business concepts to stakeholders. LOGICAL MODEL — detailed business view with full attributes, primary/foreign keys, and normalization — but NO technology-specific details. Independent of any particular database platform. PHYSICAL MODEL — technology-specific implementation. Includes table names, column data types, indexes, partitions, tablespaces, storage parameters. Audience: database administrators and developers.

Entity-Relationship (ER) Modelling

The most widely used modelling technique for relational databases. Core components: ENTITY — a thing of significance about which data is stored (Customer, Order, Product); ATTRIBUTE — a property of an entity (CustomerName, OrderDate); RELATIONSHIP — an association between entities (Customer places Order). Notation styles: CHEN — uses rectangles (entities), ovals (attributes), diamonds (relationships); CROW'S FOOT (IE/Information Engineering) — most common; uses lines with crow's feet for many, circles for optional, bars for mandatory; IDEF1X — used in government and defense; BARKER — Oracle's notation. M:M relationships are always resolved into two 1:M relationships through an associative/junction entity in the logical model.

Normalization (1NF through BCNF)

Normalization reduces redundancy and ensures data integrity through progressive normal forms: FIRST NORMAL FORM (1NF) — all attributes contain atomic (indivisible) values; no repeating groups or arrays. SECOND NORMAL FORM (2NF) — must be in 1NF AND every non-key attribute must depend on the ENTIRE primary key (no partial dependencies). Only applies to composite keys. THIRD NORMAL FORM (3NF) — must be in 2NF AND no non-key attribute depends on another non-key attribute (no transitive dependencies). BOYCE-CODD NORMAL FORM (BCNF) — every determinant is a candidate key. Stricter than 3NF. 3NF is the standard for operational/OLTP databases. Remember: The key (1NF), the whole key (2NF), and nothing but the key (3NF), so help me Codd.

Dimensional Modelling (Star and Snowflake Schemas)

Design technique optimized for analytical querying in data warehouses. Core components: FACT TABLE — stores measurable business events/transactions. Contains numeric measures and foreign keys to dimension tables. Types of facts: ADDITIVE (can be summed across all dimensions — revenue), SEMI-ADDITIVE (can be summed across some dimensions — account balance), NON-ADDITIVE (cannot be summed — ratios, percentages). DIMENSION TABLE — stores descriptive context for facts. GRAIN — the level of detail stored in the fact table; the most critical design decision. STAR SCHEMA — fact table surrounded by denormalized dimension tables (simpler, faster queries). SNOWFLAKE SCHEMA — dimension tables are normalized into sub-tables (saves storage, more complex queries). Star schema is generally preferred for its query simplicity.

Slowly Changing Dimensions (SCD Types)

Techniques for handling changes to dimension attributes over time: TYPE 0 — RETAIN ORIGINAL: Never change; keep the original value forever. TYPE 1 — OVERWRITE: Update the current value, losing history (e.g., correct a misspelled name). TYPE 2 — ADD NEW ROW: Insert a new row with the new value, keeping historical rows. Uses effective_date/expiration_date and is_current flag. Preserves full history. Most complex but most common for important changes. TYPE 3 — ADD NEW COLUMN: Add a previous_value column. Tracks only one change. TYPE 6 (HYBRID 1+2+3) — Combines type 2 (new rows with history) with type 1 (overwrite current values in historical rows) and type 3 (previous value column). The exam heavily tests Types 1, 2, and 3. Type 2 is the most commonly used for tracking changes.

Surrogate Keys vs Natural Keys

NATURAL KEY (Business Key) — a real-world attribute that uniquely identifies a record (SSN, email, product SKU). Advantages: meaningful to users, no translation needed. Disadvantages: can change, may contain sensitive data, may be composite. SURROGATE KEY — a system-generated artificial identifier (auto-increment integer, UUID) with no business meaning. Advantages: stable (never changes), simple, efficient for joins, protects against business key changes. Disadvantages: meaningless to users, requires lookups. BEST PRACTICE for data warehouses: always use surrogate keys for dimension tables while preserving the natural key as a non-key attribute. This decouples the warehouse from source system key changes.

Denormalization

The intentional introduction of controlled redundancy to improve read/query performance. Denormalization is the OPPOSITE of normalization and should be a conscious design decision, NOT accidental redundancy. Techniques include: (1) ADDING REDUNDANT COLUMNS — storing a frequently joined column directly in the table; (2) PRE-COMPUTED AGGREGATES — storing calculated totals; (3) MERGED TABLES — combining tables that are always queried together; (4) REPEATING GROUPS — storing arrays when the number of items is small and fixed. Used primarily in: dimensional models (star schemas), read-heavy reporting databases, NoSQL databases, and performance-critical operational systems. Trade-off: faster reads but slower writes, more storage, and risk of data inconsistency.

Data Model Governance and Standards

Data models should be managed as enterprise assets. Key practices: (1) NAMING CONVENTIONS — consistent, meaningful names for entities, attributes, tables, and columns; (2) VERSION CONTROL — maintain history of model changes with reasons; (3) MODEL REPOSITORY — centralized storage for all data models (enterprise, logical, physical); (4) REVIEW AND APPROVAL — formal review process before models are implemented; (5) IMPACT ANALYSIS — assess downstream effects before changing models; (6) DOCUMENTATION — business definitions, design decisions, and assumptions captured in the model metadata. Tools like ERwin, PowerDesigner, and ER/Studio support enterprise data modelling.

Cardinality and Optionality

CARDINALITY defines the numerical relationship between entities: ONE-TO-ONE (1:1) — each instance of Entity A relates to exactly one instance of Entity B (Person:Passport). ONE-TO-MANY (1:M) — each instance of A relates to multiple instances of B (Customer:Orders). MANY-TO-MANY (M:M) — multiple instances of A relate to multiple instances of B (Students:Courses). M:M relationships must be resolved into two 1:M relationships through a junction/associative entity. OPTIONALITY defines whether participation is mandatory or optional: MANDATORY — every instance must participate. OPTIONAL — an instance may or may not participate. In Crow's Foot notation: bar = mandatory, circle = optional, crow's foot = many.

Conformed Dimensions

Shared dimension tables that are used consistently across multiple fact tables and data marts. A core concept in Kimball's dimensional modelling methodology. Example: a Date dimension and Customer dimension used by both Sales and Marketing fact tables, ensuring that reports from different business areas can be combined and compared. Requirements: same keys, same attributes, same definitions, same values. Conformed dimensions enable the enterprise data warehouse bus architecture where independently developed data marts can be integrated. Without conformed dimensions, each department creates its own incompatible definitions leading to inconsistent reporting.

Factless Fact Tables

A fact table that records the occurrence of an event without any associated numeric measures. Examples: (1) COVERAGE FACT TABLE — records which products were on promotion in which stores on which dates (allows analysis of what DIDN'T sell during a promotion); (2) EVENT TRACKING — records student attendance, employee check-in/check-out, or website page views where the count of rows is the measure. The absence of measures doesn't mean the table lacks value — the relationships between dimensions and the row count itself provide analytical capability.

Junk Dimensions and Degenerate Dimensions

JUNK DIMENSION — combines miscellaneous low-cardinality flags, indicators, and codes into a single dimension table to avoid cluttering the fact table. Example: combining OrderType (online/store), PaymentMethod (cash/credit/debit), GiftWrap (yes/no), and RushShipping (yes/no) into a single TransactionProfile dimension. Creates all valid combinations as rows. DEGENERATE DIMENSION — a dimension key in the fact table that has no corresponding dimension table because all useful attributes are already in other dimensions. Example: Invoice Number or Order Number stored directly in the fact table. No separate dimension table is needed because the number itself is the only useful attribute.

Graph Data Modelling

Models data as nodes (vertices) and edges (relationships) in a graph structure. Ideal for highly connected data: social networks, fraud detection, knowledge graphs, recommendation engines, supply chain networks. Components: NODES — entities (Person, Product, Company); EDGES — relationships between nodes (KNOWS, PURCHASED, WORKS_FOR); PROPERTIES — attributes on both nodes and edges. Advantages over relational models: natural representation of relationships, efficient traversal of connected data, flexible schema. Common databases: Neo4j, Amazon Neptune, ArangoDB. Uses graph query languages like Cypher (Neo4j) or Gremlin.

Data Vault Modelling

An approach designed for enterprise data warehouses that need to handle multiple source systems, changing business rules, and full audit trails. Three core entity types: HUBS — store unique business keys (CustomerHub: CustomerID); LINKS — store relationships between hubs (CustomerOrderLink: CustomerID, OrderID); SATELLITES — store descriptive attributes and history, always linked to a hub or link (CustomerSatellite: name, address, effective_date). Advantages: highly adaptable to change, full audit trail, parallel loading from multiple sources, supports automation. Developed by Dan Linstedt as an alternative to Kimball and Inmon approaches for the raw data layer of a warehouse.

Best Practices

✓ Always start with a conceptual model to align business understanding before moving to logical and physical
✓ Use consistent naming conventions across all models — document and enforce the standards
✓ Normalize operational databases to at least 3NF to eliminate redundancy and ensure data integrity
✓ Use dimensional models (star/snowflake schemas) for analytical workloads — star schema for simplicity
✓ Define the grain of fact tables PRECISELY before designing the rest of the dimensional model
✓ Use surrogate keys for dimension tables in data warehouses to insulate from source system changes
✓ Document all design decisions, business rules, and assumptions within the model or its metadata
✓ Involve business stakeholders in conceptual and logical modelling — they validate business correctness
✓ Use conformed dimensions to enable cross-process analysis and enterprise-wide reporting
✓ Review and validate models through formal walkthroughs before implementation begins
✓ Choose SCD type based on business requirements: Type 1 for corrections, Type 2 for tracking history
✓ Maintain model versioning and change history — treat models as enterprise assets

💡 Exam Tips

★ Data Modelling is 11% of the exam — one of the Big Four — expect 11 questions
★ Know the three levels COLD: Conceptual (entities/relationships only) → Logical (+ attributes, keys, normalized) → Physical (+ data types, indexes, technology-specific)
★ Normal forms are HEAVILY tested: 1NF (atomic values), 2NF (no partial dependencies), 3NF (no transitive dependencies), BCNF (every determinant is a candidate key)
★ Remember: The key, the whole key, and nothing but the key — covers 1NF, 2NF, 3NF
★ SCD Types 1, 2, 3 are FREQUENTLY tested: Type 1 = overwrite, Type 2 = new row with history, Type 3 = new column
★ Star schema vs Snowflake: Star = denormalized dimensions (simpler), Snowflake = normalized dimensions (smaller)
★ Grain is the MOST IMPORTANT decision in dimensional modelling — define it first
★ Fact types: Additive (sum across all dimensions), Semi-additive (some dimensions), Non-additive (can't sum)
★ Know Crow's Foot notation: bar = mandatory, circle = optional, crow's foot = many
★ Surrogate keys vs Natural keys — surrogate is preferred in warehouses
★ M:M relationships must be resolved through a junction/associative entity
★ Kimball (bottom-up, dimensional) vs Inmon (top-down, normalized) — know both approaches