Data Architecture

Chapter 4 6% of exam

Overview

Data Architecture defines the blueprint for managing data assets by aligning data structures, policies, and integration mechanisms with the organization's business strategy. It is the foundational layer that describes how data is collected, stored, arranged, integrated, and consumed across the enterprise. Data Architecture sits within the broader discipline of Enterprise Architecture and works alongside application, business, and technology architectures to create a cohesive organizational blueprint. In DMBOK2, Data Architecture is covered in Chapter 4 and carries 6% weight on the CDMP exam. A well-designed Data Architecture provides several critical benefits: it enables data sharing and reuse across business units, reduces redundancy by identifying and eliminating duplicate data stores, guides technology selection by establishing standards for databases and integration platforms, supports regulatory compliance by defining data flows and retention policies, and accelerates project delivery by providing reusable patterns and models. Data architects create multiple types of artifacts including conceptual, logical, and physical data models, data flow diagrams, data landscape maps, enterprise data models, and technology architecture diagrams. These artifacts serve different audiences — conceptual models communicate with business stakeholders, while physical models guide database developers. Data Architecture frameworks such as TOGAF (The Open Group Architecture Framework) and Zachman provide structured approaches to developing and maintaining architecture. TOGAF's Architecture Development Method (ADM) offers an iterative cycle for architecture creation, while Zachman's framework provides a two-dimensional matrix organizing architecture artifacts by perspective (planner, owner, designer, builder, implementer) and interrogative (what, how, where, who, when, why). Modern data architecture must also address emerging patterns such as data mesh (domain-oriented decentralized data ownership), data fabric (intelligent integration across hybrid and multi-cloud environments), data lakes (schema-on-read repositories for raw data), and cloud-native architectures. Architecture governance ensures that projects conform to architectural standards, technology decisions are evaluated consistently, and the architecture evolves in alignment with business needs.

Key Concepts

Enterprise Data Architecture (EDA)

Enterprise Data Architecture is the master plan that defines an organization's entire data landscape — what data exists, where it resides, how it flows between systems, and how it supports business processes. EDA operates at the enterprise level, looking across all business units and systems rather than focusing on a single application or project. Key deliverables of EDA include the enterprise data model (a high-level conceptual model showing major data subject areas and their relationships), data flow diagrams (showing how data moves between systems and processes), and technology architecture documents (specifying database platforms, integration tools, and storage standards). EDA differs from project-level data architecture in that it prioritizes cross-functional consistency and reusability over individual project optimization. Successful EDA requires ongoing governance to ensure new systems conform to enterprise standards.

Data Architecture Artifacts

Data architects produce multiple artifact types that serve different purposes and audiences. KEY ARTIFACTS include: (1) Enterprise Data Model — high-level view of major data subject areas (Customer, Product, Order) and their relationships; (2) Conceptual Data Models — business-focused models showing entities and relationships without technical detail; (3) Logical Data Models — normalized, technology-independent models with entities, attributes, keys, and relationships; (4) Physical Data Models — technology-specific implementations including tables, columns, data types, indexes, and partitioning; (5) Data Flow Diagrams — showing how data moves between systems, processes, and data stores; (6) Data Lineage Maps — tracing data from source to consumption; (7) Technology Architecture Diagrams — showing databases, integration platforms, and infrastructure. Artifacts should be maintained in an architecture repository and version-controlled.

Conceptual, Logical, and Physical Data Models

These three model levels represent progressive refinement from business concepts to technical implementation. CONCEPTUAL DATA MODEL: highest abstraction level, shows major entities (Customer, Product, Order) and their relationships, no attributes or keys, used for executive and business communication, typically fits on one page. LOGICAL DATA MODEL: detailed business representation, includes all entities, attributes, primary/foreign keys, relationships with cardinality, fully normalized (usually to 3NF), technology-independent — does NOT specify data types, indexes, or physical storage. PHYSICAL DATA MODEL: technology-specific implementation, includes tables, columns, platform-specific data types, indexes, partitions, views, stored procedures, denormalization decisions for performance. The critical exam distinction: logical models are TECHNOLOGY-INDEPENDENT while physical models are TECHNOLOGY-SPECIFIC. Moving from conceptual to physical is called 'forward engineering'; reverse engineering goes from physical back to logical.

TOGAF and the Architecture Development Method (ADM)

TOGAF (The Open Group Architecture Framework) is the most widely used enterprise architecture framework. Its core component is the Architecture Development Method (ADM), an iterative cycle with phases: Preliminary Phase (establish architecture capability), Phase A (Architecture Vision), Phase B (Business Architecture), Phase C (Information Systems Architecture — includes DATA architecture), Phase D (Technology Architecture), Phase E (Opportunities and Solutions), Phase F (Migration Planning), Phase G (Implementation Governance), Phase H (Architecture Change Management), and Requirements Management (ongoing, central). Data Architecture is specifically addressed in Phase C alongside Application Architecture. TOGAF uses the concept of architecture building blocks (ABBs) for reusable architecture components and solution building blocks (SBBs) for specific implementations. The Architecture Repository stores all architecture artifacts, standards, and reference models.

Zachman Framework

The Zachman Framework is a two-dimensional classification scheme for organizing enterprise architecture artifacts. ROWS represent perspectives: Scope/Planner (executive/business context), Business Model/Owner (business concepts), System Model/Designer (logical design), Technology Model/Builder (physical design), Detailed Representation/Implementer (component specifications), and Functioning Enterprise. COLUMNS represent interrogatives: What (data/entities), How (functions/processes), Where (network/locations), Who (people/roles), When (time/schedules), Why (motivation/strategies). Each cell in the matrix represents a unique artifact — for example, Row 2 (Owner) x Column 1 (What) produces the conceptual data model. Unlike TOGAF, Zachman is a classification taxonomy, NOT a methodology — it does not prescribe a process for creating artifacts. It helps ensure completeness by identifying what artifacts exist or are needed.

Data Flow Architecture

Data Flow Architecture describes how data moves through an organization's systems and processes. Key components include: SOURCE SYSTEMS (where data originates — transactional databases, external feeds, IoT devices), INTEGRATION LAYER (ETL/ELT pipelines, APIs, message queues that move and transform data), STORAGE LAYER (data warehouses, data lakes, operational data stores), CONSUMPTION LAYER (BI tools, analytics platforms, operational applications), and GOVERNANCE LAYER (metadata management, data quality, security controls that span all layers). Data flow diagrams (DFDs) visually depict these movements using standard notation: external entities (rectangles), processes (circles/rounded rectangles), data stores (parallel lines), and data flows (arrows). Understanding data flow is essential for impact analysis, data lineage, regulatory compliance, and system integration planning.

Data Lake vs Data Warehouse vs Data Lakehouse

Three major patterns for analytical data storage. DATA WAREHOUSE: structured, schema-on-write repository using dimensional modeling (star/snowflake schemas). Data is cleaned, transformed, and loaded via ETL. Optimized for SQL queries and BI reporting. Strong data quality and governance. Examples: Snowflake, Amazon Redshift, Google BigQuery. DATA LAKE: raw data repository using schema-on-read approach. Stores structured, semi-structured, and unstructured data in native format. Data is loaded as-is (ELT) and transformed when read. Supports diverse analytics including machine learning. Risk of becoming a 'data swamp' without governance. Examples: Amazon S3, Azure Data Lake Storage, Hadoop HDFS. DATA LAKEHOUSE: hybrid combining data lake flexibility with warehouse governance and performance. Adds a metadata/governance layer (like Delta Lake or Apache Iceberg) on top of data lake storage to enable ACID transactions, schema enforcement, and time travel. Supports both BI and data science workloads from a single platform.

Data Mesh

Data Mesh is a decentralized sociotechnical approach to data architecture proposed by Zhamak Dehghani, built on four core principles: (1) DOMAIN-ORIENTED DECENTRALIZED DATA OWNERSHIP — data is owned and managed by the business domains that produce it, not a central data team. Each domain treats its data as a product. (2) DATA AS A PRODUCT — each domain publishes its data with product-quality standards including SLAs, documentation, discoverability, and defined interfaces. (3) SELF-SERVE DATA INFRASTRUCTURE PLATFORM — a centralized platform team provides tools and infrastructure that domain teams use to build, deploy, and manage their data products without needing deep infrastructure expertise. (4) FEDERATED COMPUTATIONAL GOVERNANCE — governance policies are defined centrally but enforced computationally through automation, ensuring compliance without bottlenecking domain teams. Data Mesh contrasts with centralized architectures (monolithic data warehouses/lakes) by distributing responsibility to those who best understand the data.

Data Fabric

Data Fabric is an architecture approach that provides a unified, intelligent data management layer across distributed data environments (on-premises, cloud, multi-cloud, hybrid). Unlike Data Mesh which is primarily an organizational approach, Data Fabric is a technology-driven pattern that uses active metadata, knowledge graphs, machine learning, and automation to discover, integrate, and govern data wherever it resides. Key capabilities include: automated data discovery and cataloging, intelligent data integration that recommends connections, active metadata management that learns usage patterns, unified security and governance enforcement across all data stores, and self-service data access for consumers. Data Fabric addresses the challenge of data sprawl across hybrid environments by creating a virtual integration layer rather than physically moving all data to one location. Gartner has identified Data Fabric as a top technology trend for modern data management.

Architecture Governance

Architecture governance ensures that data architecture standards, principles, and models are followed across projects and initiatives. Key governance activities include: ARCHITECTURE REVIEW BOARDS — formal bodies that review project designs for compliance with architecture standards before implementation; TECHNOLOGY STANDARDS — approved lists of databases, integration tools, and platforms that projects must select from; ARCHITECTURE PRINCIPLES — guiding statements like 'data is a shared asset' or 'design for reuse' that drive decisions; EXCEPTION MANAGEMENT — formal process for granting variances when projects cannot comply with standards; COMPLIANCE MONITORING — tracking actual implementations against approved architecture; ARCHITECTURE ROADMAPS — planned evolution of the data landscape over 3-5 years. Without governance, architecture degrades over time as individual projects make expedient but inconsistent technology choices.

Cloud Data Architecture

Cloud data architecture addresses design patterns specific to cloud environments. Key considerations include: MULTI-CLOUD vs HYBRID strategies — using multiple cloud providers versus combining cloud and on-premises; CLOUD-NATIVE SERVICES — leveraging managed database services (Amazon RDS, Azure SQL, Google Cloud SQL), serverless computing, and auto-scaling rather than simply lifting and shifting on-premises designs; DATA SOVEREIGNTY — ensuring data resides in geographic regions required by regulations (GDPR requires certain EU data to remain in the EU); COST OPTIMIZATION — cloud pricing models (pay-per-query, storage tiers, reserved capacity) require different architectural thinking than on-premises; SECURITY — shared responsibility model where the cloud provider secures infrastructure while the customer secures data and access. Cloud-native architectures typically separate storage and compute, enabling independent scaling and cost optimization.

Technology Evaluation and Selection

Data architects evaluate and recommend technologies using structured criteria including: FUNCTIONAL REQUIREMENTS — does the technology support required use cases (OLTP, OLAP, streaming, ML); NON-FUNCTIONAL REQUIREMENTS — performance, scalability, availability, security, compliance capabilities; TOTAL COST OF OWNERSHIP — licensing, infrastructure, administration, training, migration costs over 3-5 years; VENDOR VIABILITY — financial stability, market position, product roadmap, support quality; INTEGRATION CAPABILITY — ability to work with existing tools and platforms in the architecture; SKILLS AVAILABILITY — can the organization recruit or train staff to manage the technology; STANDARDS COMPLIANCE — adherence to industry standards and open formats to avoid vendor lock-in. Technology decisions should be documented in Architecture Decision Records (ADRs) that capture the context, decision, options considered, and rationale.

Best Practices

✓ Align data architecture with enterprise and business architecture — data architecture should never exist in isolation from business strategy
✓ Maintain architecture artifacts at all three levels (conceptual, logical, physical) and keep them synchronized as systems evolve
✓ Establish an architecture review board that evaluates all new projects for compliance with data architecture standards
✓ Use standard frameworks (TOGAF, Zachman) as starting points and tailor them to your organization rather than inventing from scratch
✓ Document architecture decisions using Architecture Decision Records (ADRs) capturing context, options considered, and rationale
✓ Create and maintain an enterprise data model that provides a common vocabulary across business units and projects
✓ Evaluate cloud, hybrid, and on-premises options based on total cost of ownership, data sovereignty, and performance requirements
✓ Implement data architecture governance incrementally — start with high-impact standards and expand as maturity grows
✓ Separate storage and compute in modern architectures to enable independent scaling and cost optimization
✓ Design for data sharing and reuse from the start rather than building isolated, project-specific data stores
✓ Publish approved technology standards and maintain a technology radar showing which platforms are approved, trial, hold, or retired
✓ Plan for data architecture evolution with a 3-5 year roadmap that accounts for business growth and technology changes

💡 Exam Tips

★ Data Architecture is 6% of the exam — expect approximately 6 questions
★ Know the THREE model levels: Conceptual (business entities, no attributes), Logical (attributes, keys, normalized, technology-independent), Physical (tables, columns, data types, indexes, technology-specific)
★ TOGAF places Data Architecture in Phase C of the ADM alongside Application Architecture — remember this specific phase
★ Zachman is a CLASSIFICATION FRAMEWORK (taxonomy), NOT a methodology — it does not prescribe a process, unlike TOGAF which has the ADM
★ Data Lake uses SCHEMA-ON-READ; Data Warehouse uses SCHEMA-ON-WRITE — this distinction is frequently tested
★ Data Mesh has FOUR principles: domain ownership, data as a product, self-serve infrastructure, federated computational governance
★ Data Fabric is TECHNOLOGY-DRIVEN (automated integration); Data Mesh is ORGANIZATION-DRIVEN (domain ownership) — know the distinction
★ Architecture governance includes review boards, technology standards, principles, exception management, and compliance monitoring
★ Enterprise Data Architecture operates at the ORGANIZATION level, not the project level — it focuses on cross-functional consistency
★ Forward engineering goes from conceptual to physical; reverse engineering goes from physical back to logical
★ Understand that data architecture ARTIFACTS serve different AUDIENCES: conceptual for business, logical for analysts, physical for developers
★ Cloud architecture key concepts: shared responsibility model, data sovereignty, separation of storage and compute, multi-cloud vs hybrid