Metadata Management

Chapter 12 11% of exam

Overview

Metadata Management involves the planning, implementation, and control activities for managing metadata — which is most simply defined as data about data. Metadata provides the context that makes data understandable, usable, and trustworthy. Without metadata, data is just numbers and text with no meaning. Metadata management is often called the foundation that supports ALL other data management disciplines because every discipline depends on knowing what data exists, what it means, where it comes from, and how it can be used. Metadata encompasses a vast range of information: business definitions and rules, technical schemas and data types, data lineage and transformation logic, quality metrics, access policies, usage statistics, and more. The challenge is that metadata is scattered across many systems — databases, ETL tools, BI platforms, data catalogs, business glossaries, and even tribal knowledge in people's heads. Effective metadata management brings this information together into an accessible, governed, and trustworthy resource. The business value of metadata management includes: enabling data discovery (users can find and understand data), supporting data governance (policies can reference and apply to defined data elements), improving data quality (quality rules reference metadata definitions), enabling impact analysis (understanding what breaks when something changes), supporting regulatory compliance (lineage proves where data came from), and building trust in data (users can verify data's origin and quality). DMBOK2 positions metadata management as a critical enabler of data-driven organizations.

Key Concepts

Three Types of Metadata

DMBOK2 defines three primary categories of metadata: (1) BUSINESS METADATA — describes the business meaning, context, and rules for data. Includes: business glossary definitions, data ownership information, data classification, business rules, data quality expectations, and usage policies. Audience: business users, analysts, governance teams. (2) TECHNICAL METADATA — describes the physical structure and technical characteristics of data. Includes: database schemas, table/column definitions, data types and sizes, ETL transformation logic, data lineage (source-to-target mappings), stored procedures, indexes, partitions, APIs, and system configurations. Audience: IT staff, developers, DBAs. (3) OPERATIONAL METADATA — describes how data is processed and used in day-to-day operations. Includes: ETL job logs (start/end times, row counts, error counts), data freshness timestamps, access logs, query performance statistics, storage utilization, and batch processing schedules. Audience: operations teams, IT support.

Metadata Repository vs Data Catalog

These are related but distinct concepts: METADATA REPOSITORY — a centralized database that stores metadata models and instances. Typically maintains the authoritative technical metadata: schemas, lineage, relationships. Often integrated with modelling tools and ETL platforms. More IT-focused. DATA CATALOG — a user-friendly discovery and collaboration platform that enables business users to find, understand, trust, and use data assets. Features include: search and browse, business glossary, data previews, user ratings and reviews, collaboration (comments, annotations), automated metadata harvesting, and data quality indicators. More business-focused. Modern data catalogs (Collibra, Alation, Informatica Axon, Apache Atlas) combine both capabilities.

Data Lineage (Forward and Backward)

The end-to-end tracking of data from its origin through all transformations to its consumption points. Two directions: BACKWARD LINEAGE (also called provenance or upstream lineage) — traces data from a report or dashboard BACK to its original sources. Answers: Where did this number come from? Critical for debugging, auditing, and regulatory compliance. FORWARD LINEAGE (also called impact analysis or downstream lineage) — traces data from a source system FORWARD to all its downstream uses. Answers: What will be affected if I change this? Critical for change management and impact analysis. Lineage should capture: source systems, transformations/calculations applied, intermediate staging, final targets, and timing/scheduling.

Business Glossary

A controlled vocabulary of business terms with agreed-upon definitions. One of the most important and visible metadata management deliverables. A business glossary should include: TERM — the business name; DEFINITION — clear, unambiguous description; SYNONYMS — alternative names used in different departments; RELATED TERMS — connections to other glossary entries; OWNER/STEWARD — who is responsible for the definition; SOURCE — the authoritative system of record; USAGE CONTEXT — how and where the term is used. The glossary resolves semantic conflicts where different departments use the same term differently. It should be easily accessible to all employees and governed through a formal change process.

Data Dictionary

A repository of definitions, data types, valid values, and business rules for data elements. While similar to a business glossary, a data dictionary is typically more technical and detailed. Contents include: COLUMN/FIELD NAME — the technical name; BUSINESS NAME — the user-friendly name; DEFINITION — what the element represents; DATA TYPE — varchar, integer, date, etc.; SIZE/FORMAT — length, precision, pattern; VALID VALUES — enumerated list or range constraints; DEFAULT VALUE — what's used when no value is provided; NULLABLE — whether empty values are allowed; SOURCE — originating system; DERIVATION — calculation or transformation logic; CONSTRAINTS — foreign keys, unique constraints, check constraints; EXAMPLES — sample valid and invalid values.

Metadata Standards

Industry standards for metadata representation and exchange: (1) DUBLIN CORE — 15 basic metadata elements for describing documents and digital resources (title, creator, subject, description, date, etc.). Widely used in libraries and web resources; (2) ISO 11179 — standard for metadata registries, defines how to describe data elements with naming, identification, and definition rules; (3) CWM (Common Warehouse Metamodel) — OMG standard for exchanging metadata between data warehouse tools; (4) XMI (XML Metadata Interchange) — format for exchanging metadata between modelling tools; (5) JSON-LD and Schema.org — metadata for web content and structured data; (6) DCAT (Data Catalog Vocabulary) — W3C standard for describing datasets in data catalogs. Using standards enables interoperability between tools and systems.

Metadata Architecture Approaches

Three approaches to organizing metadata across the enterprise: (1) CENTRALIZED — single metadata repository containing all metadata from all systems. Provides consistency but can become a bottleneck, difficult to keep current, and creates a single point of failure. (2) DISTRIBUTED (Federated) — metadata remains in source systems with no central store. Each tool (ETL, BI, DBMS) maintains its own metadata. Flexible but makes cross-system discovery and lineage difficult. (3) HYBRID (Recommended) — a central data catalog harvests and links metadata from distributed sources, providing a unified view without requiring all metadata to be physically centralized. The catalog acts as a metadata hub that connects metadata across tools and systems using automated crawling, APIs, and manual curation.

Automated Metadata Discovery and Harvesting

Using tools and technology to automatically discover, extract, and maintain metadata from source systems. Techniques include: (1) SCHEMA CRAWLING — automatically scanning databases to extract table/column definitions, data types, and relationships; (2) ETL LINEAGE EXTRACTION — parsing ETL tool metadata to build source-to-target mappings automatically; (3) MACHINE LEARNING CLASSIFICATION — using ML to automatically classify data (PII, sensitive, etc.) and suggest business definitions based on column names and data patterns; (4) USAGE ANALYSIS — tracking query logs to identify popular datasets, frequent joins, and actual data usage patterns; (5) PROFILING — automatically generating statistical metadata (null counts, distinct values, patterns). Automation is critical because manual metadata management can't keep pace with the volume and velocity of data in modern organizations.

Metadata Integration and Linking

Connecting metadata from different sources to create a unified view. Challenges include: different tools use different metadata models, same concepts have different names, and metadata freshness varies across sources. Integration approaches: (1) METAMODEL MAPPING — defining equivalences between different metadata models; (2) SEMANTIC LINKING — connecting business terms to technical elements; (3) LINEAGE STITCHING — connecting ETL lineage from multiple tools into end-to-end lineage; (4) KNOWLEDGE GRAPHS — representing metadata as a graph of interconnected concepts. Tools like Apache Atlas, Collibra, and Informatica EDC specialize in metadata integration.

Metadata Governance

The framework for managing metadata as an enterprise asset. Key components: (1) METADATA STRATEGY — vision, goals, and roadmap for metadata management; (2) METADATA STANDARDS — naming conventions, definition templates, quality requirements for metadata itself; (3) METADATA STEWARDSHIP — roles responsible for creating, maintaining, and governing metadata; (4) METADATA QUALITY — ensuring metadata is accurate, complete, current, and consistent (metadata can have quality issues just like data!); (5) METADATA ACCESS — policies for who can view and edit metadata; (6) CHANGE MANAGEMENT — processes for updating definitions, adding new terms, deprecating old ones. The business glossary, in particular, requires governance to prevent duplicate or conflicting definitions.

Data Lineage for Regulatory Compliance

Regulators increasingly require organizations to demonstrate data lineage: BCBS 239 (banking) — banks must show the complete lineage of risk data from source systems through aggregation to regulatory reports; GDPR — requires understanding where personal data is stored and processed (data mapping/inventory); SOX — requires audit trails for financial data from source to financial statements; IFRS 9/17 — financial reporting standards requiring traceable data; FDA (pharmaceuticals) — requires documented data lineage for clinical trial data. Lineage provides the evidence trail that data in reports and decisions is accurate, complete, and derived from authoritative sources.

Metadata and Data Quality

Metadata management and data quality are deeply interconnected: (1) Metadata provides the DEFINITIONS needed to write quality rules (you can't measure accuracy without knowing what correct means); (2) Metadata provides the CONTEXT for interpreting quality measurements (a 95% completeness rate means different things for different fields); (3) Quality METRICS are themselves metadata (data quality scores, profiling results, trend data); (4) Data LINEAGE helps identify root causes of quality issues (tracing errors back to their source); (5) The BUSINESS GLOSSARY prevents semantic quality issues (misinterpretation of data). Without good metadata, data quality programs operate in the dark.

Metadata in the Cloud and Modern Data Stack

Cloud and modern data tools create new metadata challenges and opportunities: (1) CLOUD METADATA — cloud data platforms (Snowflake, Databricks, BigQuery) generate rich operational metadata (query history, access patterns, costs); (2) DATA MESH — decentralized data ownership creates distributed metadata that still needs to be discoverable; (3) OPEN METADATA STANDARDS — initiatives like OpenMetadata and Marquez provide open-source metadata management; (4) ACTIVE METADATA — using metadata to automate data management tasks (auto-tagging, auto-classification, anomaly detection); (5) METADATA OBSERVABILITY — monitoring metadata to detect unexpected changes in data schemas, volumes, and freshness. The modern data stack treats metadata as a first-class concern, not an afterthought.

Best Practices

✓ Maintain a business glossary with agreed-upon definitions — it's the most visible and valuable metadata deliverable
✓ Implement automated data lineage tracking from source systems through transformations to reports and dashboards
✓ Automate metadata collection wherever possible — manual metadata management cannot scale to modern data volumes
✓ Create a metadata management strategy before selecting tools — understand requirements first
✓ Use a hybrid metadata architecture: central catalog with federated source harvesting
✓ Make metadata accessible to business users through a user-friendly data catalog with search and collaboration
✓ Keep metadata current — stale or inaccurate metadata erodes trust and is worse than no metadata
✓ Link business metadata to technical metadata so users can trace from business terms to physical implementations
✓ Use metadata to support data quality, governance, and compliance — it enables all other data management disciplines
✓ Establish metadata stewardship roles with clear accountability for maintaining definitions and classifications
✓ Govern the business glossary with a formal review and approval process to prevent conflicting definitions
✓ Measure metadata quality — metadata itself can have accuracy, completeness, and currency issues

💡 Exam Tips

★ Metadata Management is 11% of the exam — one of the Big Four — expect 11 questions
★ Know the THREE types cold: Business metadata (meaning), Technical metadata (structure), Operational metadata (processing)
★ Data lineage is HEAVILY tested — understand both forward lineage (impact analysis) and backward lineage (provenance)
★ Metadata repository (technical, IT-focused) vs Data catalog (user-friendly, business-focused) — know the distinction
★ Metadata management supports ALL other data management disciplines — it's the foundation
★ Know common metadata standards: Dublin Core, ISO 11179, CWM, XMI
★ Business glossary is the KEY deliverable — resolves semantic conflicts across departments
★ Automated metadata harvesting is essential for keeping metadata current at scale
★ Lineage is required by regulators: BCBS 239 (banking), GDPR (privacy), SOX (financial)
★ Three architecture approaches: Centralized, Distributed, Hybrid (recommended)
★ Metadata quality matters — metadata itself needs to be accurate, complete, and current
★ The data dictionary is more technical/detailed than the business glossary but they complement each other