Data Quality

Chapter 13 11% of exam

Overview

Data Quality Management is the planning, implementation, and control of activities that apply quality management techniques to data, in order to assure it is fit for consumption and meets the needs of data consumers. The fundamental principle is that high-quality data is 'fit for purpose' — meaning it meets the specific requirements of the people and processes that use it. Data quality is not an absolute; the same data may be high quality for one use case and inadequate for another. Data Quality encompasses the entire lifecycle of quality management: defining data quality requirements with business stakeholders, profiling data to understand current state, measuring quality against defined dimensions and rules, implementing improvement processes (prevention at the source and correction of existing issues), and monitoring quality on an ongoing basis. The DMBOK2 emphasizes that prevention is always preferred over correction — designing quality into data capture processes is more effective than cleansing data after the fact. The cost of poor data quality (COPDQ) can be enormous — Gartner estimates it at 2.9 million per year for the average organization. Costs include incorrect decisions, regulatory fines, customer churn, operational inefficiency, and lost revenue. A mature data quality program quantifies these costs to justify investment and demonstrate ROI. Data Quality is closely linked to Data Governance (which provides the policy framework) and Metadata Management (which provides the definitions and context needed to assess quality).

Key Concepts

Data Quality Dimensions — The Complete Set

Data quality is measured across multiple dimensions. The most commonly tested dimensions on the CDMP exam are: (1) ACCURACY — data correctly represents the real-world entity or event it describes; (2) COMPLETENESS — all required data values are present and populated (no missing fields or records); (3) CONSISTENCY — data values agree across different systems and datasets; (4) TIMELINESS — data is available when needed and reflects the current state within acceptable latency; (5) VALIDITY — data conforms to defined formats, ranges, and business rules; (6) UNIQUENESS — each real-world entity is represented only once (no duplicate records); (7) INTEGRITY — referential and structural integrity is maintained; (8) REASONABLENESS — data values make logical sense in context; (9) ACCESSIBILITY — authorized users can access data when they need it through appropriate channels.

Accuracy vs Validity — Critical Distinction

This distinction is frequently tested on the CDMP exam. ACCURACY means the data correctly describes the real world. VALIDITY means the data conforms to the expected format and rules. Data can be VALID but NOT ACCURATE: (555) 000-0000 is a valid phone number format but doesn't correctly represent John's actual number. Conversely, data can be ACCURATE but NOT VALID if the format is wrong. You need BOTH accuracy and validity for high-quality data.

Data Profiling

Data profiling is the examination of data to collect statistics and information about its current state. It is typically the FIRST step in any data quality improvement initiative. Three types of profiling: (1) COLUMN ANALYSIS — examines individual columns: data types, null counts, distinct values, min/max values, patterns, frequency distributions. (2) CROSS-COLUMN ANALYSIS (Dependency Analysis) — examines relationships between columns within a table: functional dependencies, correlations. (3) CROSS-TABLE ANALYSIS (Redundancy Analysis) — examines relationships across tables: orphan records, duplicate records, referential integrity. Profiling should be done BEFORE designing quality rules and improvement programs.

Data Cleansing (Data Remediation)

The process of detecting and correcting or removing corrupt, inaccurate, or irrelevant records. Key cleansing activities: (1) STANDARDIZATION — converting data to consistent formats; (2) DEDUPLICATION — identifying and merging or removing duplicate records using matching algorithms; (3) VALIDATION — checking data against business rules and correcting violations; (4) ENRICHMENT — supplementing data with additional information from external sources; (5) PARSING — breaking compound fields into components; (6) CORRECTION — fixing known errors. Important: cleansing treats symptoms, not root causes. Always combine cleansing with root cause analysis and prevention.

Root Cause Analysis for Data Quality

Identifying WHY data quality problems occur, not just fixing symptoms. Common root causes include: incorrect data entry (human error), system migration issues, integration errors between systems, inadequate validation rules, changed business rules not reflected in systems, missing or unclear data standards, poor system design, and lack of training. Root cause analysis techniques: (1) FISHBONE DIAGRAM (Ishikawa) — categorizes causes into People, Process, Technology, Data categories; (2) FIVE WHYS — repeatedly asking why to drill down to the fundamental cause; (3) PARETO ANALYSIS — identifying the 20% of causes responsible for 80% of quality issues. The goal is to fix the root cause so problems don't recur.

Data Quality Rules and Business Rules

Data quality rules define the specific criteria data must meet. Types include: (1) VALIDITY RULES — format, range, and domain constraints; (2) COMPLETENESS RULES — which fields must be populated under what conditions; (3) CONSISTENCY RULES — cross-field and cross-system agreement; (4) UNIQUENESS RULES — duplicate prevention criteria; (5) TIMELINESS RULES — freshness requirements; (6) ACCURACY RULES — verification against authoritative sources. Rules should be defined by business stakeholders (who understand requirements) and implemented by technical teams.

Data Quality SLAs (Service Level Agreements)

Formal agreements between data producers and data consumers that define acceptable quality thresholds. A DQ SLA should specify: (1) SCOPE — which data elements and datasets are covered; (2) DIMENSIONS — which quality dimensions are measured; (3) THRESHOLDS — minimum acceptable levels; (4) MEASUREMENT METHOD — how quality will be measured and how often; (5) REMEDIATION PROCESS — what happens when thresholds are breached; (6) ESCALATION — who is notified and what actions are triggered; (7) REPORTING — how results are communicated. SLAs create accountability and make quality expectations explicit rather than assumed.

Data Quality Monitoring and Dashboards

Continuous monitoring is essential for maintaining data quality. Components include: (1) AUTOMATED QUALITY CHECKS — scheduled jobs that run quality rules against data and flag violations; (2) QUALITY SCORECARDS — summary metrics showing current state across dimensions and domains; (3) TREND ANALYSIS — tracking quality metrics over time to identify improvement or degradation; (4) EXCEPTION REPORTING — automated alerts when quality drops below thresholds; (5) DATA QUALITY DASHBOARDS — visual displays for stakeholders showing key quality KPIs. Best practice is to embed quality monitoring into data pipelines (ETL/ELT processes) so issues are caught early. Statistical Process Control (SPC) techniques can detect quality drift before it becomes a major problem.

Cost of Poor Data Quality (COPDQ)

Quantifying the financial impact of poor data quality to justify improvement investments. Categories of cost: (1) PREVENTION COSTS — investment in processes and tools to prevent quality issues (training, validation rules, governance); (2) DETECTION/APPRAISAL COSTS — costs of finding quality issues (profiling, auditing, monitoring); (3) INTERNAL FAILURE COSTS — costs of quality issues found before they reach external parties (rework, data cleansing, workarounds); (4) EXTERNAL FAILURE COSTS — costs of quality issues that reach customers or regulators (customer complaints, regulatory fines, lost business). IBM estimates poor data quality costs the US economy .1 trillion annually. The 1-10-100 rule suggests it costs to prevent an error, 0 to correct it, and 00 if it reaches the customer.

Data Quality Assessment

A formal, structured evaluation of data quality at a point in time. Steps include: (1) DEFINE SCOPE — which data, which systems, which quality dimensions; (2) GATHER BUSINESS REQUIREMENTS — what does good quality mean for each dataset; (3) PROFILE THE DATA — collect statistics and identify patterns; (4) DEFINE QUALITY RULES — translate business requirements into measurable rules; (5) MEASURE — run quality rules and calculate dimension scores; (6) ANALYZE — identify patterns, root causes, and priorities; (7) REPORT — produce a quality scorecard with findings and recommendations; (8) ACTION PLAN — create a prioritized improvement roadmap. Assessments should be repeated periodically (quarterly or annually) to track progress.

Prevention vs Correction (Design Quality In)

The DMBOK2 strongly emphasizes that PREVENTION is preferred over CORRECTION. This means: designing data entry forms with validation rules that reject bad data at the source; implementing dropdown menus and auto-complete to reduce manual entry errors; using reference data lookups to enforce standard values; building quality checks into ETL/integration processes to catch issues in transit; training data entry staff on quality expectations and common errors. The manufacturing analogy: it's better to design a process that produces good products than to inspect finished products and discard defective ones. Total Quality Management (TQM) and Six Sigma principles apply directly to data quality.

PDCA Cycle for Continuous Improvement

Data quality improvement follows the Plan-Do-Check-Act cycle: (1) PLAN — identify quality issues, analyze root causes, define improvement targets, design solutions; (2) DO — implement improvements (new validation rules, cleansing processes, training); (3) CHECK — measure results against targets, verify improvement; (4) ACT — standardize successful improvements, adjust unsuccessful ones, identify next priorities. This cycle repeats continuously. Also known as the Deming Cycle. Six Sigma DMAIC (Define-Measure-Analyze-Improve-Control) follows a similar pattern and is another framework frequently referenced in DMBOK2 for data quality improvement.

Data Quality and Data Integration

Data quality challenges multiply during data integration. Common integration quality issues: (1) SCHEMA CONFLICTS — different systems use different structures for the same data; (2) SEMANTIC CONFLICTS — same term means different things in different systems; (3) DUPLICATE RECORDS — same entity represented differently across systems; (4) STALE DATA — different systems updated at different times; (5) TRANSFORMATION ERRORS — data corrupted during ETL processing. Best practice: implement quality checks at each stage of the integration pipeline (extract, transform, load) and reconcile record counts and key aggregates between source and target.

Data Quality Roles

Key roles in a DQ program: (1) DATA QUALITY ANALYST — profiles data, defines quality rules, measures quality, identifies root causes; (2) DATA QUALITY STEWARD — business representative who defines quality requirements, prioritizes issues, and validates improvements; (3) DATA QUALITY MANAGER — oversees the DQ program, sets strategy, manages resources and budget; (4) DATA QUALITY DEVELOPER — builds and maintains quality monitoring tools, automated checks, and cleansing processes; (5) DATA OWNER — ultimate business accountability for the quality of a data domain. Everyone who creates or modifies data has a quality responsibility, but these roles have formal accountability.

Best Practices

✓ Define data quality rules aligned with business requirements — quality means fit for purpose not perfect
✓ Profile data BEFORE designing quality improvement programs to understand the current state objectively
✓ Address root causes, not just symptoms — use fishbone diagrams and 5 Whys to find underlying problems
✓ Implement quality checks at the point of data entry (prevention is cheaper and more effective than correction)
✓ Establish data quality dashboards for continuous monitoring with automated alerts when thresholds are breached
✓ Create data quality SLAs between data producers and consumers with explicit thresholds and remediation processes
✓ Use statistical process control to detect quality drift over time before it becomes a major problem
✓ Involve business users in defining quality expectations — they understand what good data means for their work
✓ Track the cost of poor data quality (COPDQ) to justify investments and demonstrate ROI to leadership
✓ Automate data quality measurement wherever possible — manual quality checks don't scale
✓ Build quality checks into ETL/integration pipelines to catch issues before they propagate
✓ Maintain a data quality issue log with root causes, resolution status, and prevention measures

💡 Exam Tips

★ Data Quality is 11% of the exam — one of the Big Four — expect 11 questions
★ MEMORIZE the data quality dimensions: Accuracy, Completeness, Consistency, Timeliness, Validity, Uniqueness, Integrity, Reasonableness
★ The Accuracy vs Validity distinction is HEAVILY tested: Accuracy = correct in real world; Validity = conforms to format/rules
★ Know the difference between data profiling (understanding current state) and data quality assessment (evaluating against requirements)
★ Prevention is ALWAYS preferred over correction — design quality in, don't inspect it in
★ Root cause analysis techniques: Fishbone (Ishikawa), Five Whys, Pareto (80/20) — know all three
★ Data quality is EVERYONE's responsibility, but specific roles have formal accountability
★ Know the PDCA cycle (Plan-Do-Check-Act) and how it applies to continuous DQ improvement
★ The 1-10-100 rule: to prevent, 0 to correct, 00 if it reaches the customer
★ Three types of profiling: Column analysis, Cross-column (dependency), Cross-table (redundancy)
★ Data Quality SLAs create accountability between data producers and consumers
★ COPDQ (Cost of Poor Data Quality) is the primary way to justify DQ investment to management