Big Data and Data Science

Chapter 14 2% of exam

Overview

Big Data and Data Science, covered in DAMA-DMBOK2 Chapter 14, addresses the technologies, techniques, and management challenges associated with extremely large, fast-moving, and diverse data sets, as well as the analytical methods used to extract insights from them. Big Data is commonly characterized by the 5 V's: Volume (massive scale, from terabytes to petabytes and beyond), Velocity (data generated and processed at high speed, including real-time streaming), Variety (diverse formats including structured, semi-structured, and unstructured data), Veracity (uncertainty and trustworthiness of data quality), and Value (the business insights and outcomes derived from analysis). These characteristics demand specialized technologies and architectural approaches that go beyond traditional relational database systems. Data Science is the interdisciplinary field that combines statistics, mathematics, computer science, and domain expertise to extract knowledge and actionable insights from data. The data science process is often described using the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework, which includes six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Machine learning — a core technique in data science — is categorized into supervised learning (classification and regression using labeled training data), unsupervised learning (clustering and pattern discovery without labels), and reinforcement learning (learning optimal actions through trial and reward). Deep learning, a subset of machine learning using neural networks with many layers, has enabled breakthroughs in image recognition, natural language processing, and speech synthesis. From a data management perspective, Big Data introduces significant challenges for governance, quality, metadata, security, and architecture. Data lakes — large repositories storing raw data in its native format until needed for analysis — require careful governance to avoid becoming 'data swamps' where data is dumped without documentation or quality controls. Data engineering has emerged as a critical discipline focused on building and maintaining the pipelines, infrastructure, and architectures that enable data science. DMBOK2 emphasizes that traditional data management principles (governance, quality, security, metadata) still apply to Big Data and data science environments, even though the technologies and approaches differ. Organizations must also address ethical considerations including algorithmic bias, model transparency, privacy implications of large-scale data collection, and the responsible use of predictive and prescriptive analytics.

Key Concepts

The 5 V's of Big Data

Big Data is characterized by five defining dimensions: (1) VOLUME — the sheer quantity of data generated and stored, measured in terabytes, petabytes, or exabytes. Traditional databases cannot efficiently handle this scale. (2) VELOCITY — the speed at which data is generated, ingested, and processed. Includes batch processing (periodic bulk loads), near-real-time (minutes to hours), and real-time streaming (milliseconds). (3) VARIETY — the diversity of data formats and sources: structured (database tables), semi-structured (JSON, XML, log files), and unstructured (text, images, video, audio, sensor data). (4) VERACITY — the reliability, accuracy, and trustworthiness of data. Big data sources often have variable quality, missing values, noise, and inconsistencies that must be addressed. (5) VALUE — the business benefit derived from analyzing big data. Volume, velocity, and variety are meaningless without extracting actionable insights. Some frameworks cite additional V's (variability, visualization), but the 5 V's are the standard DMBOK2 reference.

Big Data Technologies (Hadoop, Spark, MapReduce)

HADOOP is an open-source distributed computing framework that enables storage and processing of large datasets across clusters of commodity hardware. Its core components are: HDFS (Hadoop Distributed File System) — stores data across multiple nodes with built-in replication for fault tolerance; MapReduce — a programming model that processes data in two phases: Map (splits and processes data in parallel across nodes) and Reduce (aggregates partial results into final output). However, MapReduce is batch-oriented and relatively slow for iterative or interactive workloads. APACHE SPARK has largely supplanted MapReduce for many use cases because it performs in-memory processing (up to 100x faster than MapReduce for certain tasks), supports real-time streaming (Spark Streaming), machine learning (MLlib), graph processing (GraphX), and SQL queries (Spark SQL). Other key technologies include: Apache Kafka (real-time data streaming and messaging), Apache Hive (SQL-like queries on Hadoop data), HBase (NoSQL database on HDFS), and various cloud-native alternatives (AWS EMR, Azure HDInsight, Google Dataproc).

CRISP-DM (Data Science Process)

CRISP-DM (Cross-Industry Standard Process for Data Mining) is the most widely adopted methodology for data science projects. It defines six iterative phases: (1) BUSINESS UNDERSTANDING — define the business problem, objectives, and success criteria. What decision will the model support? (2) DATA UNDERSTANDING — explore available data, assess quality, discover initial patterns, and determine data sufficiency. (3) DATA PREPARATION — the most time-consuming phase (often 60-80% of effort). Includes data cleaning, feature engineering, handling missing values, normalization, and creating training/test splits. (4) MODELING — select and apply algorithms (regression, classification, clustering), tune hyperparameters, and train models. (5) EVALUATION — assess model performance against business objectives using metrics like accuracy, precision, recall, F1-score, AUC-ROC. Determine if the model meets the success criteria defined in phase 1. (6) DEPLOYMENT — put the model into production, establish monitoring, and plan for maintenance. The process is iterative: evaluation may reveal the need for better data preparation or a different modeling approach.

Machine Learning Types

Machine learning is categorized into three primary types: (1) SUPERVISED LEARNING — the algorithm learns from labeled training data (input-output pairs) to predict outcomes for new data. Subtypes: CLASSIFICATION (predicting categorical outcomes — e.g., spam vs. not-spam, customer churn yes/no) using algorithms like logistic regression, decision trees, random forests, and support vector machines; REGRESSION (predicting continuous numeric values — e.g., house prices, revenue forecasts) using linear regression, polynomial regression, or gradient boosting. (2) UNSUPERVISED LEARNING — the algorithm finds patterns in unlabeled data without predetermined outcomes. Subtypes: CLUSTERING (grouping similar items — e.g., customer segmentation) using k-means, hierarchical clustering, or DBSCAN; DIMENSIONALITY REDUCTION (simplifying data while preserving structure) using PCA or t-SNE; ASSOCIATION RULES (finding co-occurrence patterns — e.g., market basket analysis). (3) REINFORCEMENT LEARNING — an agent learns optimal behavior through trial-and-error interactions with an environment, receiving rewards or penalties. Used in robotics, game playing, recommendation engines, and autonomous systems.

Data Lake

A data lake is a centralized repository that stores raw data in its native format — structured, semi-structured, and unstructured — at any scale, until it is needed for analysis. Unlike a data warehouse (which stores processed, structured data in a predefined schema), a data lake uses a SCHEMA-ON-READ approach: data is ingested without transformation and structure is applied when the data is accessed for analysis. Benefits: flexibility to store any data type, low-cost storage for massive volumes, support for diverse analytics (SQL, machine learning, streaming). Risks: without proper governance and metadata management, data lakes become 'DATA SWAMPS' — repositories where data is dumped without documentation, quality controls, or discoverability, making it useless. Best practices to prevent data swamps include: implementing data cataloging and metadata management, establishing data quality checks at ingestion, applying access controls and security, defining retention policies, and assigning data stewards. Modern DATA LAKEHOUSE architectures combine data lake flexibility with data warehouse reliability and performance.

Data Engineering

Data engineering is the discipline focused on designing, building, and maintaining the infrastructure and pipelines that collect, store, transform, and serve data for analytics and data science. Data engineers build and operate: ETL/ELT PIPELINES — automated workflows that extract data from sources, transform it, and load it into target systems; DATA INFRASTRUCTURE — databases, data lakes, data warehouses, streaming platforms, and cloud services; DATA QUALITY CHECKS — automated validation and monitoring within pipelines; ORCHESTRATION — scheduling and managing dependencies between pipeline tasks (using tools like Apache Airflow, Luigi, or Prefect). Data engineering is distinct from data science: engineers build the infrastructure, scientists build the models. The two roles are complementary and must collaborate closely. Without reliable data engineering, data science models have no trustworthy data to train on.

Predictive and Prescriptive Analytics

Analytics maturity progresses through four stages: (1) DESCRIPTIVE ANALYTICS — what happened? (reports, dashboards, KPIs); (2) DIAGNOSTIC ANALYTICS — why did it happen? (root cause analysis, drill-down analysis); (3) PREDICTIVE ANALYTICS — what will happen? Uses statistical models and machine learning to forecast future outcomes based on historical patterns. Examples: customer churn prediction, demand forecasting, equipment failure prediction (predictive maintenance), credit risk scoring. (4) PRESCRIPTIVE ANALYTICS — what should we do? Goes beyond prediction to recommend optimal actions. Uses optimization algorithms, simulation, and decision models to evaluate alternatives and suggest the best course of action. Examples: dynamic pricing, supply chain optimization, treatment recommendation systems. The progression from descriptive to prescriptive represents increasing analytical sophistication, data science capability, and business value.

Data Visualization

Data visualization is the graphical representation of data and analytical results to facilitate understanding, communication, and decision-making. Effective visualization applies principles of human visual perception and cognitive psychology. Key concepts include: CHART SELECTION — choosing the right visual form for the data type and message (bar charts for comparisons, line charts for trends, scatter plots for correlations, heat maps for density, geographic maps for spatial data); DESIGN PRINCIPLES — minimize chartjunk, maximize data-ink ratio (Tufte's principle), use appropriate scales, avoid misleading representations (truncated axes, 3D distortion); INTERACTIVITY — allowing users to drill down, filter, and explore data dynamically (dashboards); STORYTELLING — structuring visualizations into a narrative that guides the audience to insights and actions. Common tools include Tableau, Power BI, D3.js, and Python libraries (matplotlib, seaborn, plotly). Visualization is the primary interface between data science and business decision-makers.

Data Mining

Data mining is the process of discovering patterns, correlations, anomalies, and insights in large datasets using statistical, mathematical, and computational techniques. It differs from traditional query and reporting because it seeks previously unknown patterns rather than answering predefined questions. Core data mining techniques include: CLASSIFICATION — assigning items to predefined categories based on attributes; CLUSTERING — identifying natural groupings in data without predefined categories; ASSOCIATION RULE MINING — finding relationships between variables (e.g., customers who buy bread often buy butter); ANOMALY DETECTION — identifying unusual patterns that deviate from expected behavior (fraud detection, network intrusion detection); REGRESSION — modeling relationships between variables to predict numeric outcomes; SEQUENCE ANALYSIS — discovering sequential patterns in time-ordered data. Data mining is closely related to machine learning but is more focused on the discovery process, while machine learning emphasizes building models for prediction.

Model Deployment and Monitoring (MLOps)

Deploying machine learning models into production and maintaining them over time is a critical challenge. MLOps (Machine Learning Operations) is the discipline that applies DevOps principles to machine learning. Key concepts: MODEL DEPLOYMENT — moving a trained model from a development environment into production where it makes real-time or batch predictions. Deployment options include REST APIs, embedded models, edge deployment, and batch scoring. MODEL MONITORING — tracking model performance over time to detect degradation. MODEL DRIFT — when the statistical properties of the data the model was trained on change over time, causing predictions to become less accurate. Types include CONCEPT DRIFT (the relationship between input features and the target variable changes) and DATA DRIFT (the distribution of input features changes). MODEL RETRAINING — periodically updating the model with new data to maintain accuracy. A/B TESTING — comparing a new model against the current production model. MLOps tools include MLflow, Kubeflow, SageMaker, and Vertex AI.

Ethical Considerations in AI/ML and Algorithmic Bias

AI and machine learning systems can perpetuate, amplify, or create unfair outcomes if not carefully designed and monitored. ALGORITHMIC BIAS occurs when a model systematically produces results that are unfairly prejudiced toward or against certain groups. Sources of bias: TRAINING DATA BIAS — historical data reflecting past discrimination (e.g., hiring data biased against women); SELECTION BIAS — non-representative training samples; MEASUREMENT BIAS — features that serve as proxies for protected characteristics (zip code proxying for race); AGGREGATION BIAS — one model applied to groups with different underlying patterns. Mitigation strategies: diverse and representative training data, bias audits and fairness metrics (disparate impact ratio, equalized odds, demographic parity), algorithmic fairness techniques (reweighting, adversarial debiasing), explainability tools (LIME, SHAP) to understand model decisions, and human oversight for high-stakes decisions. Organizations must establish ethical AI frameworks, conduct impact assessments, and maintain transparency about how models are used.

Natural Language Processing (NLP)

NLP is a branch of artificial intelligence that enables computers to understand, interpret, generate, and interact with human language. Core NLP tasks include: TEXT CLASSIFICATION — assigning categories to text (sentiment analysis, spam detection, topic classification); NAMED ENTITY RECOGNITION (NER) — identifying and extracting entities like person names, organizations, locations, and dates from text; MACHINE TRANSLATION — converting text from one language to another; TEXT SUMMARIZATION — generating concise summaries of longer documents; QUESTION ANSWERING — extracting or generating answers from a knowledge base or text corpus; CHATBOTS AND CONVERSATIONAL AI — enabling human-machine dialogue. NLP is relevant to data management because it enables processing and analysis of unstructured text data, which constitutes a large percentage of organizational data. Modern NLP relies on transformer-based deep learning models (BERT, GPT) that have dramatically improved accuracy across all NLP tasks.

Best Practices

✓ Apply the same core data management principles (governance, quality, security, metadata) to big data environments — scale does not eliminate the need for discipline
✓ Implement robust metadata management and data cataloging for data lakes to prevent them from becoming ungoverned 'data swamps'
✓ Use the CRISP-DM methodology to structure data science projects with clear business objectives, iterative development, and rigorous evaluation before deployment
✓ Establish data quality checks at the point of data ingestion into big data systems, not just at the point of consumption or analysis
✓ Build cross-functional data science teams that include business domain experts alongside data scientists, data engineers, and ML engineers
✓ Implement MLOps practices for model deployment, monitoring, and retraining to ensure models remain accurate and reliable in production
✓ Conduct bias audits and fairness assessments on machine learning models before deploying them for decisions that affect people
✓ Document all models with clear descriptions of training data, features, assumptions, limitations, and known biases to support transparency and reproducibility
✓ Start with simpler models and analytics techniques before investing in complex machine learning — descriptive and diagnostic analytics often provide significant value
✓ Design data architectures that support both real-time streaming and batch processing to address the full range of velocity requirements
✓ Establish clear data retention and disposal policies for big data repositories, as the low cost of storage often leads to indefinite hoarding of data with no business value
✓ Ensure data visualization and communication of results is part of every data science project — insights that cannot be clearly communicated do not drive business decisions

💡 Exam Tips

★ Big Data and Data Science is 2% of the exam — expect approximately 2 questions, but understand the concepts well as they intersect with other knowledge areas
★ Know the 5 V's of Big Data: Volume, Velocity, Variety, Veracity, and Value — be able to define each and provide examples
★ CRISP-DM has SIX phases (Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment) — Data Preparation is the most time-consuming, often 60-80% of project effort
★ Know the THREE types of machine learning: Supervised (labeled data, classification/regression), Unsupervised (unlabeled data, clustering/association), Reinforcement (trial and reward)
★ Data Lake uses SCHEMA-ON-READ (structure applied at query time) vs Data Warehouse uses SCHEMA-ON-WRITE (structure applied at load time) — this is a frequently tested distinction
★ A data lake without governance becomes a 'data swamp' — the exam may test what prevents this (metadata management, cataloging, quality controls, stewardship)
★ The four levels of analytics maturity: Descriptive (what happened), Diagnostic (why), Predictive (what will happen), Prescriptive (what to do) — know the progression and examples
★ Algorithmic bias sources include: training data bias, selection bias, measurement bias, and aggregation bias — the exam may test your understanding of how bias enters models
★ Model drift (concept drift and data drift) is the degradation of model performance over time — know that monitoring and retraining are required for production models
★ Hadoop/MapReduce is batch-oriented; Spark provides in-memory processing and supports streaming, ML, and SQL — know the key differences between these technologies
★ DMBOK2 emphasizes that traditional data management principles (governance, quality, security, metadata) STILL APPLY to big data environments — new technology does not eliminate management requirements
★ Know that data engineering (building pipelines and infrastructure) is distinct from data science (building models) — both are essential and complementary disciplines