Data Storage and Operations

Chapter 6 6% of exam

Overview

Data Storage and Operations encompasses the design, implementation, management, and monitoring of systems that store and provide access to data throughout its lifecycle. DMBOK2 Chapter 6 covers this knowledge area, which is concerned with ensuring data is physically stored in reliable, performant, and cost-effective infrastructure while remaining available to authorized users when needed. This knowledge area bridges the gap between logical data architecture (what data we need and how it relates) and the physical reality of databases, file systems, storage hardware, and the operational processes that keep them running. Core activities include database design and implementation (selecting appropriate database technologies, creating physical schemas, and optimizing for workload characteristics), database administration (managing database instances, user accounts, security, performance, and availability), data storage management (selecting storage technologies, managing storage tiers, capacity planning), backup and recovery (protecting against data loss through regular backups and tested recovery procedures), and database performance monitoring and tuning (ensuring systems meet performance SLAs through query optimization, indexing, and resource management). Modern data storage has evolved far beyond traditional relational databases. Organizations now manage diverse storage technologies including relational databases (OLTP and OLAP), NoSQL databases (document, key-value, columnar, graph), cloud-managed database services, object storage, data lakes, time-series databases, in-memory databases, and distributed file systems. Each technology has distinct strengths and trade-offs characterized by the CAP theorem (Consistency, Availability, Partition tolerance) and ACID properties (Atomicity, Consistency, Isolation, Durability). Effective data storage operations require understanding these characteristics and matching technology choices to workload requirements while maintaining robust operational procedures for reliability, security, and disaster recovery.

Key Concepts

Database Types and Use Cases

Modern data management uses diverse database technologies optimized for different workloads: (1) RELATIONAL (RDBMS) — structured data in tables with SQL access, ACID transactions, enforced schemas. Best for transactional processing (OLTP), financial systems, ERP. Examples: Oracle, SQL Server, PostgreSQL, MySQL. (2) DOCUMENT (NoSQL) — semi-structured data stored as JSON/BSON documents. Flexible schema, nested data. Best for content management, user profiles, catalogs. Examples: MongoDB, Couchbase, Amazon DocumentDB. (3) KEY-VALUE — simple key-to-value mappings with extremely fast lookups. Best for caching, session management, real-time applications. Examples: Redis, Amazon DynamoDB, Riak. (4) COLUMNAR — data stored by column rather than row, enabling efficient analytical queries on specific columns. Best for data warehousing and analytics. Examples: Apache Cassandra, HBase, Google Bigtable. (5) GRAPH — stores entities (nodes) and relationships (edges) as first-class citizens. Best for social networks, fraud detection, recommendation engines. Examples: Neo4j, Amazon Neptune, TigerGraph. (6) TIME-SERIES — optimized for timestamped sequential data. Best for IoT, monitoring, financial ticks. Examples: InfluxDB, TimescaleDB.

ACID Properties

ACID properties define the guarantees that database transactions must provide to ensure data reliability. (1) ATOMICITY — a transaction is all-or-nothing; either all operations within the transaction complete successfully, or none of them do. If a bank transfer debits one account and credits another, either both happen or neither happens. (2) CONSISTENCY — a transaction brings the database from one valid state to another valid state, enforcing all defined rules (constraints, triggers, cascades). The database never violates its integrity constraints. (3) ISOLATION — concurrent transactions execute as if they were running sequentially; one transaction's intermediate state is invisible to other transactions. Isolation levels range from Read Uncommitted (weakest, allows dirty reads) to Serializable (strongest, full isolation but lowest concurrency). (4) DURABILITY — once a transaction is committed, it persists even if the system crashes. Achieved through write-ahead logging (WAL) and other persistence mechanisms. Traditional relational databases fully support ACID. Many NoSQL databases relax ACID properties (particularly isolation and consistency) for improved performance and scalability.

CAP Theorem

The CAP theorem (Brewer's theorem) states that a distributed data system can provide at most TWO of these three guarantees simultaneously: (1) CONSISTENCY — every read receives the most recent write (all nodes see the same data at the same time); (2) AVAILABILITY — every request receives a non-error response (the system always responds); (3) PARTITION TOLERANCE — the system continues to operate despite network partitions (communication failures between nodes). Since network partitions are unavoidable in distributed systems, the real choice is between CP (Consistency + Partition tolerance — may refuse requests during partitions to maintain consistency, e.g., HBase, MongoDB in default mode) and AP (Availability + Partition tolerance — always responds but may serve stale data during partitions, e.g., Cassandra, DynamoDB, CouchDB). Traditional single-node relational databases are CA (Consistent + Available) because they don't face partition issues. The CAP theorem is fundamental for understanding trade-offs in distributed database design.

Data Partitioning and Sharding

Partitioning divides a large dataset into smaller, more manageable segments to improve performance, availability, and manageability. VERTICAL PARTITIONING — splits a table by columns, placing frequently accessed columns together and large/infrequently accessed columns separately. Reduces I/O for queries that access only some columns. HORIZONTAL PARTITIONING — splits a table by rows based on a partition key (e.g., date ranges, geographic regions, hash of customer ID). Each partition contains a subset of rows. SHARDING is horizontal partitioning across multiple database instances (servers), where each shard is an independent database containing a portion of the data. Sharding strategies: RANGE-BASED (shard by date ranges or alphabetical ranges — can create uneven distribution), HASH-BASED (shard by hash of partition key — provides even distribution but makes range queries difficult), DIRECTORY-BASED (a lookup table maps partition keys to shards — flexible but adds a lookup overhead). Benefits: improved query performance (queries scan only relevant partitions), parallel processing, and independent backup/maintenance per partition.

Backup and Recovery Strategies

Backup and recovery protect against data loss from hardware failure, human error, corruption, disasters, and cyberattacks. Backup types: (1) FULL BACKUP — complete copy of all data. Slowest to create but fastest to restore. Typically done weekly. (2) INCREMENTAL BACKUP — copies only data changed since the LAST BACKUP (full or incremental). Fast to create but restoration requires replaying all incrementals since the last full. (3) DIFFERENTIAL BACKUP — copies all data changed since the LAST FULL backup. Faster to restore than incremental (only needs last full + last differential) but larger than incremental. Recovery objectives: RPO (Recovery Point Objective) — maximum acceptable data loss measured in time (e.g., RPO of 1 hour means you can lose at most 1 hour of data). Determines backup frequency. RTO (Recovery Time Objective) — maximum acceptable downtime (e.g., RTO of 4 hours means the system must be restored within 4 hours). Determines recovery infrastructure. The 3-2-1 rule: maintain 3 copies of data, on 2 different media types, with 1 copy offsite. Regular restore testing is critical — untested backups are unreliable.

Disaster Recovery Planning

Disaster Recovery (DR) planning ensures business continuity when catastrophic events (natural disasters, cyberattacks, infrastructure failures) render primary data systems unavailable. DR strategies vary by RTO/RPO requirements and cost: (1) COLD SITE — backup facility with space and power but no pre-installed equipment. Cheapest but slowest recovery (days to weeks). (2) WARM SITE — facility with pre-installed hardware and replicated data that is periodically updated. Moderate cost and recovery time (hours to a day). (3) HOT SITE — fully operational mirror of the primary site with real-time data replication and automatic failover. Most expensive but near-instant recovery (minutes). (4) ACTIVE-ACTIVE — both sites process traffic simultaneously, providing zero downtime and instant failover. DR plans must define: critical system prioritization (which systems recover first), communication procedures, roles and responsibilities, vendor contacts, and testing schedules. DR plans must be tested regularly through tabletop exercises, failover tests, and full disaster simulations.

Database Replication Strategies

Replication maintains copies of data across multiple locations for high availability, disaster recovery, and read scaling. KEY STRATEGIES: (1) SYNCHRONOUS REPLICATION — writes are confirmed only after both primary and replica acknowledge. Guarantees zero data loss (RPO=0) but increases write latency because every write waits for the replica. Used for critical financial data. (2) ASYNCHRONOUS REPLICATION — writes are confirmed on the primary immediately; changes propagate to replicas with a delay. Better write performance but risk of data loss if primary fails before propagation. (3) SEMI-SYNCHRONOUS — write confirms after at least one replica acknowledges, balancing performance and durability. TOPOLOGIES: primary-replica (one writer, multiple readers), multi-primary (multiple writers, requires conflict resolution), chain replication (primary → replica1 → replica2), and peer-to-peer (all nodes are equal). Replication lag — the delay between a write on the primary and its visibility on replicas — is a key metric to monitor. Applications reading from replicas must handle potential staleness.

Service Level Agreements (SLAs)

SLAs define measurable commitments for data storage and operations performance. Key metrics include: AVAILABILITY — percentage of time the system is operational (99.9% = 8.76 hours downtime/year, 99.99% = 52.6 minutes/year, 99.999% = 5.26 minutes/year — these 'nines' are commonly tested). PERFORMANCE — query response time targets (e.g., 95th percentile read latency < 100ms), throughput targets (e.g., 10,000 transactions per second). RECOVERY — RPO (maximum data loss) and RTO (maximum downtime) targets. CAPACITY — storage growth provisions, performance at projected data volumes. SLAs should define: what is measured, how it is measured, measurement frequency, remediation procedures when SLAs are breached, and escalation paths. Internal SLAs between IT and business units are as important as external SLAs with vendors. SLA monitoring requires automated tools that collect metrics continuously and alert when thresholds are approached.

Database Performance Tuning

Performance tuning optimizes database systems to meet response time and throughput requirements. Key tuning areas: (1) QUERY OPTIMIZATION — rewriting inefficient SQL, using EXPLAIN plans to analyze query execution paths, eliminating full table scans, optimizing joins. (2) INDEXING — creating appropriate indexes (B-tree for range queries, hash for equality lookups, composite for multi-column conditions, covering indexes to avoid table lookups). Too many indexes slow writes; too few slow reads. (3) SCHEMA OPTIMIZATION — strategic denormalization for read-heavy workloads, appropriate data types, table partitioning. (4) MEMORY TUNING — configuring buffer pools and cache sizes so frequently accessed data stays in memory. (5) I/O OPTIMIZATION — separating data files and log files onto different storage, using SSDs for high-performance needs. (6) CONNECTION MANAGEMENT — using connection pooling to reduce overhead of establishing database connections. (7) STATISTICS MAINTENANCE — keeping query optimizer statistics up to date so the optimizer makes good execution plan decisions. Performance tuning is iterative: measure, identify bottlenecks, apply changes, measure again.

Data Archiving and Storage Tiers

Data archiving moves data that is no longer actively used but must be retained to lower-cost storage, reducing primary storage costs and improving performance. Storage tiers provide different cost-performance trade-offs: (1) HOT TIER — high-performance storage (SSD, in-memory) for frequently accessed, latency-sensitive data. Most expensive per GB. (2) WARM TIER — moderate-performance storage (standard HDD, lower-tier SSD) for data accessed occasionally. (3) COLD TIER — low-cost, high-latency storage for data rarely accessed but retained for compliance or historical analysis. Examples: Amazon S3 Glacier, Azure Cool Storage. (4) ARCHIVE TIER — lowest-cost storage with very high retrieval latency (hours). For long-term retention only. Data lifecycle policies automatically move data between tiers based on age, access patterns, or classification. Archiving must preserve data integrity, maintain searchability and retrievability, comply with retention requirements, and support legal hold requests. The distinction between archiving (moving to cheaper storage with continued access) and purging (permanent deletion) is important for compliance.

Capacity Planning

Capacity planning ensures that data storage and processing infrastructure can meet current and future demands. Key activities: (1) BASELINE MEASUREMENT — documenting current storage utilization, I/O throughput, CPU usage, memory consumption, and growth rates. (2) GROWTH FORECASTING — projecting future needs based on historical trends, planned business initiatives, new data sources, and seasonal patterns. (3) THRESHOLD MANAGEMENT — setting alerts at capacity thresholds (e.g., alert at 80% storage utilization, critical at 90%). (4) PROCUREMENT PLANNING — aligning hardware/cloud resource procurement timelines with projected needs. (5) PERFORMANCE MODELING — predicting how workload changes will affect system performance. Cloud environments simplify capacity management through elastic scaling (automatically adding resources as demand increases) but require careful cost monitoring to prevent unexpected expenses. Capacity planning should cover compute, storage, network bandwidth, and database connections.

Database Administration (DBA) Functions

Database Administration encompasses the operational activities required to keep databases running reliably and efficiently. Core DBA functions: (1) INSTALLATION AND CONFIGURATION — deploying database software, configuring parameters for workload characteristics, setting up high availability. (2) SECURITY MANAGEMENT — creating and managing user accounts, implementing access controls, encryption, auditing. (3) BACKUP AND RECOVERY — implementing backup schedules, testing restores, managing backup storage. (4) PERFORMANCE MONITORING AND TUNING — tracking response times, identifying slow queries, optimizing indexes and configurations. (5) PATCH MANAGEMENT — applying database vendor patches and updates while minimizing downtime. (6) CAPACITY MANAGEMENT — monitoring growth and planning storage/compute expansions. (7) CHANGE MANAGEMENT — coordinating schema changes, migration scripts, and application releases. (8) INCIDENT MANAGEMENT — responding to outages, performance degradations, and security incidents. DBAs work closely with data architects (who design schemas), developers (who write queries), and operations teams (who manage infrastructure). In cloud environments, managed database services shift some DBA responsibilities to the cloud provider.

Best Practices

✓ Match database technology to workload characteristics — relational for OLTP, columnar for analytics, document for flexible schema, graph for relationships
✓ Implement the 3-2-1 backup rule: 3 copies of data, on 2 different media types, with 1 copy stored offsite or in a different cloud region
✓ Define and document RPO and RTO for every database, and test backup restoration procedures at least quarterly
✓ Use connection pooling to reduce the overhead of establishing and tearing down database connections
✓ Implement automated monitoring with alerts for storage capacity, query performance, replication lag, and availability SLA thresholds
✓ Design data archiving policies that automatically move aging data to lower-cost storage tiers based on access patterns and retention requirements
✓ Use read replicas to offload reporting and analytics workloads from primary transactional databases
✓ Keep index strategies aligned with actual query patterns — review slow query logs regularly and remove unused indexes
✓ Implement automated failover for critical databases to minimize downtime when primary instances fail
✓ Separate database environments (production, staging, development) with different access controls and never test with production data
✓ Conduct regular capacity planning reviews that project growth 6-12 months ahead and align with procurement timelines
✓ Document all database configurations, architecture decisions, and operational procedures in a runbook for the operations team

💡 Exam Tips

★ Data Storage and Operations is 6% of the exam — expect approximately 6 questions
★ Know ACID properties: Atomicity (all or nothing), Consistency (valid state to valid state), Isolation (transactions independent), Durability (committed data persists)
★ CAP theorem: distributed systems can guarantee only TWO of Consistency, Availability, and Partition tolerance — since partitions are inevitable, the real choice is CP vs AP
★ RPO (Recovery Point Objective) = maximum acceptable DATA LOSS; RTO (Recovery Time Objective) = maximum acceptable DOWNTIME — these are frequently confused on exams
★ Full backup copies everything; Incremental copies changes since LAST backup; Differential copies changes since last FULL backup — know the restore implications of each
★ Availability 'nines': 99.9% = 8.76 hours/year downtime, 99.99% = 52.6 minutes/year, 99.999% = 5.26 minutes/year
★ Know the main database types: relational (SQL, ACID), document (JSON, flexible schema), key-value (fast lookups), columnar (analytics), graph (relationships), time-series (temporal data)
★ Synchronous replication guarantees zero data loss but increases latency; asynchronous has better performance but risks data loss on failure
★ Hot site (instant failover, most expensive), warm site (hours to recover, moderate cost), cold site (days to recover, cheapest) — know the trade-offs
★ Horizontal partitioning splits by ROWS (sharding); vertical partitioning splits by COLUMNS — these are distinct concepts
★ Database performance tuning involves query optimization, indexing, memory tuning, I/O optimization, and statistics maintenance
★ Data archiving moves data to cheaper storage while maintaining accessibility; purging permanently deletes data — compliance requires knowing the difference