thetinkerer: Consistency Tradeoffs in Modern Distributed Databases System Design

1. Introduction

The CAP theorem’s impact on modern distributed database system design is more limited than is often perceived. Another tradeoff—between consistency and latency —has had a more direct influence on several well-known DDBSs. A proposed new formulation, PACELC, unifies this tradeoff with CAP.
Recent Industry Adoption of DDBSs: Despite decades of research, the extensive use of Distributed Database Systems (DDBSs) by the industry is a relatively recent trend.
Two Primary Drivers: This trend is driven by the need for elastically scalable database systems to handle increased data and transactional throughput, and the requirement for worldwide data accessibility due to globalization.
Examples of Recent DDBSs: The past decade has seen the development of numerous DDBSs aiming for high scalability or global reach, including SimpleDB/Dynamo/DynamoDB, Cassandra, Voldemort, Sherpa/PNUTS, Riak, HBase/BigTable, MongoDB, VoltDB/H-Store, and Megastore.
Complexity and the Value of Understanding Tradeoffs: Building DDBSs is complex and difficult, making tools that help designers understand the inherent tradeoffs valuable.
The CAP Theorem's Usefulness and Misunderstanding: The CAP theorem has been helpful in reasoning about system capabilities and exposing marketing hype, but it is increasingly misunderstood and misapplied.
Misconception about CAP's Restrictions: Many designers incorrectly believe CAP imposes restrictions during normal system operation, leading to unnecessarily limited implementations.
CAP's Actual Limitations: In reality, CAP only defines limitations in the presence of certain types of failures and does not constrain normal operation.
Consistency/Latency Tradeoff's Influence: A different tradeoff, between consistency and latency, has arguably been more influential on DDBS design during normal operation than CAP.
Importance of Both Tradeoffs: Both CAP-related tradeoffs (during failures) and the consistency/latency tradeoff (during normal operation) are significant.
PACELC as a Unifying Formulation: Unifying CAP and the consistency/latency tradeoff into a single framework called PACELC can lead to a deeper understanding of modern DDBS design.

2. CAP is for failures

CAP Theorem Basics: The CAP theorem states that a Distributed Database System (DDBS) can only guarantee two out of three properties: Consistency (C), Availability (A), and Partition Tolerance (P), leading to CA, CP, or AP system possibilities.
Modern DDBSs and Consistency: Many modern DDBSs (like SimpleDB/Dynamo, Cassandra, Voldemort, Sherpa/PNUTS, and Riak) initially did not guarantee strong consistency as defined by CAP.
Definition of Consistency in CAP: The proof of CAP uses the definition of atomic/linearizable consistency, where operations appear to complete instantaneously in a total order, as if on a single node.
Initial Assumption about CAP's Influence: It's often assumed that CAP heavily influenced modern architects to build eventually consistent systems, reasoning that partition tolerance necessitates a choice between consistency and availability, and high availability is often prioritized.
Flaw in the Assumption: This reasoning is flawed because the consistency/availability tradeoff according to CAP only occurs during an actual network partition, not simply due to the requirement of partition tolerance.
Network Partition Probability: The likelihood of a network partition depends on factors like network scope (WAN vs. local), hardware quality, configuration management, and redundancy levels. Generally, network partitions are less frequent than other serious failures.
CAP's Restriction is Conditional: CAP imposes no restrictions on a DDBS during normal operation (in the absence of partitions).
CAP Doesn't Justify Reduced Consistency in Normal Operation: Therefore, DDBSs that reduce consistency by default, even when there are no network partitions, are not necessarily doing so because of CAP.
CAP Allows ACID and High Availability without Partitions: CAP permits a system to provide full ACID guarantees alongside high availability when network partitions are not present.
Conclusion on Default Consistency Reduction: The CAP theorem does not fully justify the default configuration of many DDBSs that reduce consistency and other ACID properties.

3. Consistency/Latency tradeoff

Context of Modern DDBS Design: To understand modern Distributed Database System (DDBS) design, it's crucial to consider the use cases for which these systems were initially built.
Examples of DDBS Use Cases:

Dynamo was designed by Amazon for core e-commerce services (e.g., shopping cart).
Cassandra was created by Facebook to power its Inbox Search.
Voldemort was built by LinkedIn to handle online updates from write-intensive features.
PNUTS was developed by Yahoo to store user data for webpage views, shopping page listings, and social networking applications.
Riak was motivated by use cases similar to Amazon's.

Common Use Case Characteristics: These systems typically serve data for webpages generated dynamically and delivered to active website users, and they handle online updates.
Latency Sensitivity: Studies show that latency is critical for online interactions, with even small increases (e.g., 100ms) significantly impacting user engagement and retention.
Consistency, Availability, and Latency Tradeoff: There's a fundamental tradeoff between consistency, availability, and latency in these systems.
Availability vs. Latency: Availability and latency are closely related, with unavailability essentially being very high latency. For the discussion, "high latency" is defined as approaching hundreds of milliseconds, while "unavailability" is longer than a typical request timeout (e.g., a few seconds). However, the author in the paper will later simplify this to a consistency vs. latency tradeoff.
Tradeoff Independent of Partitions: This consistency/latency tradeoff exists even without network partitions and is distinct from the CAP theorem's tradeoffs.
Reason for the Tradeoff: Replication and Failure: High availability necessitates data replication because system component failures are inevitable over time. Replication is required to maintain availability when failures occur.

4. Data replication

Replication and the Consistency/Latency Tradeoff: As soon as a DDBS replicates data, a fundamental tradeoff between consistency and latency emerges.
Three Basic Replication Strategies: There are three primary ways to implement data replication:

Sending updates to all replicas simultaneously.
Sending updates to an agreed-upon master node first.
Sending updates to a single (arbitrary) node first.

Sending Updates to All Replicas Simultaneously:

Consistency Issues (without preprocessing/agreement): If updates are sent directly to all replicas without a coordination mechanism, concurrent updates from different clients can lead to replica divergence and inconsistency due to different replicas potentially applying updates in different orders. Even with commutative updates, strict consistency definitions might not be met.
Consistency via Preprocessing/Agreement: To ensure consistency, updates can pass through a preprocessing layer or involve an agreement protocol among all involved nodes to decide the order of operations.
Increased Latency (with preprocessing/agreement): Achieving consistency through these methods introduces increased latency due to:

The overhead of the agreement protocol itself.
Routing updates through an additional preprocessor component.
The preprocessor potentially requiring its own agreement protocol if it's composed of multiple machines.

Forcing all updates to route to a single preprocessor machine, regardless of the initiator's location, even if a closer replica exists.

2. Data updates sent to an agreed-upon location first (Master Node):

A designated "master node" handles all update requests for a specific data item, determining the order of updates which is then followed by all replicas.
After processing updates, the master node replicates them to other replicas.

a. Synchronous Replication:

The master node waits for confirmation that all replicas have received the update.
Consistency: Ensures replicas remain consistent.
Latency: Increases latency, especially over a WAN, due to the need for message passing and the limitation imposed by the slowest entity.

b. Asynchronous Replication:

The update is considered complete (at least written to stable storage at the master) before replication to other nodes is confirmed to the initiator.
Consistency/Latency Tradeoff depends on read handling:

i. All reads routed to the master node:

Consistency: No reduction in consistency as reads always go to the source of truth.
Latency: Increased read latency as requests must travel to the master node, even if a closer replica exists. Increased latency potential due to lack of read load balancing; reads must wait for the master node if it's busy or has failed.

ii. Reads can be served from any node:

Latency: Read latency is generally much better as reads can be served locally.
Consistency: Can lead to inconsistent reads as different replicas might have different versions of the data during update propagation. While techniques like sequence numbers and sequential/read-your-writes consistency can bound this inconsistency, they are still forms of reduced consistency. Additionally, write latency can be high if the master node is geographically distant from the write requester.

c. Combination of Synchronous and Asynchronous Replication

Updates are sent to a subset of replicas synchronously and the rest asynchronously.
The consistency/latency tradeoff in this scenario will depend on the size and selection of the synchronous subset.
The consistency/latency tradeoff is again determined by how the system handles read requests.

i. Reads routed to at least one synchronously updated node (e.g., quorum-based with R + W > N):

Consistency: Consistency can be preserved by ensuring reads involve at least one replica that has received the synchronous update.
Latency: The latency problems seen in full synchronous replication (a) and routing all reads to a master (b)(i)(1 & 2) are still present, though potentially to a lesser extent because fewer nodes are involved in the synchronous operations, and more than one node might be able to serve reads.

ii. Reads can be served from nodes not yet synchronously updated (e.g., quorum-based with R + W ≤ N):

Consistency: Inconsistent reads are possible, similar to the fully asynchronous read scenario (b)(ii), as reads might hit replicas that haven't received the latest updates.
Technical Consistency Note: While quorum protocols aim for a degree of consistency, simply using a quorum is technically insufficient to guarantee the strict linearizability consistency defined by Gilbert and Lynch. However, the additional protocol complexities needed for full consistency are not the focus here; latency is already a factor in quorum protocols.

3. Data updates sent to an arbitrary location first:

The system accepts an update at the location where it's initiated and then propagates it to other replicas.
Unlike the master node approach (2), the initial update location for a specific data item can vary between different updates. Simultaneous updates for the same item can originate from different locations.
a. Synchronous Replication:

Latency: Suffers from the same latency issues as synchronous replication with a master node (2)(a).
Additional Latency: Can incur extra latency to detect and resolve conflicts arising from simultaneous updates initiated at different locations.

b. Asynchronous Replication:

Consistency: Faces consistency problems similar to those in sending updates to all replicas simultaneously (1) and asynchronous replication with a master node (2)(b), as updates might not be immediately reflected across all replicas, leading to potential inconsistencies during reads.

5. Tradeoff examples

Consistency/Latency Tradeoff is Inherent in Replication: Regardless of the replication strategy, DDBSs face a fundamental tradeoff between consistency and latency. This is especially pronounced for WAN replication due to network communication delays.
High-Availability DDBSs Sacrifice Consistency for Latency: Dynamo, Cassandra, PNUTS, and Riak, designed for low-latency web interactions, all prioritize latency over strong consistency in their baseline configurations.
Dynamo, Cassandra, and Riak's Replication: These systems use a mix of master-based updates with synchronous replication to a few nodes (leading to potential inconsistent reads) and arbitrary node updates (further increasing consistency risks).
PNUTS Prioritizes Latency with Reduced Consistency: PNUTS uses asynchronous replication from a master to replicas over a WAN, serving reads from any replica to achieve low latency at the cost of consistency.
Evidence from Cassandra Study: A study showed a significant latency increase (four times or more) when switching from potentially inconsistent "weak reads" to more consistent "quorum reads" in Cassandra.
SimpleDB Study and Local Replication: A study on SimpleDB found no significant latency increase for consistent reads, likely due to its use of master-slave replication within a single, geographically close Amazon region. Amazon's own documentation warns of increased latency for consistent reads, especially across regions.
Configurable Consistency: All four DDBSs allow users to adjust parameters to favor consistency over latency (e.g., increasing R + W in quorum systems).
Consistency/Latency Tradeoff During Normal Operation: The consistency/latency tradeoff exists continuously during normal system operation, even without network partitions, and is amplified over WANs. This suggests runtime latency is a key driver for reduced consistency.
PNUTS as Evidence Against CAP as the Primary Driver: PNUTS, configured as CP (chooses consistency over availability during a partition by making the master's data unavailable for updates), still sacrifices consistency in its baseline operation for better latency. This highlights the consistency/latency tradeoff as a more obvious reason for reduced consistency.
CAP's Influence on Dynamo, Cassandra, and Riak: These systems, being AP, likely considered network partitions in their design by implementing mechanisms for data reconciliation upon divergence. The flexibility built for handling partitions might then be reused to choose a point in the baseline consistency/latency tradeoff.
Conclusion: CAP is only one reason for reduced consistency in modern DDBSs. The consistency/latency tradeoff in replicated systems is a major factor, present continuously, and potentially more influential on baseline operations than CAP, which only applies during (arguably less frequent) network partitions.

6. PACELC

PACELC Framework: This framework reinterprets CAP by considering two scenarios:

Partition (P): How does the system trade off Availability (A) and Consistency (C)?
Else (E): When there's no partition (normal operation), how does the system trade off Latency (L) and Consistency (C)?

ELC Applies to Replicated Systems: The Latency/Consistency (ELC) tradeoff primarily applies to systems that replicate data. Non-replicated systems face availability issues (extreme latency) upon failures. The decision to replicate or not can be considered part of the ELC tradeoff.
PA/EL Systems (Dynamo, Cassandra, Riak): These systems prioritize Availability over Consistency during a Partition (PA) and prioritize Low Latency over Consistency during normal operation (EL). Their design is simplified by consistently sacrificing Consistency in both scenarios. They offer user-adjustable settings for the ELC tradeoff (e.g., adjusting R + W), but cannot achieve full linearizability even with stricter settings.
PC/EC Systems (VoltDB/H-Store, Megastore, BigTable/HBase): These fully ACID systems prioritize Consistency in both scenarios. They are Partition-Tolerant and Consistent (PC), accepting the potential costs in Availability and exhibit Consistency with higher Latency during normal operation (EC).
PA/EC System (MongoDB): In normal operation (EC), MongoDB guarantees consistent reads and writes. However, during a partition involving the master node (PA), it prioritizes Availability by electing a new master, leading to potential inconsistencies that require manual reconciliation. While technically not fully Available according to CAP during a minority partition, PACELC classifies it as PA/EC due to the primary tradeoff being between Availability and Consistency during partitions.
PC/EL System (PNUTS): In normal operation (EL), PNUTS sacrifices Consistency for Latency. During a partition (PC), it sacrifices Availability for Consistency (by making data unavailable for updates if the master is in a minority partition). The PC designation doesn't mean full consistency during partitions, but rather that consistency isn't reduced further than the baseline level; availability is reduced instead.
Complexity of DDBS Tradeoffs: Building DDBSs involves complex tradeoffs that neither CAP nor PACELC can fully explain.
Importance of Considering Consistency/Latency: Incorporating the Consistency/Latency tradeoff into architectural discussions for modern DDBS design is crucial.

7. References

Original IEEE computer society paper by Daniel Abadi - https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf

thetinkerer

Sunday, April 6, 2025

Consistency Tradeoffs in Modern Distributed Databases System Design - Abadi PACELC

1. Introduction

2. CAP is for failures

3. Consistency/Latency tradeoff

4. Data replication

Sending Updates to All Replicas Simultaneously:

2. Data updates sent to an agreed-upon location first (Master Node):

3. Data updates sent to an arbitrary location first:

5. Tradeoff examples

6. PACELC

7. References

No comments:

Post a Comment

Horizontal scaling of Chroma DB

Report Abuse

Labels