thetinkerer: Replication

Replication

Replication means keeping a copy of the same data on multiple machines that are connected via a network.
There are three popular variants of the replication algorithms -

Single-leader
Multi-leader
Leaderless replication

Leader based replication / active/passive or master-slave replication

One of the replicas is designated as leader. The client must reach update the leader to make the writes.
Other replicas are known as followers. Whenever the leader writes data to its local storage it sends the update to all of its followers as part of replication log or change stream.
When a client wants to read from the database, it can query either the leader or any of the followers.

Synchronous Versus Asynchronous Replication

Advantage of synchronous replication is that the follower is guaranteed to have up to date copy of the data that is consistent with the leader. But the major disadvantage is the leader has to block all the writes and wait until the synchronous replica is available again.
In practice if synchronous replication is enabled on a database then one of the follower is synchronous and the rest of the followers are asynchronous.
Although asynchronous based leader-replication doesn't guarantee write durability it has the advantage that leader can continue processing writes, even if all of its followers have fallen behind. This is widely used irrespective of the disadvantage.
Note: It can be a serious problem for systems to lose data if the leader fails in asynchronous replication. So there have been great success in chain replication (a variant of synchronous replication) that has been successfully implemented in few systems such as Microsoft Azure Storage.

Setting Up New Followers

There is often times a need to set up new followers incase to increase the replicas or to replace failed nodes
Steps:

Take a consistent snapshot of the leader's database at some point in time.
Copy the snapshot to the new follower node.
The follower connects to the leader and requests all data changes that have happened since the snapshot was taken.
Follower catches up to the backlog of data changes.

Handling Node Outages

Each follower keeps a log of the data changes it has received from the leader. If the follower crashes or experiences a temporary interruption it requests the data changes that has occurred during the time when follower was disconnected.
In case of leader failure one of the follower needs to be promoted to be the leader. An automatic failover process consists of the following steps -

1. Determining the leader has failed. Mostly using timeouts typically set to 30seconds
2. Choosing a new leader. Usually done through election process.
3. Reconfiguring the system to use the new leader. Clients now need to send their write requests to the new leader

Implementation of Replication Logs

There are several different replication methods used in practice.

i. Statement-based replication

Leader logs every write request that it executes and sends that statement log to its followers.
For a relational database, every INSERT, UPDATE or DELETE is forwarded to followers and each follower parses and executes that SQL statement
This approach can break down if nondeterministic function such as NOW() or RAND()

ii. Write-Ahead log (WAL) shipping

The log is an append-only sequence of bytes containing all writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers.
There can be issues of downtime if the upgrade versions are incompatible

iii. Logical (row-based) log replication

If the different log format is used in replication and for the storage engine, which allows the decoupling of replication logs. Then it is called logical log.

iv. Trigger-based replication

A trigger lets us register custom application code that is automatically executed when the data change (write transaction) occurs in a DB system.

Problems with Replication Lag

Leader-based replication requires all writes to go through a single node, but read-only queries can go to any replica.
This is useful in web applications as mostly there is more read and less writes. However, for this approach to work we need to use only asynchronous replication.
However reading from asynchronous followers can lead to inconsistent read results from the follower and leader. This inconsistency is just a temporary state - eventual consistency (if all writes are stopped then the follower will eventually catch up to leader).
      Reading Your Own Writes
With asynchronous replication there is a problem that all the followers might not have caught up-to the leader and this to the user who made a write looks as though the data they submitted was lost.
In this case we need read-after-write consistency, also known as read-your-writes consistency. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves.
Steps to implement read-after-write consistency -
When reading something the user may have modified, read it from the leader; otherwise, read it from a follower.
This approach doesn't scale if most things in the application are potentially editable by the user. So we might track the time of last update and, for one minute after the last update, make all reads from the leader.
The client can remember the timestamp of its most recent write - then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp (logical timestamp).
       Monotonic Reads
When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward i.e, they will not read older data after having previously read newer data.
It's a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency.
       Consistent Prefix Reads
If some partitions are replicated slower than others, an observer may see answer before they see the question.
Consistent prefix reads guarantee says that if a sequence of writes happen in a certain order, then anyone reading those writes will see them appear in the same order.
One solution is to make sure that any writes that are causally related to each other are written to the same partition.
       Solutions for Replication Lag
In an eventually consistent system, it is worth thinking about how the application behaves if the replication lag increases to several minutes or even hours. If it might cause an issue then we need to provide stronger guarantee like read-after-write.
Application developers don't have to worry about subtle replication issues and could just trust their databases. This is why transactions exist to provide stronger guarantees.
   Multi-Leader Replication
Allows more than one node to accept writes. Replication still happens in the same way: each node that processes a write must forward that data change to all other nodes.
      Uses Cases for Multi-Leader Replication
Multi-datacenter operation
Multi-leader configuration allows us to have a leader in each datacenter.
Multi-leader has better performance compared to single-leader configuration as the writes can be processed locally in the datacenter and replicated to other replicas asynchronously.
In multi-leader each datacenter can continue operating independently of the others and replication catches up when the failed datacenter comes back online.
A multi-leader configuration with asynchronous replication can usually tolerate network problems better
Clients with offline operation
Multi-leader replication is suitable for applications that need to remain fully functional (read and write) even when disconnected from the internet. The example of calendar apps on various devices illustrates this: each device acts as a local leader, accepting changes offline, and these changes are asynchronously synced with other devices and a server when connectivity is restored. This architecture, while challenging to implement correctly (as evidenced by past calendar sync issues), is essentially an extreme form of multi-datacenter multi-leader replication where each device is treated as an unreliable "datacenter." Tools like CouchDB are designed to simplify this type of operation.
Collaborative editing
Real-time collaborative editing is essentially a form of multi-leader replication. Each user's local client acts as a leader, accepting their edits immediately. These changes are then asynchronously replicated to a central server and other concurrent editors. While locking can prevent conflicts (similar to single-leader replication), faster, lock-free collaboration necessitates handling the complexities of multi-leader replication, including the need for conflict resolution due to simultaneous edits from multiple "leaders."
       Handling Write Conflicts
The biggest problem with multi-leader replication is that write conflicts can occur, which means that conflict resolution is required.
       Synchronous versus asynchronous conflict detection:
In multi-leader replication, conflicts are typically detected asynchronously, unlike single-leader where the second write either blocks or aborts. While synchronous conflict detection is possible in multi-leader, it negates the primary advantage of independent write acceptance.

       Conflict avoidance:
The simplest way to handle conflicts is to prevent them by ensuring all writes for a specific record go through the same leader. Routing user requests to their "home" datacenter is an example, effectively making it a single-leader setup from the user's perspective. However, this breaks down when the designated leader needs to change.

       Converging toward a consistent state:
In multi-leader setups with no defined write order, replicas must employ convergent conflict resolution to ensure all replicas eventually reach the same final data state despite conflicting updates.

       Various ways of achieving convergent conflict resolution:

Last Write Wins (LWW): Choosing the write with the highest unique ID (often a timestamp), which can lead to data loss.
Replica ID Precedence: Prioritizing writes based on the originating replica's ID, also risking data loss.
Value Merging: Combining conflicting values (e.g., alphabetical concatenation).
Explicit Conflict Recording: Preserving all conflicting writes in a data structure for later resolution by application code or the user.
       Custom conflict resolution logic:
Many multi-leader systems allow developers to implement application-specific conflict resolution logic, which can be executed either:
On write: When a conflict is detected in the replication log (often automated, without direct user interaction).
On read: When conflicting versions are retrieved, allowing the application (or user) to resolve them before writing back to the database.
Row-level conflict resolution: Conflict resolution in multi-leader replication typically happens at the level of individual rows or documents, not at the level of entire atomic transactions. Each write within a transaction is treated separately for conflict resolution.
       Automatic Conflict Resolution:
While custom conflict resolution can be complex and error-prone (as illustrated by the Amazon shopping cart example), researchers are exploring approaches like Conflict-Free Replicated Datatypes (CRDTs), mergeable persistent data structures, and Operational Transformation (used in collaborative editing tools) to resolve conflicts in a more sensible and automated way. Although these techniques are still relatively new in database implementations, their future integration promises to make multi-leader data synchronization significantly easier for applications.
       What is conflict?
A conflict in a multi-leader replication system occurs when concurrent operations on different leaders lead to inconsistencies. While some conflicts are obvious (like two different writes to the same field), others can be more subtle and application-specific, such as two overlapping bookings for the same resource in a reservation system.
   Multi-Leader Replication Topologies
A replication topology describes the communication paths along which writes are propagated from one node to another. If you have two leaders, there is only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With more than two leaders, various different topologies are possible.
All-to-all is the most common type of topology. There are inherent flaws with Circular or star topology.
Leaderless Replication
Leaderless replication is an alternative approach to single-leader and multi-leader replication where no single "leader" node dictates the order of writes. Instead, any replica can directly accept write requests from clients (either directly or via a coordinator). This design, popularized by Amazon's Dynamo and adopted by systems like Riak, Cassandra, and Voldemort (collectively known as Dynamo-style databases), fundamentally changes how the database handles writes and has significant implications for its usage.

Writing to the Database When a Node Is Down
Leaderless replication handles node outages gracefully without failover: Unlike leader-based systems that might require a failover process when a replica is unavailable, leaderless systems can continue processing writes as long as a sufficient number of replicas are available.
Writes are sent to multiple replicas: Clients in a leaderless system typically send write requests to several replicas in parallel. The write is considered successful once a quorum (a predefined number) of replicas acknowledge it, even if some replicas are down.
Unavailable nodes miss writes, leading to stale reads: When a previously unavailable replica comes back online, it will be missing the writes that occurred during its downtime. Reading from such a node can result in stale or outdated data.
Reads are also sent to multiple replicas to detect staleness: To address the issue of stale reads, clients in leaderless systems often send read requests to multiple replicas in parallel. This allows them to receive potentially different versions of the data.
Version numbers are used to determine the latest data: To reconcile the potentially different responses from multiple replicas during a read, version numbers are employed to identify the most recent and correct version of the data.
       Read repair and anti-entropy
Read repair updates stale replicas during reads: When a client reads from multiple replicas and detects outdated data on some, it writes the latest version back to those stale replicas, ensuring eventual consistency for frequently read data.
Anti-entropy processes backfill missing data in the background: Some systems use a background process to periodically compare data across replicas and copy missing information, ensuring eventual consistency for data that might not be read frequently (though not all systems implement this, potentially impacting durability for rarely read data).
       Quorums for Reading and Writing
Quorums (w for writes, r for reads) ensure consistency and fault tolerance: By requiring a minimum number of successful acknowledgments (w) for writes and successful responses (r) for reads, where w + r > n (total replicas), the system aims to provide up-to-date reads despite some node failures.
Configurable quorums balance performance and resilience: The values of n, w, and r are tunable, allowing optimization for different workloads (e.g., faster reads with r=1, w=n at the cost of write availability). If fewer than w or r nodes respond successfully, the operation fails.
Quorums enable tolerance of unavailable nodes: The w < n condition allows writes to succeed with some down nodes, and r < n allows reads to succeed similarly. The specific values of w and r determine the number of simultaneous node failures the system can tolerate (e.g., n=3, w=2, r=2 tolerates one failure).
Read and write requests are sent to all replicas, but success depends on quorum response: Clients send requests to all n replicas in parallel, but the operation is considered successful only after receiving the required number of successful responses (w for writes, r for reads). The specific reason for a node's unavailability (crash, error, network issue) is less important than whether it returns a successful response within the timeout period.
   Limitations of Quorum Consistency

Quorums (w + r > n) aim for read-after-write consistency but aren't absolute guarantees: While the overlap between read and write quorums generally increases the likelihood of reading the latest value, various edge cases (like sloppy quorums, concurrent writes, concurrent read/write, partial write failures, node recovery from stale data, and timing issues) can still lead to stale reads even when the quorum condition is met.
Smaller quorums (w + r ≤ n) prioritize availability and lower latency over strong consistency: Reducing the required number of successful read and write responses increases the system's tolerance to node failures and network issues, allowing operations to proceed with fewer available replicas. However, this significantly increases the probability of reading stale data.
Leaderless systems lack inherent read-your-writes and monotonic reads guarantees, making staleness monitoring crucial: Unlike leader-based replication with a defined write order and replication lag metrics, leaderless systems with eventual consistency make staleness harder to track. Monitoring the likelihood and extent of stale reads is essential for operational awareness, especially since read repair alone doesn't guarantee an upper bound on staleness for infrequently accessed data.
   Sloppy Quorums and Hinted Handoff
In a leaderless replication system, a sloppy quorum is a relaxed form of the traditional quorum. Instead of requiring a strict majority or a predefined number (w for writes, r for reads) of the designated n "home" nodes to be available and acknowledge an operation, a sloppy quorum allows the operation to succeed if the required number of any reachable nodes in the cluster respond.
Hinted handoff is a mechanism often used in conjunction with sloppy quorums to ensure eventual consistency. When a write is accepted by a node that is not one of the designated "home" nodes for that data (due to a temporary failure or network partition), the accepting node stores a "hint."
Leaderless databases with quorums offer high availability and low latency by tolerating individual node failures and slowness without failover. They can return results as soon as a quorum (w for writes, r for reads) of nodes responds.
Sloppy quorums and hinted handoff enhance availability during network partitions by accepting writes on any reachable w nodes (even outside the designated n "home" nodes) and later transferring these writes to the correct nodes once the network is restored. However, this weakens consistency guarantees, as reads might not see the latest writes until the handoff completes.
Leaderless replication is well-suited for multi-datacenter operation due to its tolerance for network issues and concurrent writes. Implementations vary: Cassandra and Voldemort extend their leaderless model across data-centers with configurable replica distribution and local quorum reads/writes, while Riak uses a more localized approach within each datacenter with asynchronous cross-datacenter replication.
   Detecting Concurrent Writes
Inherent Concurrency Leads to Conflicts:
Multiple Independent Writers: Unlike single-leader systems where writes are serialized, Dynamo-style databases allow any replica to accept writes. This inherently means multiple clients can be writing to the same data (same key) simultaneously, without awareness of each other.
Asynchronous Replication: Even with quorums, the replication between nodes is often asynchronous. A write acknowledged by the quorum might not have reached all other replicas immediately, increasing the window for concurrent modifications.
Read Repair and Hinted Handoff Introduce Conflict Potential: While these mechanisms ensure eventual consistency, they can also introduce conflicts. For example, read repair might reconcile different versions of data, and hinted handoff might deliver delayed writes that conflict with more recent ones.
The Problem of Divergent Replicas:
Different Order of Operations: Due to network latency variations and partial failures, the same sequence of write operations might arrive at different replicas in different orders.
Simple Overwriting Leads to Permanent Inconsistency: If each replica simply overwrites the value based on the last write it receives, the replicas will end up with different final values, violating the principle of eventual consistency (where all replicas should eventually agree on the same data).
Last Write Wins (LWW): A Simple but Flawed Solution:
Timestamp-Based Resolution: LWW resolves conflicts by assigning a timestamp to each write and choosing the write with the latest timestamp as the "winner." Older writes are discarded.
Achieves Eventual Convergence (Potentially): If all writes are eventually propagated, all replicas will eventually converge to the value of the write with the latest timestamp.
Significant Durability Risks: Silent Data Loss of Concurrent Writes: If multiple clients write concurrently, only the one with the "latest" timestamp (which might be arbitrary due to clock skew) will be preserved. The other successful writes are silently lost.
Loss of Non-Concurrent Writes Due to Clock Skew: Even non-concurrent writes can be lost if the clocks on different clients or servers are not perfectly synchronized. A write that logically happened later might have an earlier timestamp.
Limited Use Cases: LWW is generally only acceptable for scenarios where data loss is tolerable, such as caching or session management with short lifespans.
Tracking Causality with Version Numbers (Single Replica):
Server-Managed Versioning: The server assigns an incrementing version number to each write for a given key.
Client Must Read Before Write: Clients must first read the current value and its version number before attempting to write.
Write Includes Prior Version: When writing, the client sends the new value along with the version number it last saw.
Server Determines Overwrite vs. Concurrency:
If the server receives a write with a version number equal to or greater than its current version, it knows this write supersedes the previous state and overwrites accordingly.
If the server receives a write with a version number lower than its current version, it indicates a concurrent write. The server keeps both the existing value and the new concurrent value as "siblings."
Client-Side Merging Required: When a client reads a key with multiple sibling values, it needs to implement logic to merge these concurrent versions into a single, coherent value.
Version Vectors (Multi-Replica Causality Tracking):
Version per Replica: In a distributed setting with multiple replicas, a single version number is insufficient. Version vectors are used, where each replica maintains its own version number for each key and also tracks the versions it has seen from other replicas.
Capturing the History of Updates: A version vector essentially represents the history of updates to a key across all replicas.
Distinguishing Overwrites from Concurrency: By comparing version vectors, replicas can determine if one write causally precedes another (an overwrite) or if they are concurrent (leading to siblings).
Client-Side Merging Remains Necessary: Similar to the single-replica case with version numbers, clients might still need to merge sibling values when reading.
The Concept of "Happens-Before" and Concurrency:
Happens-Before Relationship: Operation A happens before operation B if B knows about A, depends on A's result, or builds upon A. This implies a causal relationship.
Concurrency Defined by Lack of Causality: Two operations are considered concurrent if neither happens before the other – neither operation had knowledge of the other when it started.
Importance for Conflict Resolution: If B happens after A, B should overwrite A. If A and B are concurrent, there's a conflict that needs resolution.
Merging Siblings: The Application's Role:
Client Responsibility: When concurrent writes result in sibling values, the burden of merging these values often falls on the application logic on the client side.
Simple Merges (e.g., Union for Sets): For some data types, simple merging strategies like taking the union of elements (as in a shopping cart example) might be sufficient.
Complex Merges and Tombstones: For more complex scenarios (like allowing removals from a set), simple union might lead to incorrect results. Tombstones (special markers indicating a deletion) are often used to handle removals correctly during merging.
Error-Prone and Complex: Implementing correct and robust merging logic in application code can be challenging and error-prone, as highlighted by the Amazon shopping cart issue.
Automatic Conflict Resolution as a Future Direction:
CRDTs (Conflict-Free Replicated Datatypes): These are specialized data structures (like counters, sets, maps, ordered lists) designed to be concurrently edited without explicit conflict resolution. They have built-in logic to automatically merge concurrent updates in a predictable and sensible way.
Mergeable Persistent Data Structures: These track history explicitly (like Git) and use three-way merges to reconcile concurrent changes.
Operational Transformation: Used in collaborative editing, this algorithm transforms operations based on concurrently executed operations to maintain consistency in ordered lists (like characters in a document).
Potential to Simplify Development: The integration of automatic conflict resolution techniques into databases promises to significantly ease the complexity of handling multi-leader and leaderless replication for application developers.

thetinkerer

Tuesday, April 15, 2025

Replication

Replication

Leader based replication / active/passive or master-slave replication

Synchronous Versus Asynchronous Replication

Setting Up New Followers

Handling Node Outages

Implementation of Replication Logs

Problems with Replication Lag

Multi-Leader Replication

Uses Cases for Multi-Leader Replication

Handling Write Conflicts

Leaderless Replication

Writing to the Database When a Node Is Down

Sloppy Quorums and Hinted Handoff

Detecting Concurrent Writes

No comments:

Post a Comment

Horizontal scaling of Chroma DB

Report Abuse

Labels