thetinkerer: April 2025

Wednesday, April 16, 2025

Leetcode 2179. Count Good Triplets in an Array

2179. Count Good Triplets in an Array

A good triplet is a set of 3 distinct values which are present in increasing order by position both in nums1 and nums2. In other words, if we consider pos1v as the index of the value v in nums1 and pos2v as the index of the value v in nums2, then a good triplet will be a set (x, y, z) where 0 <= x, y, z <= n - 1, such that pos1x < pos1y < pos1z and pos2x < pos2y < pos2z.
Return the total number of good triplets.
Example 1:
Input: nums1 = [2,0,1,3], nums2 = [0,1,2,3]
Output: 1
Explanation:
There are 4 triplets (x,y,z) such that pos1x < pos1y < pos1z. They are (2,0,1), (2,0,3), (2,1,3), and (0,1,3).
Out of those triplets, only the triplet (0,1,3) satisfies pos2x < pos2y < pos2z. Hence, there is only 1 good triplet.
Example 2:
Input: nums1 = [4,0,1,3,2], nums2 = [4,1,0,2,3]
Output: 4
Explanation: The 4 good triplets are (4,0,3), (4,0,2), (4,1,3), and (4,1,2).

There are multiple approaches that one could take to solve this problem. Let us explore this by order of difficulty

1. Brute Force:

We can generate all the triplets in both the lists and then see if there are common triplets formed from both the lists that satisfy the condition pos1x < pos1y < pos1z
The TC of this solution is O(n^3) and SC of this approach is also O(n^3) to store all the triplets we generate
2. Hash Map + Brute Force

We need to follow these steps -

Build a position map for nums2 so we can easily look up where each value from nums1 appears in nums2.
Loop through each element nums1[i], and for each:

Find its position in nums2
Count how many elements we’ve already seen (before i) that appear before it in nums2.

This is done using bisect.bisect() on a sorted list of seen positions (seen_positions)

Count how many elements will come after i in nums1 that will also come after it in nums2.

This ensures the triplet will be increasing in both arrays

Multiply count_before * count_after to get the number of triplets where nums1[i] is in the middle.

Insert the current position (from nums2) into seen_positions for the next round.


from typing import List
import bisect

class Solution:
    def goodTriplets(self, nums1: List[int], nums2: List[int]) -> int:
        n = len(nums1)
        res = 0
        
        # Map each value in nums2 to its index
        index_in_nums2 = [0] * n
        for i in range(n):
            index_in_nums2[nums2[i]] = i
        
        seen_positions = []  # Holds the positions (in nums2) of elements seen so far in nums1
        
        for i in range(n):
            # Find the corresponding index of nums1[i] in nums2
            pos_in_nums2 = index_in_nums2[nums1[i]]
            
            # Find how many seen elements are less than this position (binary search)
            count_before = bisect.bisect(seen_positions, pos_in_nums2)
            
            # Insert current position in sorted order
            bisect.insort(seen_positions, pos_in_nums2)
            
            # Elements after this position in both arrays
            count_after = (n - 1 - i) - (len(seen_positions) - count_before - 1)
            
            # Multiply the number of valid 'before' and 'after' elements
            res += count_before * count_after
        
        return res

TC is O(n log n)

SC is O(n)

3. Best approach is using Trie / Fenwick tree

1. Goal

Count triplets (i, j, k) such that: i < j < k

nums1[i], nums1[j], nums1[k] and nums2[i'], nums2[j'], nums2[k'] have the same relative order

2. Key Idea

Map the problem to:

Count increasing triplets in a transformed array

(i.e., after mapping nums1's values to their positions in nums2)

arr[i] = position of nums1[i] in nums2

Then, count triplets where:

arr[i] < arr[j] < arr[k] and i < j < k

We use a Fenwick Tree (Binary Indexed Tree) to efficiently:

Count how many numbers less than arr[i] have appeared so far (to the left)
Deduce how many numbers greater than arr[i] will appear later (to the right)

For each element at index i in arr:

left = count of numbers < arr[i] before i → use tree.query(arr[i] - 1)
right = count of numbers > arr[i] after i
→ total remaining = n - 1 - arr[i]
→ already counted = i - left
→ so right = (n - 1 - arr[i]) - (i - left)
Total triplets contributed by arr[i] as the middle element: res += left * right


class FenwickTree:
    def __init__(self, size):
        self.tree = [0] * (size + 1)

    def update(self, index, delta):
        index += 1  # 1-based indexing
        while index < len(self.tree):
            self.tree[index] += delta
            index += index & -index

    def query(self, index):
        index += 1  # 1-based indexing
        res = 0
        while index > 0:
            res += self.tree[index]
            index -= index & -index
        return res

class Solution:
    def goodTriplets(self, nums1: List[int], nums2: List[int]) -> int:
        n = len(nums1)

        # Map each number to its index in nums2
        pos_in_nums2 = [0] * n
        for i, val in enumerate(nums2):
            pos_in_nums2[val] = i

        # Create array where nums1[i] is mapped to its position in nums2
        arr = [0] * n
        for i, val in enumerate(nums1):
            arr[i] = pos_in_nums2[val]

        tree = FenwickTree(n)
        res = 0

        for i, val in enumerate(arr):
            left = tree.query(val - 1)  # Count of elements < val so far
            right = (n - 1 - val) - (i - left)  # Remaining elements > val
            res += left * right
            tree.update(val, 1)  # Mark val as seen

        return res

Tuesday, April 15, 2025

Replication

Replication means keeping a copy of the same data on multiple machines that are connected via a network.
There are three popular variants of the replication algorithms -

Single-leader
Multi-leader
Leaderless replication

Leader based replication / active/passive or master-slave replication

One of the replicas is designated as leader. The client must reach update the leader to make the writes.
Other replicas are known as followers. Whenever the leader writes data to its local storage it sends the update to all of its followers as part of replication log or change stream.
When a client wants to read from the database, it can query either the leader or any of the followers.

Synchronous Versus Asynchronous Replication

Advantage of synchronous replication is that the follower is guaranteed to have up to date copy of the data that is consistent with the leader. But the major disadvantage is the leader has to block all the writes and wait until the synchronous replica is available again.
In practice if synchronous replication is enabled on a database then one of the follower is synchronous and the rest of the followers are asynchronous.
Although asynchronous based leader-replication doesn't guarantee write durability it has the advantage that leader can continue processing writes, even if all of its followers have fallen behind. This is widely used irrespective of the disadvantage.
Note: It can be a serious problem for systems to lose data if the leader fails in asynchronous replication. So there have been great success in chain replication (a variant of synchronous replication) that has been successfully implemented in few systems such as Microsoft Azure Storage.

Setting Up New Followers

There is often times a need to set up new followers incase to increase the replicas or to replace failed nodes
Steps:

Take a consistent snapshot of the leader's database at some point in time.
Copy the snapshot to the new follower node.
The follower connects to the leader and requests all data changes that have happened since the snapshot was taken.
Follower catches up to the backlog of data changes.

Handling Node Outages

Each follower keeps a log of the data changes it has received from the leader. If the follower crashes or experiences a temporary interruption it requests the data changes that has occurred during the time when follower was disconnected.
In case of leader failure one of the follower needs to be promoted to be the leader. An automatic failover process consists of the following steps -

1. Determining the leader has failed. Mostly using timeouts typically set to 30seconds
2. Choosing a new leader. Usually done through election process.
3. Reconfiguring the system to use the new leader. Clients now need to send their write requests to the new leader

Implementation of Replication Logs

There are several different replication methods used in practice.

i. Statement-based replication

Leader logs every write request that it executes and sends that statement log to its followers.
For a relational database, every INSERT, UPDATE or DELETE is forwarded to followers and each follower parses and executes that SQL statement
This approach can break down if nondeterministic function such as NOW() or RAND()

ii. Write-Ahead log (WAL) shipping

The log is an append-only sequence of bytes containing all writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers.
There can be issues of downtime if the upgrade versions are incompatible

iii. Logical (row-based) log replication

If the different log format is used in replication and for the storage engine, which allows the decoupling of replication logs. Then it is called logical log.

iv. Trigger-based replication

A trigger lets us register custom application code that is automatically executed when the data change (write transaction) occurs in a DB system.

Problems with Replication Lag

Leader-based replication requires all writes to go through a single node, but read-only queries can go to any replica.
This is useful in web applications as mostly there is more read and less writes. However, for this approach to work we need to use only asynchronous replication.
However reading from asynchronous followers can lead to inconsistent read results from the follower and leader. This inconsistency is just a temporary state - eventual consistency (if all writes are stopped then the follower will eventually catch up to leader).
      Reading Your Own Writes
With asynchronous replication there is a problem that all the followers might not have caught up-to the leader and this to the user who made a write looks as though the data they submitted was lost.
In this case we need read-after-write consistency, also known as read-your-writes consistency. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves.
Steps to implement read-after-write consistency -
When reading something the user may have modified, read it from the leader; otherwise, read it from a follower.
This approach doesn't scale if most things in the application are potentially editable by the user. So we might track the time of last update and, for one minute after the last update, make all reads from the leader.
The client can remember the timestamp of its most recent write - then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp (logical timestamp).
       Monotonic Reads
When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward i.e, they will not read older data after having previously read newer data.
It's a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency.
       Consistent Prefix Reads
If some partitions are replicated slower than others, an observer may see answer before they see the question.
Consistent prefix reads guarantee says that if a sequence of writes happen in a certain order, then anyone reading those writes will see them appear in the same order.
One solution is to make sure that any writes that are causally related to each other are written to the same partition.
       Solutions for Replication Lag
In an eventually consistent system, it is worth thinking about how the application behaves if the replication lag increases to several minutes or even hours. If it might cause an issue then we need to provide stronger guarantee like read-after-write.
Application developers don't have to worry about subtle replication issues and could just trust their databases. This is why transactions exist to provide stronger guarantees.
   Multi-Leader Replication
Allows more than one node to accept writes. Replication still happens in the same way: each node that processes a write must forward that data change to all other nodes.
      Uses Cases for Multi-Leader Replication
Multi-datacenter operation
Multi-leader configuration allows us to have a leader in each datacenter.
Multi-leader has better performance compared to single-leader configuration as the writes can be processed locally in the datacenter and replicated to other replicas asynchronously.
In multi-leader each datacenter can continue operating independently of the others and replication catches up when the failed datacenter comes back online.
A multi-leader configuration with asynchronous replication can usually tolerate network problems better
Clients with offline operation
Multi-leader replication is suitable for applications that need to remain fully functional (read and write) even when disconnected from the internet. The example of calendar apps on various devices illustrates this: each device acts as a local leader, accepting changes offline, and these changes are asynchronously synced with other devices and a server when connectivity is restored. This architecture, while challenging to implement correctly (as evidenced by past calendar sync issues), is essentially an extreme form of multi-datacenter multi-leader replication where each device is treated as an unreliable "datacenter." Tools like CouchDB are designed to simplify this type of operation.
Collaborative editing
Real-time collaborative editing is essentially a form of multi-leader replication. Each user's local client acts as a leader, accepting their edits immediately. These changes are then asynchronously replicated to a central server and other concurrent editors. While locking can prevent conflicts (similar to single-leader replication), faster, lock-free collaboration necessitates handling the complexities of multi-leader replication, including the need for conflict resolution due to simultaneous edits from multiple "leaders."
       Handling Write Conflicts
The biggest problem with multi-leader replication is that write conflicts can occur, which means that conflict resolution is required.
       Synchronous versus asynchronous conflict detection:
In multi-leader replication, conflicts are typically detected asynchronously, unlike single-leader where the second write either blocks or aborts. While synchronous conflict detection is possible in multi-leader, it negates the primary advantage of independent write acceptance.

       Conflict avoidance:
The simplest way to handle conflicts is to prevent them by ensuring all writes for a specific record go through the same leader. Routing user requests to their "home" datacenter is an example, effectively making it a single-leader setup from the user's perspective. However, this breaks down when the designated leader needs to change.

       Converging toward a consistent state:
In multi-leader setups with no defined write order, replicas must employ convergent conflict resolution to ensure all replicas eventually reach the same final data state despite conflicting updates.

       Various ways of achieving convergent conflict resolution:

Last Write Wins (LWW): Choosing the write with the highest unique ID (often a timestamp), which can lead to data loss.
Replica ID Precedence: Prioritizing writes based on the originating replica's ID, also risking data loss.
Value Merging: Combining conflicting values (e.g., alphabetical concatenation).
Explicit Conflict Recording: Preserving all conflicting writes in a data structure for later resolution by application code or the user.
       Custom conflict resolution logic:
Many multi-leader systems allow developers to implement application-specific conflict resolution logic, which can be executed either:
On write: When a conflict is detected in the replication log (often automated, without direct user interaction).
On read: When conflicting versions are retrieved, allowing the application (or user) to resolve them before writing back to the database.
Row-level conflict resolution: Conflict resolution in multi-leader replication typically happens at the level of individual rows or documents, not at the level of entire atomic transactions. Each write within a transaction is treated separately for conflict resolution.
       Automatic Conflict Resolution:
While custom conflict resolution can be complex and error-prone (as illustrated by the Amazon shopping cart example), researchers are exploring approaches like Conflict-Free Replicated Datatypes (CRDTs), mergeable persistent data structures, and Operational Transformation (used in collaborative editing tools) to resolve conflicts in a more sensible and automated way. Although these techniques are still relatively new in database implementations, their future integration promises to make multi-leader data synchronization significantly easier for applications.
       What is conflict?
A conflict in a multi-leader replication system occurs when concurrent operations on different leaders lead to inconsistencies. While some conflicts are obvious (like two different writes to the same field), others can be more subtle and application-specific, such as two overlapping bookings for the same resource in a reservation system.
   Multi-Leader Replication Topologies
A replication topology describes the communication paths along which writes are propagated from one node to another. If you have two leaders, there is only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With more than two leaders, various different topologies are possible.
All-to-all is the most common type of topology. There are inherent flaws with Circular or star topology.
Leaderless Replication
Leaderless replication is an alternative approach to single-leader and multi-leader replication where no single "leader" node dictates the order of writes. Instead, any replica can directly accept write requests from clients (either directly or via a coordinator). This design, popularized by Amazon's Dynamo and adopted by systems like Riak, Cassandra, and Voldemort (collectively known as Dynamo-style databases), fundamentally changes how the database handles writes and has significant implications for its usage.

Writing to the Database When a Node Is Down
Leaderless replication handles node outages gracefully without failover: Unlike leader-based systems that might require a failover process when a replica is unavailable, leaderless systems can continue processing writes as long as a sufficient number of replicas are available.
Writes are sent to multiple replicas: Clients in a leaderless system typically send write requests to several replicas in parallel. The write is considered successful once a quorum (a predefined number) of replicas acknowledge it, even if some replicas are down.
Unavailable nodes miss writes, leading to stale reads: When a previously unavailable replica comes back online, it will be missing the writes that occurred during its downtime. Reading from such a node can result in stale or outdated data.
Reads are also sent to multiple replicas to detect staleness: To address the issue of stale reads, clients in leaderless systems often send read requests to multiple replicas in parallel. This allows them to receive potentially different versions of the data.
Version numbers are used to determine the latest data: To reconcile the potentially different responses from multiple replicas during a read, version numbers are employed to identify the most recent and correct version of the data.
       Read repair and anti-entropy
Read repair updates stale replicas during reads: When a client reads from multiple replicas and detects outdated data on some, it writes the latest version back to those stale replicas, ensuring eventual consistency for frequently read data.
Anti-entropy processes backfill missing data in the background: Some systems use a background process to periodically compare data across replicas and copy missing information, ensuring eventual consistency for data that might not be read frequently (though not all systems implement this, potentially impacting durability for rarely read data).
       Quorums for Reading and Writing
Quorums (w for writes, r for reads) ensure consistency and fault tolerance: By requiring a minimum number of successful acknowledgments (w) for writes and successful responses (r) for reads, where w + r > n (total replicas), the system aims to provide up-to-date reads despite some node failures.
Configurable quorums balance performance and resilience: The values of n, w, and r are tunable, allowing optimization for different workloads (e.g., faster reads with r=1, w=n at the cost of write availability). If fewer than w or r nodes respond successfully, the operation fails.
Quorums enable tolerance of unavailable nodes: The w < n condition allows writes to succeed with some down nodes, and r < n allows reads to succeed similarly. The specific values of w and r determine the number of simultaneous node failures the system can tolerate (e.g., n=3, w=2, r=2 tolerates one failure).
Read and write requests are sent to all replicas, but success depends on quorum response: Clients send requests to all n replicas in parallel, but the operation is considered successful only after receiving the required number of successful responses (w for writes, r for reads). The specific reason for a node's unavailability (crash, error, network issue) is less important than whether it returns a successful response within the timeout period.
   Limitations of Quorum Consistency

Quorums (w + r > n) aim for read-after-write consistency but aren't absolute guarantees: While the overlap between read and write quorums generally increases the likelihood of reading the latest value, various edge cases (like sloppy quorums, concurrent writes, concurrent read/write, partial write failures, node recovery from stale data, and timing issues) can still lead to stale reads even when the quorum condition is met.
Smaller quorums (w + r ≤ n) prioritize availability and lower latency over strong consistency: Reducing the required number of successful read and write responses increases the system's tolerance to node failures and network issues, allowing operations to proceed with fewer available replicas. However, this significantly increases the probability of reading stale data.
Leaderless systems lack inherent read-your-writes and monotonic reads guarantees, making staleness monitoring crucial: Unlike leader-based replication with a defined write order and replication lag metrics, leaderless systems with eventual consistency make staleness harder to track. Monitoring the likelihood and extent of stale reads is essential for operational awareness, especially since read repair alone doesn't guarantee an upper bound on staleness for infrequently accessed data.
   Sloppy Quorums and Hinted Handoff
In a leaderless replication system, a sloppy quorum is a relaxed form of the traditional quorum. Instead of requiring a strict majority or a predefined number (w for writes, r for reads) of the designated n "home" nodes to be available and acknowledge an operation, a sloppy quorum allows the operation to succeed if the required number of any reachable nodes in the cluster respond.
Hinted handoff is a mechanism often used in conjunction with sloppy quorums to ensure eventual consistency. When a write is accepted by a node that is not one of the designated "home" nodes for that data (due to a temporary failure or network partition), the accepting node stores a "hint."
Leaderless databases with quorums offer high availability and low latency by tolerating individual node failures and slowness without failover. They can return results as soon as a quorum (w for writes, r for reads) of nodes responds.
Sloppy quorums and hinted handoff enhance availability during network partitions by accepting writes on any reachable w nodes (even outside the designated n "home" nodes) and later transferring these writes to the correct nodes once the network is restored. However, this weakens consistency guarantees, as reads might not see the latest writes until the handoff completes.
Leaderless replication is well-suited for multi-datacenter operation due to its tolerance for network issues and concurrent writes. Implementations vary: Cassandra and Voldemort extend their leaderless model across data-centers with configurable replica distribution and local quorum reads/writes, while Riak uses a more localized approach within each datacenter with asynchronous cross-datacenter replication.
   Detecting Concurrent Writes
Inherent Concurrency Leads to Conflicts:
Multiple Independent Writers: Unlike single-leader systems where writes are serialized, Dynamo-style databases allow any replica to accept writes. This inherently means multiple clients can be writing to the same data (same key) simultaneously, without awareness of each other.
Asynchronous Replication: Even with quorums, the replication between nodes is often asynchronous. A write acknowledged by the quorum might not have reached all other replicas immediately, increasing the window for concurrent modifications.
Read Repair and Hinted Handoff Introduce Conflict Potential: While these mechanisms ensure eventual consistency, they can also introduce conflicts. For example, read repair might reconcile different versions of data, and hinted handoff might deliver delayed writes that conflict with more recent ones.
The Problem of Divergent Replicas:
Different Order of Operations: Due to network latency variations and partial failures, the same sequence of write operations might arrive at different replicas in different orders.
Simple Overwriting Leads to Permanent Inconsistency: If each replica simply overwrites the value based on the last write it receives, the replicas will end up with different final values, violating the principle of eventual consistency (where all replicas should eventually agree on the same data).
Last Write Wins (LWW): A Simple but Flawed Solution:
Timestamp-Based Resolution: LWW resolves conflicts by assigning a timestamp to each write and choosing the write with the latest timestamp as the "winner." Older writes are discarded.
Achieves Eventual Convergence (Potentially): If all writes are eventually propagated, all replicas will eventually converge to the value of the write with the latest timestamp.
Significant Durability Risks: Silent Data Loss of Concurrent Writes: If multiple clients write concurrently, only the one with the "latest" timestamp (which might be arbitrary due to clock skew) will be preserved. The other successful writes are silently lost.
Loss of Non-Concurrent Writes Due to Clock Skew: Even non-concurrent writes can be lost if the clocks on different clients or servers are not perfectly synchronized. A write that logically happened later might have an earlier timestamp.
Limited Use Cases: LWW is generally only acceptable for scenarios where data loss is tolerable, such as caching or session management with short lifespans.
Tracking Causality with Version Numbers (Single Replica):
Server-Managed Versioning: The server assigns an incrementing version number to each write for a given key.
Client Must Read Before Write: Clients must first read the current value and its version number before attempting to write.
Write Includes Prior Version: When writing, the client sends the new value along with the version number it last saw.
Server Determines Overwrite vs. Concurrency:
If the server receives a write with a version number equal to or greater than its current version, it knows this write supersedes the previous state and overwrites accordingly.
If the server receives a write with a version number lower than its current version, it indicates a concurrent write. The server keeps both the existing value and the new concurrent value as "siblings."
Client-Side Merging Required: When a client reads a key with multiple sibling values, it needs to implement logic to merge these concurrent versions into a single, coherent value.
Version Vectors (Multi-Replica Causality Tracking):
Version per Replica: In a distributed setting with multiple replicas, a single version number is insufficient. Version vectors are used, where each replica maintains its own version number for each key and also tracks the versions it has seen from other replicas.
Capturing the History of Updates: A version vector essentially represents the history of updates to a key across all replicas.
Distinguishing Overwrites from Concurrency: By comparing version vectors, replicas can determine if one write causally precedes another (an overwrite) or if they are concurrent (leading to siblings).
Client-Side Merging Remains Necessary: Similar to the single-replica case with version numbers, clients might still need to merge sibling values when reading.
The Concept of "Happens-Before" and Concurrency:
Happens-Before Relationship: Operation A happens before operation B if B knows about A, depends on A's result, or builds upon A. This implies a causal relationship.
Concurrency Defined by Lack of Causality: Two operations are considered concurrent if neither happens before the other – neither operation had knowledge of the other when it started.
Importance for Conflict Resolution: If B happens after A, B should overwrite A. If A and B are concurrent, there's a conflict that needs resolution.
Merging Siblings: The Application's Role:
Client Responsibility: When concurrent writes result in sibling values, the burden of merging these values often falls on the application logic on the client side.
Simple Merges (e.g., Union for Sets): For some data types, simple merging strategies like taking the union of elements (as in a shopping cart example) might be sufficient.
Complex Merges and Tombstones: For more complex scenarios (like allowing removals from a set), simple union might lead to incorrect results. Tombstones (special markers indicating a deletion) are often used to handle removals correctly during merging.
Error-Prone and Complex: Implementing correct and robust merging logic in application code can be challenging and error-prone, as highlighted by the Amazon shopping cart issue.
Automatic Conflict Resolution as a Future Direction:
CRDTs (Conflict-Free Replicated Datatypes): These are specialized data structures (like counters, sets, maps, ordered lists) designed to be concurrently edited without explicit conflict resolution. They have built-in logic to automatically merge concurrent updates in a predictable and sensible way.
Mergeable Persistent Data Structures: These track history explicitly (like Git) and use three-way merges to reconcile concurrent changes.
Operational Transformation: Used in collaborative editing, this algorithm transforms operations based on concurrently executed operations to maintain consistency in ordered lists (like characters in a document).
Potential to Simplify Development: The integration of automatic conflict resolution techniques into databases promises to significantly ease the complexity of handling multi-leader and leaderless replication for application developers.

Monday, April 7, 2025

The Chubby lock service for loosely-coupled distributed systems - Mike Burrows (Google)

Abstract

Purpose: Chubby is designed to provide coarse-grained locking and reliable, low-volume storage for loosely-coupled distributed systems.
Interface: It offers an interface similar to a distributed file system with advisory locks.
Design Emphasis: Its primary design goals are availability and reliability, rather than high performance.
Real-World Usage: Many instances of Chubby have been in use for over a year, with some handling tens of thousands of concurrent clients.

1. Introduction

Chubby's Purpose: A lock service designed for synchronization and sharing basic environmental information within loosely-coupled distributed systems of moderately large numbers of small machines on high-speed networks (e.g., ten thousand 4-processor machines on 1Gbit/s Ethernet).
Deployment Scope: Primarily intended for single data centers or machine rooms, though at least one instance spans thousands of kilometers.
Primary Goals: Reliability, availability to a moderately large number of clients, and easy-to-understand semantics. Throughput and storage capacity were secondary concerns.
Client Interface: Similar to a simple file system with whole-file reads/writes, augmented with advisory locks and notifications for events like file modification.
Intended Use Cases: Primarily for coarse-grained synchronization, particularly leader election among equivalent servers (e.g., GFS master, Bigtable master discovery). Also used by GFS and Bigtable as a reliable, well-known location for storing small amounts of metadata, acting as the root of their distributed data structures. Some services use locks for coarse-grained work partitioning.
Benefits Over Existing Methods: Before Chubby, Google's distributed systems used ad hoc primary election (saving computing effort with Chubby) or required operator intervention (Chubby significantly improved availability by eliminating this need on failure).
Consensus Problem: Primary election is recognized as an instance of the distributed consensus problem, requiring a solution that works with asynchronous communication (like Ethernet or the Internet, where packets can be lost, delayed, and reordered).
Paxos Protocol: Asynchronous consensus is solved by the Paxos protocol, which forms the core of all working protocols encountered so far (including Oki and Liskov's viewstamped replication). Paxos ensures safety without timing assumptions, and clocks are needed for liveness (overcoming the Fischer et al. impossibility result).
Engineering Effort, Not Research: Building Chubby was an engineering effort to meet specific needs, not a research project with new algorithms or techniques.
Paper's Focus: The paper describes Chubby's design and implementation, how it evolved based on experience, unexpected uses, and identified mistakes. It omits details of well-established concepts like consensus protocols or RPC systems.

2. Design

2.1 Rationale

Why a Lock Service Over a Paxos Library:

Easier Adoption for Existing Systems: Developers often don't initially design for high availability with consensus protocols. A lock service allows them to integrate synchronization (like leader election) with minimal changes to existing program structure and communication patterns (e.g., acquiring a lock and passing a lock acquisition count).
Integrated Name Service Functionality: Many services needing leader election or data partitioning also need a way to advertise the results. Chubby's file system-like interface allows storing and fetching small amounts of data, reducing the need for a separate name service and leveraging the shared consistency features with consistent client caching (superior to time-based caching like DNS TTL).
Familiar Interface: A lock-based interface is more familiar to programmers, lowering the barrier to adopting a reliable distributed decision-making mechanism, even if they don't fully understand the complexities of distributed locks.
Reduced Server Dependency for Clients: A client using a lock service can make progress safely even with only one active client process, as the lock service (with its own replicas) acts as a generic electorate. This reduces the number of client-side servers needed for reliability compared to a client-side Paxos library.

Key Design Decisions:

Lock Service Chosen: Instead of a client-side Paxos library or a separate consensus service.
Small-File Serving Implemented: To allow elected primaries to advertise themselves and parameters, avoiding the need for a separate name service.

Decisions Based on Expected Use and Environment:

Scalable Read Access for Advertised Primaries: Allowing thousands of clients to observe the primary's file efficiently without excessive servers.
Event Notification: Useful for clients and replicas to avoid constant polling.
Client-Side File Caching: Desirable due to the likelihood of clients polling even if not strictly necessary.
Consistent Caching: Preferred over non-intuitive caching semantics to avoid developer confusion.
Security Mechanisms (Access Control): Necessary to prevent financial loss and security breaches.

Decision for Coarse-Grained Locking:

Expected Use Case: Primarily for coarse-grained locks held for longer durations (hours or days) for tasks like leader election, rather than fine-grained locks for short-duration transaction control.
Benefits of Coarse-Grained Locks: Lower load on the lock server (acquisition rate less tied to client transaction rate), less impact from temporary lock server unavailability, and the ability to survive lock server failures (important for long-held locks).
Fine-Grained Locking Left to Clients: Clients can implement their own application-specific fine-grained locks, potentially using Chubby's coarse-grained locks to manage groups of these fine-grained locks, simplifying the client's responsibility for consensus implementation and server provisioning for high load.

2.2 System structure:

Three Main Components:

Server (Replicas): The core Chubby instances.
Client Library: Linked with client applications, mediating all communication with Chubby servers.
Optional Proxy Server.

Chubby Cell: Consists of a small number (typically five) of replicas placed to minimize correlated failures.
Master Election via Consensus: Replicas use a distributed consensus protocol to elect a master. The master needs a majority of votes and promises (master lease) from those replicas not to elect another master for a short period.
Master Lease Renewal: The master periodically renews its lease by winning a majority vote.
Database Replication: Replicas maintain a simple database copy, but only the elected master initiates reads and writes to it. Updates from the master are copied to other replicas via the consensus protocol.
Client Discovery of Master: Clients query the replicas listed in DNS for the master's location. Non-master replicas forward the master's identity.
Client Communication: Once the master is found, clients direct all requests to it until it becomes unresponsive or indicates it's no longer the master.
Write Request Handling: Write requests are propagated through the consensus protocol to all replicas and acknowledged when a majority of replicas have received the write.
Read Request Handling: Read requests are served by the master alone, which is safe as long as its lease is valid (preventing other masters).
Master Failover: If the master fails, the remaining replicas initiate a new election when their master leases expire, typically resulting in a new master within a few seconds (though times can vary).
Replica Replacement: If a replica fails for an extended period, a replacement system provisions a new machine, starts the Chubby server, and updates the DNS. The current master periodically polls DNS, detects the change, and updates the cell's membership list in the replicated database. The new replica then catches up by retrieving backups and updates from active replicas and can vote in future elections once it processes a pending commit.

2.3 Files, directories and handles

Hierarchical Namespace: Chubby presents a strict tree structure of files and directories, with names separated by slashes (e.g., /ls/foo/wombat/pouch). The /ls prefix is standard, and the second component is the Chubby cell name resolved via DNS (with local being a special name for the nearest cell).
UNIX-like Structure: Directories contain lists of children, and files contain sequences of bytes, making it familiar to users and allowing the use of existing file system tools via specialized APIs or interfaces like GFS.
Distribution-Friendly Design Differences from UNIX:

No file moving between directories.
No directory modified times.
Path-independent permissions (access controlled by file permissions, not path).
No last-access times (for easier metadata caching).

Nodes: The namespace contains only files and directories, collectively called nodes. Each node has a single name within its cell and no symbolic or hard links.
Permanent and Ephemeral Nodes: Nodes can be permanent (deleted explicitly) or ephemeral (deleted if no client has them open and, for directories, if they are empty). Ephemeral files serve as temporary files and client liveness indicators.
Advisory Reader/Writer Locks: Any node can function as an advisory lock.
Metadata with Access Control Lists (ACLs): Each node has metadata including names of three ACLs (read, write, change ACL). ACLs are inherited from the parent directory by default. ACLs are themselves files in a well-known /acl directory, containing lists of principal names (similar to Plan 9's groups). User authentication is handled by the RPC system. This file-based ACL system is also accessible to other services.
Monotonically Increasing Change Detection Numbers: Each node has four 64-bit numbers to easily detect changes:

Instance number (increments on renaming/recreation).
Content generation number (files only, increments on write).
Lock generation number (increments on lock state change).
ACL generation number (increments on ACL name change).

File Content Checksum: A 64-bit checksum is provided for file content comparison.
Handles: Clients open nodes to get handles (analogous to UNIX file descriptors) containing:

Check digits (preventing forgery, so access control is primarily at handle creation).
Sequence number (master can identify handles from previous masters).
Mode information (for master to reconstruct state on restarts with old handles).

2.4 Locks and sequencers

Reader-Writer Locks: Files and directories can act as reader-writer locks. Exclusive (writer) mode allows one holder, shared (reader) mode allows multiple holders.
Advisory Locks:

Locks are advisory, meaning they only conflict with other lock acquisition attempts. Holding a lock doesn't necessitate accessing the associated file nor prevent others from doing so.

Rejection of Mandatory Locks: Mandatory locks (making locked objects inaccessible to non-holders) were rejected because:

Chubby locks often protect resources managed by other services, requiring extensive modifications to enforce mandatory locking there.
They didn't want to force application shutdowns for debugging or administrative access to locked files.
Developers rely on assertions for error checking, diminishing the value of mandatory checks. Buggy processes have other ways to corrupt data even without mandatory locking.

Write Permission for Lock Acquisition: Acquiring a lock (in either mode) requires write permission to prevent unprivileged readers from blocking writers.
Challenges of Distributed Locking: Distributed systems with uncertain communication and independent process failures make locking complex (e.g., delayed or reordered requests after a lock holder fails).
Sequencers for Ordering: Chubby provides "sequencers" – opaque byte-strings describing the lock's state upon acquisition (lock name, mode, generation number). Clients pass sequencers to servers to indicate lock protection. Servers verify the sequencer's validity and mode before acting on the request. This mechanism adds ordering information only to lock-protected interactions.
Lock-Delay for Non-Sequencer Aware Systems: For systems not using sequencers, Chubby implements a "lock-delay." If a lock becomes free due to holder failure, the server prevents other clients from acquiring it for a specified duration (up to one minute). This imperfectly mitigates issues caused by delayed or reordered requests and client restarts, protecting unmodified systems.
Normal Lock Release: When a client releases a lock normally, it becomes immediately available.

2.5 Events

Event Subscription: Clients can subscribe to various events when creating a handle to a Chubby node.
Asynchronous Delivery: Events are delivered to the client via an up-call from the Chubby client library.
Types of Events:

File contents modified: Used to monitor service locations advertised in files.
Child node added, removed, or modified: Used for mirroring and monitoring ephemeral files without affecting their reference counts.
Chubby master failed over: Warns clients of potential lost events, requiring data rescanning.
Handle/lock invalidation: Indicates a communication problem.
Lock acquired: Signals when a primary has been elected.
Conflicting lock request from another client: Theoretically allows for caching data based on Chubby locks.
Delivery Guarantee: Events are delivered after the corresponding action has occurred. Clients are guaranteed to see the new data (or more recent data) upon subsequent reads after a file modification event.
Underutilized Events: The "lock acquired" and "conflicting lock request" events are rarely used and could have potentially been omitted. Clients typically need to communicate with a new primary (requiring a file modification event with the address) rather than just knowing an election occurred. The conflicting lock event for cache consistency has not been adopted by users.

2.6 API

Handle Management:

Open(name, options): Opens a named file or directory (name relative to an existing directory handle, with "/" being the root). Takes options for usage mode (read, write+lock, ACL change), event subscriptions, lock-delay, and creation flags (with optional initial content/ACLs). Returns a handle and indicates if a new node was created.
Close(handle): Closes an open handle, making it unusable. Never fails.
Poison(handle): Causes outstanding and subsequent operations on the handle to fail without closing it (for canceling calls from other threads).

Data Access:

GetContentsAndStat(handle): Atomically reads the entire content and metadata of a file. Encourages small files.
GetStat(handle): Returns only the metadata of a node.
ReadDir(handle): Returns the names and metadata of the children of a directory.
SetContents(handle, content, [generation]): Atomically writes the entire content of a file. Optionally takes a content generation number for compare-and-swap semantics.
SetACL(handle, acl_names, [generation]): Similar to SetContents, but operates on the ACL names of a node.

Node Management:

Delete(handle): Deletes the node if it has no children.

Locking:

Acquire(handle, mode, [blocking]): Acquires a lock in exclusive or shared mode (can be blocking or non-blocking via TryAcquire()).
TryAcquire(handle, mode): Attempts to acquire a lock without blocking.
Release(handle): Releases a held lock.
GetSequencer(handle): Returns a sequencer for a held lock.
SetSequencer(handle, sequencer): Associates a sequencer with a handle; subsequent operations fail if the sequencer is invalid.
CheckSequencer(handle, sequencer): Checks if a given sequencer is valid.

General Operation Parameter: All calls take an operation parameter for:

Asynchronous call with a callback.
Waiting for asynchronous call completion.
Obtaining extended error and diagnostic information.

Handle Validity: Handles are associated with a specific instance of a file/directory, not just the name. Operations on a handle to a deleted (even if recreated) node will fail. Access control checks can occur on any call, but always on Open().
Primary Election Example: Clients attempt to acquire a lock on a designated file. The winner becomes the primary, writes its identity to the file using SetContents(), and clients/replicas read it using GetContentsAndStat() (often triggered by a file modification event). The primary ideally uses GetSequencer() and passes it to servers, which verify it with CheckSequencer(). Lock-delay can be used for non-sequencer-aware services.

2.7 Caching

Consistent Write-Through Cache: Clients maintain an in-memory cache of file data and metadata (including absence) for read performance.
Lease-Based Consistency: Cache consistency is maintained using a lease mechanism and invalidations from the master.
Master Tracks Cached Data: The master keeps track of what each client is caching.
Invalidation on Modification: When data changes, the master blocks the modification, sends invalidations to all caching clients (via KeepAlive RPCs), and proceeds only after all clients acknowledge (via next KeepAlive) or their lease expires.
Single Round of Invalidation: Only one invalidation round is needed as the master treats the node as uncachable until acknowledgements are received.
Read Performance Priority: This approach prioritizes read performance by allowing reads to proceed without delay.
Simple Invalidation Protocol: Cached data is invalidated on change, never updated. Update-only protocols were deemed inefficient due to potentially unbounded unnecessary updates.
Strict Consistency Rationale: Weaker consistency models were rejected for being harder for programmers to use. Complex client-side sequencing (like virtual synchrony) was also deemed inappropriate for the diverse existing environments.
Handle Caching: Clients also cache open handles to avoid redundant Open() RPCs. This is restricted for ephemeral files and concurrent locking handles to maintain correct semantics.
Lock Caching: Clients can cache locks (hold them longer than needed) and are notified via an event if another client requests a conflicting lock.

2.8 Sessions and KeepAlives:

Session Definition: A time-bound relationship between a Chubby cell and a client, maintained by periodic KeepAlive handshakes.
Session Validity: Handles, locks, and cached data remain valid as long as the session is valid (subject to acknowledging invalidations).
Session Initiation and Termination: Sessions are requested on first contact and end explicitly on client termination or after a minute of inactivity (no open handles/calls).
Session Lease: Each session has a lease (time interval) during which the master guarantees not to unilaterally terminate it. The end of this interval is the session lease timeout.
Lease Timeout Advancement: The master extends the lease timeout on session creation, master failover, and in response to KeepAlive RPCs.
KeepAlive Mechanism: Clients send KeepAlive RPCs. The master typically blocks the reply until the current lease is near expiration, then returns with a new lease timeout (default 12s extension). Clients send a new KeepAlive immediately after receiving a reply, ensuring a near-constant blocked KeepAlive at the master.
Piggybacking Events and Invalidations: KeepAlive replies also carry events and cache invalidations, forcing clients to acknowledge invalidations to maintain their session. This also ensures all Chubby RPCs flow from client to master.
Local Lease Timeout: Clients maintain a conservative local estimate of the master's lease timeout, accounting for network latency and potential clock drift.
Jeopardy and Grace Period: If the local lease timeout expires, the session is in jeopardy, the cache is emptied and disabled. The client enters a grace period (45s default) to re-establish communication.
Session Recovery or Expiration Events: The library informs the application via jeopardy, safe (session recovered), and expired events. This allows applications to quiesce during uncertainty and recover without restarting.
Handle Invalidation on Session Expiration: If a session expires during an operation on a handle, all subsequent operations (except Close() and Poison()) on that handle will fail, ensuring that network outages cause only a suffix of operations to be lost, aiding in marking complex changes as committed.

2.9 Fail-overs:

Master State Loss: When a master fails, it loses its in-memory state about sessions, handles, and locks.
Session Lease Timer Stops: The authoritative session lease timer on the master stops during failover, effectively extending client leases.
Client Behavior During Failover:

Quick Election: Clients might contact the new master before their local lease timers expire.
Long Election: Clients flush caches and enter the grace period while trying to find the new master.

Grace Period Importance: The grace period allows session maintenance across master failovers exceeding the normal lease timeout.
Event Sequence (Figure 2): Illustrates a lengthy failover where the client relies on the grace period to maintain its session, showing lease expirations and KeepAlive exchanges.
Client Illusion of No Failure: Once a new master is contacted, the client library and master cooperate to present an illusion of continuous operation to the application.
New Master's Reconstruction: The new master reconstructs a conservative approximation of the previous master's in-memory state by:

Picking a new client epoch number to reject old packets.
Initially only responding to master-location requests.
Building in-memory structures for sessions and locks from the replicated database, extending leases conservatively.
Allowing KeepAlives but no other session operations initially.
Emitting a failover event, causing clients to flush caches and warn applications.
Waiting for all sessions to acknowledge the failover or expire.
Allowing all operations to proceed.
Recreating in-memory handles created before the failover (using sequence numbers) and preventing their re-creation after closing within the same epoch.
Deleting orphaned ephemeral files (without open handles) after a delay (e.g., one minute), requiring clients to refresh ephemeral file handles after failover.

Failover Code Complexity: The less frequently executed failover code has been a source of many bugs.

2.10 Database Implementation:

Initial Version (Berkeley DB): Used the replicated version of Berkeley DB, a B-tree mapping byte-string keys to values. A custom key comparison function kept sibling nodes adjacent. Its replicated logs aligned well with Chubby's design after master leases were added.
Reason for Rewrite: While Berkeley DB's B-tree was mature, the replication code was newer and less widely used, posing a perceived higher risk.
Current Version (Simple Custom Database): A simple database using write-ahead logging and snapshotting (similar to Birrell et al.) was implemented. The database log is still distributed using a consensus protocol.
Simplification Benefits: The rewrite allowed significant simplification of Chubby as it used few of Berkeley DB's advanced features (e.g., needing atomic operations but not general transactions).

2.11 Backup:

Periodic Snapshots: Every few hours, the master of each Chubby cell writes a snapshot of its database to a GFS file server in a separate building.
Benefits: Provides disaster recovery and a way to initialize new replicas without loading active ones.
Location Choice: Using a separate building ensures survival against local damage and avoids cyclic dependencies with GFS master election.

2.12 Mirroring:

Fast Replication: Allows mirroring a collection of small files between Chubby cells. The event mechanism ensures near-instantaneous replication (under a second worldwide with good network).
Unreachable Mirrors: Unreachable mirrors remain unchanged until connectivity is restored, with checksums used to identify updated files.
Common Use Case: Copying configuration files to globally distributed computing clusters.
Global Cell: A special "global" cell with geographically distributed replicas mirrors its /ls/global/master subtree to /ls/cell/slave in all other Chubby cells, ensuring high accessibility.
Mirrored Content: Includes Chubby's own ACLs, service presence announcements for monitoring, pointers to large datasets (like Bigtable cells), and configuration files for other systems.

3 Mechanisms for Scaling:

Challenge: Chubby must handle a large number of individual client processes (potentially 90,000+ per master).
Ineffectiveness of Minor Master Optimizations: Assuming no major bugs, small improvements in master request processing have limited impact.
Scaling Approaches:

Multiple Chubby Cells: Clients use nearby cells (via DNS) to avoid reliance on remote machines. Typical deployment: one cell per data center of thousands of machines.
Increased Lease Times: Master can increase lease times (up to ~60s from 12s) under heavy load to reduce KeepAlive frequency (the dominant request type and common overload failure mode).
Client-Side Caching: Clients cache file data, metadata, absence, and handles to reduce server calls.
Protocol Conversion Servers: Translate Chubby protocol to less complex ones (like DNS).
Proxies (Planned): Trusted processes that forward client requests to a Chubby cell. Can significantly reduce master load by handling KeepAlives and read requests (write traffic is low). However, they add an extra RPC and potentially increase temporary unavailability risk. Failover for proxies needs improvement.
Partitioning (Planned): Partitioning the cell's namespace by directory across multiple master/replica sets. Each partition handles most calls independently, reducing load per partition but not necessarily KeepAlive traffic. Cross-partition communication is needed for some operations (ACLs, directory deletion) but is expected to have modest impact. A combination of proxies and partitioning is the strategy for further scaling.

Current Scaling Needs: No immediate need for >5x scaling due to data center limits and hardware improvements increasing both client and server capacity.

4 Use, Surprises, and Design Errors

4.1 Use and Behaviour:

Many files used for naming, configuration, ACLs, and metadata.
Significant negative caching (caching file absence).
High average client count per cached file (~10).
Few clients hold locks, mostly exclusive (for primary election/partitioning).
RPC traffic dominated by KeepAlives, few reads/writes/lock acquisitions.
Most outages are short (<30s) and often caused by network issues, maintenance, software errors, and overload.
Data loss has been rare (6 times in many cell-years), mainly due to database/operator errors.
Chubby data fits in RAM, leading to low mean request latency until overload (high active sessions or exceptional read/write patterns).
Server scaling relies more on reducing communication than optimizing server code paths. Client-side cache performance is critical.

4.2 Java Clients:

Reluctance of Java programmers to use JNI led to the development and maintenance of a protocol-conversion server for Java clients.

4.3 Use as a Name Service:

Most popular use case, surpassing locking.
Chubby's explicit invalidation caching is superior to DNS's time-based (TTL) caching, especially for high-churn environments with many clients.
Load spikes from large job startups were mitigated by batching name entries.
A simple protocol-conversion server for name lookups was created.
A Chubby DNS server bridges Chubby names to DNS clients.

4.4 Problems with Fail-over:

Original design of writing sessions to the database on creation caused overload during mass client starts.
Optimization to write sessions on first modification/lock/ephemeral open led to potential loss of read-only sessions during failover.
New design avoids writing sessions to disk, recreating them on the new master (like handles), requiring the new master to wait a full lease timeout.
This new design enables proxies to manage sessions unknown to the master.

4.5 Abusive Clients:

Shared Chubby cells require isolation from misbehaving clients.
Lack of Quotas: Naive assumption that Chubby wouldn't be used for large data led to problems with a service rewriting a large file on every user action. File size limits were introduced.
Publish/Subscribe Misuse: Attempts to use the event mechanism for general pub/sub were inefficient due to Chubby's strong guarantees and invalidation-based caching.

4.6 Lessons Learned:

Developers often underestimate failure probabilities and treat Chubby as always available.
Developers often confuse service uptime with application availability (e.g., global cell availability vs. local).
API choices can unintentionally influence how developers handle outages (e.g., crashing on master failover event).
Proactive review of Chubby usage plans and providing higher-level libraries help mitigate availability risks.
Coarse-grained locking is often sufficient and avoids unnecessary communication.
Poor API choices (like tying cancellation to handle invalidation) can have unexpected limitations. A separate Cancel() RPC might be better.
Combining session lease refresh with event/invalidation delivery via KeepAlives (initially via TCP, then UDP due to TCP backoff issues) has implications for transport protocol choice. A potential solution is a separate TCP-based GetEvent() RPC alongside UDP KeepAlives.
Lack of aggressive caching (especially for file absence and handle reuse) led to inefficient client behavior. Initial countermeasures (exponential backoff on repeated Open()) were less effective than making repeated Open() calls cheap.

5 Comparison with related work

Based on Established Ideas: Chubby's design builds upon well-established concepts in distributed systems.
Cache Design: Inspired by distributed file system caching mechanisms.
Sessions and Cache Tokens: Similar to concepts in the Echo distributed file system and the V system (reducing lease overhead).
General-Purpose Lock Service Idea: Found in VMS, though VMS initially had a high-speed interconnect.
File System-like API: Chubby's API, including the namespace, borrows from file system models, recognizing the convenience for more than just files.
Differences from Distributed File Systems (Echo, AFS):

Performance and Storage Aspirations: Chubby doesn't aim for high throughput or low latency for uncached data, nor does it expect clients to read/write/store large amounts of data.
Emphasis on Consistency, Availability, and Reliability: These are easier to achieve with less focus on raw performance.
Small Database and Replication: Allows for many online copies (replicas and backups) and frequent integrity checks.

Scalability: The relaxed performance/storage requirements enable serving tens of thousands of clients per master.
Central Coordination Point: Solves a class of problems for system developers by providing a shared information and coordination hub.
Comparison with Boxwood's Lock Server: Chosen for comparison due to its recent design for loosely-coupled environments, yet with different design choices.
Chubby's Integrated Approach: Combines locks, reliable small-file storage, and a session/lease mechanism in a single service.
Boxwood's Separated Approach: Divides these functionalities into three independent services: a lock service, a Paxos service (for reliable state), and a failure detection service.
Target Audience Difference: Chubby targets a diverse audience (experts to novices) with a familiar API for a large-scale shared service. Boxwood provides a toolkit for potentially smaller groups of more sophisticated developers working on projects that might share code but aren't necessarily used together.
Interface Level: Chubby often provides a higher-level interface (e.g., combined lock and file namespaces). Boxwood's Paxos service client might implement caching using the lock service.
Default Parameter Differences: Boxwood uses more aggressive failure detection (frequent pings) with shorter timeouts, while Chubby has longer default lease and KeepAlive intervals, reflecting different scale and failure expectations.
Replication Factor: Boxwood subcomponents typically use 2-3 replicas, while Chubby uses 5 per cell, likely due to scale and shared infrastructure uncertainties.
Chubby's Grace Period: A key difference, allowing clients to survive long master outages without losing sessions/locks, which Boxwood lacks (its "grace period" is equivalent to Chubby's session lease). This reflects different expectations about the cost of lost locks.
Lock Purpose: Chubby locks are heavier-weight and require sequencers for external resource protection. Boxwood locks are lighter-weight and primarily intended for internal Boxwood use.

6 Summary

Chubby is a distributed lock service used for coarse-grained synchronization and widely adopted as a name and configuration service within Google.
Its design combines distributed consensus, consistent client-side caching, timely notifications, and a familiar file system interface.
Scaling is achieved through caching, protocol conversion, and load adaptation, with plans for proxies and partitioning.
Chubby is a central component for various Google systems (name service, MapReduce rendezvous, GFS/Bigtable master election, high-availability configuration).
Operating within a single company minimizes malicious DoS, but mistakes and differing developer expectations pose similar challenges.
Heavy-handed remedies (like usage reviews) are used to prevent abuse, but predicting future usage is difficult.
Lack of performance advice in documentation can lead to misuse.
Lessons learned highlight the importance of robust caching, careful API design, understanding developer behavior, and proactive guidance on using distributed services.

Sunday, April 6, 2025

Consistency Tradeoffs in Modern Distributed Databases System Design - Abadi PACELC

1. Introduction

The CAP theorem’s impact on modern distributed database system design is more limited than is often perceived. Another tradeoff—between consistency and latency —has had a more direct influence on several well-known DDBSs. A proposed new formulation, PACELC, unifies this tradeoff with CAP.
Recent Industry Adoption of DDBSs: Despite decades of research, the extensive use of Distributed Database Systems (DDBSs) by the industry is a relatively recent trend.
Two Primary Drivers: This trend is driven by the need for elastically scalable database systems to handle increased data and transactional throughput, and the requirement for worldwide data accessibility due to globalization.
Examples of Recent DDBSs: The past decade has seen the development of numerous DDBSs aiming for high scalability or global reach, including SimpleDB/Dynamo/DynamoDB, Cassandra, Voldemort, Sherpa/PNUTS, Riak, HBase/BigTable, MongoDB, VoltDB/H-Store, and Megastore.
Complexity and the Value of Understanding Tradeoffs: Building DDBSs is complex and difficult, making tools that help designers understand the inherent tradeoffs valuable.
The CAP Theorem's Usefulness and Misunderstanding: The CAP theorem has been helpful in reasoning about system capabilities and exposing marketing hype, but it is increasingly misunderstood and misapplied.
Misconception about CAP's Restrictions: Many designers incorrectly believe CAP imposes restrictions during normal system operation, leading to unnecessarily limited implementations.
CAP's Actual Limitations: In reality, CAP only defines limitations in the presence of certain types of failures and does not constrain normal operation.
Consistency/Latency Tradeoff's Influence: A different tradeoff, between consistency and latency, has arguably been more influential on DDBS design during normal operation than CAP.
Importance of Both Tradeoffs: Both CAP-related tradeoffs (during failures) and the consistency/latency tradeoff (during normal operation) are significant.
PACELC as a Unifying Formulation: Unifying CAP and the consistency/latency tradeoff into a single framework called PACELC can lead to a deeper understanding of modern DDBS design.

2. CAP is for failures

CAP Theorem Basics: The CAP theorem states that a Distributed Database System (DDBS) can only guarantee two out of three properties: Consistency (C), Availability (A), and Partition Tolerance (P), leading to CA, CP, or AP system possibilities.
Modern DDBSs and Consistency: Many modern DDBSs (like SimpleDB/Dynamo, Cassandra, Voldemort, Sherpa/PNUTS, and Riak) initially did not guarantee strong consistency as defined by CAP.
Definition of Consistency in CAP: The proof of CAP uses the definition of atomic/linearizable consistency, where operations appear to complete instantaneously in a total order, as if on a single node.
Initial Assumption about CAP's Influence: It's often assumed that CAP heavily influenced modern architects to build eventually consistent systems, reasoning that partition tolerance necessitates a choice between consistency and availability, and high availability is often prioritized.
Flaw in the Assumption: This reasoning is flawed because the consistency/availability tradeoff according to CAP only occurs during an actual network partition, not simply due to the requirement of partition tolerance.
Network Partition Probability: The likelihood of a network partition depends on factors like network scope (WAN vs. local), hardware quality, configuration management, and redundancy levels. Generally, network partitions are less frequent than other serious failures.
CAP's Restriction is Conditional: CAP imposes no restrictions on a DDBS during normal operation (in the absence of partitions).
CAP Doesn't Justify Reduced Consistency in Normal Operation: Therefore, DDBSs that reduce consistency by default, even when there are no network partitions, are not necessarily doing so because of CAP.
CAP Allows ACID and High Availability without Partitions: CAP permits a system to provide full ACID guarantees alongside high availability when network partitions are not present.
Conclusion on Default Consistency Reduction: The CAP theorem does not fully justify the default configuration of many DDBSs that reduce consistency and other ACID properties.

3. Consistency/Latency tradeoff

Context of Modern DDBS Design: To understand modern Distributed Database System (DDBS) design, it's crucial to consider the use cases for which these systems were initially built.
Examples of DDBS Use Cases:

Dynamo was designed by Amazon for core e-commerce services (e.g., shopping cart).
Cassandra was created by Facebook to power its Inbox Search.
Voldemort was built by LinkedIn to handle online updates from write-intensive features.
PNUTS was developed by Yahoo to store user data for webpage views, shopping page listings, and social networking applications.
Riak was motivated by use cases similar to Amazon's.

Common Use Case Characteristics: These systems typically serve data for webpages generated dynamically and delivered to active website users, and they handle online updates.
Latency Sensitivity: Studies show that latency is critical for online interactions, with even small increases (e.g., 100ms) significantly impacting user engagement and retention.
Consistency, Availability, and Latency Tradeoff: There's a fundamental tradeoff between consistency, availability, and latency in these systems.
Availability vs. Latency: Availability and latency are closely related, with unavailability essentially being very high latency. For the discussion, "high latency" is defined as approaching hundreds of milliseconds, while "unavailability" is longer than a typical request timeout (e.g., a few seconds). However, the author in the paper will later simplify this to a consistency vs. latency tradeoff.
Tradeoff Independent of Partitions: This consistency/latency tradeoff exists even without network partitions and is distinct from the CAP theorem's tradeoffs.
Reason for the Tradeoff: Replication and Failure: High availability necessitates data replication because system component failures are inevitable over time. Replication is required to maintain availability when failures occur.

4. Data replication

Replication and the Consistency/Latency Tradeoff: As soon as a DDBS replicates data, a fundamental tradeoff between consistency and latency emerges.
Three Basic Replication Strategies: There are three primary ways to implement data replication:

Sending updates to all replicas simultaneously.
Sending updates to an agreed-upon master node first.
Sending updates to a single (arbitrary) node first.

Sending Updates to All Replicas Simultaneously:

Consistency Issues (without preprocessing/agreement): If updates are sent directly to all replicas without a coordination mechanism, concurrent updates from different clients can lead to replica divergence and inconsistency due to different replicas potentially applying updates in different orders. Even with commutative updates, strict consistency definitions might not be met.
Consistency via Preprocessing/Agreement: To ensure consistency, updates can pass through a preprocessing layer or involve an agreement protocol among all involved nodes to decide the order of operations.
Increased Latency (with preprocessing/agreement): Achieving consistency through these methods introduces increased latency due to:

The overhead of the agreement protocol itself.
Routing updates through an additional preprocessor component.
The preprocessor potentially requiring its own agreement protocol if it's composed of multiple machines.

Forcing all updates to route to a single preprocessor machine, regardless of the initiator's location, even if a closer replica exists.

2. Data updates sent to an agreed-upon location first (Master Node):

A designated "master node" handles all update requests for a specific data item, determining the order of updates which is then followed by all replicas.
After processing updates, the master node replicates them to other replicas.

a. Synchronous Replication:

The master node waits for confirmation that all replicas have received the update.
Consistency: Ensures replicas remain consistent.
Latency: Increases latency, especially over a WAN, due to the need for message passing and the limitation imposed by the slowest entity.

b. Asynchronous Replication:

The update is considered complete (at least written to stable storage at the master) before replication to other nodes is confirmed to the initiator.
Consistency/Latency Tradeoff depends on read handling:

i. All reads routed to the master node:

Consistency: No reduction in consistency as reads always go to the source of truth.
Latency: Increased read latency as requests must travel to the master node, even if a closer replica exists. Increased latency potential due to lack of read load balancing; reads must wait for the master node if it's busy or has failed.

ii. Reads can be served from any node:

Latency: Read latency is generally much better as reads can be served locally.
Consistency: Can lead to inconsistent reads as different replicas might have different versions of the data during update propagation. While techniques like sequence numbers and sequential/read-your-writes consistency can bound this inconsistency, they are still forms of reduced consistency. Additionally, write latency can be high if the master node is geographically distant from the write requester.

c. Combination of Synchronous and Asynchronous Replication

Updates are sent to a subset of replicas synchronously and the rest asynchronously.
The consistency/latency tradeoff in this scenario will depend on the size and selection of the synchronous subset.
The consistency/latency tradeoff is again determined by how the system handles read requests.

i. Reads routed to at least one synchronously updated node (e.g., quorum-based with R + W > N):

Consistency: Consistency can be preserved by ensuring reads involve at least one replica that has received the synchronous update.
Latency: The latency problems seen in full synchronous replication (a) and routing all reads to a master (b)(i)(1 & 2) are still present, though potentially to a lesser extent because fewer nodes are involved in the synchronous operations, and more than one node might be able to serve reads.

ii. Reads can be served from nodes not yet synchronously updated (e.g., quorum-based with R + W ≤ N):

Consistency: Inconsistent reads are possible, similar to the fully asynchronous read scenario (b)(ii), as reads might hit replicas that haven't received the latest updates.
Technical Consistency Note: While quorum protocols aim for a degree of consistency, simply using a quorum is technically insufficient to guarantee the strict linearizability consistency defined by Gilbert and Lynch. However, the additional protocol complexities needed for full consistency are not the focus here; latency is already a factor in quorum protocols.

3. Data updates sent to an arbitrary location first:

The system accepts an update at the location where it's initiated and then propagates it to other replicas.
Unlike the master node approach (2), the initial update location for a specific data item can vary between different updates. Simultaneous updates for the same item can originate from different locations.
a. Synchronous Replication:

Latency: Suffers from the same latency issues as synchronous replication with a master node (2)(a).
Additional Latency: Can incur extra latency to detect and resolve conflicts arising from simultaneous updates initiated at different locations.

b. Asynchronous Replication:

Consistency: Faces consistency problems similar to those in sending updates to all replicas simultaneously (1) and asynchronous replication with a master node (2)(b), as updates might not be immediately reflected across all replicas, leading to potential inconsistencies during reads.

5. Tradeoff examples

Consistency/Latency Tradeoff is Inherent in Replication: Regardless of the replication strategy, DDBSs face a fundamental tradeoff between consistency and latency. This is especially pronounced for WAN replication due to network communication delays.
High-Availability DDBSs Sacrifice Consistency for Latency: Dynamo, Cassandra, PNUTS, and Riak, designed for low-latency web interactions, all prioritize latency over strong consistency in their baseline configurations.
Dynamo, Cassandra, and Riak's Replication: These systems use a mix of master-based updates with synchronous replication to a few nodes (leading to potential inconsistent reads) and arbitrary node updates (further increasing consistency risks).
PNUTS Prioritizes Latency with Reduced Consistency: PNUTS uses asynchronous replication from a master to replicas over a WAN, serving reads from any replica to achieve low latency at the cost of consistency.
Evidence from Cassandra Study: A study showed a significant latency increase (four times or more) when switching from potentially inconsistent "weak reads" to more consistent "quorum reads" in Cassandra.
SimpleDB Study and Local Replication: A study on SimpleDB found no significant latency increase for consistent reads, likely due to its use of master-slave replication within a single, geographically close Amazon region. Amazon's own documentation warns of increased latency for consistent reads, especially across regions.
Configurable Consistency: All four DDBSs allow users to adjust parameters to favor consistency over latency (e.g., increasing R + W in quorum systems).
Consistency/Latency Tradeoff During Normal Operation: The consistency/latency tradeoff exists continuously during normal system operation, even without network partitions, and is amplified over WANs. This suggests runtime latency is a key driver for reduced consistency.
PNUTS as Evidence Against CAP as the Primary Driver: PNUTS, configured as CP (chooses consistency over availability during a partition by making the master's data unavailable for updates), still sacrifices consistency in its baseline operation for better latency. This highlights the consistency/latency tradeoff as a more obvious reason for reduced consistency.
CAP's Influence on Dynamo, Cassandra, and Riak: These systems, being AP, likely considered network partitions in their design by implementing mechanisms for data reconciliation upon divergence. The flexibility built for handling partitions might then be reused to choose a point in the baseline consistency/latency tradeoff.
Conclusion: CAP is only one reason for reduced consistency in modern DDBSs. The consistency/latency tradeoff in replicated systems is a major factor, present continuously, and potentially more influential on baseline operations than CAP, which only applies during (arguably less frequent) network partitions.

6. PACELC

PACELC Framework: This framework reinterprets CAP by considering two scenarios:

Partition (P): How does the system trade off Availability (A) and Consistency (C)?
Else (E): When there's no partition (normal operation), how does the system trade off Latency (L) and Consistency (C)?

ELC Applies to Replicated Systems: The Latency/Consistency (ELC) tradeoff primarily applies to systems that replicate data. Non-replicated systems face availability issues (extreme latency) upon failures. The decision to replicate or not can be considered part of the ELC tradeoff.
PA/EL Systems (Dynamo, Cassandra, Riak): These systems prioritize Availability over Consistency during a Partition (PA) and prioritize Low Latency over Consistency during normal operation (EL). Their design is simplified by consistently sacrificing Consistency in both scenarios. They offer user-adjustable settings for the ELC tradeoff (e.g., adjusting R + W), but cannot achieve full linearizability even with stricter settings.
PC/EC Systems (VoltDB/H-Store, Megastore, BigTable/HBase): These fully ACID systems prioritize Consistency in both scenarios. They are Partition-Tolerant and Consistent (PC), accepting the potential costs in Availability and exhibit Consistency with higher Latency during normal operation (EC).
PA/EC System (MongoDB): In normal operation (EC), MongoDB guarantees consistent reads and writes. However, during a partition involving the master node (PA), it prioritizes Availability by electing a new master, leading to potential inconsistencies that require manual reconciliation. While technically not fully Available according to CAP during a minority partition, PACELC classifies it as PA/EC due to the primary tradeoff being between Availability and Consistency during partitions.
PC/EL System (PNUTS): In normal operation (EL), PNUTS sacrifices Consistency for Latency. During a partition (PC), it sacrifices Availability for Consistency (by making data unavailable for updates if the master is in a minority partition). The PC designation doesn't mean full consistency during partitions, but rather that consistency isn't reduced further than the baseline level; availability is reduced instead.
Complexity of DDBS Tradeoffs: Building DDBSs involves complex tradeoffs that neither CAP nor PACELC can fully explain.
Importance of Considering Consistency/Latency: Incorporating the Consistency/Latency tradeoff into architectural discussions for modern DDBS design is crucial.

7. References

Original IEEE computer society paper by Daniel Abadi - https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf

Wednesday, April 16, 2025

Leetcode 2179. Count Good Triplets in an Array

2179. Count Good Triplets in an Array

1. Brute Force:

3. Best approach is using Trie / Fenwick tree

Tuesday, April 15, 2025

Replication

Replication

Leader based replication / active/passive or master-slave replication

Synchronous Versus Asynchronous Replication

Setting Up New Followers

Handling Node Outages

Implementation of Replication Logs

Problems with Replication Lag

Multi-Leader Replication

Uses Cases for Multi-Leader Replication

Handling Write Conflicts

Leaderless Replication

Writing to the Database When a Node Is Down

Sloppy Quorums and Hinted Handoff

Detecting Concurrent Writes

Monday, April 7, 2025

The Chubby lock service for loosely-coupled distributed systems - Mike Burrows (Google)

Abstract

1. Introduction

2. Design

2.1 Rationale

Why a Lock Service Over a Paxos Library:

Key Design Decisions:

Decisions Based on Expected Use and Environment:

Decision for Coarse-Grained Locking:

2.2 System structure:

2.3 Files, directories and handles

2.4 Locks and sequencers

2.5 Events

2.6 API

2.7 Caching

2.8 Sessions and KeepAlives:

2.9 Fail-overs:

2.10 Database Implementation:

2.11 Backup:

2.12 Mirroring:

3 Mechanisms for Scaling:

4 Use, Surprises, and Design Errors

4.1 Use and Behaviour:

4.2 Java Clients:

4.3 Use as a Name Service:

4.4 Problems with Fail-over:

4.5 Abusive Clients:

4.6 Lessons Learned:

5 Comparison with related work

6 Summary

Sunday, April 6, 2025

Consistency Tradeoffs in Modern Distributed Databases System Design - Abadi PACELC

1. Introduction

2. CAP is for failures

3. Consistency/Latency tradeoff

4. Data replication

Sending Updates to All Replicas Simultaneously:

2. Data updates sent to an agreed-upon location first (Master Node):

3. Data updates sent to an arbitrary location first:

5. Tradeoff examples

6. PACELC

7. References

Horizontal scaling of Chroma DB