thetinkerer: The Chubby lock service for loosely-coupled distributed systems

Abstract

Purpose: Chubby is designed to provide coarse-grained locking and reliable, low-volume storage for loosely-coupled distributed systems.
Interface: It offers an interface similar to a distributed file system with advisory locks.
Design Emphasis: Its primary design goals are availability and reliability, rather than high performance.
Real-World Usage: Many instances of Chubby have been in use for over a year, with some handling tens of thousands of concurrent clients.

1. Introduction

Chubby's Purpose: A lock service designed for synchronization and sharing basic environmental information within loosely-coupled distributed systems of moderately large numbers of small machines on high-speed networks (e.g., ten thousand 4-processor machines on 1Gbit/s Ethernet).
Deployment Scope: Primarily intended for single data centers or machine rooms, though at least one instance spans thousands of kilometers.
Primary Goals: Reliability, availability to a moderately large number of clients, and easy-to-understand semantics. Throughput and storage capacity were secondary concerns.
Client Interface: Similar to a simple file system with whole-file reads/writes, augmented with advisory locks and notifications for events like file modification.
Intended Use Cases: Primarily for coarse-grained synchronization, particularly leader election among equivalent servers (e.g., GFS master, Bigtable master discovery). Also used by GFS and Bigtable as a reliable, well-known location for storing small amounts of metadata, acting as the root of their distributed data structures. Some services use locks for coarse-grained work partitioning.
Benefits Over Existing Methods: Before Chubby, Google's distributed systems used ad hoc primary election (saving computing effort with Chubby) or required operator intervention (Chubby significantly improved availability by eliminating this need on failure).
Consensus Problem: Primary election is recognized as an instance of the distributed consensus problem, requiring a solution that works with asynchronous communication (like Ethernet or the Internet, where packets can be lost, delayed, and reordered).
Paxos Protocol: Asynchronous consensus is solved by the Paxos protocol, which forms the core of all working protocols encountered so far (including Oki and Liskov's viewstamped replication). Paxos ensures safety without timing assumptions, and clocks are needed for liveness (overcoming the Fischer et al. impossibility result).
Engineering Effort, Not Research: Building Chubby was an engineering effort to meet specific needs, not a research project with new algorithms or techniques.
Paper's Focus: The paper describes Chubby's design and implementation, how it evolved based on experience, unexpected uses, and identified mistakes. It omits details of well-established concepts like consensus protocols or RPC systems.

2. Design

2.1 Rationale

Why a Lock Service Over a Paxos Library:

Easier Adoption for Existing Systems: Developers often don't initially design for high availability with consensus protocols. A lock service allows them to integrate synchronization (like leader election) with minimal changes to existing program structure and communication patterns (e.g., acquiring a lock and passing a lock acquisition count).
Integrated Name Service Functionality: Many services needing leader election or data partitioning also need a way to advertise the results. Chubby's file system-like interface allows storing and fetching small amounts of data, reducing the need for a separate name service and leveraging the shared consistency features with consistent client caching (superior to time-based caching like DNS TTL).
Familiar Interface: A lock-based interface is more familiar to programmers, lowering the barrier to adopting a reliable distributed decision-making mechanism, even if they don't fully understand the complexities of distributed locks.
Reduced Server Dependency for Clients: A client using a lock service can make progress safely even with only one active client process, as the lock service (with its own replicas) acts as a generic electorate. This reduces the number of client-side servers needed for reliability compared to a client-side Paxos library.

Key Design Decisions:

Lock Service Chosen: Instead of a client-side Paxos library or a separate consensus service.
Small-File Serving Implemented: To allow elected primaries to advertise themselves and parameters, avoiding the need for a separate name service.

Decisions Based on Expected Use and Environment:

Scalable Read Access for Advertised Primaries: Allowing thousands of clients to observe the primary's file efficiently without excessive servers.
Event Notification: Useful for clients and replicas to avoid constant polling.
Client-Side File Caching: Desirable due to the likelihood of clients polling even if not strictly necessary.
Consistent Caching: Preferred over non-intuitive caching semantics to avoid developer confusion.
Security Mechanisms (Access Control): Necessary to prevent financial loss and security breaches.

Decision for Coarse-Grained Locking:

Expected Use Case: Primarily for coarse-grained locks held for longer durations (hours or days) for tasks like leader election, rather than fine-grained locks for short-duration transaction control.
Benefits of Coarse-Grained Locks: Lower load on the lock server (acquisition rate less tied to client transaction rate), less impact from temporary lock server unavailability, and the ability to survive lock server failures (important for long-held locks).
Fine-Grained Locking Left to Clients: Clients can implement their own application-specific fine-grained locks, potentially using Chubby's coarse-grained locks to manage groups of these fine-grained locks, simplifying the client's responsibility for consensus implementation and server provisioning for high load.

2.2 System structure:

Three Main Components:

Server (Replicas): The core Chubby instances.
Client Library: Linked with client applications, mediating all communication with Chubby servers.
Optional Proxy Server.

Chubby Cell: Consists of a small number (typically five) of replicas placed to minimize correlated failures.
Master Election via Consensus: Replicas use a distributed consensus protocol to elect a master. The master needs a majority of votes and promises (master lease) from those replicas not to elect another master for a short period.
Master Lease Renewal: The master periodically renews its lease by winning a majority vote.
Database Replication: Replicas maintain a simple database copy, but only the elected master initiates reads and writes to it. Updates from the master are copied to other replicas via the consensus protocol.
Client Discovery of Master: Clients query the replicas listed in DNS for the master's location. Non-master replicas forward the master's identity.
Client Communication: Once the master is found, clients direct all requests to it until it becomes unresponsive or indicates it's no longer the master.
Write Request Handling: Write requests are propagated through the consensus protocol to all replicas and acknowledged when a majority of replicas have received the write.
Read Request Handling: Read requests are served by the master alone, which is safe as long as its lease is valid (preventing other masters).
Master Failover: If the master fails, the remaining replicas initiate a new election when their master leases expire, typically resulting in a new master within a few seconds (though times can vary).
Replica Replacement: If a replica fails for an extended period, a replacement system provisions a new machine, starts the Chubby server, and updates the DNS. The current master periodically polls DNS, detects the change, and updates the cell's membership list in the replicated database. The new replica then catches up by retrieving backups and updates from active replicas and can vote in future elections once it processes a pending commit.

2.3 Files, directories and handles

Hierarchical Namespace: Chubby presents a strict tree structure of files and directories, with names separated by slashes (e.g., /ls/foo/wombat/pouch). The /ls prefix is standard, and the second component is the Chubby cell name resolved via DNS (with local being a special name for the nearest cell).
UNIX-like Structure: Directories contain lists of children, and files contain sequences of bytes, making it familiar to users and allowing the use of existing file system tools via specialized APIs or interfaces like GFS.
Distribution-Friendly Design Differences from UNIX:

No file moving between directories.
No directory modified times.
Path-independent permissions (access controlled by file permissions, not path).
No last-access times (for easier metadata caching).

Nodes: The namespace contains only files and directories, collectively called nodes. Each node has a single name within its cell and no symbolic or hard links.
Permanent and Ephemeral Nodes: Nodes can be permanent (deleted explicitly) or ephemeral (deleted if no client has them open and, for directories, if they are empty). Ephemeral files serve as temporary files and client liveness indicators.
Advisory Reader/Writer Locks: Any node can function as an advisory lock.
Metadata with Access Control Lists (ACLs): Each node has metadata including names of three ACLs (read, write, change ACL). ACLs are inherited from the parent directory by default. ACLs are themselves files in a well-known /acl directory, containing lists of principal names (similar to Plan 9's groups). User authentication is handled by the RPC system. This file-based ACL system is also accessible to other services.
Monotonically Increasing Change Detection Numbers: Each node has four 64-bit numbers to easily detect changes:

Instance number (increments on renaming/recreation).
Content generation number (files only, increments on write).
Lock generation number (increments on lock state change).
ACL generation number (increments on ACL name change).

File Content Checksum: A 64-bit checksum is provided for file content comparison.
Handles: Clients open nodes to get handles (analogous to UNIX file descriptors) containing:

Check digits (preventing forgery, so access control is primarily at handle creation).
Sequence number (master can identify handles from previous masters).
Mode information (for master to reconstruct state on restarts with old handles).

2.4 Locks and sequencers

Reader-Writer Locks: Files and directories can act as reader-writer locks. Exclusive (writer) mode allows one holder, shared (reader) mode allows multiple holders.
Advisory Locks:

Locks are advisory, meaning they only conflict with other lock acquisition attempts. Holding a lock doesn't necessitate accessing the associated file nor prevent others from doing so.

Rejection of Mandatory Locks: Mandatory locks (making locked objects inaccessible to non-holders) were rejected because:

Chubby locks often protect resources managed by other services, requiring extensive modifications to enforce mandatory locking there.
They didn't want to force application shutdowns for debugging or administrative access to locked files.
Developers rely on assertions for error checking, diminishing the value of mandatory checks. Buggy processes have other ways to corrupt data even without mandatory locking.

Write Permission for Lock Acquisition: Acquiring a lock (in either mode) requires write permission to prevent unprivileged readers from blocking writers.
Challenges of Distributed Locking: Distributed systems with uncertain communication and independent process failures make locking complex (e.g., delayed or reordered requests after a lock holder fails).
Sequencers for Ordering: Chubby provides "sequencers" – opaque byte-strings describing the lock's state upon acquisition (lock name, mode, generation number). Clients pass sequencers to servers to indicate lock protection. Servers verify the sequencer's validity and mode before acting on the request. This mechanism adds ordering information only to lock-protected interactions.
Lock-Delay for Non-Sequencer Aware Systems: For systems not using sequencers, Chubby implements a "lock-delay." If a lock becomes free due to holder failure, the server prevents other clients from acquiring it for a specified duration (up to one minute). This imperfectly mitigates issues caused by delayed or reordered requests and client restarts, protecting unmodified systems.
Normal Lock Release: When a client releases a lock normally, it becomes immediately available.

2.5 Events

Event Subscription: Clients can subscribe to various events when creating a handle to a Chubby node.
Asynchronous Delivery: Events are delivered to the client via an up-call from the Chubby client library.
Types of Events:

File contents modified: Used to monitor service locations advertised in files.
Child node added, removed, or modified: Used for mirroring and monitoring ephemeral files without affecting their reference counts.
Chubby master failed over: Warns clients of potential lost events, requiring data rescanning.
Handle/lock invalidation: Indicates a communication problem.
Lock acquired: Signals when a primary has been elected.
Conflicting lock request from another client: Theoretically allows for caching data based on Chubby locks.
Delivery Guarantee: Events are delivered after the corresponding action has occurred. Clients are guaranteed to see the new data (or more recent data) upon subsequent reads after a file modification event.
Underutilized Events: The "lock acquired" and "conflicting lock request" events are rarely used and could have potentially been omitted. Clients typically need to communicate with a new primary (requiring a file modification event with the address) rather than just knowing an election occurred. The conflicting lock event for cache consistency has not been adopted by users.

2.6 API

Handle Management:

Open(name, options): Opens a named file or directory (name relative to an existing directory handle, with "/" being the root). Takes options for usage mode (read, write+lock, ACL change), event subscriptions, lock-delay, and creation flags (with optional initial content/ACLs). Returns a handle and indicates if a new node was created.
Close(handle): Closes an open handle, making it unusable. Never fails.
Poison(handle): Causes outstanding and subsequent operations on the handle to fail without closing it (for canceling calls from other threads).

Data Access:

GetContentsAndStat(handle): Atomically reads the entire content and metadata of a file. Encourages small files.
GetStat(handle): Returns only the metadata of a node.
ReadDir(handle): Returns the names and metadata of the children of a directory.
SetContents(handle, content, [generation]): Atomically writes the entire content of a file. Optionally takes a content generation number for compare-and-swap semantics.
SetACL(handle, acl_names, [generation]): Similar to SetContents, but operates on the ACL names of a node.

Node Management:

Delete(handle): Deletes the node if it has no children.

Locking:

Acquire(handle, mode, [blocking]): Acquires a lock in exclusive or shared mode (can be blocking or non-blocking via TryAcquire()).
TryAcquire(handle, mode): Attempts to acquire a lock without blocking.
Release(handle): Releases a held lock.
GetSequencer(handle): Returns a sequencer for a held lock.
SetSequencer(handle, sequencer): Associates a sequencer with a handle; subsequent operations fail if the sequencer is invalid.
CheckSequencer(handle, sequencer): Checks if a given sequencer is valid.

General Operation Parameter: All calls take an operation parameter for:

Asynchronous call with a callback.
Waiting for asynchronous call completion.
Obtaining extended error and diagnostic information.

Handle Validity: Handles are associated with a specific instance of a file/directory, not just the name. Operations on a handle to a deleted (even if recreated) node will fail. Access control checks can occur on any call, but always on Open().
Primary Election Example: Clients attempt to acquire a lock on a designated file. The winner becomes the primary, writes its identity to the file using SetContents(), and clients/replicas read it using GetContentsAndStat() (often triggered by a file modification event). The primary ideally uses GetSequencer() and passes it to servers, which verify it with CheckSequencer(). Lock-delay can be used for non-sequencer-aware services.

2.7 Caching

Consistent Write-Through Cache: Clients maintain an in-memory cache of file data and metadata (including absence) for read performance.
Lease-Based Consistency: Cache consistency is maintained using a lease mechanism and invalidations from the master.
Master Tracks Cached Data: The master keeps track of what each client is caching.
Invalidation on Modification: When data changes, the master blocks the modification, sends invalidations to all caching clients (via KeepAlive RPCs), and proceeds only after all clients acknowledge (via next KeepAlive) or their lease expires.
Single Round of Invalidation: Only one invalidation round is needed as the master treats the node as uncachable until acknowledgements are received.
Read Performance Priority: This approach prioritizes read performance by allowing reads to proceed without delay.
Simple Invalidation Protocol: Cached data is invalidated on change, never updated. Update-only protocols were deemed inefficient due to potentially unbounded unnecessary updates.
Strict Consistency Rationale: Weaker consistency models were rejected for being harder for programmers to use. Complex client-side sequencing (like virtual synchrony) was also deemed inappropriate for the diverse existing environments.
Handle Caching: Clients also cache open handles to avoid redundant Open() RPCs. This is restricted for ephemeral files and concurrent locking handles to maintain correct semantics.
Lock Caching: Clients can cache locks (hold them longer than needed) and are notified via an event if another client requests a conflicting lock.

2.8 Sessions and KeepAlives:

Session Definition: A time-bound relationship between a Chubby cell and a client, maintained by periodic KeepAlive handshakes.
Session Validity: Handles, locks, and cached data remain valid as long as the session is valid (subject to acknowledging invalidations).
Session Initiation and Termination: Sessions are requested on first contact and end explicitly on client termination or after a minute of inactivity (no open handles/calls).
Session Lease: Each session has a lease (time interval) during which the master guarantees not to unilaterally terminate it. The end of this interval is the session lease timeout.
Lease Timeout Advancement: The master extends the lease timeout on session creation, master failover, and in response to KeepAlive RPCs.
KeepAlive Mechanism: Clients send KeepAlive RPCs. The master typically blocks the reply until the current lease is near expiration, then returns with a new lease timeout (default 12s extension). Clients send a new KeepAlive immediately after receiving a reply, ensuring a near-constant blocked KeepAlive at the master.
Piggybacking Events and Invalidations: KeepAlive replies also carry events and cache invalidations, forcing clients to acknowledge invalidations to maintain their session. This also ensures all Chubby RPCs flow from client to master.
Local Lease Timeout: Clients maintain a conservative local estimate of the master's lease timeout, accounting for network latency and potential clock drift.
Jeopardy and Grace Period: If the local lease timeout expires, the session is in jeopardy, the cache is emptied and disabled. The client enters a grace period (45s default) to re-establish communication.
Session Recovery or Expiration Events: The library informs the application via jeopardy, safe (session recovered), and expired events. This allows applications to quiesce during uncertainty and recover without restarting.
Handle Invalidation on Session Expiration: If a session expires during an operation on a handle, all subsequent operations (except Close() and Poison()) on that handle will fail, ensuring that network outages cause only a suffix of operations to be lost, aiding in marking complex changes as committed.

2.9 Fail-overs:

Master State Loss: When a master fails, it loses its in-memory state about sessions, handles, and locks.
Session Lease Timer Stops: The authoritative session lease timer on the master stops during failover, effectively extending client leases.
Client Behavior During Failover:

Quick Election: Clients might contact the new master before their local lease timers expire.
Long Election: Clients flush caches and enter the grace period while trying to find the new master.

Grace Period Importance: The grace period allows session maintenance across master failovers exceeding the normal lease timeout.
Event Sequence (Figure 2): Illustrates a lengthy failover where the client relies on the grace period to maintain its session, showing lease expirations and KeepAlive exchanges.
Client Illusion of No Failure: Once a new master is contacted, the client library and master cooperate to present an illusion of continuous operation to the application.
New Master's Reconstruction: The new master reconstructs a conservative approximation of the previous master's in-memory state by:

Picking a new client epoch number to reject old packets.
Initially only responding to master-location requests.
Building in-memory structures for sessions and locks from the replicated database, extending leases conservatively.
Allowing KeepAlives but no other session operations initially.
Emitting a failover event, causing clients to flush caches and warn applications.
Waiting for all sessions to acknowledge the failover or expire.
Allowing all operations to proceed.
Recreating in-memory handles created before the failover (using sequence numbers) and preventing their re-creation after closing within the same epoch.
Deleting orphaned ephemeral files (without open handles) after a delay (e.g., one minute), requiring clients to refresh ephemeral file handles after failover.

Failover Code Complexity: The less frequently executed failover code has been a source of many bugs.

2.10 Database Implementation:

Initial Version (Berkeley DB): Used the replicated version of Berkeley DB, a B-tree mapping byte-string keys to values. A custom key comparison function kept sibling nodes adjacent. Its replicated logs aligned well with Chubby's design after master leases were added.
Reason for Rewrite: While Berkeley DB's B-tree was mature, the replication code was newer and less widely used, posing a perceived higher risk.
Current Version (Simple Custom Database): A simple database using write-ahead logging and snapshotting (similar to Birrell et al.) was implemented. The database log is still distributed using a consensus protocol.
Simplification Benefits: The rewrite allowed significant simplification of Chubby as it used few of Berkeley DB's advanced features (e.g., needing atomic operations but not general transactions).

2.11 Backup:

Periodic Snapshots: Every few hours, the master of each Chubby cell writes a snapshot of its database to a GFS file server in a separate building.
Benefits: Provides disaster recovery and a way to initialize new replicas without loading active ones.
Location Choice: Using a separate building ensures survival against local damage and avoids cyclic dependencies with GFS master election.

2.12 Mirroring:

Fast Replication: Allows mirroring a collection of small files between Chubby cells. The event mechanism ensures near-instantaneous replication (under a second worldwide with good network).
Unreachable Mirrors: Unreachable mirrors remain unchanged until connectivity is restored, with checksums used to identify updated files.
Common Use Case: Copying configuration files to globally distributed computing clusters.
Global Cell: A special "global" cell with geographically distributed replicas mirrors its /ls/global/master subtree to /ls/cell/slave in all other Chubby cells, ensuring high accessibility.
Mirrored Content: Includes Chubby's own ACLs, service presence announcements for monitoring, pointers to large datasets (like Bigtable cells), and configuration files for other systems.

3 Mechanisms for Scaling:

Challenge: Chubby must handle a large number of individual client processes (potentially 90,000+ per master).
Ineffectiveness of Minor Master Optimizations: Assuming no major bugs, small improvements in master request processing have limited impact.
Scaling Approaches:

Multiple Chubby Cells: Clients use nearby cells (via DNS) to avoid reliance on remote machines. Typical deployment: one cell per data center of thousands of machines.
Increased Lease Times: Master can increase lease times (up to ~60s from 12s) under heavy load to reduce KeepAlive frequency (the dominant request type and common overload failure mode).
Client-Side Caching: Clients cache file data, metadata, absence, and handles to reduce server calls.
Protocol Conversion Servers: Translate Chubby protocol to less complex ones (like DNS).
Proxies (Planned): Trusted processes that forward client requests to a Chubby cell. Can significantly reduce master load by handling KeepAlives and read requests (write traffic is low). However, they add an extra RPC and potentially increase temporary unavailability risk. Failover for proxies needs improvement.
Partitioning (Planned): Partitioning the cell's namespace by directory across multiple master/replica sets. Each partition handles most calls independently, reducing load per partition but not necessarily KeepAlive traffic. Cross-partition communication is needed for some operations (ACLs, directory deletion) but is expected to have modest impact. A combination of proxies and partitioning is the strategy for further scaling.

Current Scaling Needs: No immediate need for >5x scaling due to data center limits and hardware improvements increasing both client and server capacity.

4 Use, Surprises, and Design Errors

4.1 Use and Behaviour:

Many files used for naming, configuration, ACLs, and metadata.
Significant negative caching (caching file absence).
High average client count per cached file (~10).
Few clients hold locks, mostly exclusive (for primary election/partitioning).
RPC traffic dominated by KeepAlives, few reads/writes/lock acquisitions.
Most outages are short (<30s) and often caused by network issues, maintenance, software errors, and overload.
Data loss has been rare (6 times in many cell-years), mainly due to database/operator errors.
Chubby data fits in RAM, leading to low mean request latency until overload (high active sessions or exceptional read/write patterns).
Server scaling relies more on reducing communication than optimizing server code paths. Client-side cache performance is critical.

4.2 Java Clients:

Reluctance of Java programmers to use JNI led to the development and maintenance of a protocol-conversion server for Java clients.

4.3 Use as a Name Service:

Most popular use case, surpassing locking.
Chubby's explicit invalidation caching is superior to DNS's time-based (TTL) caching, especially for high-churn environments with many clients.
Load spikes from large job startups were mitigated by batching name entries.
A simple protocol-conversion server for name lookups was created.
A Chubby DNS server bridges Chubby names to DNS clients.

4.4 Problems with Fail-over:

Original design of writing sessions to the database on creation caused overload during mass client starts.
Optimization to write sessions on first modification/lock/ephemeral open led to potential loss of read-only sessions during failover.
New design avoids writing sessions to disk, recreating them on the new master (like handles), requiring the new master to wait a full lease timeout.
This new design enables proxies to manage sessions unknown to the master.

4.5 Abusive Clients:

Shared Chubby cells require isolation from misbehaving clients.
Lack of Quotas: Naive assumption that Chubby wouldn't be used for large data led to problems with a service rewriting a large file on every user action. File size limits were introduced.
Publish/Subscribe Misuse: Attempts to use the event mechanism for general pub/sub were inefficient due to Chubby's strong guarantees and invalidation-based caching.

4.6 Lessons Learned:

Developers often underestimate failure probabilities and treat Chubby as always available.
Developers often confuse service uptime with application availability (e.g., global cell availability vs. local).
API choices can unintentionally influence how developers handle outages (e.g., crashing on master failover event).
Proactive review of Chubby usage plans and providing higher-level libraries help mitigate availability risks.
Coarse-grained locking is often sufficient and avoids unnecessary communication.
Poor API choices (like tying cancellation to handle invalidation) can have unexpected limitations. A separate Cancel() RPC might be better.
Combining session lease refresh with event/invalidation delivery via KeepAlives (initially via TCP, then UDP due to TCP backoff issues) has implications for transport protocol choice. A potential solution is a separate TCP-based GetEvent() RPC alongside UDP KeepAlives.
Lack of aggressive caching (especially for file absence and handle reuse) led to inefficient client behavior. Initial countermeasures (exponential backoff on repeated Open()) were less effective than making repeated Open() calls cheap.

5 Comparison with related work

Based on Established Ideas: Chubby's design builds upon well-established concepts in distributed systems.
Cache Design: Inspired by distributed file system caching mechanisms.
Sessions and Cache Tokens: Similar to concepts in the Echo distributed file system and the V system (reducing lease overhead).
General-Purpose Lock Service Idea: Found in VMS, though VMS initially had a high-speed interconnect.
File System-like API: Chubby's API, including the namespace, borrows from file system models, recognizing the convenience for more than just files.
Differences from Distributed File Systems (Echo, AFS):

Performance and Storage Aspirations: Chubby doesn't aim for high throughput or low latency for uncached data, nor does it expect clients to read/write/store large amounts of data.
Emphasis on Consistency, Availability, and Reliability: These are easier to achieve with less focus on raw performance.
Small Database and Replication: Allows for many online copies (replicas and backups) and frequent integrity checks.

Scalability: The relaxed performance/storage requirements enable serving tens of thousands of clients per master.
Central Coordination Point: Solves a class of problems for system developers by providing a shared information and coordination hub.
Comparison with Boxwood's Lock Server: Chosen for comparison due to its recent design for loosely-coupled environments, yet with different design choices.
Chubby's Integrated Approach: Combines locks, reliable small-file storage, and a session/lease mechanism in a single service.
Boxwood's Separated Approach: Divides these functionalities into three independent services: a lock service, a Paxos service (for reliable state), and a failure detection service.
Target Audience Difference: Chubby targets a diverse audience (experts to novices) with a familiar API for a large-scale shared service. Boxwood provides a toolkit for potentially smaller groups of more sophisticated developers working on projects that might share code but aren't necessarily used together.
Interface Level: Chubby often provides a higher-level interface (e.g., combined lock and file namespaces). Boxwood's Paxos service client might implement caching using the lock service.
Default Parameter Differences: Boxwood uses more aggressive failure detection (frequent pings) with shorter timeouts, while Chubby has longer default lease and KeepAlive intervals, reflecting different scale and failure expectations.
Replication Factor: Boxwood subcomponents typically use 2-3 replicas, while Chubby uses 5 per cell, likely due to scale and shared infrastructure uncertainties.
Chubby's Grace Period: A key difference, allowing clients to survive long master outages without losing sessions/locks, which Boxwood lacks (its "grace period" is equivalent to Chubby's session lease). This reflects different expectations about the cost of lost locks.
Lock Purpose: Chubby locks are heavier-weight and require sequencers for external resource protection. Boxwood locks are lighter-weight and primarily intended for internal Boxwood use.

6 Summary

Chubby is a distributed lock service used for coarse-grained synchronization and widely adopted as a name and configuration service within Google.
Its design combines distributed consensus, consistent client-side caching, timely notifications, and a familiar file system interface.
Scaling is achieved through caching, protocol conversion, and load adaptation, with plans for proxies and partitioning.
Chubby is a central component for various Google systems (name service, MapReduce rendezvous, GFS/Bigtable master election, high-availability configuration).
Operating within a single company minimizes malicious DoS, but mistakes and differing developer expectations pose similar challenges.
Heavy-handed remedies (like usage reviews) are used to prevent abuse, but predicting future usage is difficult.
Lack of performance advice in documentation can lead to misuse.
Lessons learned highlight the importance of robust caching, careful API design, understanding developer behavior, and proactive guidance on using distributed services.

thetinkerer

Monday, April 7, 2025

The Chubby lock service for loosely-coupled distributed systems - Mike Burrows (Google)

Abstract

1. Introduction

2. Design

2.1 Rationale

Why a Lock Service Over a Paxos Library:

Key Design Decisions:

Decisions Based on Expected Use and Environment:

Decision for Coarse-Grained Locking:

2.2 System structure:

2.3 Files, directories and handles

2.4 Locks and sequencers

2.5 Events

2.6 API

2.7 Caching

2.8 Sessions and KeepAlives:

2.9 Fail-overs:

2.10 Database Implementation:

2.11 Backup:

2.12 Mirroring:

3 Mechanisms for Scaling:

4 Use, Surprises, and Design Errors

4.1 Use and Behaviour:

4.2 Java Clients:

4.3 Use as a Name Service:

4.4 Problems with Fail-over:

4.5 Abusive Clients:

4.6 Lessons Learned:

5 Comparison with related work

6 Summary

No comments:

Post a Comment

Horizontal scaling of Chroma DB

Report Abuse

Labels