Dynamo - Amazon's Highly Available Key-value Store [sosp2007]

Amazon.com faces significant reliability challenges due to its massive scale and distributed infrastructure, where continuous component failures can lead to financial losses and erode customer trust. To combat this, Amazon developed Dynamo, a highly available key-value storage system. Dynamo prioritizes continuous availability over strict consistency during failures, employing object versioning and application-assisted conflict resolution to ensure an "always-on" customer experience.

1. Introduction

Amazon's global e-commerce platform, serving millions of customers, demands extreme reliability, scalability, and performance, as even minor outages carry significant financial and trust implications. Recognizing that system resilience hinges on effective application state management within its decentralized, service-oriented architecture, Amazon developed Dynamo.
Dynamo is a highly available and scalable distributed data store designed for core Amazon services with stringent reliability needs, offering flexible control over tradeoffs between availability, consistency, cost, and performance. Unlike traditional relational databases, Dynamo provides a simple primary-key interface, proving highly efficient for services like shopping carts and session management.
Dynamo achieves its scalability and availability by combining several techniques: consistent hashing for data partitioning and replication, object versioning for consistency, a quorum-like technique and decentralized synchronization protocol for replica consistency during updates, and a gossip-based protocol for failure detection. This decentralized system allows for seamless node additions and removals without manual intervention.
Successfully deployed in Amazon's core e-commerce services, Dynamo efficiently handled extreme peak loads during holiday seasons without downtime, demonstrating that an eventually-consistent storage system can meet the demands of critical production applications. The paper highlights the effectiveness of combining various techniques to build a highly available system and offers insights into optimizing these techniques for real-world performance requirements.

2. Background

Simple Query Model: Dynamo focuses on basic read/write operations for small, unique key-identified binary objects, accommodating a significant portion of Amazon's services that don't need relational schemas.
Prioritizes Availability over Strong Consistency: Recognizing that ACID (Atomicity, Consistency, Isolation, Durability) guarantees often compromise availability, Dynamo embraces weaker consistency, specifically eventual consistency, to ensure high availability, even during failures. This means all updates will eventually propagate to all replicas.
"Always Writeable": Dynamo prioritizes successful writes. To achieve this, conflict resolution is performed during reads rather than writes. This ensures that customer actions, like adding items to a shopping cart, are never rejected due to system issues.
Application-Assisted Conflict Resolution: While Dynamo offers a simple "last write wins" policy, it primarily empowers applications to resolve conflicts. This allows services, like the shopping cart, to implement domain-specific merge logic for a better customer experience.
Efficiency and Performance: Designed for commodity hardware, Dynamo meets Amazon's stringent latency requirements, measured at the 99.9th percentile of request distribution, rather than averages. This focus ensures a consistent, high-quality experience for nearly all customers. Services can configure Dynamo to balance performance, cost, availability, and durability.
Incremental Scalability: Dynamo allows for the addition of single storage nodes with minimal disruption, supporting continuous growth.
Symmetry and Decentralization: Every Dynamo node has equal responsibilities, and the system favors decentralized peer-to-peer techniques to avoid single points of failure and simplify operations.
Heterogeneity: Dynamo can leverage diverse hardware capabilities, distributing work proportionally to utilize higher-capacity nodes efficiently without requiring a complete infrastructure overhaul.

3. Related Work

3.1 Peer-to-Peer (P2P) Systems

First-generation P2P systems (e.g., Freenet, Gnutella):
- Used unstructured overlays — arbitrary peer connections.
- Relied on flooding to discover shared files.
Structured P2P networks (e.g., Chord, Pastry):
- Employed consistent global protocols.
- Enabled bounded-hop routing to locate data efficiently.
- Some systems introduced O(1) routing to reduce latency (constant-time routing with local info).
Storage Systems on P2P Overlays:
- Oceanstore:
  - Global, transactional, persistent storage.
  - Uses conflict resolution and total-order updates for concurrent writes.
  - Designed for untrusted infrastructure.
- PAST:
  - Built on Pastry for persistent, immutable objects.
  - Applications layer handles mutability semantics.

3.2 Distributed File Systems and Databases

Key differences from P2P systems:
- Support hierarchical namespaces (vs flat in P2P).
Examples of Distributed File Systems:
- Ficus, Coda:
  - Use replication for high availability.
  - Handle update conflicts via system-level conflict resolution.
- Farsite:
  - Fully distributed (no central server).
  - Achieves availability and scalability via replication.
- Google File System (GFS):
  - Central master server for metadata.
  - Data stored in chunks on chunkservers.
Distributed Databases:
- Bayou:
  - Supports disconnected operations, application-level conflict resolution, and eventual consistency.
- All (Bayou, Coda, Ficus) provide eventual consistency and handle network partitions.
Dynamo (Amazon):
- Inspired by above systems.
- Allows read/write during partitions.
- Uses different conflict resolution methods.
Other Distributed Storage Models:
- FAB:
  - Block storage; splits large objects into smaller blocks.
- Key-value stores:
  - Ideal for small objects (<1MB).
  - Easy to configure per application.
- Antiquity:
  - Ensures data integrity using secure logs and Byzantine fault tolerance.
  - Dynamo trades off these guarantees in favor of trusted infrastructure.
Bigtable:
- Stores structured data using a sparse, multi-dimensional sorted map.
- Allows access via multiple attributes.
- Dynamo differs by focusing on simple key/value access and high availability.
Traditional replicated relational databases:
- Prioritize strong consistency.
- Struggle with scalability and availability, especially under network partitions.

3.3 Discussion (Dynamo Design Trade-offs)

Dynamo’s primary goals:
1. Provide an “always writeable” store — no rejected writes.
2. Operate in a trusted environment (single administrative domain).
3. Skip support for hierarchical namespaces or complex schemas.
4. Meet strict latency SLAs (e.g., 99.9% of ops < few hundred ms).
Zero-hop DHT:
- Dynamo avoids multi-hop routing (as in Chord/Pastry).
- Each node stores enough routing info to directly forward requests → reduces latency variability.

4. System Architecture

Beyond data persistence, Dynamo requires robust solutions for load balancing, membership, failure detection/recovery, replica synchronization, overload handling, state transfer, concurrency, job scheduling, request marshalling/routing, monitoring, and configuration management. The paper focuses on core techniques: partitioning, replication, versioning, membership, failure handling, and scaling, with advantages summarized in Table 1.

4.1 System Interface:

Operations: Supports get(key) (retrieves object(s) with context) and put(key, context, object) (stores replicas with context).
Key Handling: Keys and objects are treated as opaque byte arrays; MD5 hashing generates 128-bit identifiers for node assignment.

4.2 Partitioning Algorithm:

Uses consistent hashing on a ring, where each node is responsible for a key range between itself and its predecessor.
Virtual Nodes: Assigns multiple tokens to nodes to address non-uniform data/load distribution and node heterogeneity.
Advantages:
- Even load dispersion during node failures.
- Balanced load acceptance for new nodes.
- Accounts for node capacity differences.

4.3 Replication:

Replicates data on N hosts (configurable) for high availability and durability.
Coordinator node stores keys locally and at N-1 clockwise successors.
Preference List: Lists nodes responsible for a key, ensuring distinct physical nodes despite virtual nodes.

4.4 Data Versioning:

Offers eventual consistency, with asynchronous replica updates.
Uses vector clocks to track causality, allowing multiple versions of an object.
Reconciliation:
- Syntactic: System resolves versions when possible.
- Semantic: Clients reconcile conflicting versions (e.g., merging shopping cart versions).
Vector Clock Truncation: Limits size by removing oldest entries (with timestamps) when exceeding a threshold, potentially impacting reconciliation.

4.5 Execution of get() and put() Operations:

Any node can coordinate, selected via load balancer or partition-aware client library (latter reduces latency).
Uses quorum-like protocol with R (read) and W (write) values, where R + W > N ensures consistency.
Put: Coordinator writes locally, sends to N reachable nodes, succeeds if W-1 respond.
Get: Coordinator retrieves versions from N reachable nodes, returns causally unrelated versions for client reconciliation.

4.6 Handling Failures: Hinted Handoff:

Uses sloppy quorum, accessing the first N healthy nodes for read/write operations.
During node failures, replicas are sent to fallback nodes with hints, stored separately, and delivered upon recovery.
Replicates across multiple data centers to handle data center failures.

4.7 Handling Permanent Failures: Replica Synchronization:

Employs Merkle trees for anti-entropy, detecting inconsistencies with minimal data transfer.
Each node maintains Merkle trees per key range for efficient synchronization.

4.8 Membership and Failure Detection:

4.8.1 Ring Membership: Explicit node addition/removal via admin tools, propagated via gossip-based protocol.
4.8.2 External Discovery: Seed nodes prevent logical partitions, ensuring membership reconciliation.
4.8.3 Failure Detection: Local, decentralized detection based on node responsiveness; nodes retry failed peers periodically.

4.9 Adding/Removing Storage Nodes:

New nodes receive random tokens, taking key ranges from existing nodes.
Key transfers include confirmation to avoid duplicates, ensuring uniform load distribution and fast bootstrapping.

This architecture ensures Dynamo’s high availability, scalability, and durability, tailored for applications like Amazon’s shopping cart, balancing consistency and performance.

5. Implementation

Each storage node in Dynamo has three main software components:

Request Coordination
Membership and Failure Detection
Local Persistence Engine

All components are implemented in Java.

Pluggable Local Persistence

Supports multiple storage engines:
- Berkeley DB (BDB) Transactional Data Store (default)
- BDB Java Edition
- MySQL
- In-memory buffer with persistent backing
Engine is selected based on application’s object size and access patterns:
- BDB: suited for small (tens of KB) objects
- MySQL: better for larger objects

Request Coordination

Built on event-driven messaging (inspired by SEDA architecture)
Uses Java NIO channels
Each client request creates a state machine on the coordinating node
State machine handles:
- Node selection
- Request dispatch
- Response collection
- Retry logic
- Final client response packaging

Read Operation Flow (State Machine)

Send read requests to relevant nodes
Wait for minimum required responses
If insufficient responses: fail request
Reconcile versions (if enabled) using vector clocks
Return latest data
Perform read repair for stale replicas post-response

Write Coordination

Write coordinator is selected from the top N nodes in the preference list
Not always the first node to avoid load imbalance
Optimized by choosing the node that responded fastest to the preceding read
- Improves read-your-writes consistency
- Reduces latency variability
- Enhances 99.9 percentile performance

6. Experiences & Lessons Learned
Dynamo is used by various Amazon services with different configurations, varying in reconciliation logic and quorum characteristics (N, R, W). Key experiences and lessons are summarized below.

Usage Patterns:
- Business Logic Specific Reconciliation: Applications reconcile divergent versions using custom logic (e.g., merging shopping cart versions).
- Timestamp-Based Reconciliation: Uses “last write wins” based on the latest timestamp (e.g., customer session information service).
- High Performance Read Engine: Configured for high read rates with few updates (R=1, W=N), acting as a scalable persistence cache (e.g., product catalog, promotional items).
- Tunable Parameters: Clients adjust N (durability), R (read quorum), and W (write quorum) to balance performance, availability, and consistency. Common configuration: (N=3, R=2, W=2).

6.1 Balancing Performance and Durability:

Performance Goals: Services target 99.9% of read/write requests within 300ms on commodity hardware.
Challenges: Multiple node involvement limits performance to the slowest R/W replica; diurnal request patterns cause latency variations.
Optimization: Write buffering stores objects in memory, reducing 99.9th percentile latency by 5x during peak traffic. Reads check buffer first.
Trade-off: Buffering risks data loss on server crashes; mitigated by one replica performing a durable write without impacting performance.

6.2 Ensuring Uniform Load Distribution:

Load Imbalance: Measured over 24 hours; imbalance ratio (nodes deviating >15% from average load) is higher at low loads (20%) than high loads (10%).

Partitioning Strategies:
- Strategy 1: T Random Tokens per Node: Random token assignment causes uneven ranges, slow bootstrapping (up to a day), complex Merkle tree recalculation, and inefficient archiving.
- Strategy 2: T Random Tokens, Equal Partitions: Divides hash space into Q equal partitions (Q >> N), decouples partitioning/placement, but has poor load balancing.
- Strategy 3: Q/S Tokens, Equal Partitions: Assigns Q/S tokens per node, preserves equal partitions, and achieves best load balancing efficiency. Reduces membership data size by three orders of magnitude, simplifies bootstrapping/recovery, and enables easy archiving.
Evaluation: Strategy 3 outperforms others in load balancing efficiency (ratio of average to maximum node requests), tested with S=30, N=3.

6.3 Divergent Versions: When and How Many?:

Causes: Divergent versions arise from failures (node, data center, network partitions) or concurrent writes.
Impact: Minimizing divergent versions reduces semantic reconciliation load. Over 24 hours, 99.94% of shopping cart requests saw one version, 0.00057% saw 2, 0.00047% saw 3, and 0.00009% saw 4.
Observation: Concurrent writes (often by automated clients) cause more divergence than failures.

6.4 Client-Driven or Server-Driven Coordination:

Server-Driven: Load balancer assigns requests; coordinators (top N nodes for writes) handle versioning, incurring extra network hops.
Client-Driven: Clients download membership state every 10 seconds, directly coordinate requests, reducing latency by ≥30ms (99.9th percentile) and 3-4ms (average).
Advantages: Client-driven eliminates load balancer overhead, leverages uniform key distribution, but depends on fresh membership data.

6.5 Balancing Background vs. Foreground Tasks:

Issue: Background tasks (replica synchronization, data handoff) competed with foreground put/get operations, impacting performance.
Solution: Admission control mechanism allocates resource slices to background tasks, adjusted via feedback from foreground task performance (e.g., 99th percentile read latency).
Monitoring: Tracks disk latencies, lock contention, and queue wait times to limit background task intrusiveness.

6.6 Discussion:

Success: Dynamo achieves 99.9995% request success rate with no data loss over two years.
Flexibility: (N, R, W) tuning simplifies application porting for Amazon’s failure-tolerant services, though new applications require careful conflict resolution design.
Scalability Limitation: Full membership model (gossip-based routing table) suits hundreds of nodes but may not scale to tens of thousands without hierarchical extensions or O(1) DHT approaches.

7. Summary of Dynamo’s Strengths

Highly Available and Scalable
- Used for storing the state of core Amazon.com services.
Robustness
- Handles server failures, data center outages, and network partitions effectively.
Incremental Scalability
- Allows service owners to scale up or down based on real-time load.
Customizable Trade-offs
- Parameters N (replication factor), R (read quorum), and W (write quorum) let users tune for:
  - Performance
  - Durability
  - Consistency (according to their SLA requirements)
Real-World Proven
- Proven effective over a year in Amazon’s demanding production environment.
Demonstrates Viability of Decentralized Techniques
- Shows that decentralized, eventually consistent systems can successfully support highly available applications.

thetinkerer

Thursday, May 29, 2025