Friday, April 4, 2025

Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency (SOSP Paper)

1. Abstract and Introduction

  • Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless data and the customers are charged only for the storage they use.
  • WAS storage comes in the form of Blobs (files), Tables (structured storage), and Queues (message delivery).
  • WAS is in production from 2008 and this particular paper was presented in the SOSP in 2011
  • One of the use cases for this storage was for the bing search to be able to access the Facebook updates or Twitter (now X) posts to be searchable in 15 or so seconds. The way this happens is explained as follows -
    • Facebook and Twitter send the raw public content to WAS (e.g., user postings, user status updates, etc.) to be made publicly searchable. This content is stored in WAS Blobs.
    • The ingestion engine annotates this data with user auth, spam, and adult scores; content classification; and classification for language and named entities. In addition the links are crawled and expanded. This causes high rate of access on WAS tables.
    • These Blobs are then folded into the Bing search engine to make the content publicly searchable. 
    • The ingestion engine uses Queues to manage the flow of work, the indexing jobs, and the timing of folding the results into the search engine. 
  • Other facts about this use case is - Facebook and Twitter keeps around 350TB of data in WAS (before replication). In terms of transactions, the ingestion engine has a peak traffic load of around 40,000 transactions per second and does between two to three billion transactions per day
  • It is claimed that WAS provides three properties that the CAP theorem claims are difficult to achieve at the same time: strong consistency, high availability, and partition tolerance.
  • Aims to provide global namespace with exabyte scaling capabilities.
  • It also provides redundancy and multi-tenancy which allows to use the same hardware to store multiple customer data.

2. Global Partitioned Namespace

  • Storage uses DNS service to provide global namespace and scaling capability by dividing the URI to this form - http(s)://AccountName.<service>1.core.windows.net/PartitionName/ObjectName
  • AccountName is the customer selected account name for accessing storage and is part of the DNS host name. The AccountName DNS translation is used to locate the primary storage cluster and data center where the data is stored.  
  • PartitionName locates the data once a request reaches the storage cluster. The PartitionName is used to scale out access to the data across storage nodes based on traffic needs.
  • ObjectName identifies individual objects within that partition. The system supports atomic transactions across objects with the same PartitionName value.

3. High Level Architecture

  • WAS is built on top of Fabric controller which provides many of the features like node management, network management, health monitoring, start/stop service.
  • WAS production system consists of Location service and Storage stamps like below -
        





  • A storage stamp is a cluster of N racks of storage nodes, where each rack is built out as a separate fault domain with redundant networking and power.

  • To provide low cost cloud storage, we need to keep the storage provisioned in production as highly utilized as possible. Our goal is to keep a storage stamp around 70% utilized in terms of

    capacity, transactions, and bandwidth.

  • The other gains from this are - (a) disk short stroking to gain better seek time and higher throughput by utilizing the outer tracks of the disks and (b) to continue providing

    storage capacity and availability in the presence of a rack failure within a stamp.

  •  Location service(LS) The location service manages all the storage stamps. It is also responsible for managing the account namespace across all stamps. LS is also replicated in two geographies in case of a outage in one geography.

  • As can be seen in the architecture LS is connected two stamps. When the application comes with a request to create a new storage account it also specifies the affinity to a location (USEast) and gets allocated with a primary stamp from that location.

    • The LS then stores the account metadata information in the chosen storage stamp, which tells the stamp to start taking traffic for the assigned account. 
    • The LS then updates DNS to allow requests to now route from the name https://AccountName.service.core.windows.net/ to that storage stamp’s virtual IP (VIP, an IP address the storage stamp exposes for external traffic.
  • Storage Stamp has 3 sub layers within -
    • Stream layer -  This layer stores the bits on disk and is in charge of distributing and replicating the data across many servers to keep data durable within a storage stamp. The stream layer can be thought of as a distributed file system layer within a stamp.
    • Partition layer -  The partition layer is built for (a) managing and understanding higher level data abstractions (Blob, Table, Queue), (b) providing a scalable object namespace, (c) providing transaction ordering and strong consistency for objects, (d) storing object data on top of the stream layer, and (e) caching object data to reduce disk I/O. This is also supposed to be responsible for scalability by partitioning all data objects.
    • Frontend layer -  The Front-End (FE) layer consists of a set of stateless servers that take incoming requests. Upon receiving a request, an FE looks up the AccountName, authenticates and authorizes the request, then routes the request to a partition server in the partition layer (based on the PartitionName). The system maintains a Partition Map that keeps track of the PartitionName ranges and which partition server is serving which PartitionNames.
  • WAS has two types of replication engines with these features -
    • Intra-Stamp Replication (stream layer) – This system provides synchronous replication and is focused on making sure all the data written into a stamp is kept durable within that stamp. Also this is in critical path of user writes. Success is returned back to the user after this replication completes.
    • Inter-Stamp Replication (partition layer) – This system provides asynchronous replication and is focused on replicating data across stamps. Inter-stamp replication is done in the background and is off the critical path of the customer’s request. This replication is at the object level, where either the whole object is replicated or recent delta changes are replicated for a given account. Inter-stamp replication is used for (a) keeping a copy of an account’s data in two locations for disaster recovery and (b) migrating an account’s data between stamps. 

4. Stream Layer

  • Blocks are the fundamental data units for read/write operations, appended to extents, with client-controlled size (up to N bytes), and are always read entirely for checksum validation performed on every read and periodically for data integrity.
  • Extents, the unit of replication (defaulting to three replicas), are NTFS files comprising block sequences with a target size of 1GB, used by the partition layer to store objects, either by appending many small ones to the same extent/block or by breaking large ones across multiple extents, with object locations tracked in the partition layer's index.
  • Streams, identified by a hierarchical name, appear as large appendable and randomly readable files to the partition layer, are managed by the Stream Manager as ordered lists of pointers to extents, allowing for fast stream creation by concatenating existing extents, with only the last extent being mutable.
  • Example of stream with four extents where E1, E2, E3 are sealed and E4 can still be written to.
  • Stream Layer has the following architecture 

 
  • The Stream Manager (SM), a Paxos cluster off the critical path, manages the stream namespace, extent-to-stream mapping, and extent allocation across Extent Nodes (ENs), handling tasks like namespace maintenance, EN health monitoring, extent creation/assignment, lazy re-replication, garbage collection of unused extents, and erasure coding scheduling, by periodically polling EN state and ensuring desired replication levels through random placement across fault domains, while remaining unaware of blocks and scaling by tracking stream and extent state within a single stamp.
  • Extent Nodes (ENs) manage storage for assigned extent replicas (comprising data blocks and checksums in files with an offset-to-block index) on their attached disks, are unaware of streams, maintain a cached view of their owned extents and peer replicas from the SM, communicate with other ENs for block write replication and SM-directed replica creation, and reclaim space upon SM notification of garbage collected extents.
  • 4.3.1 Replication Flow: When a stream's first extent is created, the Stream Manager assigns three replicas (one primary, two secondary) on different Extent Nodes across fault domains, with writes always going to the fixed primary for coordination, and this process repeats for new extents when the current one is sealed.
  • 4.3.2 Sealing: The Stream Manager coordinates extent sealing by querying ENs for their current length and choosing the smallest commit length among reachable replicas to ensure no acknowledged data is lost, forcing unreachable replicas to synchronize later, guaranteeing bitwise identical sealed replicas.
  • 4.3.3 Interaction with Partition Layer: The partition layer reads row/blob data at known locations (based on successful append acknowledgements, ensuring consistency) and sequentially iterates metadata/commit log streams on partition load, with a commit length check on the last extent's primary EN to ensure a consistent view even if reading from different sealed replicas.
  • 4.4 Erasure Coding Sealed Extents: To reduce storage costs and increase durability, sealed extents for Blob storage are broken into fragments and encoded with error-correcting codes, reducing storage overhead to 1.3x-1.5x while improving resilience.
  • 4.5 Read Load-Balancing: Read requests for replicated extents include a deadline, allowing clients to retry with a different EN if the deadline cannot be met; for erasure-coded data, reconstruction from other fragments is used if reading a specific fragment is too slow.
  • 4.6 Spindle Anti-Starvation: To ensure fairness and prevent throughput-optimized disks from starving non-sequential I/O, the system uses custom I/O scheduling that limits new I/O when pending I/O exceeds thresholds or when requests are delayed beyond a certain time.
  • 4.7 Durability and Journaling: The stream layer guarantees at least three durable copies of acknowledged writes by using a dedicated journal drive (SSD or disk) on each Extent Node for sequential write logging before queuing to the data disk, significantly reducing append latency and variance by decoupling writes from data disk contention.

5. The partition layer:

  • Stores object types (Blob, Table, Queue), understands transactions for each, and provides data models, processing logic, a scalable namespace, load balancing, and strong consistency.
  • 5.1 Partition Layer Data Model: Utilizes Object Tables (OTs), massive tables dynamically sharded into non-overlapping RangePartitions based on load and spread across Partition Servers, with specific OTs like Account Table, Blob Table, Entity Table, Message Table, Schema Table, and Partition Map Table, each having a fixed schema and primary keys (often AccountName, PartitionName, ObjectName for data tables).
  • 5.1.1 Supported Data Types and Operations: Supports standard simple data types and two special types (DictionaryType for flexible properties and BlobType for storing large data in a separate stream via pointers), along with standard row operations (insert, update, delete, query/get), batch transactions within a PartitionName, and snapshot isolation for concurrent reads and writes.
  • 5.2 Partition Layer Architecture: Comprises three main components: a Partition Manager (PM), Partition Servers (PS), and a Lock Service.

  • Partition Manager (PM): Tracks and splits Object Tables into RangePartitions, assigns each to a Partition Server, stores assignments in the Partition Map Table, ensures each RangePartition has one active server without overlap, performs load balancing, and uses a Lock Service for leader election among multiple PM instances.
  • Partition Server (PS): Serves requests for assigned RangePartitions, stores partition state persistently in streams, maintains a memory cache for efficiency, uses leases from the Lock Service to guarantee exclusive service of a RangePartition for strong consistency and transaction ordering, and can concurrently serve multiple RangePartitions.
  • Lock Service: A Paxos Lock Service used for PM leader election and for PSs to maintain leases for serving partitions, ensuring no two PSs serve the same RangePartition simultaneously.
  • On partition server failure, the PM reassigns all RangePartitions served by the failed PS to available PSs based on their load and updates the Partition Map Table, allowing the Front-End layer to locate RangePartitions.
  • A PS serves a RangePartition by maintaining in-memory data structures and persistent data structures in streams.
  • 5.3 RangePartition Data Structures: A PS serves a RangePartition by maintaining a set of in-memory data structures and a set 1 of persistent data structures in streams.


  • 5.3.1 Persistent Data Structure: A RangePartition uses a Log-Structured Merge-Tree, with each Object Table's RangePartition having its own set of streams (Metadata Stream, Commit Log Stream, Row Data Stream, Blob Data Stream), though extents can be shared due to splitting.
  • Metadata Stream: The root stream for a RangePartition, assigned by the PM, containing information to load the partition (names of other streams, pointers to starting positions) and the status of split/merge operations.
  • Commit Log Stream: Stores recent insert, update, and delete operations applied to the RangePartition since the last checkpoint.
  • Row Data Stream: Stores the checkpoint row data and index for the RangePartition.
  • Blob Data Stream: Used only by the Blob Table to store the blob data bits separately.
  • Each RangePartition in an Object Table has only one data stream, except the Blob Table which has a row data stream (for the blob index) and a separate blob data stream (for the blob data bits).
  • 5.3.2 In-Memory Data Structures: A partition server maintains a Memory Table (in-memory commit log), an Index Cache (checkpoint indexes), a Row Data Cache (checkpoint row data), and Bloom Filters (to efficiently check for row presence in checkpoints).
  • Memory Table: The in-memory version of the commit log, holding recent un-checkpointed updates.
  • Index Cache: Stores the checkpoint indexes of the row data stream.
  • Row Data Cache: A read-only memory cache of checkpoint row data pages.
  • Bloom Filters: Probabilistic data structures for each checkpoint, indicating if a row might be present.
  • 5.4 Data Flow: Write requests are appended to the commit log and placed in the memory table, with success returned to the client; when thresholds are met, the memory table is checkpointed to the row data stream, and the corresponding commit log portion can be removed; Blob data bits are directly added to the commit log (and later concatenated to the Blob data stream during checkpoint); loading a partition involves reading the metadata stream and replaying the commit log to rebuild in-memory state.
  • 5.5 RangePartition Load Balancing: The PM performs Load Balance (reassigning partitions from heavily loaded PSs), Split (dividing overloaded partitions), and Merge (combining lightly loaded adjacent partitions) to manage load and the total number of partitions, aiming to keep the partition count within a range proportional to the number of partition servers.
  • 5.5.1 Load Balance Operation Details: The PM tracks load metrics (transactions/second, pending transactions, throttling, CPU/network usage, latency, data size) for RangePartitions and PSs via heartbeats, and reassigns RangePartitions from overloaded PSs to less loaded ones after an offload checkpoint.
  • 5.5.2 Split Operation: The PM instructs a PS to split a RangePartition based on load or the size of its row/blob data streams; the PS chooses the split key and uses a "MultiModify" stream operation to create new sets of streams for the two new partitions, appends new key ranges to their metadata, starts serving, and notifies the PM for Partition Map Table update.
  • 5.5.3 Merge Operation: The PM moves two lightly loaded, adjacent RangePartitions to the same PS and instructs it to merge them into a new RangePartition; the PS checkpoints both, uses "MultiModify" to create a new commit log and data streams by concatenating the extents of the merged partitions, constructs a new metadata stream with the combined key range and pointers, and starts serving the new partition, with the PM updating the Partition Map Table.
  • 5.6 Partition Layer Inter-Stamp Replication: Accounts have a primary and potentially secondary stamps (assigned by the Location Service) for inter-stamp replication (e.g., geo-replication for disaster recovery or migration); writes to the primary stamp are replicated within the stamp, and then asynchronously replicated to the secondary stamp by the partition layer, with DNS pointing to the primary stamp's VIP, and failover involving the Location Service promoting a secondary to primary and switching DNS.

6. Application Throughput

  • Tests on live production stamps show near-linear scaling of WAS Table operation throughput (entities/second) with increasing VMs for random gets/puts and batch inserts, while Blob throughput (MB/s) scales linearly up to eight VMs before network capacity limits, with batch Table puts offering significantly higher throughput than single entity puts due to reduced network roundtrips and stream writes, and Table reads exhibiting slightly lower throughput due to the random access pattern minimizing caching benefits.

7. Workload Profiles: 

  • Various internal Microsoft services like XBox GameSaves (using Blob and Table storage), XBox Telemetry (using Blobs for data and Tables for metadata, with Queues for coordination), and Zune backend (using Blobs for media) demonstrate diverse usage patterns of WAS, with Blobs consuming the most capacity, Tables handling the most requests (across all services), and Queues primarily used for communication with minimal storage footprint.

    8. Design Choices and Lessons Learned:

  •  Separating computation from storage enables independent scaling and load balancing for each, providing isolation in the multi-tenant environment, and necessitates advancements in data center networking for efficient high-bandwidth access.
  • Choosing range-based partitioning over hashing for Object Tables facilitates performance isolation and efficient enumeration by keeping an account's objects together, allowing for throttling and isolation of abusive accounts, despite posing challenges for scaling sequential access patterns (which can be mitigated by hashing the PartitionName).
  • Implementing a Sample-Hold algorithm for tracking request rates of top AccountNames and PartitionNames allows for selective throttling of overloaded accounts while protecting well-behaved ones, especially when load balancing cannot address access pattern issues.
  • Automatic load balancing of partitions is crucial for maintaining high availability and handling traffic spikes, requiring adaptive profile information, effective metrics (now including request, CPU, and network loads), and split triggers (based on throttling, timeouts, size) with a swappable algorithm and scripting support for customization.
  • Using separate log streams per RangePartition provides performance isolation by limiting load time to recent updates of that specific partition.
  • Implementing journaling (a single sequential log file per EN) significantly reduced latency and provided consistent performance for appends by decoupling them from data disk contention.
  • Adopting an append-only system and sealing extents upon failure simplified replication and failure handling, ensured consistency via commit lengths, enabled low-cost snapshots/versioning and efficient erasure coding, and greatly benefited diagnostics and recovery.
  • Employing end-to-end checksums for user data at each layer prevents data corruption and helps identify faulty hardware.
  • Spreading servers across different fault and upgrade domains and using rolling upgrades with PM/SM coordination ensures high availability during planned maintenance.
  • Supporting multiple data abstractions (Blobs, Tables, Queues) from a single storage stack reduces costs by running all services on the same hardware and leveraging shared improvements.
  • Using system-defined Object Tables simplifies management and upgrades by focusing on a small set of internal schemas, isolating them from end-user data abstractions.
  •  Limiting storage per account to 100TB (initially due to stamp size) led large customers to use multiple accounts, a reasonable tradeoff for those already partitioning data across regions, but a limitation planned for future increase.
  • WAS achieves high availability with strong consistency within a storage stamp (seemingly defying CAP theorem) through layering (append-only stream layer for availability, partition layer for consistency) and designing around a specific fault model (node and TOR switch failures).
  • Utilizing a high-performance debug logging infrastructure with distributed search capabilities, optimized for performance via tokenization and compression, is critical for diagnosing production issues without requiring special deployments or reproduction.
  • Employing Pressure Point testing, a programmable interface for triggering various operations and injecting faults, has been instrumental in finding and reproducing complex, rare issues.

9. Related Work

  • Existing research highlights the difficulty of achieving strong consistency and high availability in unreliable networks, with some systems sacrificing consistency; WAS aims for strong consistency and high availability with partition tolerance, drawing inspiration from systems like Chain Replication while implementing asynchronous geo-replication (unlike some systems with synchronous approaches) for better write latency, and shares similarities with GFS and BigTable but with key differences like guaranteed bitwise consistency across replicas, journaling instead of parallel log writes, and integrated scalable Blob storage and batch Table transactions, along with automatic partition management.

10. Conclusions

  • Windows Azure Storage provides essential cloud services with strong consistency, a global namespace, and disaster recovery, efficiently running diverse customer workloads on shared hardware, offering Blob, Table, and Queue abstractions that empower developers for various storage and workflow needs, as exemplified by the rapid development of the Facebook/Twitter search ingestion engine.

Reference:

  1. Original SOSP storage paper Calder2011 - https://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
  2. Video by Calder - https://www.youtube.com/watch?v=QnYdbQO0yj4&ab_channel=sosp2011

No comments:

Post a Comment

Horizontal scaling of Chroma DB

Horizontal scaling of Chroma DB Chroma DB has become the critical component in our ads infrastructure for serving ads to our end users from ...