Sunday, November 23, 2025

Horizontal scaling of Chroma DB

Horizontal scaling of Chroma DB

  • Chroma DB has become the critical component in our ads infrastructure for serving ads to our end users from our indirect ads source.
  • Our Ads Search service does the cosine similarity to determine from the indirect ads source to determine the closest related ads.
  • Increase in the volume of users has necessitated us in exploring the options to scale the Chroma DB infrastructure.
  • Our chroma db has ads from our third party advertisers that are published once per day. So it is mostly a read heavy system.

Challenges in scaling -

  1. The collection id is generated as uuid on the chroma instance on the chroma service in the v2 apis. So, this defeats the idea of just putting a load balancer directly behind a bunch of chroma nodes.
  2. We also need to have high availability of the chroma infrastructure as it is a critical path and any downtime would result in loss of revenue. 
  3. We currently have east / west chroma instances serving the traffic for east / west search service traffic in the US. However, we also want to improve the resiliency by having search service as fallbacks.

Infrastructure


  • Our infra in a particular region looks like the above picture.
  • The indirect ads pipelines ingests and after filters the undesired traffic based on various classifiers. In its final stages it embeds the ad data and publishes that to one of our chroma write replica.
  • After publishing is complete the written chroma replica compresses the sqllite and other metadata files and uploads it to the object storage.
  • Pipeline later kicks of download and restore on each of read replicas currently in a sequential way. Although we have plans to do it in a batchwise where we can batch wise do the replication in parallel across east / west clusters.

Accomplishments

  • We achieved the goals we set forth to achieve at the beginning
  • In the process we learned the multi region replication based on a storage based off of one geography will become a bottle neck or the slowest part of the infra when we are doing multi gig transfers. We have plans to create multi storages accounts for each region.
  • Average latency for the search api which is the actual consumer for the ads data dropped. This has given us enough confidence that we can scale our services close to 5Mn requests per day which is what the team hopes to achieve.






 

Friday, June 13, 2025

Design a Load Balancer

 Design a Load Balancer

  • General system design involves using load balancer as component to distribute the scaled traffic horizontally across multiple backend nodes. But atypical but still possible is the fact that there can be a teams within (typically cloud) companies who might actually be working on such stuff and who may gave this gotcha kind of system design question.

Requirements

  • Some clarifying questions -
    • Candidate: Is it L4/ L7 load balancer?
    • Interviewer: Great question, let's keep both the options. Can you also explain the difference between them?
      • Candidate: Sure, L4 load balancer load balances based on TCP/IP Layer 4 of OSI model. Whereas L7 load balances the application level traffic at Layer 7 OSI model. Another difference is L4 load balancer can handle more traffic compared to L7 as L7 does deeper inspection to route traffic it is much slower.
      • Candidate: What is the traffic that I can assume that our LB will be handling?
      • Interviewer: You can expect the traffic to be Millions of requests per second.
      • Candidate: Ok, so there has to multiple instances to handle such a traffic. And with that much of traffic I estimate we will have network bandwidth of multiple Gbps for our L4 load balancer. In our design mostly we need to make the load balancer more available over having it consistent across load balancer instances after making changes to traffic configuration.
      • Candidate: What other features should over load balancer provide?
      • Interviewer: You can tell me what features do modern load balancer provide.
      • Candidate: I believe modern load balancer offers variety of features -
        • Load balancing
        • TLS termination
        • Authentication gateway
        • Proxy / Reverse proxy
        • Rate limiting
        • IP Whitelisting / ACLs
      • Interviewer: What sort of load balancing algorithms does your load balancer provide?
      • Candidate: Load balancing algorithms -
    AlgorithmLayerDescriptionWhen to Use
    Round RobinL4/L7Evenly distributes across all backendsSimple, stateless backends
    Least ConnectionsL4/L7Chooses backend with fewest active connectionsLong-lived connections (e.g., WebSocket)
    Weighted Round RobinL4/L7Like RR, but favors higher-weight serversHeterogeneous backend capacity
    RandomL4/L7Random selectionWhen backend is stateless and symmetrical
    Hash-BasedL4/L7Hashes client IP, session ID, etc.Sticky sessions
    Consistent HashingL7Used in sharded systemsDistributed caches or sharded DBs
    Least Latency / Performance-basedL7Route based on observed response timesAdaptive load balancing
    IP HashL4Route based on source IP hashBasic stickiness without cookies


      • Interviewer: What sort of routing algorithms does your load balancer provide?
      • Candidate: Routing algorithms -
    Algorithm / Method Layer Description Example
    Path-based Routing L7 Route based on URL path /api/v1/* → service-A
    Host-based Routing L7 Route based on Host header api.example.com → API LB
    Header-based Routing L7 Inspect request headers X-Tenant-ID → tenant-specific svc
    Cookie-based Routing L7 Sticky routing or A/B testing cookie=variant-B → version-B backend
    Canary Routing L7 Send % of traffic to new version 90% → v1, 10% → v2
    Geo-based Routing L7 (sometimes L3/L4) Route based on client location Client from EU → EU datacenter
    User/Session Routing L7 Route based on JWT token/user ID userID % 10 → shard-2


    • Since, the problem is very vague we can continue asking about -
      • Resilience & Fault Tolerance
      • Observability
      • Scalability
      • Configuration Management
      • Zero downtime deployments
      • CDN Integration
      • Rate aware routing

    Estimation

    • We need to store the configuration settings of load balancer routing and load balancing and resources in our storage.
              We can expect configuration files in json to be on average 250KB
              Approximately, if we have 1 Million such configuration that we are supporting then we need a storage of about 250GB to store.
              We also went to replicate this across multiple data center (dc) regions

    Design the service

    Configurations: Allow service owners to set the Load balancer configuration

    API

    POST, PUT, GET /api/v1/user/{userid}/loadbalancer/{loadbalancerid}

    RequestBody
    {
         loadbalancerid: {loadbalancerid}
         type: l4/l7
         resources: [
                       dns-name / containerid / ip
              ]
          lb-algorithm: ecmp/geo-based
          health-check: heartbeat/latency
    }


    Data Model

    1. Load balancer configuration
    {
      "lb_id": "lb-12345",
      "name": "prod-app-lb",
      "type": "L7",  // or L4
      "algorithm": "round_robin", // or least_conn, hash, etc.
      "listener_port": 443,
      "protocol": "HTTPS",
      "ssl_cert_id": "cert-abc",
      "backend_pool_ids": ["pool-1", "pool-2"],
      "health_check_id": "hc-001",
      "created_at": "2025-06-09T09:00:00Z"
    }

    2. Backend pools
    {
      "pool_id": "pool-1",
      "targets": [
        {"ip": "10.0.0.12", "port": 8080, "zone": "us-west1-a"},
        {"ip": "10.0.0.15", "port": 8080, "zone": "us-west1-b"}
      ],
      "max_connections": 1000,
      "health_check_id": "hc-001"
    }

    3. Health checks
    {
      "hc_id": "hc-001",
      "protocol": "HTTP",
      "path": "/health",
      "interval_secs": 5,
      "timeout_secs": 2,
      "unhealthy_threshold": 3,
      "healthy_threshold": 2
    }

    4. Routing rules(for L7)
    {
      "rule_id": "rule-77",
      "path_pattern": "/api/v1/*",
      "methods": ["GET", "POST"],
      "forward_to_pool": "pool-1"
    }

    5. Observability 
    {
      "log_id": "req-8817",
      "lb_id": "lb-12345",
      "timestamp": "2025-06-09T09:20:00Z",
      "source_ip": "192.168.1.77",
      "target_ip": "10.0.0.15",
      "status_code": 200,
      "latency_ms": 123,
      "region": "us-west1"
    }

    Architecture 


    Scaling

    • Horizontally scalable
      • To achieve horizontal scalability we need to replicate the configuration across all the regional datacenter or in worst case at least to datacenter in each availability zone.
      • The most feasible replicated db for these key value stores is etcd or consul
      • Feature comparison
        • Feature etcd Consul
          Persistence BoltDB on disk Persistent storage on disk
          Consensus Algo Raft Raft
          Usage Kubernetes, config storage Service mesh, discovery, config
          Strong Consistency Yes Yes (Raft)
          APIs gRPC/HTTP HTTP RESTful
          Built-in UI No Yes



    Thursday, May 29, 2025

    Dynamo - Amazon's Highly Available Key-value Store [sosp2007]

     Dynamo - Amazon's Highly Available Key-value Store [sosp2007]

    • Amazon.com faces significant reliability challenges due to its massive scale and distributed infrastructure, where continuous component failures can lead to financial losses and erode customer trust. To combat this, Amazon developed Dynamo, a highly available key-value storage system. Dynamo prioritizes continuous availability over strict consistency during failures, employing object versioning and application-assisted conflict resolution to ensure an "always-on" customer experience.

    1. Introduction

    • Amazon's global e-commerce platform, serving millions of customers, demands extreme reliability, scalability, and performance, as even minor outages carry significant financial and trust implications. Recognizing that system resilience hinges on effective application state management within its decentralized, service-oriented architecture, Amazon developed Dynamo.
    • Dynamo is a highly available and scalable distributed data store designed for core Amazon services with stringent reliability needs, offering flexible control over tradeoffs between availability, consistency, cost, and performance. Unlike traditional relational databases, Dynamo provides a simple primary-key interface, proving highly efficient for services like shopping carts and session management.
    • Dynamo achieves its scalability and availability by combining several techniques: consistent hashing for data partitioning and replication, object versioning for consistency, a quorum-like technique and decentralized synchronization protocol for replica consistency during updates, and a gossip-based protocol for failure detection. This decentralized system allows for seamless node additions and removals without manual intervention.
    • Successfully deployed in Amazon's core e-commerce services, Dynamo efficiently handled extreme peak loads during holiday seasons without downtime, demonstrating that an eventually-consistent storage system can meet the demands of critical production applications. The paper highlights the effectiveness of combining various techniques to build a highly available system and offers insights into optimizing these techniques for real-world performance requirements.

    2. Background



    • Simple Query Model: Dynamo focuses on basic read/write operations for small, unique key-identified binary objects, accommodating a significant portion of Amazon's services that don't need relational schemas.
    • Prioritizes Availability over Strong Consistency: Recognizing that ACID (Atomicity, Consistency, Isolation, Durability) guarantees often compromise availability, Dynamo embraces weaker consistency, specifically eventual consistency, to ensure high availability, even during failures. This means all updates will eventually propagate to all replicas.
    • "Always Writeable": Dynamo prioritizes successful writes. To achieve this, conflict resolution is performed during reads rather than writes. This ensures that customer actions, like adding items to a shopping cart, are never rejected due to system issues.
    • Application-Assisted Conflict Resolution: While Dynamo offers a simple "last write wins" policy, it primarily empowers applications to resolve conflicts. This allows services, like the shopping cart, to implement domain-specific merge logic for a better customer experience.
    • Efficiency and Performance: Designed for commodity hardware, Dynamo meets Amazon's stringent latency requirements, measured at the 99.9th percentile of request distribution, rather than averages. This focus ensures a consistent, high-quality experience for nearly all customers. Services can configure Dynamo to balance performance, cost, availability, and durability.
    • Incremental Scalability: Dynamo allows for the addition of single storage nodes with minimal disruption, supporting continuous growth.
    • Symmetry and Decentralization: Every Dynamo node has equal responsibilities, and the system favors decentralized peer-to-peer techniques to avoid single points of failure and simplify operations.
    • Heterogeneity: Dynamo can leverage diverse hardware capabilities, distributing work proportionally to utilize higher-capacity nodes efficiently without requiring a complete infrastructure overhaul.

    3. Related Work




    3.1 Peer-to-Peer (P2P) Systems

    • First-generation P2P systems (e.g., Freenet, Gnutella):

      • Used unstructured overlays — arbitrary peer connections.
      • Relied on flooding to discover shared files.
    • Structured P2P networks (e.g., Chord, Pastry):

      • Employed consistent global protocols.
      • Enabled bounded-hop routing to locate data efficiently.
      • Some systems introduced O(1) routing to reduce latency (constant-time routing with local info).
    • Storage Systems on P2P Overlays:

      • Oceanstore:

        • Global, transactional, persistent storage.
        • Uses conflict resolution and total-order updates for concurrent writes.
        • Designed for untrusted infrastructure.
      • PAST:

        • Built on Pastry for persistent, immutable objects.
        • Applications layer handles mutability semantics.

    3.2 Distributed File Systems and Databases

    • Key differences from P2P systems:

      • Support hierarchical namespaces (vs flat in P2P).
    • Examples of Distributed File Systems:

      • Ficus, Coda:

        • Use replication for high availability.
        • Handle update conflicts via system-level conflict resolution.
      • Farsite:

        • Fully distributed (no central server).
        • Achieves availability and scalability via replication.
      • Google File System (GFS):

        • Central master server for metadata.
        • Data stored in chunks on chunkservers.
    • Distributed Databases:

      • Bayou:

        • Supports disconnected operations, application-level conflict resolution, and eventual consistency.
      • All (Bayou, Coda, Ficus) provide eventual consistency and handle network partitions.

    • Dynamo (Amazon):

      • Inspired by above systems.
      • Allows read/write during partitions.
      • Uses different conflict resolution methods.
    • Other Distributed Storage Models:

      • FAB:

        • Block storage; splits large objects into smaller blocks.
      • Key-value stores:

        • Ideal for small objects (<1MB).
        • Easy to configure per application.
      • Antiquity:

        • Ensures data integrity using secure logs and Byzantine fault tolerance.
        • Dynamo trades off these guarantees in favor of trusted infrastructure.
    • Bigtable:

      • Stores structured data using a sparse, multi-dimensional sorted map.
      • Allows access via multiple attributes.
      • Dynamo differs by focusing on simple key/value access and high availability.
    • Traditional replicated relational databases:

      • Prioritize strong consistency.
      • Struggle with scalability and availability, especially under network partitions.

    3.3 Discussion (Dynamo Design Trade-offs)

    • Dynamo’s primary goals:

      1. Provide an “always writeable” store — no rejected writes.
      2. Operate in a trusted environment (single administrative domain).
      3. Skip support for hierarchical namespaces or complex schemas.
      4. Meet strict latency SLAs (e.g., 99.9% of ops < few hundred ms).
    • Zero-hop DHT:

      • Dynamo avoids multi-hop routing (as in Chord/Pastry).
      • Each node stores enough routing info to directly forward requests → reduces latency variability.


    4. System Architecture

    Beyond data persistence, Dynamo requires robust solutions for load balancing, membership, failure detection/recovery, replica synchronization, overload handling, state transfer, concurrency, job scheduling, request marshalling/routing, monitoring, and configuration management. The paper focuses on core techniques: partitioning, replication, versioning, membership, failure handling, and scaling, with advantages summarized in Table 1.





    4.1 System Interface:

    • Operations: Supports get(key) (retrieves object(s) with context) and put(key, context, object) (stores replicas with context).
    • Key Handling: Keys and objects are treated as opaque byte arrays; MD5 hashing generates 128-bit identifiers for node assignment.

    4.2 Partitioning Algorithm:

    • Uses consistent hashing on a ring, where each node is responsible for a key range between itself and its predecessor.
    • Virtual Nodes: Assigns multiple tokens to nodes to address non-uniform data/load distribution and node heterogeneity.
    • Advantages:
      • Even load dispersion during node failures.
      • Balanced load acceptance for new nodes.
      • Accounts for node capacity differences.

    4.3 Replication:

    • Replicates data on N hosts (configurable) for high availability and durability.
    • Coordinator node stores keys locally and at N-1 clockwise successors.
    • Preference List: Lists nodes responsible for a key, ensuring distinct physical nodes despite virtual nodes.

    4.4 Data Versioning:

    • Offers eventual consistency, with asynchronous replica updates.
    • Uses vector clocks to track causality, allowing multiple versions of an object.
    • Reconciliation:
      • Syntactic: System resolves versions when possible.
      • Semantic: Clients reconcile conflicting versions (e.g., merging shopping cart versions).
    • Vector Clock Truncation: Limits size by removing oldest entries (with timestamps) when exceeding a threshold, potentially impacting reconciliation.

    4.5 Execution of get() and put() Operations:

    • Any node can coordinate, selected via load balancer or partition-aware client library (latter reduces latency).
    • Uses quorum-like protocol with R (read) and W (write) values, where R + W > N ensures consistency.
    • Put: Coordinator writes locally, sends to N reachable nodes, succeeds if W-1 respond.
    • Get: Coordinator retrieves versions from N reachable nodes, returns causally unrelated versions for client reconciliation.

    4.6 Handling Failures: Hinted Handoff:

    • Uses sloppy quorum, accessing the first N healthy nodes for read/write operations.
    • During node failures, replicas are sent to fallback nodes with hints, stored separately, and delivered upon recovery.
    • Replicates across multiple data centers to handle data center failures.

    4.7 Handling Permanent Failures: Replica Synchronization:

    • Employs Merkle trees for anti-entropy, detecting inconsistencies with minimal data transfer.
    • Each node maintains Merkle trees per key range for efficient synchronization.

    4.8 Membership and Failure Detection:

    • 4.8.1 Ring Membership: Explicit node addition/removal via admin tools, propagated via gossip-based protocol.
    • 4.8.2 External Discovery: Seed nodes prevent logical partitions, ensuring membership reconciliation.
    • 4.8.3 Failure Detection: Local, decentralized detection based on node responsiveness; nodes retry failed peers periodically.

    4.9 Adding/Removing Storage Nodes:

    • New nodes receive random tokens, taking key ranges from existing nodes.
    • Key transfers include confirmation to avoid duplicates, ensuring uniform load distribution and fast bootstrapping.

    This architecture ensures Dynamo’s high availability, scalability, and durability, tailored for applications like Amazon’s shopping cart, balancing consistency and performance.




    5. Implementation

    Each storage node in Dynamo has three main software components:

    1. Request Coordination
    2. Membership and Failure Detection
    3. Local Persistence Engine

    All components are implemented in Java.


    Pluggable Local Persistence

    • Supports multiple storage engines:

      • Berkeley DB (BDB) Transactional Data Store (default)
      • BDB Java Edition
      • MySQL
      • In-memory buffer with persistent backing
    • Engine is selected based on application’s object size and access patterns:

      • BDB: suited for small (tens of KB) objects
      • MySQL: better for larger objects

    Request Coordination

    • Built on event-driven messaging (inspired by SEDA architecture)

    • Uses Java NIO channels

    • Each client request creates a state machine on the coordinating node

    • State machine handles:

      • Node selection
      • Request dispatch
      • Response collection
      • Retry logic
      • Final client response packaging

    Read Operation Flow (State Machine)

    1. Send read requests to relevant nodes
    2. Wait for minimum required responses
    3. If insufficient responses: fail request
    4. Reconcile versions (if enabled) using vector clocks
    5. Return latest data
    6. Perform read repair for stale replicas post-response

    Write Coordination

    • Write coordinator is selected from the top N nodes in the preference list

    • Not always the first node to avoid load imbalance

    • Optimized by choosing the node that responded fastest to the preceding read

      • Improves read-your-writes consistency
      • Reduces latency variability
      • Enhances 99.9 percentile performance


    6. Experiences & Lessons Learned
        Dynamo is used by various Amazon services with different configurations, varying in reconciliation         logic and quorum characteristics (N, R, W). Key experiences and lessons are summarized below.
    • Usage Patterns:
      • Business Logic Specific Reconciliation: Applications reconcile divergent versions using custom logic (e.g., merging shopping cart versions).
      • Timestamp-Based Reconciliation: Uses “last write wins” based on the latest timestamp (e.g., customer session information service).
      • High Performance Read Engine: Configured for high read rates with few updates (R=1, W=N), acting as a scalable persistence cache (e.g., product catalog, promotional items).
      • Tunable Parameters: Clients adjust N (durability), R (read quorum), and W (write quorum) to balance performance, availability, and consistency. Common configuration: (N=3, R=2, W=2).

    6.1 Balancing Performance and Durability:

    • Performance Goals: Services target 99.9% of read/write requests within 300ms on commodity hardware.
    • Challenges: Multiple node involvement limits performance to the slowest R/W replica; diurnal request patterns cause latency variations.
    • Optimization: Write buffering stores objects in memory, reducing 99.9th percentile latency by 5x during peak traffic. Reads check buffer first.
    • Trade-off: Buffering risks data loss on server crashes; mitigated by one replica performing a durable write without impacting performance.

    6.2 Ensuring Uniform Load Distribution:

    • Load Imbalance: Measured over 24 hours; imbalance ratio (nodes deviating >15% from average load) is higher at low loads (20%) than high loads (10%).
    • Partitioning Strategies:
      • Strategy 1: T Random Tokens per Node: Random token assignment causes uneven ranges, slow bootstrapping (up to a day), complex Merkle tree recalculation, and inefficient archiving.
      • Strategy 2: T Random Tokens, Equal Partitions: Divides hash space into Q equal partitions (Q >> N), decouples partitioning/placement, but has poor load balancing.
      • Strategy 3: Q/S Tokens, Equal Partitions: Assigns Q/S tokens per node, preserves equal partitions, and achieves best load balancing efficiency. Reduces membership data size by three orders of magnitude, simplifies bootstrapping/recovery, and enables easy archiving.
    • Evaluation: Strategy 3 outperforms others in load balancing efficiency (ratio of average to maximum node requests), tested with S=30, N=3.

    6.3 Divergent Versions: When and How Many?:

    • Causes: Divergent versions arise from failures (node, data center, network partitions) or concurrent writes.
    • Impact: Minimizing divergent versions reduces semantic reconciliation load. Over 24 hours, 99.94% of shopping cart requests saw one version, 0.00057% saw 2, 0.00047% saw 3, and 0.00009% saw 4.
    • Observation: Concurrent writes (often by automated clients) cause more divergence than failures.

    6.4 Client-Driven or Server-Driven Coordination:

    • Server-Driven: Load balancer assigns requests; coordinators (top N nodes for writes) handle versioning, incurring extra network hops.
    • Client-Driven: Clients download membership state every 10 seconds, directly coordinate requests, reducing latency by ≥30ms (99.9th percentile) and 3-4ms (average).
    • Advantages: Client-driven eliminates load balancer overhead, leverages uniform key distribution, but depends on fresh membership data.

    6.5 Balancing Background vs. Foreground Tasks:

    • Issue: Background tasks (replica synchronization, data handoff) competed with foreground put/get operations, impacting performance.
    • Solution: Admission control mechanism allocates resource slices to background tasks, adjusted via feedback from foreground task performance (e.g., 99th percentile read latency).
    • Monitoring: Tracks disk latencies, lock contention, and queue wait times to limit background task intrusiveness.

    6.6 Discussion:

    • Success: Dynamo achieves 99.9995% request success rate with no data loss over two years.
    • Flexibility: (N, R, W) tuning simplifies application porting for Amazon’s failure-tolerant services, though new applications require careful conflict resolution design.
    • Scalability Limitation: Full membership model (gossip-based routing table) suits hundreds of nodes but may not scale to tens of thousands without hierarchical extensions or O(1) DHT approaches.



    7. Summary of Dynamo’s Strengths

    1. Highly Available and Scalable

      • Used for storing the state of core Amazon.com services.
    2. Robustness

      • Handles server failures, data center outages, and network partitions effectively.
    3. Incremental Scalability

      • Allows service owners to scale up or down based on real-time load.
    4. Customizable Trade-offs

      • Parameters N (replication factor), R (read quorum), and W (write quorum) let users tune for:

        • Performance
        • Durability
        • Consistency (according to their SLA requirements)
    5. Real-World Proven

      • Proven effective over a year in Amazon’s demanding production environment.
    6. Demonstrates Viability of Decentralized Techniques

      • Shows that decentralized, eventually consistent systems can successfully support highly available applications.



    Tuesday, May 20, 2025

    Leetcode 256 Minimum cost to paint house and additionally show the minimum colors to be print

    There is a row of n houses, where each house can be painted one of three colors: red, blue, or green. The cost of painting each house with a certain color is different. You have to paint all the houses such that no two adjacent houses have the same color.


    The cost of painting each house with a certain color is represented by an n x 3 cost matrix costs.


    For example, costs[0][0] is the cost of painting house 0 with the color red; costs[1][2] is the cost of painting house 1 with color green, and so on...

    Return the minimum cost to paint all houses.


     


    Example 1:


    Input: costs = [[17,2,17],[16,16,5],[14,3,19]]

    Output: 10

    Explanation: Paint house 0 into blue, paint house 1 into green, paint house 2 into blue.

    Minimum cost: 2 + 5 + 3 = 10.

    Example 2:


    Input: costs = [[7,6,2]]

    Output: 2

     


    Constraints:


    costs.length == n

    costs[i].length == 3

    1 <= n <= 100

    1 <= costs[i][j] <= 20 


    Top Down approach 

    
    from functools import lru_cache
    from typing import List, Tuple
    
    class Solution:
        def minCost(self, costs: List[List[int]]) -> Tuple[int, List[int]]:
            n = len(costs)
            if n == 0:
                return 0, []
    
            @lru_cache(None)
            def dp(i: int, prev_color: int) -> int:
                if i == n:
                    return 0
    
                min_cost = float('inf')
                for color in range(3):
                    if color != prev_color:
                        cost = costs[i][color] + dp(i + 1, color)
                        if cost < min_cost:
                            min_cost = cost
                return min_cost
    
            # Reconstruct the path
            result_path = []
            prev_color = -1
            for i in range(n):
                min_cost = float('inf')
                best_color = -1
                for color in range(3):
                    if color != prev_color:
                        cost = costs[i][color] + dp(i + 1, color)
                        if cost < min_cost:
                            min_cost = cost
                            best_color = color
                result_path.append(best_color)
                prev_color = best_color
    
            total_min_cost = dp(0, -1)
            return total_min_cost, result_path
    

    Bottoms up approach 

    
    class Solution:
        def minCost(self, costs: List[List[int]]) -> Tuple[int, List[int]]:
            if not costs:
                return 0, []
            
            n = len(costs)
            dp = [[0] * 3 for _ in range(n)]
            path = [[-1] * 3 for _ in range(n)]  # To reconstruct the color path
    
            # Base case
            for j in range(3):
                dp[0][j] = costs[0][j]
    
            # DP fill
            for i in range(1, n):
                for j in range(3):
                    min_cost = float('inf')
                    for k in range(3):
                        if j != k:
                            if dp[i-1][k] < min_cost:
                                min_cost = dp[i-1][k]
                                path[i][j] = k  # Store color used for previous house
                    dp[i][j] = costs[i][j] + min_cost
    
            # Find the minimum cost and the color used at the last house
            min_total = min(dp[n-1])
            last_color = dp[n-1].index(min_total)
    
            # Reconstruct path
            result_path = [0] * n
            color = last_color
            for i in reversed(range(n)):
                result_path[i] = color
                color = path[i][color] if i > 0 else -1
    
            return min_total, result_path
            
    '''
    costs = [
        [17, 2, 17],
        [16, 16, 5],
        [14, 3, 19]
    ]
    
    sol = Solution()
    print(sol.minCost(costs))
    
    (10, [1, 2, 1])
    
    TC = O(n)
    SC = O(n)
    '''
    
    

    Wednesday, April 16, 2025

    Leetcode 2179. Count Good Triplets in an Array

    2179. Count Good Triplets in an Array

    A good triplet is a set of 3 distinct values which are present in increasing order by position both in nums1 and nums2. In other words, if we consider pos1v as the index of the value v in nums1 and pos2v as the index of the value v in nums2, then a good triplet will be a set (x, y, z) where 0 <= x, y, z <= n - 1, such that pos1x < pos1y < pos1z and pos2x < pos2y < pos2z.

    Return the total number of good triplets.

    Example 1:

    Input: nums1 = [2,0,1,3], nums2 = [0,1,2,3]

    Output: 1

    Explanation: 

    There are 4 triplets (x,y,z) such that pos1x < pos1y < pos1z. They are (2,0,1), (2,0,3), (2,1,3), and (0,1,3). 

    Out of those triplets, only the triplet (0,1,3) satisfies pos2x < pos2y < pos2z. Hence, there is only 1 good triplet.

    Example 2:

    Input: nums1 = [4,0,1,3,2], nums2 = [4,1,0,2,3]

    Output: 4

    Explanation: The 4 good triplets are (4,0,3), (4,0,2), (4,1,3), and (4,1,2).


    • There are multiple approaches that one could take to solve this problem. Let us explore this by order of difficulty

            1. Brute Force:

    • We can generate all the triplets in both the lists and then see if there are common triplets formed from both the lists that satisfy the condition pos1x < pos1y < pos1z
    • The TC of this solution is O(n^3) and SC of this approach is also O(n^3) to store all the triplets we generate
           2. Hash Map + Brute Force

    • We need to follow these steps -
      1. Build a position map for nums2 so we can easily look up where each value from nums1 appears in nums2.
      2. Loop through each element nums1[i], and for each:
        1. Find its position in nums2
        2. Count how many elements we’ve already seen (before i) that appear before it in nums2.
          1. This is done using bisect.bisect() on a sorted list of seen positions (seen_positions)
        3. Count how many elements will come after i in nums1 that will also come after it in nums2.
          1. This ensures the triplet will be increasing in both arrays
        4. Multiply count_before * count_after to get the number of triplets where nums1[i] is in the middle.
      3. Insert the current position (from nums2) into seen_positions for the next round.


    
    
    
    from typing import List
    import bisect
    
    class Solution:
        def goodTriplets(self, nums1: List[int], nums2: List[int]) -> int:
            n = len(nums1)
            res = 0
            
            # Map each value in nums2 to its index
            index_in_nums2 = [0] * n
            for i in range(n):
                index_in_nums2[nums2[i]] = i
            
            seen_positions = []  # Holds the positions (in nums2) of elements seen so far in nums1
            
            for i in range(n):
                # Find the corresponding index of nums1[i] in nums2
                pos_in_nums2 = index_in_nums2[nums1[i]]
                
                # Find how many seen elements are less than this position (binary search)
                count_before = bisect.bisect(seen_positions, pos_in_nums2)
                
                # Insert current position in sorted order
                bisect.insort(seen_positions, pos_in_nums2)
                
                # Elements after this position in both arrays
                count_after = (n - 1 - i) - (len(seen_positions) - count_before - 1)
                
                # Multiply the number of valid 'before' and 'after' elements
                res += count_before * count_after
            
            return res
    

    TC is O(n log n)
    SC is O(n)

        3. Best approach is using Trie / Fenwick tree

        1. Goal
            Count triplets (i, j, k) such that: i < j < k
            nums1[i], nums1[j], nums1[k] and nums2[i'], nums2[j'], nums2[k'] have the same relative order

        2. Key Idea
            Map the problem to:
            Count increasing triplets in a transformed array
            (i.e., after mapping nums1's values to their positions in nums2)
            arr[i] = position of nums1[i] in nums2

            Then, count triplets where:

            arr[i] < arr[j] < arr[k] and i < j < k

    • We use a Fenwick Tree (Binary Indexed Tree) to efficiently:
      • Count how many numbers less than arr[i] have appeared so far (to the left)
      • Deduce how many numbers greater than arr[i] will appear later (to the right)
    • For each element at index i in arr:
      • left = count of numbers < arr[i] before i → use tree.query(arr[i] - 1)
      • right = count of numbers > arr[i] after i
      • → total remaining = n - 1 - arr[i]
      • → already counted = i - left
      • → so right = (n - 1 - arr[i]) - (i - left)
      • Total triplets contributed by arr[i] as the middle element: res += left * right


    
    class FenwickTree:
        def __init__(self, size):
            self.tree = [0] * (size + 1)
    
        def update(self, index, delta):
            index += 1  # 1-based indexing
            while index < len(self.tree):
                self.tree[index] += delta
                index += index & -index
    
        def query(self, index):
            index += 1  # 1-based indexing
            res = 0
            while index > 0:
                res += self.tree[index]
                index -= index & -index
            return res
    
    class Solution:
        def goodTriplets(self, nums1: List[int], nums2: List[int]) -> int:
            n = len(nums1)
    
            # Map each number to its index in nums2
            pos_in_nums2 = [0] * n
            for i, val in enumerate(nums2):
                pos_in_nums2[val] = i
    
            # Create array where nums1[i] is mapped to its position in nums2
            arr = [0] * n
            for i, val in enumerate(nums1):
                arr[i] = pos_in_nums2[val]
    
            tree = FenwickTree(n)
            res = 0
    
            for i, val in enumerate(arr):
                left = tree.query(val - 1)  # Count of elements < val so far
                right = (n - 1 - val) - (i - left)  # Remaining elements > val
                res += left * right
                tree.update(val, 1)  # Mark val as seen
    
            return res
    
    

    Tuesday, April 15, 2025

    Replication

     Replication

    • Replication means keeping a copy of the same data on multiple machines that are connected via a network.
    • There are three popular variants of the replication algorithms - 
      • Single-leader
      • Multi-leader
      • Leaderless replication

         Leader based replication / active/passive or master-slave replication

    • One of the replicas is designated as leader. The client must reach update the leader to make the writes.
    • Other replicas are known as followers. Whenever the leader writes data to its local storage it sends the update to all of its followers as part of replication log or change stream.
    • When a client wants to read from the database, it can query either the leader or any of the followers.

            Synchronous Versus Asynchronous Replication

    • Advantage of synchronous replication is that the follower is guaranteed to have up to date copy of the data that is consistent with the leader. But the major disadvantage is the leader has to block all the writes and wait until the synchronous replica is available again.
    • In practice if synchronous replication is enabled on a database then one of the follower is synchronous and the rest of the followers are asynchronous.
    • Although asynchronous based leader-replication doesn't guarantee write durability it has the advantage that leader can continue processing writes, even if all of its followers have fallen behind. This is widely used irrespective of the disadvantage.
    • Note: It can be a serious problem for systems to lose data if the leader fails in asynchronous replication. So there have been great success in chain replication (a variant of synchronous replication) that has been successfully implemented in few systems such as Microsoft Azure Storage.

            Setting Up New Followers

    • There is often times a need to set up new followers incase to increase the replicas or to replace failed nodes
    • Steps:
      • Take a consistent snapshot of the leader's database at some point in time.
      • Copy the snapshot to the new follower node.
      • The follower connects to the leader and requests all data changes that have happened since the snapshot was taken.
      • Follower catches up to the backlog of data changes.

            Handling Node Outages

    • Each follower keeps a log of the data changes it has received from the leader. If the follower crashes or experiences a temporary interruption it requests the data changes that has occurred during the time when follower was disconnected.
    • In case of leader failure one of the follower needs to be promoted to be the leader. An automatic failover process consists of the following steps -
      • 1. Determining the leader has failed. Mostly using timeouts typically set to 30seconds
      • 2. Choosing a new leader. Usually done through election process.
      • 3. Reconfiguring the system to use the new leader. Clients now need to send their write requests to the new leader 

            Implementation of Replication Logs

    • There are several different replication methods used in practice.
            i. Statement-based replication
    • Leader logs every write request that it executes and sends that statement log to its followers.
    • For a relational database, every INSERT, UPDATE or DELETE is forwarded to followers and each follower parses and executes that SQL statement
    • This approach can break down if nondeterministic function such as NOW() or RAND()
            ii. Write-Ahead log (WAL) shipping
    • The log is an append-only sequence of bytes containing all writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers.
    • There can be issues of downtime if the upgrade versions are incompatible

            iii. Logical (row-based) log replication
    • If the different log format is used in replication and for the storage engine, which allows the decoupling of replication logs. Then it is called logical log.
            iv. Trigger-based replication
    • A trigger lets us register custom application code that is automatically executed when the data change (write transaction) occurs in a DB system.

            Problems with Replication Lag

    • Leader-based replication requires all writes to go through a single node, but read-only queries can go to any replica.
    • This is useful in web applications as mostly there is more read and less writes. However, for this approach to work we need to use only asynchronous replication.
    • However reading from asynchronous followers can lead to inconsistent read results from the follower and leader. This inconsistency is just a temporary state - eventual consistency (if all writes are stopped then the follower will eventually catch up to leader).
            Reading Your Own Writes
    • With asynchronous replication there is a problem that all the followers might not have caught up-to the leader and this to the user who made a write looks as though the data they submitted was lost.
    • In this case we need read-after-write consistency, also known as read-your-writes consistency. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves.
    • Steps to implement read-after-write consistency -
      • When reading something the user may have modified, read it from the leader; otherwise, read it from a follower.
      • This approach doesn't scale if most things in the application are potentially editable by the user. So we might track the time of last update and, for one minute after the last update, make all reads from the leader.
      • The client can remember the timestamp of its most recent write - then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp (logical timestamp).
            Monotonic Reads
    • When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward i.e, they will not read older data after having previously read newer data.
    • It's a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency.
            Consistent Prefix Reads
    • If some partitions are replicated slower than others, an observer may see answer before they see the question.
    • Consistent prefix reads guarantee says that if a sequence of writes happen in a certain order, then anyone reading those writes will see them appear in the same order.
    • One solution is to make sure that any writes that are causally related to each other are written to the same partition.
            Solutions for Replication Lag
    • In an eventually consistent system, it is worth thinking about how the application behaves if the replication lag increases to several minutes or even hours. If it might cause an issue then we need to provide stronger guarantee like read-after-write.
    • Application developers don't have to worry about subtle replication issues and could just trust their databases. This is why transactions exist to provide stronger guarantees.

        Multi-Leader Replication

    • Allows more than one node to accept writes. Replication still happens in the same way: each node that processes a write must forward that data change to all other nodes.

            Uses Cases for Multi-Leader Replication

    • Multi-datacenter operation
      • Multi-leader configuration allows us to have a leader in each datacenter.
      • Multi-leader has better performance compared to single-leader configuration as the writes can be processed locally in the datacenter and replicated to other replicas asynchronously.
      • In multi-leader each datacenter can continue operating independently of the others and replication catches up when the failed datacenter comes back online.
      • A multi-leader configuration with asynchronous replication can usually tolerate network problems better
    • Clients with offline operation
      • Multi-leader replication is suitable for applications that need to remain fully functional (read and write) even when disconnected from the internet. The example of calendar apps on various devices illustrates this: each device acts as a local leader, accepting changes offline, and these changes are asynchronously synced with other devices and a server when connectivity is restored. This architecture, while challenging to implement correctly (as evidenced by past calendar sync issues), is essentially an extreme form of multi-datacenter multi-leader replication where each device is treated as an unreliable "datacenter." Tools like CouchDB are designed to simplify this type of operation.
    • Collaborative editing
      • Real-time collaborative editing is essentially a form of multi-leader replication. Each user's local client acts as a leader, accepting their edits immediately. These changes are then asynchronously replicated to a central server and other concurrent editors. While locking can prevent conflicts (similar to single-leader replication), faster, lock-free collaboration necessitates handling the complexities of multi-leader replication, including the need for conflict resolution due to simultaneous edits from multiple "leaders."

            Handling Write Conflicts

    • The biggest problem with multi-leader replication is that write conflicts can occur, which means that conflict resolution is required.
            Synchronous versus asynchronous conflict detection
      • In multi-leader replication, conflicts are typically detected asynchronously, unlike single-leader where the second write either blocks or aborts. While synchronous conflict detection is possible in multi-leader, it negates the primary advantage of independent write acceptance.

            Conflict avoidance: 
      • The simplest way to handle conflicts is to prevent them by ensuring all writes for a specific record go through the same leader. Routing user requests to their "home" datacenter is an example, effectively making it a single-leader setup from the user's perspective. However, this breaks down when the designated leader needs to change.

            Converging toward a consistent state
      • In multi-leader setups with no defined write order, replicas must employ convergent conflict resolution to ensure all replicas eventually reach the same final data state despite conflicting updates.

            Various ways of achieving convergent conflict resolution: 
                
      • Last Write Wins (LWW): Choosing the write with the highest unique ID (often a timestamp), which can lead to data loss.
      • Replica ID Precedence: Prioritizing writes based on the originating replica's ID, also risking data loss.
      • Value Merging: Combining conflicting values (e.g., alphabetical concatenation).
      •  Explicit Conflict Recording: Preserving all conflicting writes in a data structure for later resolution by application code or the user.
            Custom conflict resolution logic: 
      • Many multi-leader systems allow developers to implement application-specific conflict resolution logic, which can be executed either:
      • On write: When a conflict is detected in the replication log (often automated, without direct user interaction).
      • On read: When conflicting versions are retrieved, allowing the application (or user) to resolve them before writing back to the database.
      • Row-level conflict resolution: Conflict resolution in multi-leader replication typically happens at the level of individual rows or documents, not at the level of entire atomic transactions. Each write within a transaction is treated separately for conflict resolution.
            Automatic Conflict Resolution:
    • While custom conflict resolution can be complex and error-prone (as illustrated by the Amazon shopping cart example), researchers are exploring approaches like Conflict-Free Replicated Datatypes (CRDTs), mergeable persistent data structures, and Operational Transformation (used in collaborative editing tools) to resolve conflicts in a more sensible and automated way. Although these techniques are still relatively new in database implementations, their future integration promises to make multi-leader data synchronization significantly easier for applications.
            What is conflict?
    • A conflict in a multi-leader replication system occurs when concurrent operations on different leaders lead to inconsistencies. While some conflicts are obvious (like two different writes to the same field), others can be more subtle and application-specific, such as two overlapping bookings for the same resource in a reservation system. 
        Multi-Leader Replication Topologies
    • A replication topology describes the communication paths along which writes are propagated from one node to another. If you have two leaders, there is only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With more than two leaders, various different topologies are possible.
    • All-to-all is the most common type of topology. There are inherent flaws with Circular or star topology.

    Leaderless Replication

    • Leaderless replication is an alternative approach to single-leader and multi-leader replication where no single "leader" node dictates the order of writes. Instead, any replica can directly accept write requests from clients (either directly or via a coordinator). This design, popularized by Amazon's Dynamo and adopted by systems like Riak, Cassandra, and Voldemort (collectively known as Dynamo-style databases), fundamentally changes how the database handles writes and has significant implications for its usage.

        Writing to the Database When a Node Is Down

    • Leaderless replication handles node outages gracefully without failover: Unlike leader-based systems that might require a failover process when a replica is unavailable, leaderless systems can continue processing writes as long as a sufficient number of replicas are available.
    • Writes are sent to multiple replicas: Clients in a leaderless system typically send write requests to several replicas in parallel. The write is considered successful once a quorum (a predefined number) of replicas acknowledge it, even if some replicas are down.
    • Unavailable nodes miss writes, leading to stale reads: When a previously unavailable replica comes back online, it will be missing the writes that occurred during its downtime. Reading from such a node can result in stale or outdated data.
    • Reads are also sent to multiple replicas to detect staleness: To address the issue of stale reads, clients in leaderless systems often send read requests to multiple replicas in parallel. This allows them to receive potentially different versions of the data.
    • Version numbers are used to determine the latest data: To reconcile the potentially different responses from multiple replicas during a read, version numbers are employed to identify the most recent and correct version of the data.
            Read repair and anti-entropy
    • Read repair updates stale replicas during reads: When a client reads from multiple replicas and detects outdated data on some, it writes the latest version back to those stale replicas, ensuring eventual consistency for frequently read data.
    • Anti-entropy processes backfill missing data in the background: Some systems use a background process to periodically compare data across replicas and copy missing information, ensuring eventual consistency for data that might not be read frequently (though not all systems implement this, potentially impacting durability for rarely read data).
            Quorums for Reading and Writing
    • Quorums (w for writes, r for reads) ensure consistency and fault tolerance: By requiring a minimum number of successful acknowledgments (w) for writes and successful responses (r) for reads, where w + r > n (total replicas), the system aims to provide up-to-date reads despite some node failures.
    • Configurable quorums balance performance and resilience: The values of n, w, and r are tunable, allowing optimization for different workloads (e.g., faster reads with r=1, w=n at the cost of write availability). If fewer than w or r nodes respond successfully, the operation fails.
    • Quorums enable tolerance of unavailable nodes: The w < n condition allows writes to succeed with some down nodes, and r < n allows reads to succeed similarly. The specific values of w and r determine the number of simultaneous node failures the system can tolerate (e.g., n=3, w=2, r=2 tolerates one failure).
    • Read and write requests are sent to all replicas, but success depends on quorum response: Clients send requests to all n replicas in parallel, but the operation is considered successful only after receiving the required number of successful responses (w for writes, r for reads). The specific reason for a node's unavailability (crash, error, network issue) is less important than whether it returns a successful response within the timeout period.

        Limitations of Quorum Consistency

    • Quorums (w + r > n) aim for read-after-write consistency but aren't absolute guarantees: While the overlap between read and write quorums generally increases the likelihood of reading the latest value, various edge cases (like sloppy quorums, concurrent writes, concurrent read/write, partial write failures, node recovery from stale data, and timing issues) can still lead to stale reads even when the quorum condition is met.
    • Smaller quorums (w + r ≤ n) prioritize availability and lower latency over strong consistency: Reducing the required number of successful read and write responses increases the system's tolerance to node failures and network issues, allowing operations to proceed with fewer available replicas. However, this significantly increases the probability of reading stale data.
    • Leaderless systems lack inherent read-your-writes and monotonic reads guarantees, making staleness monitoring crucial: Unlike leader-based replication with a defined write order and replication lag metrics, leaderless systems with eventual consistency make staleness harder to track. Monitoring the likelihood and extent of stale reads is essential for operational awareness, especially since read repair alone doesn't guarantee an upper bound on staleness for infrequently accessed data.

        Sloppy Quorums and Hinted Handoff

    • In a leaderless replication system, a sloppy quorum is a relaxed form of the traditional quorum. Instead of requiring a strict majority or a predefined number (w for writes, r for reads) of the designated n "home" nodes to be available and acknowledge an operation, a sloppy quorum allows the operation to succeed if the required number of any reachable nodes in the cluster respond.  
    • Hinted handoff is a mechanism often used in conjunction with sloppy quorums to ensure eventual consistency. When a write is accepted by a node that is not one of the designated "home" nodes for that data (due to a temporary failure or network partition), the accepting node stores a "hint." 
    • Leaderless databases with quorums offer high availability and low latency by tolerating individual node failures and slowness without failover. They can return results as soon as a quorum (w for writes, r for reads) of nodes responds.
    • Sloppy quorums and hinted handoff enhance availability during network partitions by accepting writes on any reachable w nodes (even outside the designated n "home" nodes) and later transferring these writes to the correct nodes once the network is restored. However, this weakens consistency guarantees, as reads might not see the latest writes until the handoff completes.
    • Leaderless replication is well-suited for multi-datacenter operation due to its tolerance for network issues and concurrent writes. Implementations vary: Cassandra and Voldemort extend their leaderless model across data-centers with configurable replica distribution and local quorum reads/writes, while Riak uses a more localized approach within each datacenter with asynchronous cross-datacenter replication.

        Detecting Concurrent Writes

    • Inherent Concurrency Leads to Conflicts:
      • Multiple Independent Writers: Unlike single-leader systems where writes are serialized, Dynamo-style databases allow any replica to accept writes. This inherently means multiple clients can be writing to the same data (same key) simultaneously, without awareness of each other.
      • Asynchronous Replication: Even with quorums, the replication between nodes is often asynchronous. A write acknowledged by the quorum might not have reached all other replicas immediately, increasing the window for concurrent modifications.
      • Read Repair and Hinted Handoff Introduce Conflict Potential: While these mechanisms ensure eventual consistency, they can also introduce conflicts. For example, read repair might reconcile different versions of data, and hinted handoff might deliver delayed writes that conflict with more recent ones.
    • The Problem of Divergent Replicas:
      • Different Order of Operations: Due to network latency variations and partial failures, the same sequence of write operations might arrive at different replicas in different orders.
      • Simple Overwriting Leads to Permanent Inconsistency: If each replica simply overwrites the value based on the last write it receives, the replicas will end up with different final values, violating the principle of eventual consistency (where all replicas should eventually agree on the same data).
    • Last Write Wins (LWW): A Simple but Flawed Solution:
      • Timestamp-Based Resolution: LWW resolves conflicts by assigning a timestamp to each write and choosing the write with the latest timestamp as the "winner." Older writes are discarded.
      • Achieves Eventual Convergence (Potentially): If all writes are eventually propagated, all replicas will eventually converge to the value of the write with the latest timestamp.
      • Significant Durability Risks: Silent Data Loss of Concurrent Writes: If multiple clients write concurrently, only the one with the "latest" timestamp (which might be arbitrary due to clock skew) will be preserved. The other successful writes are silently lost.
      • Loss of Non-Concurrent Writes Due to Clock Skew: Even non-concurrent writes can be lost if the clocks on different clients or servers are not perfectly synchronized. A write that logically happened later might have an earlier timestamp.
      • Limited Use Cases: LWW is generally only acceptable for scenarios where data loss is tolerable, such as caching or session management with short lifespans.
    • Tracking Causality with Version Numbers (Single Replica):
      • Server-Managed Versioning: The server assigns an incrementing version number to each write for a given key.
      • Client Must Read Before Write: Clients must first read the current value and its version number before attempting to write.
      • Write Includes Prior Version: When writing, the client sends the new value along with the version number it last saw.
      • Server Determines Overwrite vs. Concurrency:
        • If the server receives a write with a version number equal to or greater than its current version, it knows this write supersedes the previous state and overwrites accordingly.
        • If the server receives a write with a version number lower than its current version, it indicates a concurrent write. The server keeps both the existing value and the new concurrent value as "siblings."
      • Client-Side Merging Required: When a client reads a key with multiple sibling values, it needs to implement logic to merge these concurrent versions into a single, coherent value.
    • Version Vectors (Multi-Replica Causality Tracking):
      • Version per Replica: In a distributed setting with multiple replicas, a single version number is insufficient. Version vectors are used, where each replica maintains its own version number for each key and also tracks the versions it has seen from other replicas.
      •  Capturing the History of Updates: A version vector essentially represents the history of updates to a key across all replicas.
      •  Distinguishing Overwrites from Concurrency: By comparing version vectors, replicas can determine if one write causally precedes another (an overwrite) or if they are concurrent (leading to siblings).
      •  Client-Side Merging Remains Necessary: Similar to the single-replica case with version numbers, clients might still need to merge sibling values when reading.
    • The Concept of "Happens-Before" and Concurrency:
      • Happens-Before Relationship: Operation A happens before operation B if B knows about A, depends on A's result, or builds upon A. This implies a causal relationship.
      • Concurrency Defined by Lack of Causality: Two operations are considered concurrent if neither happens before the other – neither operation had knowledge of the other when it started.
      • Importance for Conflict Resolution: If B happens after A, B should overwrite A. If A and B are concurrent, there's a conflict that needs resolution.
    • Merging Siblings: The Application's Role:
      • Client Responsibility: When concurrent writes result in sibling values, the burden of merging these values often falls on the application logic on the client side.
      • Simple Merges (e.g., Union for Sets): For some data types, simple merging strategies like taking the union of elements (as in a shopping cart example) might be sufficient.
      • Complex Merges and Tombstones: For more complex scenarios (like allowing removals from a set), simple union might lead to incorrect results. Tombstones (special markers indicating a deletion) are often used to handle removals correctly during merging.
      • Error-Prone and Complex: Implementing correct and robust merging logic in application code can be challenging and error-prone, as highlighted by the Amazon shopping cart issue.
    • Automatic Conflict Resolution as a Future Direction:
      • CRDTs (Conflict-Free Replicated Datatypes): These are specialized data structures (like counters, sets, maps, ordered lists) designed to be concurrently edited without explicit conflict resolution. They have built-in logic to automatically merge concurrent updates in a predictable and sensible way.
      • Mergeable Persistent Data Structures: These track history explicitly (like Git) and use three-way merges to reconcile concurrent changes.
      • Operational Transformation: Used in collaborative editing, this algorithm transforms operations based on concurrently executed operations to maintain consistency in ordered lists (like characters in a document).
      • Potential to Simplify Development: The integration of automatic conflict resolution techniques into databases promises to significantly ease the complexity of handling multi-leader and leaderless replication for application developers.

    Horizontal scaling of Chroma DB

    Horizontal scaling of Chroma DB Chroma DB has become the critical component in our ads infrastructure for serving ads to our end users from ...