Key Points about Replication in Distributed Systems
The author of this publication is EPAM Lead Software Engineer, Jai Ludhani.
In contemporary enterprise-level software systems, there is a significant shift towards distributed architecture. This goal of this transition is to enhance scalability and serve customers more efficiently, compared to traditional monolithic architectures. Scalability introduces several challenges, however, particularly in the replication of data across different nodes. This article discusses data replication and its associated terminologies.
What is Data Replication?
In distributed systems, multiple data nodes are used to fetch and store customer data. To prevent data loss in case of node failures, data replication ensures that copies of the same data are maintained across multiple machines connected via a network. This concept is detailed in the book titled Designing Data-Intensive Applications: The Big ideas Behind Reliable, Scalable, and Maintainable Systems, by Martin Kleppman (DDIA, Chapter 5).
Benefits of Data Replication
- Single Point of Failure (SPOF) Mitigation: Enhances system availability.
- Reduced Access Latency: Provides faster access for users across different geographical locations.
- Increased Read Throughput: Allows scaling out by adding more machines to handle read queries.
Challenges of Data Replication
Replicating static data is straightforward, but most customer data changes frequently. This makes timely replication across all nodes challenging. Customers expect their queries to return the latest data, necessitating efficient replication mechanisms.
Common Data Replication Algorithms
Several algorithms are used for data replication. Three are particularly popular and I discuss two of them below.
Single Leader Approach
In this approach:
- Leader Node: Handles all write (update) requests.
- Follower Nodes: Handle read requests and replicate data from the leader.
- The leader can also handle read requests, while followers are restricted to read-only operations.
Synchronous vs. Asynchronous Replication
Consider a scenario in which a user updates their email address in a distributed application with one leader and two follower nodes:
- Synchronous Replication: The leader waits for acknowledgment from a follower before completing the update. This ensures consistency, but can be delayed by the follower's state (e.g., system updates).
- Asynchronous Replication: The leader sends the update request to the follower without waiting for acknowledgment, improving performance but risking temporary inconsistency.
In practice, a semi-synchronous approach is often used, balancing performance and consistency.
A brief about handling of the different scenarios in Distributed Architecture where number of data nodes are there in which some are leader and others are followers nodes.
Adding Followers in Distributed Environments
Adding a new follower node involves:
- Snapshotting: Taking a snapshot of the leader node’s data.
- Copying: Transferring the snapshot to the new follower node.
- Catching Up: The new follower retrieves changes from the leader’s replication log that occurred after the snapshot.
- Synchronization: Once caught up, the new follower can process new changes.
Handling Follower Node Outages
If a follower node goes down due to an issue such as network failure or system updates, it recovers by:
- Identifying the last successful transaction from its logs.
- Requesting from the leader updates that occurred since the last successful transaction.
- Applying these updates to catch up and synchronize with the leader.
Handling Leader Node Failures
Leader failures are more complex and involve a process called failover:
- Failover: A new leader is selected from the existing followers, either manually or through consensus algorithms like Raft or Zab.
- Reconfiguration: The remaining followers begin replicating data from the new leader.