Beyond the success of Kotlin: a documentary about how and why Kotlin succeeded in the world of Android development.

Key Points about Replication in Distributed Systems

The author of this publication is EPAM Lead Software Engineer, Jai Ludhani.

Lead Software Engineer, Jai Ludhani

Published in Tech matters26 June 20242 min read

In contemporary enterprise-level software systems, there is a significant shift towards distributed architecture. This goal of this transition is to enhance scalability and serve customers more efficiently, compared to traditional monolithic architectures. Scalability introduces several challenges, however, particularly in the replication of data across different nodes. This article discusses data replication and its associated terminologies.

Grow your tech career with generative AI
Explore courses, discover new tools, and read the latest news.
View offer

What is Data Replication?

In distributed systems, multiple data nodes are used to fetch and store customer data. To prevent data loss in case of node failures, data replication ensures that copies of the same data are maintained across multiple machines connected via a network. This concept is detailed in the book titled Designing Data-Intensive Applications: The Big ideas Behind Reliable, Scalable, and Maintainable Systems, by Martin Kleppman (DDIA, Chapter 5).

Benefits of Data Replication

  1. Single Point of Failure (SPOF) Mitigation: Enhances system availability.
  2. Reduced Access Latency: Provides faster access for users across different geographical locations.
  3. Increased Read Throughput: Allows scaling out by adding more machines to handle read queries.

Challenges of Data Replication

Replicating static data is straightforward, but most customer data changes frequently. This makes timely replication across all nodes challenging. Customers expect their queries to return the latest data, necessitating efficient replication mechanisms.

Common Data Replication Algorithms

Several algorithms are used for data replication. Three are particularly popular and I discuss two of them below.

Single Leader Approach

In this approach:

  • Leader Node: Handles all write (update) requests.
  • Follower Nodes: Handle read requests and replicate data from the leader.
  • The leader can also handle read requests, while followers are restricted to read-only operations.

Synchronous vs. Asynchronous Replication

Consider a scenario in which a user updates their email address in a distributed application with one leader and two follower nodes:

  • Synchronous Replication: The leader waits for acknowledgment from a follower before completing the update. This ensures consistency, but can be delayed by the follower's state (e.g., system updates).
  • Asynchronous Replication: The leader sends the update request to the follower without waiting for acknowledgment, improving performance but risking temporary inconsistency.

In practice, a semi-synchronous approach is often used, balancing performance and consistency.

A brief about handling of the different scenarios in Distributed Architecture where number of data nodes are there in which some are leader and others are followers nodes.

Adding Followers in Distributed Environments

Adding a new follower node involves:

  1. Snapshotting: Taking a snapshot of the leader node’s data.
  2. Copying: Transferring the snapshot to the new follower node.
  3. Catching Up: The new follower retrieves changes from the leader’s replication log that occurred after the snapshot.
  4. Synchronization: Once caught up, the new follower can process new changes.

Handling Follower Node Outages

If a follower node goes down due to an issue such as network failure or system updates, it recovers by:

  1. Identifying the last successful transaction from its logs.
  2. Requesting from the leader updates that occurred since the last successful transaction.
  3. Applying these updates to catch up and synchronize with the leader.

Handling Leader Node Failures

Leader failures are more complex and involve a process called failover:

  1. Failover: A new leader is selected from the existing followers, either manually or through consensus algorithms like Raft or Zab.
  2. Reconfiguration: The remaining followers begin replicating data from the new leader.
The views expressed in the articles on this site are solely those of the authors and do not necessarily reflect the opinions or views of Anywhere Club or its members.
Related posts
Get the latest updates on the platforms you love