2.9 Leader Election

Electing a leader is one of the fundamental problems in distributed computing [Pel90]. Also called coordinator election or leader finding, a leader election selects a single node among the ones of a distributed system, according to some election criterion.

The leader election problem was proposed for the first time 2 in 1977 by Gérard Le Lann [Le 77], a French Computer Researcher.

In this thesis, two leader election problems are considered, denoted the classical leader election problem (Section 2.9.1), and the eventual leader election problem (Section 2.9.2).

2.9.1 Classical Leader Election

In 1980, Angluin shows that there is no deterministic leader election algorithm in anonymous and uniform networks [Ang80], where anonymous means that processes do not have a unique identifier, so there is no way to distinguish a process i from another process j, and uniform means that the number of processes in the system is not known in advance by processes, i.e. not hardcoded in the algorithm [AW04]. To circumvent this impossibility, the symmetry, i.e. the fact that all processes have the same local state which makes a leader election impossible [Ray13], needs to be broken. Therefore, the literature proposes two solutions:

This thesis focuses on the first solution called non-anonymous, where nodes have unique identifiers in the system, but initially do not have knowledge of the identity of the other nodes in the system.

At the beginning of the leader election algorithm, all nodes in the system are unaware of which node is the leader [Hal15]. At the end of the election, there is exactly one correct node that has been elected as the leader among the set of nodes, and each correct node throughout the network recognizes this same and unique node as the leader. The number of nodes in the system can be known or unknown by the algorithm, as seen in Section 2.6.

Formally, the classical leader election problem has the two following properties [MWV00; AW04]:

During the execution of a distributed system, failures may arise and if the system is dynamic, processes may join and leave the system overtime, as seen in Section 2.6.2. Therefore, the system may be partitioned into multiple connected components. Every connected component of the system has a unique leader. Therefore, Malpani et al. [MWV00] modify the definition of a classical leader election to take into account multiple components:

In a classical leader election algorithm, during the election, nodes are aware that they do not currently have the identify of the leader, and therefore, are in an unstable state. This state can be indicated by a done flag, as suggested by Raynal [Ray13], or by returning like in the algorithm of Vasudevan et al. [VKT04].

2.9.2 Eventual Leader Election

The eventual leader election is a key component for many fault-tolerant services in asynchronous distributed systems. An eventual leader election considers that the uniqueness property of the elected leader is eventually satisfied for all correct nodes in the system. Several consensus algorithms such as Paxos [Lam98] or Raft [OO14], adopt a leader-based approach. They rely on an eventual leader election service, also known as the Ω failure detector [CHT96]. Consensus is a fundamental problem of distributed computing [Pel90], used by many other problems in the literature, like state machine replication or atomic broadcast.

Problem Definition

A first definition of the eventual leader election problem is given. Then, a second definition is presented based on the Ω failure detector. Finally, this second definition is enhanced to tolerate asynchronous dynamic systems.

Formally, an eventual leader election algorithm has the two following properties:

The Ω failure detector introduced by Chandra et al. in 1996 [CHT96], considering a static distributed system with reliable communication links and known membership, satisfies the following property [Cal15]:

In a dynamic system, as processes can join and leave the system, the size of the system may increase or decrease overtime. Therefore, Larrea et al. [Lar+12] have defined the Dynamic Omega failure detector class denoted ΔΩ, which provides an eventual leader election algorithm in an asynchronous dynamic system (denoted Δ𝒜𝒮). ΔΩ have the two following properties, assuming that there is at least one process in the system at any time:

Note that, by definition, Ω ΔΩ [Cal15].

According to Larrea et al. [Lar+12] and considering dynamic (denoted Δ𝒮) systems, the class ΔΩ of failure detectors includes all failure detectors which satisfies both properties EL_NI and EL_ND.

Note that the dynamics of the system may imply that no process stays ”long enough” in the system, in which case it is possible that no common process will ever be elected [Lar+12]. Therefore, a leader, which can be temporary, is required to be elected only when the size of the system does no longer increase or decrease, during ”long enough” period of time. The two properties EL_NI and EL_ND satisfy this requirement [Lar+12].

Implementing Eventual Leader Election

Solving the eventual leader election problem is possible by implementing an eventual leader service, also called a leader oracle [FJR06]. Such a service consists in providing the processes with a primitive denoted Leader() that [Ray07]:

1.
Returns the identity of a process of the system each time it is called.
2.
Ensures that there exists a time after which it always returns the identity of the same correct process.

As there is no knowledge of when the leader is elected, several leaders can coexist at time t, such as two processes can have different leaders, but eventually, the same correct process is elected as the leader and its identity is known by all correct nodes. The Ω failure detector provides such a Leader() primitive satisfying these properties [CHT96].

A distributed algorithm based on Ω is indulgent, i.e. it never violates the safety property of the consensus, if the algorithm never produces incorrect outputs regardless of the behavior of Ω [GR04; GL08; Lar+12]. Therefore, if Ω behaves correctly (its behavior corresponds to its specification), the algorithm produces correct outputs.

Ω cannot be implemented in pure asynchronous distributed systems prone to process crashes. Otherwise, it would solve the consensus proven to be impossible in such systems [FLP85; MRT04]. There exist two main approaches to implement Ω in the literature:

1.
Timer-based: it considers additional synchrony assumptions, where some or all links are eventually synchronous [LFA00; Agu+04]. In this case, other assumptions are usually included such as the maximum number of processes that can crash, the number of eventually synchronous links, etc.
2.
Time-free: it considers a property on the message exchange pattern based on query-response, with a maximum number of processes that can crash [MMR03; Ara+13]. It assumes that responses from some nodes eventually and permanently arrive among the first ones.

Many algorithms use the maximum number of faults, i.e. the number of processes that can crash, denoted f, where 1 f < n (n being the number of processes n = |Π|) [MOZ05; FJR06; Hut+08]. An important result by Aguilera et al. [Agu+04] shows that implementing the Ω failure detector in a partially synchronous system with n process and up to f process crashes, requires the existence of some correct processes called f-source, (whose identity does not have to be known) with f outgoing links that are eventually timely. If f = 1, implementing Ω requires only one eventual timely link. Ω can be implemented if there is at least one correct f-source (the identity of the f-source does not have to be known).

Leader Based Consensus

Ω allows to solve the consensus problem with the weakest assumptions on process failures considering a majority of correct processes. In consensus problems, each process proposes a value, and all correct processes should eventually decide on a common value among the proposed ones [FLP85; CT96].

The following properties are defined:

Fisher et al. [FLP85] proved the impossibility to deterministically achieve consensus in a completely asynchronous system, if at least one node is prone to crash failure, due to the inherent difficulty of determining whether a process has crashed, or is slow to reply/compute.

Several consensus algorithms use a leader election to solve the consensus, such as Paxos [Lam98] and Raft [OO14].

Lamport introduced in 1989 Paxos [Lam98], which uses an eventual leader election to guarantee the safety properties with weak assumptions. The uniqueness property of the leader is required to ensure the protocol to make progress, otherwise, two processes thinking they are leaders may stall the protocol by continuously proposing conflicting updates. Once a single leader is elected, it is the only one that tries to issue proposals.

In 2004, Ongaro and Ousterhout introduced Raft [OO14], a leader-based consensus algorithm which uses randomized timers to elect leaders, responsible for log replication to the follower nodes. A leader election is triggered at the initialization of the algorithm, when the existing leader fails or disconnects, or if no heartbeat is received by the followers of the leader after a timeout. Note that Raft, like other leader-based consensus algorithms, is not tolerant to Byzantine fault, as the nodes trust the elected leader.


Table of Contents

1 Introduction
1.1 Contributions
1.1.1 Topology Aware Leader Election Algorithm for Dynamic Networks
1.1.2 Centrality-Based Eventual Leader Election in Dynamic Networks
1.2 Manuscript Organization
1.3 Publications
1.3.1 Articles in International Conferences
1.3.2 Articles in National Conferences
2 Background
2.1 Properties of Distributed Algorithms
2.2 Timing Models
2.3 Process Failures
2.4 Communication Channels
2.5 Failures of Communication Channels
2.6 Distributed Systems
2.6.1 Static Systems
2.6.2 Dynamic Systems
2.7 Centralities
2.8 Messages Dissemination
2.9 Leader Election
2.9.1 Classical Leader Election
2.9.2 Eventual Leader Election
2.10 Conclusion
3 Related Work
3.1 Classical Leader Election Algorithms
3.1.1 Static Systems
3.1.2 Dynamic Systems
3.2 Eventual Leader Election Algorithms
3.2.1 Static Systems
3.2.2 Dynamic Systems
3.3 Conclusion
4 Topology Aware Leader Election Algorithm for Dynamic Networks
4.1 System Model and Assumptions
4.1.1 Node states and failures
4.1.2 Communication graph
4.1.3 Channels
4.1.4 Membership and nodes identity
4.2 Topology Aware Leader Election Algorithm
4.2.1 Pseudo-code
4.2.2 Data structures, variables, and messages (lines 1 to 6)
4.2.3 Initialization (lines 7 to 11)
4.2.4 Periodic updates task (lines 12 to 16)
4.2.5 Connection (lines 20 to 23)
4.2.6 Disconnection (lines 24 to 27)
4.2.7 Knowledge reception (lines 28 to 38)
4.2.8 Updates reception (lines 39 to 53)
4.2.9 Pending updates (lines 54 to 65)
4.2.10 Leader election (lines 17 to 19)
4.2.11 Execution examples
4.3 Simulation Environment
4.3.1 Algorithms
4.3.2 Algorithms Settings
4.3.3 Mobility Models
4.4 Evaluation
4.4.1 Metrics
4.4.2 Instability
4.4.3 Number of messages sent per second
4.4.4 Path to the leader
4.4.5 Fault injection
4.5 Conclusion
5 Centrality-Based Eventual Leader Election in Dynamic Networks
5.1 System Model and Assumptions
5.1.1 Node states and failures
5.1.2 Communication graph
5.1.3 Channels
5.1.4 Membership and nodes identity
5.2 Centrality-Based Eventual Leader Election Algorithm
5.2.1 Pseudo-code
5.2.2 Data structures, messages, and variables (lines 1 to 4)
5.2.3 Initialization (lines 5 to 7)
5.2.4 Node connection (lines 8 to 17)
5.2.5 Node disconnection (lines 18 to 23)
5.2.6 Knowledge update (lines 24 to 34)
5.2.7 Neighbors update (lines 35 to 41)
5.2.8 Information propagation (lines 42 to 47)
5.2.9 Leader election (lines 48 to 52)
5.3 Simulation Environment
5.3.1 Algorithms Settings
5.3.2 Mobility Models
5.4 Evaluation
5.4.1 Metrics
5.4.2 Average number of messages sent per second per node
5.4.3 Average of the median path to the leader
5.4.4 Instability
5.4.5 Focusing on the 60 meters range over time
5.4.6 A comparative analysis with Topology Aware
5.5 Conclusion
6 Conclusion and Future Work
6.1 Contributions
6.2 Future Directions
A Appendix
A.1 Energy consumption per node
A.1.1 Simulation environment
A.1.2 Algorithms settings
A.1.3 Mobility Models
A.1.4 Metric
A.1.5 Performance Results