Electing a leader is one of the fundamental problems in distributed computing [Pel90]. Also called coordinator election or leader finding, a leader election selects a single node among the ones of a distributed system, according to some election criterion.
The leader election problem was proposed for the first time 2 in 1977 by Gérard Le Lann [Le 77], a French Computer Researcher.
In this thesis, two leader election problems are considered, denoted the classical leader election problem (Section 2.9.1), and the eventual leader election problem (Section 2.9.2).
In 1980, Angluin shows that there is no deterministic leader election algorithm in anonymous and uniform networks [Ang80], where anonymous means that processes do not have a unique identifier, so there is no way to distinguish a process i from another process j, and uniform means that the number of processes in the system is not known in advance by processes, i.e. not hardcoded in the algorithm [AW04]. To circumvent this impossibility, the symmetry, i.e. the fact that all processes have the same local state which makes a leader election impossible [Ray13], needs to be broken. Therefore, the literature proposes two solutions:
This thesis focuses on the first solution called non-anonymous, where nodes have unique identifiers in the system, but initially do not have knowledge of the identity of the other nodes in the system.
At the beginning of the leader election algorithm, all nodes in the system are unaware of which node is the leader [Hal15]. At the end of the election, there is exactly one correct node that has been elected as the leader among the set of nodes, and each correct node throughout the network recognizes this same and unique node as the leader. The number of nodes in the system can be known or unknown by the algorithm, as seen in Section 2.6.
Formally, the classical leader election problem has the two following properties [MWV00; AW04]:
During the execution of a distributed system, failures may arise and if the system is dynamic, processes may join and leave the system overtime, as seen in Section 2.6.2. Therefore, the system may be partitioned into multiple connected components. Every connected component of the system has a unique leader. Therefore, Malpani et al. [MWV00] modify the definition of a classical leader election to take into account multiple components:
In a classical leader election algorithm, during the election, nodes are aware that they do not currently have the identify of the leader, and therefore, are in an unstable state. This state can be indicated by a done flag, as suggested by Raynal [Ray13], or by returning ⊥ like in the algorithm of Vasudevan et al. [VKT04].
The eventual leader election is a key component for many fault-tolerant services in asynchronous distributed systems. An eventual leader election considers that the uniqueness property of the elected leader is eventually satisfied for all correct nodes in the system. Several consensus algorithms such as Paxos [Lam98] or Raft [OO14], adopt a leader-based approach. They rely on an eventual leader election service, also known as the Ω failure detector [CHT96]. Consensus is a fundamental problem of distributed computing [Pel90], used by many other problems in the literature, like state machine replication or atomic broadcast.
A first definition of the eventual leader election problem is given. Then, a second definition is presented based on the Ω failure detector. Finally, this second definition is enhanced to tolerate asynchronous dynamic systems.
Formally, an eventual leader election algorithm has the two following properties:
The Ω failure detector introduced by Chandra et al. in 1996 [CHT96], considering a static distributed system with reliable communication links and known membership, satisfies the following property [Cal15]:
In a dynamic system, as processes can join and leave the system, the size of the system may increase or decrease overtime. Therefore, Larrea et al. [Lar+12] have defined the Dynamic Omega failure detector class denoted ΔΩ, which provides an eventual leader election algorithm in an asynchronous dynamic system (denoted Δ𝒜𝒮). ΔΩ have the two following properties, assuming that there is at least one process in the system at any time:
Note that, by definition, Ω ⊂ ΔΩ [Cal15].
According to Larrea et al. [Lar+12] and considering dynamic (denoted Δ𝒮) systems, the class ΔΩ of failure detectors includes all failure detectors which satisfies both properties EL_NI and EL_ND.
Note that the dynamics of the system may imply that no process stays ”long enough” in the system, in which case it is possible that no common process will ever be elected [Lar+12]. Therefore, a leader, which can be temporary, is required to be elected only when the size of the system does no longer increase or decrease, during ”long enough” period of time. The two properties EL_NI and EL_ND satisfy this requirement [Lar+12].
Solving the eventual leader election problem is possible by implementing an eventual leader service, also called a leader oracle [FJR06]. Such a service consists in providing the processes with a primitive denoted Leader() that [Ray07]:
As there is no knowledge of when the leader is elected, several leaders can coexist at time t, such as two processes can have different leaders, but eventually, the same correct process is elected as the leader and its identity is known by all correct nodes. The Ω failure detector provides such a Leader() primitive satisfying these properties [CHT96].
A distributed algorithm based on Ω is indulgent, i.e. it never violates the safety property of the consensus, if the algorithm never produces incorrect outputs regardless of the behavior of Ω [GR04; GL08; Lar+12]. Therefore, if Ω behaves correctly (its behavior corresponds to its specification), the algorithm produces correct outputs.
Ω cannot be implemented in pure asynchronous distributed systems prone to process crashes. Otherwise, it would solve the consensus proven to be impossible in such systems [FLP85; MRT04]. There exist two main approaches to implement Ω in the literature:
Many algorithms use the maximum number of faults, i.e. the number of processes that can crash, denoted f, where 1 ≤ f < n (n being the number of processes n = |Π|) [MOZ05; FJR06; Hut+08]. An important result by Aguilera et al. [Agu+04] shows that implementing the Ω failure detector in a partially synchronous system with n process and up to f process crashes, requires the existence of some correct processes called ♢f-source, (whose identity does not have to be known) with f outgoing links that are eventually timely. If f = 1, implementing Ω requires only one eventual timely link. Ω can be implemented if there is at least one correct ♢f-source (the identity of the ♢f-source does not have to be known).
Ω allows to solve the consensus problem with the weakest assumptions on process failures considering a majority of correct processes. In consensus problems, each process proposes a value, and all correct processes should eventually decide on a common value among the proposed ones [FLP85; CT96].
The following properties are defined:
Fisher et al. [FLP85] proved the impossibility to deterministically achieve consensus in a completely asynchronous system, if at least one node is prone to crash failure, due to the inherent difficulty of determining whether a process has crashed, or is slow to reply/compute.
Several consensus algorithms use a leader election to solve the consensus, such as Paxos [Lam98] and Raft [OO14].
Lamport introduced in 1989 Paxos [Lam98], which uses an eventual leader election to guarantee the safety properties with weak assumptions. The uniqueness property of the leader is required to ensure the protocol to make progress, otherwise, two processes thinking they are leaders may stall the protocol by continuously proposing conflicting updates. Once a single leader is elected, it is the only one that tries to issue proposals.
In 2004, Ongaro and Ousterhout introduced Raft [OO14], a leader-based consensus algorithm which uses randomized timers to elect leaders, responsible for log replication to the follower nodes. A leader election is triggered at the initialization of the algorithm, when the existing leader fails or disconnects, or if no heartbeat is received by the followers of the leader after a timeout. Note that Raft, like other leader-based consensus algorithms, is not tolerant to Byzantine fault, as the nodes trust the elected leader.