Classical Leader Election Algorithms

The classical leader election problem defined in Section 2.9.1 has the two following properties:

In this section, the presented works assume a unique identifier for every process in the system, where n and e correspond to the number of processes in the network, and the number of edges respectively.

3.1.1 Static Systems

There exist several works in the literature on classical leader election problem, essentially distinguishable on the topology and complexity. Usually, they are extrema-finding, i.e. the highest or lowest process identifier is used as the election criterion.

As a reminder, the complexity of an algorithm is provided using the mathematical Big O notation, describing the limiting behavior of a function when the argument tends towards infinity [Cor⁺09]:

Note that the Ω(g(n)) complexity should not be confused with the Ω failure detector introduced in Section 2.9.2.

Ring Topologies

The leader election problem was studied for ring topologies since 1977 with the first solution proposed by Le Lann [Le 77] which required O(n²) messages. In 1979, Chang and Roberts [CR79] improved the algorithm of Le Lann by reducing the number of messages to O(nlog n) on average, still with a worst-case scenario of O(n²) messages. Both solutions consider unidirectional communication. In Chang and Roberts algorithm, a process compares its identifier with the one received by message from its right neighbor in the ring. If the former is lower than the identifier in the message, the process forwards the message to its left neighbor, otherwise, it discards it. When at phase i a process receives both of its messages, if the latter traveled overall the ring, the node is elected as the leader, otherwise the process starts phase i + 1.

Hirschberg and Sinclair proposed in 1980 [HS80] an algorithm requiring O(nlog n) messages in the worst case, by assuming bidirectional communication on the ring, i.e. processes can send messages to the left and right sides. The algorithm executes by phases. In phase i, processes send messages along paths of length 2ⁱ. If the process identifier is lower than the received one by message from its left (right) neighbor, it forwards the message to its right (left) neighbor. Otherwise, it drops the message. If a message of a process travels over all the processes, it has won the election. They also conjectured that any unidirectional solution must be Ω(n²). However, Dolev [DKR82] and Peterson [Pet82] both shown in 1982 that the conjecture of Hirschberg and Sinclair is false, by presenting unidirectional algorithms requiring at most O(nlog n) messages. By improving the algorithm of Peterson, Dolev obtained a 1.356nlog n + O(n) messages algorithm, which was improved to 1.271nlog n + O(n) by Higham et al. [HP93], and to 0.693nlog n + O(n) by Pachl et al. [PKR82].

Burns [Bur80] and Franklin [Fra82] improved the algorithm of Chang and Roberts. Burns proposed an algorithm that saves messages by alternating the direction in which messages are sent. He also formally defined the model and the problem, giving a Ω(nlog n) lower bound for bidirectional communication.

Leeuwen et al. proposed in 1987 [LT87] a 1.441nlog n + O(n) message complexity algorithm for a deterministic solution on bidirectional, but unoriented rings of size n. Frederickson et al. [FL84] makes a distinction between algorithms doing computation or comparison on the values of process identifiers in synchronous rings. He presented an algorithm doing comparison with a lower bound of O(nlog n) messages. Gafni [Gaf85] improved the time complexity of the Frederickson algorithm in synchronous rings, with two algorithms: a first one with Θ(n) messages and Θ(n2ⁿ + |T|²) time, where |T| is the cardinality of the set of the processes identifiers, and a second algorithm achieving O(nlog n) messages and O(α⁻¹(log n)|T|) time, where α⁻¹() is the functional inverse of log .

General Networks and Spanning Trees

The problem of finding a leader is reducible to the problem of finding a spanning tree [Awe87], as any distributed algorithm that constructs a spanning tree can be transformed into an election algorithm [Afe85], where the root vertex of the spanning tree is the leader of the system. Spanning tree algorithms are usually based on the Echo algorithm, which is a wave algorithm for networks of arbitrary topology [Tel00]. In the Echo algorithm, defined by Chang [Cha82] in 1982, an initiator node sends messages to all its neighbors, which forward messages to all their neighbors (except the sender node, i.e. the father), and so on. After receiving a message from all its neighbors, each node sends an echo message back to its father. When, the initiator receives an echo message from all its neighbors, and decides.

In 1977, Spira investigated distributed algorithms to find the Minimum Spanning Tree (MST), based on an asynchronous algorithm of Dalal [Dal77] by which the MST can be found and maintained in a completely distributed manner. Spira proposed then an algorithm using O(nlog 2n + e) messages, where e is the number of edges [Spi77].

Humblet [Hum83] presented in 1983 a distributed algorithm for minimum weight directed spanning trees, taking into account costs (i.e. weights) associated with links in the network, achieving a complexity of O(n²) on both messages and also time. For instance, the cost could represent latency or some physical distance between two processes.

One notable distributed algorithm for minimum weight spanning trees in static networks which requires O(5nlog n + 2e) messages and O(nlog n) time in a connected undirected graph was proposed by Gallager et al. in 1983 [GHS83]. The algorithm of Chin et al. [CT85] for an undirected weighted connected graph improved the time complexity of this algorithm, with O(5nlog n + 2e) messages and O(nG(n)) time, where G(n) is the number of times that the log function must be applied to n to get a result smaller than or equal to 1. Gafni [Gaf85] also presented in 1985 an algorithm in asynchronous general networks with a Θ(e + nlog n) messages and Θ(nlog n) time. Awerbuch [Awe87] proposed in 1987 an optimal distributed algorithm in messages and time to find a MST in asynchronous networks, which requires O(e + nlog n) messages and O(n) time. Peleg suggested an alternative to MST in 1990 [Pel90], with an algorithm for general networks inspired from Afek [Afe85] that achieves a complexity of O(de) messages and O(d) time, with d being the diameter of the network.

Garcia-Molina [Gar82] proposed the fault-tolerant bully algorithm for synchronous general networks which requires O(n²) messages in the worst case, and where the highest identifier process forces smaller identifier processes into accepting it as the leader. Initially, a process p attempts to contact all processes with higher identifier: if any of these processes respond, then process p waits until the process with a higher identifier becomes the new leader. Otherwise, if all processes with higher identifiers do not respond after a time limit t as they have failed, process p elects itself as the leader.

As proposed by Afek [Afe85] and Awerbuch [Awe87], Ω(e) is a lower bound to construct a spanning tree in asynchronous general networks (including rings topologies), since an algorithm requires to send at least one message over each edge to traverse the network. Burns [Bur80] proved an Ω(nlog n) lower bound on the worst-case number of messages sent to find a leader in an asynchronous ring. Therefore, Ω(e + nlog n) messages are required to construct a spanning tree in asynchronous general network which solves the election problem [Afe85; Awe87], and is an optimum as observed by Gallager et al. [GHS83].

Complete network

Korach et al. [KMZ84] shown in 1984 that leader election in asynchronous complete networks has a lower bound of Ω(nlog n) messages presenting an algorithm of 5nlog k + O(n) messages where k is the number of processes starting the algorithm, and O(nlog n) time.

Afek et al. [AG85] proposed in 1985 two leader election algorithms for synchronous and asynchronous complete networks, with O(log n) and O(n) time complexity for the synchronous and asynchronous algorithms respectively, and with O(nlog n) messages for both algorithms. In synchronous complete networks, the authors also proved a lower bound of Ω(nlog n) messages and that Ω(log n) time is required for any message-optimal synchronous algorithm.

3.1.2 Dynamic Systems

As presented in Section 2.6.2, the communication graph of dynamic systems evolves over time since nodes can fail, recover, join or leave the system at an arbitrary time during execution.

In this type of system, network partitions can happen resulting in disjoint connected component, and algorithms have to handle communication changes.

Election Criterion

Like in static systems, many of the leader election algorithms are extrema finding, meaning that they use the highest node identifier as an election criterion [Hat⁺99; MWV00; RAC08]. Some other ones use different criteria to elect a leader, like election time [Ing⁺09; Ing⁺13], or some feature such as remaining battery or computation power [VKT04; KW13].

Hatzis et al. [Hat⁺99] presented two leader election algorithms, where both elect the node with the highest identifier. They solve a stronger problem by specifying that once elected, the leader should also be aware of the size of the system. In the algorithm of Malpani et al. [MWV00], the leader is the node with the lowest identifier of the component.

Rahman et al. [RAC08] proposed an algorithm for static and dynamic systems aiming at reducing the number of leader election rounds and therefore saving energy: each node maintains a list of leaders, and the leader is the node with the highest identifier. Each time a new node joins the system, after recovering from a crash for example, it starts a new leader election round.

Algorithms proposed by Ingram et al. [Ing⁺09; Ing⁺13] use clocks to record the time the election took place, and the leader is the process that wins the most recent election.

Vasudevan et al. [VKT04] proposed an algorithm where the election criterion is some value related to the node, i.e. a performance-related characteristic, such as remaining battery life or computation capabilities. The authors suggest the idea of electing as the leader the node with the minimum average distance to other nodes, but no implementation is given.

Similarly, Masum et al. [MA⁺06] proposed an algorithm for electing a local extrema among the nodes participating in the election, based on an arbitrary value called priority. Each node has a priority indicating its attractiveness to be the leader, which can be a performance-related attribute of the node, such as the battery life or computational capabilities.

By exploiting the structure of a spanning tree, Kim et al. algorithm [KW13] elects a centrally positioned leader, according to the average depth of nodes in the tree. They called their centrality measure tree-based centrality, and compare it with different centrality measures such as degree, closeness, and betweenness. Compared to the closeness centrality, the tree-based central leader is not always optimal because it depends on the node that initiates the election algorithm.

Information Spreading

Dynamic classical leader election algorithms have different structures of communication, which are mainly a leader-oriented Directed Acyclic Graph (DAG) where each node has a direct path to the leader, or a spanning tree directed towards the initiator node of the election. Some other algorithms use a different communication structure, based on counters, for example.

Directed Acyclic Graph. The algorithm proposed by Malpani et al. [MWV00] is based on a routing algorithm for mobile wireless networks called TORA [PC97], which creates a leader-oriented DAG. A mechanism is used to detect network partitions, and nodes that no longer have a path to a destination stop sending unnecessary messages. Each node creates a 6-uplet from the 5-uplet used in TORA, adding the identifier of the current assumed leader of the node partition. This 6-uplet is modified during topological changes and is used for the election.

Ingram et al. [Ing⁺09] improved Malpani et al. algorithm [MWV00] and requires a three waves algorithm [Tel00] on all nodes of the system to elect a new leader (two waves to search a potential leader and one confirmation wave). Like Malpani et al. [MWV00], a leader-oriented DAG structure is used in each connected component, where every node has a directed path to the leader. Leader stability is also studied to reduce new leader elections while a path to the old leader still exists. Their algorithm requires that nodes have perfectly synchronized clocks, which is made possible by using a global time accessible by all nodes in the system.

A more complete version of the previous work was published in 2013 by Ingram et al. [Ing⁺13], with causal clocks instead of a global clock accessible to all nodes, which can be implemented using perfect clocks or logical clocks. Performance of the algorithm depends on the type of clocks used to implement a causal time and the authors specify that their algorithm is not correct for approximately synchronized clocks unless they preserve causality.

Spanning Tree. Vasudevan et al. [VKT04] use a wave mechanism to build a spanning tree, where each node sends back to its parent the identifier of the node having the highest value in its subtree. Their algorithm returns a leader identity only when it is accepted by all nodes in the network, returning ⊥ in the rest of the time. Note that links are assumed to be bidirectional and heartbeat messages are periodically sent from the leader node. Based on Vasudevan et al. [VKT04], the algorithm of Rahman et al. [RAC08], also builds a spanning tree, where nodes periodically send probe messages to their neighbors, and get reply messages from them. As in the Echo algorithm previously presented, an acknowledgement system is used where nodes reply to election messages with ack messages along with their identifier. The leader sends heartbeat messages every twenty seconds, and after a timeout of six messages, a new leader election is triggered. Like in Vasudevan et al., the algorithm also assumes that communication links are bidirectional.

Kim et al. [KW13] exploit a three wave algorithm to build a spanning tree, used to elect a central leader. Each node periodically broadcasts hello messages, in order to check connectivity with its neighbors, and to create a neighbor information table. Election messages are used to dynamically build the spanning tree and are propagated when an election is triggered.

Counter-based. Hatzis et al. [Hat⁺99] presented an algorithm with a different communication structure, where each node maintains a local counter representing the number of other mobile nodes met. When two mobile nodes meet, they exchange their identities: the one with the higher identifier wins and receives the counter of the loser, which is added to its local counter. The loser node changes its state to inactive, and no longer responds to messages about the execution of the algorithm. Information is only transmitted when a new node joins a component, thereby reducing the number of broadcasts needed by the algorithm and saving, therefore, battery power consumed for message transmissions. Each node also keeps a local list of node identifiers that it has defeated. When two nodes meet, the winner concatenates its local list with the list transmitted by the loser node. That way, the final winner will know the network size and the identifiers of nodes in the network.

Other approaches. Masum et al. [MA⁺06] consider that communication links are reliable, bidirectional, and FIFO. Their algorithm does not rely on a specific communication structure, and tolerates intermittent failures, such as link failures, sudden crash, as well as recovery of mobile nodes. A node is considered faulty if the communication links between the node and each of its neighbors have failed, while a node recovery is the recovery of the communication links between the recovering node and its neighbors. Message delivery is only guaranteed if the sender and receiver remain both connected (not partitioned) for the entire duration of the message transfer.

Connectivity Assumption

To handle frequent topology changes, election algorithms must tolerate arbitrary, concurrent changes and should eventually terminate electing a unique leader within the connected component. Due to the dynamics of the network, it is impossible to guarantee a unique leader all the time, because when a network partition occurs or when two components merge, it will take some time to elect a new leader. In order to satisfy the agreement property of the leader election and eventually elect a single leader, algorithms designed for dynamic systems make different assumptions about connectivity of nodes of the system.

The two leader election algorithms of Hatzis et al. [Hat⁺99] are designed to handle dynamic topology changes with mobility of some or all nodes. Both leader election algorithms require that nodes know in advance the type and dimensions of the area in which they move, and nodes need to measure the distance that they cover when moving. The first algorithm presented might never elect a single leader, if nodes never meet to exchange information, while the second one assumes nodes with no sense of orientation that follow random walks and is based on a Las Vegas algorithm. Nodes have a unique identifier, but a variation of the second algorithm allows anonymous nodes.

Malpani et al. [MWV00] supposed a synchronous system, where executions take place in stages during finite phases. When a partition occurs, the system is separated in two or more components and the authors consider that any component whose topology is static long enough will eventually have exactly one leader.

Similarly, Ingram et al. [Ing⁺09; Ing⁺13] consider an asynchronous system where messages are delivered in FIFO order through reliable asynchronous communication channels, and where nodes are assumed to be completely reliable. If topology changes cease, then eventually each connected component of the network has a unique leader.

Vasudevan et al. [VKT04] modified the requirements that eventually every connected component has a unique leader, by proposing an algorithm that ensures that after a finite number of topology changes, eventually each node has the same leader.

Masum et al. [MA⁺06] assume one or multiple simultaneous topological changes can occur, but each node remains in the network for a sufficiently long time. Any connected component of the network whose topology remains static sufficiently long will eventually have exactly one unique leader.

The algorithm of Rahman et al. [RAC08] does not make the assumption that topology changes eventually stop. Nodes permanently send periodic probe messages and wait for the reply message from the neighbors of the spanning tree, to maintain the connectivity.

Note that most of the presented works do not specify which mobility model or pattern they use.

1 Introduction
1.1 Contributions
1.1.1 Topology Aware Leader Election Algorithm for Dynamic Networks
1.1.2 Centrality-Based Eventual Leader Election in Dynamic Networks
1.2 Manuscript Organization
1.3 Publications
1.3.1 Articles in International Conferences
1.3.2 Articles in National Conferences
2 Background
2.1 Properties of Distributed Algorithms
2.2 Timing Models
2.3 Process Failures
2.4 Communication Channels
2.5 Failures of Communication Channels
2.6 Distributed Systems
2.6.1 Static Systems
2.6.2 Dynamic Systems
2.7 Centralities
2.8 Messages Dissemination
2.9 Leader Election
2.9.1 Classical Leader Election
2.9.2 Eventual Leader Election
2.10 Conclusion
3 Related Work
3.1 Classical Leader Election Algorithms
3.1.1 Static Systems
3.1.2 Dynamic Systems
3.2 Eventual Leader Election Algorithms
3.2.1 Static Systems
3.2.2 Dynamic Systems
3.3 Conclusion
4 Topology Aware Leader Election Algorithm for Dynamic Networks
4.1 System Model and Assumptions
4.1.1 Node states and failures
4.1.2 Communication graph
4.1.3 Channels
4.1.4 Membership and nodes identity
4.2 Topology Aware Leader Election Algorithm
4.2.1 Pseudo-code
4.2.2 Data structures, variables, and messages (lines 1 to 6)
4.2.3 Initialization (lines 7 to 11)
4.2.4 Periodic updates task (lines 12 to 16)
4.2.5 Connection (lines 20 to 23)
4.2.6 Disconnection (lines 24 to 27)
4.2.7 Knowledge reception (lines 28 to 38)
4.2.8 Updates reception (lines 39 to 53)
4.2.9 Pending updates (lines 54 to 65)
4.2.10 Leader election (lines 17 to 19)
4.2.11 Execution examples
4.3 Simulation Environment
4.3.1 Algorithms
4.3.2 Algorithms Settings
4.3.3 Mobility Models
4.4 Evaluation
4.4.1 Metrics
4.4.2 Instability
4.4.3 Number of messages sent per second
4.4.4 Path to the leader
4.4.5 Fault injection
4.5 Conclusion
5 Centrality-Based Eventual Leader Election in Dynamic Networks
5.1 System Model and Assumptions
5.1.1 Node states and failures
5.1.2 Communication graph
5.1.3 Channels
5.1.4 Membership and nodes identity
5.2 Centrality-Based Eventual Leader Election Algorithm
5.2.1 Pseudo-code
5.2.2 Data structures, messages, and variables (lines 1 to 4)
5.2.3 Initialization (lines 5 to 7)
5.2.4 Node connection (lines 8 to 17)
5.2.5 Node disconnection (lines 18 to 23)
5.2.6 Knowledge update (lines 24 to 34)
5.2.7 Neighbors update (lines 35 to 41)
5.2.8 Information propagation (lines 42 to 47)
5.2.9 Leader election (lines 48 to 52)
5.3 Simulation Environment
5.3.1 Algorithms Settings
5.3.2 Mobility Models
5.4 Evaluation
5.4.1 Metrics
5.4.2 Average number of messages sent per second per node
5.4.3 Average of the median path to the leader
5.4.4 Instability
5.4.5 Focusing on the 60 meters range over time
5.4.6 A comparative analysis with Topology Aware
5.5 Conclusion
6 Conclusion and Future Work
6.1 Contributions
6.2 Future Directions
A Appendix
A.1 Energy consumption per node
A.1.1 Simulation environment
A.1.2 Algorithms settings
A.1.3 Mobility Models
A.1.4 Metric
A.1.5 Performance Results

3.1 Classical Leader Election Algorithms