Eventual Leader Election Algorithms

As presented in Section 2.9.2.0, Chandra et al. [CHT96] introduced in 1996 the Ω failure detector, which provides a primitive Leader() satisfying the following property: there is a time after which all correct processes trust the same correct process p, i.e. the process returned by the Leader() primitive.

As Ω cannot be implemented in an asynchronous distributed system where process crashes arise (otherwise it would solve the consensus in such system, which is known to be impossible as seen in Section 2.9.2.0), any asynchronous system has to be enriched with some assumptions in order to implement Ω [Lar⁺12].

Note that an algorithm is communication efficient if after some time, the eventual leader is the only process to send messages to all other processes, forever [Agu⁺01].

3.2.1 Static Systems

Eventual leader election algorithms designed for static networks can be divided in two categories, known and unknown networks, based on assumptions about network knowledge, as presented in Section 2.6. The literature contains two main approaches: timer-based and time-free as defined in Section 2.9.2.0.

Known Networks

Timer-based. Chandra et al. [CT96], as well as Larrea et al. [LFA00], proposed the first Ω solutions by considering a fully-connected network and reliable eventually timely links.

Aguilera et al. [Agu⁺01] presented in 2001 a weaker condition on channel reliability and synchrony, by relaxing the need of time constraints on all links. In their algorithm, the authors assume that there is at least one process whose links are eventually timely with all the other processes of the system. They later proposed, in 2003, an algorithm [Agu⁺03] implementing Ω in weak systems, where all processes may be arbitrarily slow or may crash, except for some timely processes s whose identities are not known. Only the output links of s are eventually timely, and all other links can be arbitrarily slow and lossy. Then in 2004, the authors presented an article where there exist some correct processes, whose identities are not known, with f outgoing links that are eventually timely [Agu⁺04]. They show that when f = 1, it is sufficient to have only one eventually timely link to implement Ω and solve consensus. The authors also give a simple communication efficient algorithm that implements Ω for systems with up to f crashes, and with at least one correct ◇f − source, i.e. a correct process with eventually f outgoing links that are timely.

Malkhi et al. [MOZ05] proposed an alternative solution without having any eventual timely links, by considering eventually accessible links. Their algorithm considers that eventually one process can send messages such that every message obtains f timely responses, denoted ◇f −accessible. This property implies that the f responders do not need to be fixed, and may change from one message to another, i.e. to vary in time. Compared to Aguilera et al. [Agu⁺04], where in ◇f − source, the set of f links is fixed throughout the execution, Malkhi et al. algorithm presents a weaker assumption with ◇f − accessible that allows the f links to vary in time.

By unifying the assumptions made in [Agu⁺03], [Agu⁺04] and [MOZ05], an even weaker model was proposed by Hutle [Hut⁺08] in 2008. The authors show that Ω can also be implemented in a system with at least one process with f outgoing moving eventually timely links (moving means that the set of neighbors may change over time), assuming either unicast or broadcast steps. They assume a model of a fully-connected network where processes are partially synchronous. Their model requires at least one correct process that is a ◇moving − f − source, i.e. it must have timely outgoing links with an unknown delay bound to a moving set of f receivers at any time.

As seen in Section 2.3, in some systems, a crashed process can recover with the same identifier and goes on executing. The algorithm of Fernández-Campusano et al. [Fer⁺17] implements Ω in a crash-recovery and omissive system. They consider a totally ordered set of n processes which are known by each node, and nodes have access to a stable storage. Every pair of nodes is connected by two unidirectional communication links, one in each direction. Each node elects as leader the node with the smallest penalty value among the nodes that communicate with a majority of processes. A node increases its penalty value when it recovers after a crash, or if it is not well connected with a majority of processes. Each node keeps track of its communication with every other node by periodically exchanging messages, which includes its penalty value, its current leader, and information about its connectivity with the receiver node.

Time-free. Mostéfaoui et al. [MMR03] use a query-response mechanism, which assumes that the responses from some processes to a query arrive among the (n − f) first ones. The authors show that the algorithm implements the ♢𝒮 failure detector, which is known to be equivalent to Ω when the membership is known. In [MRT06], Mostéfaoui et al. extend the previous algorithm to implement Ω, considering a star communication structure if the following property is satisfied: there is a correct process p and a set Q of f processes q (p

Q, and processes in Q can crash), such that eventually, either 1) each time q broadcasts a query, q receives a response from p among the (n−f) first responses to this query (and such a first response is called a winning response), or 2) the channel from p to q is timely.

Unknown Networks

Other works aim to implement Ω in unknown networks and look for models with the weakest possible assumptions on the knowledge and communication graph. They share a common assumption on reachability communication between every pair of correct processes. The presented works consider a timer-based approach, as defined in Section 2.9.2.0.

The work of Jimenez et al. [JAF06] shows that it is possible to implement Ω without knowledge of the system membership. To this end, they present an algorithm implementing a Ω failure detector, which requires minimal reliability and synchrony assumptions in systems whose links are only of two types: either eventually timely (where messages are received by time t + Δ after an unknown GST, with Δ being an unknown bound) or lossy asynchronous (where messages can be lost or arbitrarily delayed), as seen in Section 2.4.

Two algorithms were proposed by Fernandez et al. [FJR06; AJR10] to implement Ω with weak assumptions on the initial knowledge of each process and the behavior of the underlying network. The first one considers a partial unknown network, where each process knows the lower bound α on the number of correct processes (α = n − f). This algorithm assumes fair-lossy links and a strongly connected graph, where there is a correct process connected to f − c other correct processes through eventually timely paths (with c being the actual number of crashes in the considered run, and eventually timely paths are paths made up of correct processes and eventually timely links). Note that this first algorithm is not communication-efficient, as each correct process has to send messages forever by fair-lossy links (one in each direction). The second algorithm considers an unknown network with a complete communication graph, where each pair of correct processes is connected by fair-lossy links (one in each direction). It also assumes that there is a correct process with output links to every correct process that are eventually timely. The authors also present an important impossibility result which consists of a lower bound theorem: in a system where processes know neither α (a lower bound on the number of correct processes) nor c (a lower bound on the actual number of crashes) in their initial knowledge, there is no eventual leader election algorithm with less than n − c − 1 eventually timely links [AJR10].

Martín et al. [MLJ09] proposed some algorithms in 2009 to implement Ω in systems not necessarily fully connected, where two algorithms of them assume an unknown membership. The system does not require that every pair of processes is connected by a direct communication link, and some links can be lossy asynchronous. The first algorithm assumes that eventually all processes are reachable timely from the process that crashes and recovers a minimum number of times. The second algorithm assumes that all processes are eventually reachable timely from some correct process. The eventual leader is the process with the lowest identifier in the set of processes that no longer crashes after some time.

3.2.2 Dynamic Systems

The eventual leader election problem in dynamic systems has been studied in the literature, particularly for MANET networks. Since this thesis proposes eventual leader election algorithms for mobile networks, this section aims to discuss some related work organized following the approach used by the related algorithms: timer-based or time-free.

Timer-based

The algorithm presented by Larrea et al. [Lar⁺12] elects the node that has been in the system for the longest period of time, i.e. the node that joined the system first and has not yet left nor crashed. Each node has a local clock and the time when a node joins the system is timestamped using its current local clock value. If two nodes have the same timestamp, the lower identifier breaks the tie. Clocks are not required to be synchronized, but the local clock value of a node joining the system should be as large as the clock value of the nodes already in the system. A new election is started when a timer related to the current leader expires. The authors assume that communication links are eventually timely among nodes in the system [Agu⁺08], and their algorithm is communication-efficient. The proposed algorithm is of class ΔΩ and satisfies the EL-NI and EL-ND properties introduced in Section 2.9.2.0, i.e. no more nodes join or leave the system during long enough periods of time.

Similarly to the previous algorithm, the algorithm of Gómez-Calzado et al. [Góm⁺13] uses the timer-based approach and elects as the leader the oldest node of the connected component with the highest identifier. Each node maintains a set with the identity of the nodes that belong to its component. When a node does not belong to any component, it periodically broadcasts join messages. On the other hand, if it is the leader of a component, it sends a leader message with the size of its component. When a node receives a join message, it adds the new node to its component, and propagates this new information over the network. When it receives a leader message, it changes its leader if the received component size is higher than its current leader component size (when components merge, for example), and propagates the result over the network. Communication channels are bidirectional, and since a minimum stability of the graph is required, they use a notion of link duration, i.e. a time interval that ensures communication among nodes. Periodically, all links act as eventually timely channels. Nodes are mobile and the system alternates periods of good and bad behaviors. During a good period, i.e. in a stable and long enough connected interval, the communication paths guarantee that graph connectivity corresponds to a spanning tree. The authors also consider the EL-NI and EL-ND properties which guarantee stability of the system after some time, extending them to tolerate nodes mobility, i.e. graph partitioning and merging. To this end, they formally define Mobile Dynamic Omega, denoted Δ^∗Ω. In short, Δ^∗Ω requires that eventually and for a sufficiently long time, there is a bidirectional path between the leader and the rest of nodes in case of fragmentation of the network or merging of multiple partitions (which then act as independent networks). They also introduced a formalism with a framework to classify mobile and dynamic distributed systems [Cal15].

Tucci-Piergiovanni et al. [TB10] consider an unknown system where nodes are up or down, with a bounded number of concurrently up nodes. Nodes have a local oracle called HB^∗ which provides a list of up nodes, used to implement Ω. HB^∗ sorts nodes according to a sequence number of heartbeats, which are periodically sent by up nodes. Nodes that are eventually and permanently up, i.e. the oldest nodes, are in the first positions of the sorted list. HB^∗ permanently exchange the set of the first nodes of their respective lists, updating them. The leader is the node with the lowest identifier among the b first nodes in the list. The authors do not consider any specific communication structure, assuming unknown but finite bounds on message losses and message delay. Network partitions are temporary. Each node periodically sends heartbeats with its identifier, so that the other up nodes consider it as a participant of the system. An infinite number of nodes may join and leave over time. Each node has a local clock, which is not synchronized among each other nodes.

Melit et al. [MB12] proposed an algorithm that also elects the node which has the highest priority value among all nodes within its connected component. The lowest node identifier is used to break ties in case of equal priority value. A timer is used to detect loss of communication with the leader, i.e. when no more leader messages are received. At the expiration of the timer of a node, the node elects itself as the new leader and starts sending leader message periodically. Links are either fair-lossy or eventual timely. The algorithm uses the flooding approach for the leader messages, which is initiated by the leader and contains the identifier and the priority value of the leader. Nodes forward received leader messages to all direct neighbors, except the sender node. A finite number of topological changes can take place, i.e. after a time t, the topology does not change anymore and becomes static.

Time-free

Mostefaoui et al. [Mos⁺05] extends the algorithm for static systems [MRT04], for dynamic systems. The system can have infinitely many nodes, but at each run has only a finite number of nodes. Nodes are asynchronous, i.e. there is no bound on the execution of a computation step, and they enter the system by executing a join() operation that provides them an identity. The system is characterized by a succession of unstable and stable periods, where progress can only be guaranteed during the stable periods if they last long enough. The set of nodes which are correct after entering the system, i.e. neither crash nor leave, is called stable, and the elected leader is a stable node. The following progress condition must be satisfied: |stable| > α, where α = n − f. The leader is the node with the lowest identifier among a set of trusted nodes. This set of trusted nodes eventually contains all nodes of the component that replied among the α first response to queries. Each node maintains a logical date and a set of trusted node identifiers, such that eventually, the set of all nodes have the same value, hence allowing each node to elect the same leader from its set. Nodes use gossiping to disseminate their set of trusted nodes as well as a logical date defining the age of set. This logical date indicates if the received set of trusted nodes is more recent than the current one. A query-response mechanism is used to get knowledge about nodes in the system. Upon receiving α responses, the query terminates.

Arantes et al. [Ara⁺13] propose a time-free algorithm for unknown network where nodes can move, fail by crashing, join and leave the system. The authors consider the finite arrival model, i.e. the network is a dynamic system composed of infinitely many mobile nodes but each run consists of a finite set of n nodes. Communication channels are fair-lossy and the dynamics of the network is modeled by using the recurrent connectivity class of Time-Varying Graph (TVG) that ensures that at any point in time the communication graph remains connected over time [Cas⁺11]. For diffusing information, the algorithm uses local query-response mechanism [MMR03]: at each query-response round, node p_i systematically broadcasts a query message to the nodes in its neighborhood until it possibly crashes or leaves the system. When p_i receives α_i messages, the query-response terminates. α_i is defined locally as a function of the expected number of correct known neighbors with whom p_i may communicate at the time t in which the query is issued. The algorithm elects the leader based on a punishment procedure and on the periodic exchange of query-response messages. If a neighbor p_j of p_i does not respond within the α_i responses, p_i punishes it by incrementing a counter associated to it. The algorithm thus will eventually elect a correct process that has the smallest punish counter. To ensure that all the nodes will elect the same leader, query messages of p_i carry information about the view it has of the system and the values of its process punished counters. In order to tolerate mobility of nodes and avoid false suspicions in case of mobility, messages are timestamped: as soon as p_i gets the information (by the contents of a received message) that another node has received a message from p_j with a greater timestamp, p_i stops punishing p_j, which was falsely suspected as faulty by p_i. To ensure eventual stability of the algorithm, the system assumes both the Stabilized Responsiveness Property (𝒮ℛ𝒫), which defines the ability of a correct process to eventually always reply to a query sent among the first processes, and the Stable Termination Property (𝒮at𝒫), which guarantees that information from/to a process is going to be sent/received to/from at least one correct other process in its neighborhood.

1 Introduction
1.1 Contributions
1.1.1 Topology Aware Leader Election Algorithm for Dynamic Networks
1.1.2 Centrality-Based Eventual Leader Election in Dynamic Networks
1.2 Manuscript Organization
1.3 Publications
1.3.1 Articles in International Conferences
1.3.2 Articles in National Conferences
2 Background
2.1 Properties of Distributed Algorithms
2.2 Timing Models
2.3 Process Failures
2.4 Communication Channels
2.5 Failures of Communication Channels
2.6 Distributed Systems
2.6.1 Static Systems
2.6.2 Dynamic Systems
2.7 Centralities
2.8 Messages Dissemination
2.9 Leader Election
2.9.1 Classical Leader Election
2.9.2 Eventual Leader Election
2.10 Conclusion
3 Related Work
3.1 Classical Leader Election Algorithms
3.1.1 Static Systems
3.1.2 Dynamic Systems
3.2 Eventual Leader Election Algorithms
3.2.1 Static Systems
3.2.2 Dynamic Systems
3.3 Conclusion
4 Topology Aware Leader Election Algorithm for Dynamic Networks
4.1 System Model and Assumptions
4.1.1 Node states and failures
4.1.2 Communication graph
4.1.3 Channels
4.1.4 Membership and nodes identity
4.2 Topology Aware Leader Election Algorithm
4.2.1 Pseudo-code
4.2.2 Data structures, variables, and messages (lines 1 to 6)
4.2.3 Initialization (lines 7 to 11)
4.2.4 Periodic updates task (lines 12 to 16)
4.2.5 Connection (lines 20 to 23)
4.2.6 Disconnection (lines 24 to 27)
4.2.7 Knowledge reception (lines 28 to 38)
4.2.8 Updates reception (lines 39 to 53)
4.2.9 Pending updates (lines 54 to 65)
4.2.10 Leader election (lines 17 to 19)
4.2.11 Execution examples
4.3 Simulation Environment
4.3.1 Algorithms
4.3.2 Algorithms Settings
4.3.3 Mobility Models
4.4 Evaluation
4.4.1 Metrics
4.4.2 Instability
4.4.3 Number of messages sent per second
4.4.4 Path to the leader
4.4.5 Fault injection
4.5 Conclusion
5 Centrality-Based Eventual Leader Election in Dynamic Networks
5.1 System Model and Assumptions
5.1.1 Node states and failures
5.1.2 Communication graph
5.1.3 Channels
5.1.4 Membership and nodes identity
5.2 Centrality-Based Eventual Leader Election Algorithm
5.2.1 Pseudo-code
5.2.2 Data structures, messages, and variables (lines 1 to 4)
5.2.3 Initialization (lines 5 to 7)
5.2.4 Node connection (lines 8 to 17)
5.2.5 Node disconnection (lines 18 to 23)
5.2.6 Knowledge update (lines 24 to 34)
5.2.7 Neighbors update (lines 35 to 41)
5.2.8 Information propagation (lines 42 to 47)
5.2.9 Leader election (lines 48 to 52)
5.3 Simulation Environment
5.3.1 Algorithms Settings
5.3.2 Mobility Models
5.4 Evaluation
5.4.1 Metrics
5.4.2 Average number of messages sent per second per node
5.4.3 Average of the median path to the leader
5.4.4 Instability
5.4.5 Focusing on the 60 meters range over time
5.4.6 A comparative analysis with Topology Aware
5.5 Conclusion
6 Conclusion and Future Work
6.1 Contributions
6.2 Future Directions
A Appendix
A.1 Energy consumption per node
A.1.1 Simulation environment
A.1.2 Algorithms settings
A.1.3 Mobility Models
A.1.4 Metric
A.1.5 Performance Results