Distributed computing is present everywhere in our world and is at the heart of the information processing of our modern society. From the World Wide Web to Peer-to-Peer applications to blockchains like Ethereum [Woo+14], many applications revealed the importance of distributed computing in our daily life.
A distributed 1 system can be defined 2 as follows [VT17]:
A distributed system is a collection of computing elements that interact with one another in order to achieve a common goal.
Computing elements are autonomous computational entities, each of them has its own local memory. They can be multiple software processes located on the same physical computer, or different hardware devices interconnected together, such as computer networks [VT17; And00]. Computing elements are also called processes or nodes, and to achieve their common goal, they collaborate and coordinate their actions by communicating with each other. There are two main communication paradigms: shared memory [Abr88] and messages passing [Tel00]. This thesis only considers communication by messages passing which uses a communication channel, i.e. a link between two processes of the system on which information is transmitted. Through a communication channel, a process sends and receives messages to/from another process [And00].
Since a distributed system is composed of many interconnected processes, some of them may become faulty during the execution of the system, and consequently, induce failures in the system. A failure is a deviation from the expected behavior of the system. A distributed system is considered reliable if it respects its specification, and if the system is able to provide correct services regardless of failures. Therefore, to run a reliable system in the presence of faults, the distributed system needs to be fault tolerant.
With the ongoing advent of mobile computers such as smartphones, intelligent vehicular, drones or mobile sensors, nodes with mobility compose a dynamic distributed system. These nodes move over a physical area and communicate directly with each other through wireless links. They can also fail as well as join and leave the system. Thus, the system evolves over time and the network is, therefore, highly dynamic.
A common type of dynamic network is the Mobile Ad Hoc Network (MANET), which does not rely on any pre-existing infrastructure to communicate, such as router or access-point. A MANET is a self-configuring network in which nodes are connected without wires and are free to move. Nodes are equipped with a radio module that enables wireless communications and participate in the network by forwarding messages to other nodes.
Many distributed applications and services, such as the distributed database Casandra [LM10] and blockchains like Bitcoin [Nak08], require that processes reach a consensus, by choosing a single value among proposed ones. The consensus problem is one of the most fundamental problems of distributed computing. A consensus algorithm requires that every process proposes a value to other processes in the system, and all non-faulty process eventually agree on a single value among the proposed ones. Consensus is used by many other distributed algorithms in the literature, such as state machine replication or atomic broadcast. However, Fischer, Lynch and Paterson have proved (the well-known FLP theorem [FLP85]) that it is impossible to deterministically achieve consensus in a completely asynchronous system where at least one node is prone to crash failure.
Several existing well-known algorithms in the literature solve consensus in failure-prone distributed systems by using a leader election algorithm. Examples of such algorithm are Paxos [Lam98], used by Google, Amazon and Microsoft, or Raft [OO14]. Also known as the Ω failure detector [CHT96], an eventual leader election allows to deterministically solve the consensus problem with the weakest assumptions on process failures considering a majority of correct processes. Ω provides a primitive called Leader(), which, when invoked, returns the identifier of a process in the system and guarantees that there is a time after which it always returns the identifier of the same correct process, i.e. the leader.
Many leadership protocols were proposed to implement Ω. Most of them consider static distributed systems and assume that the membership of the system is either known in advance [LFA00; Agu+04], or unknown [FJR06; JAF06]. Among the ones that tolerate system dynamics [Góm+13; Ara+13], only a few of them take into account the characteristics and the lack of knowledge of highly dynamic systems. Furthermore, most of these algorithms do not choose the leader according to a topological criterion, i.e. the position of the node in the network, but rather on the highest or lowest node identifier. The topological position of the leader has a strong impact on the performance of algorithms using the leader election service, since the leader must collect information from the other nodes, such as from a majority of processes in the case of consensus. Thus, the average number of hops to reach the leader has a direct impact on the performance of consensus algorithms. A representation of a central leader is given in 1.1.
This thesis studies the eventual leader election problem considering the above described dynamic evolving networks and performance issues of leader-based algorithms.