2.3 Process Failures

Processes are prone to failures during the execution of the system. A failure can be defined as a deviation from correct behavior [HT94], i.e. a behavior that does not comply with its specification. More specifically, a fault is a flaw in the system that originates from a node, which may cause unexpected behavior and result in an error. If the error is not handled and then propagates through the system, it can lead to a failure. 2.2 illustrates the steps of a failure in a distributed system.

PIC

Figure 2.2: Steps of a failure in a distributed system.

Failures can happen due to algorithm conception and programming (which can be difficult to detect, especially in distributed algorithms as processes can be at different steps of the execution of the algorithm), hardware, or hacking. A process which fails during the execution of the system is considered to be faulty, whereas a process which never fails during the whole execution of the system is correct.

There exist different modes of process failures in the literature. In addition, failures can be permanent, where a component stays in the faulty state forever, transient, where a component is in the faulty state for a finite duration, or intermittent, where a component successively exhibits correct and faulty behavior [Dub11]. Failures can be classified from the weak to the strongest one, i.e. a failure includes previous failures and is a subset of the next failures [Ber04; GR06; CGR11; Cal15; Cam20].
As shown in 2.3, crash failures are included in omission failures, which are included in crash-recovery failures, which are included in arbitrary failures.

PIC

Figure 2.3: Failure modes of a process, by [GR06].

The different modes of process failure are summarized in 2.2. Note that this list is not exhaustive and there exist other failures, such as fail-safe, fail-stop, authenticated byzantine, etc.

Table 2.2: Table of process failures modes.
Failures
Component
Description
Crash Process Permanent halt of process.
Omission Process Temporary halt of process.
Fail-recovery Process Halt of process execution, but may recover and resume afterwards.
Arbitrary/Byzantine Process Arbitrarily omits step or reply, sets or return wrong value.


Table of Contents

1 Introduction
1.1 Contributions
1.1.1 Topology Aware Leader Election Algorithm for Dynamic Networks
1.1.2 Centrality-Based Eventual Leader Election in Dynamic Networks
1.2 Manuscript Organization
1.3 Publications
1.3.1 Articles in International Conferences
1.3.2 Articles in National Conferences
2 Background
2.1 Properties of Distributed Algorithms
2.2 Timing Models
2.3 Process Failures
2.4 Communication Channels
2.5 Failures of Communication Channels
2.6 Distributed Systems
2.6.1 Static Systems
2.6.2 Dynamic Systems
2.7 Centralities
2.8 Messages Dissemination
2.9 Leader Election
2.9.1 Classical Leader Election
2.9.2 Eventual Leader Election
2.10 Conclusion
3 Related Work
3.1 Classical Leader Election Algorithms
3.1.1 Static Systems
3.1.2 Dynamic Systems
3.2 Eventual Leader Election Algorithms
3.2.1 Static Systems
3.2.2 Dynamic Systems
3.3 Conclusion
4 Topology Aware Leader Election Algorithm for Dynamic Networks
4.1 System Model and Assumptions
4.1.1 Node states and failures
4.1.2 Communication graph
4.1.3 Channels
4.1.4 Membership and nodes identity
4.2 Topology Aware Leader Election Algorithm
4.2.1 Pseudo-code
4.2.2 Data structures, variables, and messages (lines 1 to 6)
4.2.3 Initialization (lines 7 to 11)
4.2.4 Periodic updates task (lines 12 to 16)
4.2.5 Connection (lines 20 to 23)
4.2.6 Disconnection (lines 24 to 27)
4.2.7 Knowledge reception (lines 28 to 38)
4.2.8 Updates reception (lines 39 to 53)
4.2.9 Pending updates (lines 54 to 65)
4.2.10 Leader election (lines 17 to 19)
4.2.11 Execution examples
4.3 Simulation Environment
4.3.1 Algorithms
4.3.2 Algorithms Settings
4.3.3 Mobility Models
4.4 Evaluation
4.4.1 Metrics
4.4.2 Instability
4.4.3 Number of messages sent per second
4.4.4 Path to the leader
4.4.5 Fault injection
4.5 Conclusion
5 Centrality-Based Eventual Leader Election in Dynamic Networks
5.1 System Model and Assumptions
5.1.1 Node states and failures
5.1.2 Communication graph
5.1.3 Channels
5.1.4 Membership and nodes identity
5.2 Centrality-Based Eventual Leader Election Algorithm
5.2.1 Pseudo-code
5.2.2 Data structures, messages, and variables (lines 1 to 4)
5.2.3 Initialization (lines 5 to 7)
5.2.4 Node connection (lines 8 to 17)
5.2.5 Node disconnection (lines 18 to 23)
5.2.6 Knowledge update (lines 24 to 34)
5.2.7 Neighbors update (lines 35 to 41)
5.2.8 Information propagation (lines 42 to 47)
5.2.9 Leader election (lines 48 to 52)
5.3 Simulation Environment
5.3.1 Algorithms Settings
5.3.2 Mobility Models
5.4 Evaluation
5.4.1 Metrics
5.4.2 Average number of messages sent per second per node
5.4.3 Average of the median path to the leader
5.4.4 Instability
5.4.5 Focusing on the 60 meters range over time
5.4.6 A comparative analysis with Topology Aware
5.5 Conclusion
6 Conclusion and Future Work
6.1 Contributions
6.2 Future Directions
A Appendix
A.1 Energy consumption per node
A.1.1 Simulation environment
A.1.2 Algorithms settings
A.1.3 Mobility Models
A.1.4 Metric
A.1.5 Performance Results