Autonomic Web Services Based on Different Adaptive Quasi-Asynchronous Checkpointing Techniques

Companies, organizations and individuals use Web services to build complex business functionalities. Web services must operate properly in the unreliable Internet infrastructure even in the presence of failures. To increase system dependability, organizations, including service providers, adapt their systems to the autonomic computing paradigm. Strategies can vary from having one to all (S-CHOP, self-configuration, self-healing, self-optimization and self-protection) features. Regarding self-healing, an almost identical tool is communication-induced checkpointing (CiC), a checkpoint contains the state (heap, registers, stack, kernel state) for each process in the system. CiC is based on quasi-synchronous checkpointing where processes take checkpoints relying of control information piggybacked inside application messages; however, avoiding dangerous patterns such as Z-paths and Z-cycles; in such a regard the system takes forced checkpoints and avoids inconsistent states. CiC, unlike other tools, does not incur system performance, our proposal does not incur high overhead (as results show), and it has the advantage of being scalable. As we have shown in a previous work, CiC can be used to address dependability problems when dealing with Web services, as CiC mechanism work in a distributed and efficient manner. Therefore, in this work we propose an adaptable and dynamic generation of checkpoints to support fault tolerance. We present an alternative considering Quality of Service (QoS) criteria, and the different impact applications have on it. We propose taking checkpoints dynamically in case of failure or QoS degradation. Experimental results show that our approach has significantly reduced the generation of checkpoints of various well-known tools in the literature.


Introduction
We find distributed systems everywhere (they are ubiquitous) and people use them in their everyday life. Organizations and users benefit from distributed systems, for these kinds of systems make it possible to complete tasks that otherwise could not be carried out by a single computer; people all over the world can collaborate on projects for solving difficult tasks; organizations benefit one from another through sharing resources and building large scale distributed systems; distributed systems the number of forced checkpoints further while maintaining system consistency. Detecting degradation problems via monitoring QoS is carried out also in accordance with process or service requirements, including Service Level Agreements (SLAs).
A summary of our contributions are as follows: • A dynamic and adaptable generation of checkpoints, by considering the monitoring of system performance through non-functional requirements and a fuzzy approach, allowing to reduce the overhead and system overload.

•
The design of an autonomic web services (MAPE) control loop that allows an easy implementation of diverse checkpointing techniques.

•
The adaptation of diverse checkpointing techniques to offer consistency (checkpoints free of dangerous patterns such as zigzag paths and z-cycles) and non-functional requirements (system evaluation for individual web services). • A comparative study that shows the performance of different fault tolerance techniques that rely on the most commonly used checkpointing techniques. We hypothesize that a fault tolerance technique can be used under diverse applications and that by adopting one technique many more can easily further be adapted for other computer-related areas.
Using web services as a black box containing all relative MAPE features, we implemented three different checkpointing techniques, namely, Delayed Checkpointing Fully Informed (DCFI) credited to Simon et al. [9], HMNR also called Fully Informed (FI) presented by Helary et al. [14], and Fully Informed and Efficient (FINE) of Luo and Manivannan [15]. In order to detect a faulty component of a system, we propose implementing the MAPE control loop individually by each web service; doing so can lead to discarding useless checkpoints or to update the frequency these are taken. We evaluate web services QoS, at runtime, using a fuzzy approach; in particular, we propose a fuzzy-consistency system evaluation (FCSE) function, as previously reported in [1], we identify useless checkpoint concerning QoS and system consistency. To the best of our knowledge, this is the first mechanism that combines a fuzzy approach for system assessment purposes with CiC mechanisms. We used ChkSim [16] and Jfuzzylogic http://jfuzzylogic.sourceforge.net/html/index.html, for simulation as both are compatible with JAVA programming language. We show that our proposal is more efficient than current solutions. To this end, we have compared our approach against three of the most efficient CiC based solutions, reported in State-of-the-Art.
We organized the rest of the paper as follows. In Section 2, we give an overview of related work. Then, in Section 3, we analyze the mathematical model for checkpointing and briefly introduce fuzzy logic. In Section 4, we provide an analysis of the environment and the assumptions upon which our solution builds, and present a case study. Next, we elaborate on the results that support our approach, showing the feasibility, Section 5. Finally, conclusions drawn from this study appear in Section 7, along with suggestions for further work.

Related Work
In this section, we provide some of the most efficient fault tolerance techniques that rely on checkpointing. In our previous work, we list some of the work that merges autonomic computing and checkpointing techniques (work that addresses some know fault tolerance techniques for Web services and business processes) [1]. Besides, in [1], we ensure the dependability by merging CiC mechanisms and autonomous computing. In [17], we implemented a fuzzy evaluation model into the MAPE loop to measure system performance without degrading the system. In contrast, in this paper, we show how to generate dynamically checkpoints with less overhead and system overload.
There exist many methods for checkpointing such as synchronous checkpointing, asynchronous checkpointing and quasi-asynchronous checkpointing; we focus on quasi-asynchronous checkpointing. A complete survey can be found in [3], where we give a new taxonomy for business processes that rely on checkpointing mechanisms for both orchestration and choreography.
State-of-the-Art literature classifies quasi-asynchronous checkpointing further into two different communication induced and index-based protocols, which in turn implement a variant of Lamport's logical clock. For instance, the HMNR protocol [14], also called Fully Informed (FI) [18], called this way because it bears specific information about the causal past of processes. New versions of the FI method appear regularly in State-of-the-Art literature, proposing a more efficient strategy. For example, Tsai introduces the LazyFI approach [19], which applies a lazy strategy to increment FI's logical clocks. Fully Informed and Efficient (FINE) is other FI variant, introduced by Luo and Manivannan [15,20]. In FINE authors establish a stronger checkpointing condition using the same control information preserved by FI. An optimized version of FINE, called LazyFINE, applies a lazy strategy using the work of Lou and Manivannan [21,22]. Finally, Simon et al. [9,12,23] propose another FI variant, which addresses system scalability, aimed for large-scale systems. Simon et al. reduce the number of forced checkpoints by delaying non-forced checkpoints.
Fault tolerance techniques relying on checkpointing protocols can apply a rollback recovery when detecting system failure. However, using these techniques is known to have a high computational cost (since more and more constraint resource devices need these techniques) new strategies must evolve to meet these constraints. The comparison between checkpointing protocols is in terms of the number of forced checkpoints; the lesser the number of forced checkpoints the better. Therefore, fault tolerance techniques that rely on checkpointing must reduce the message overhead exchanged by each process because failing to do so can require a large amount of storage space to store checkpoints [3,24]. Currently, there is not an optimal checkpointing protocol for all checkpoints and communication patterns, regardless of how each mechanism triggers forced checkpoints [25].

System Model
Distributed systems have specific characteristics such as: communication between processes is done by exchanging messages, time is not global for all processes, each process can have its own time, and processes do not share a common memory. Processes can be one or more nodes, and processes consist of a single thread or any number of threads (a process can be a computer, a service, a cell phone, a video game console, etc.). In a distributed computation P = {P 1 , P 2 , . . . , P n } represent a finite set of processes; where communication channels can be unpredictable, presenting finite delays, however, these channels are considered asynchronous and reliable. In a distributed system, there exist two types of events internal and external.
An internal event reflects a change in the state of a process, for example, when checkpointing. A checkpoint contains the state of a process such as heap, registers, stack, and the kernel state. We denote C x i as checkpoints that contain the state of a process i and E i denotes all checkpoints previously taken. We consider external events, those that affect the state of the system globally, for instance, send and delivery of messages, where m is a message; send(m) denotes the sending of m by any process P i ∈ P; and delivery(P j , m) denotes the delivery of message m to another process P j ∈ P, where P i = P j . The set of messages M containing external events is given by the following expression: where E m is the set of events associated with M. All set of events in the system, E, are: Therefore, modeling of a distributed computation is done through a partially-ordered set E = (E, →) where → and denotes Lamport's happened-before relationship [26].

Background
The Happened Before Relation (HBR) determines the causal precedence dependencies for a set of events. HBR has a strict partial order (i.e., it is transitive, irreflexive and antisymmetric). Lamport defined the HBR in the following way [26]: Definition 1. HBR, denote by →, is defines the smallest relationship for a set of events E, established by three rules, as follows:

1.
Let a and b be events in the same process. Thus, if an event a occurs before b, then a → b.

2.
If a stands for the event send(m), and b for the event delivery(P i , m), then a → b. 3.
If a → b and b → c, then a → c.
In practice, HBR is expensive, since it accounts for the precedence of every pair of events. We mitigate expensiveness by introducing a stronger relationship; identifying and attaching the minimal amount of control information per message to ensure causal ordering.

Immediate Dependency Relation (IDR).
The IDR is expressed by ↓, and it is the transitive reduction of the HBR. We define the IDR in the following way [27]: Definition 2. Two events, a, b ∈ E, are related by the IDR, a ↓ b, if: Where E stands for a set of events.

Checkpoint and communication pattern (CCP):
A CCP contains all information about processes local checkpoints as well as their transitive messages, i.e., incoming and outgoing messages towards other processes; the CCP has the following definition [28]: where E is a partially-ordered set modelling a distributed computation and E i is a set of local checkpoints defined on E. Figure 1 illustrates a CCP and all associated checkpoints, intervals, and exchanged messages. For example, checkpoint interval is illustrated as I x k , (x = 0) , representing that a checkpoint belongs to process P k . A checkpointing interval represents events that happened before another checkpoint in P k , from C x−1 k to C x k , (x = 0).

How Consistent Global Snapshots (CGS) Are Build
There exist diverse ways to take checkpoints for a distributed system such as synchronous checkpoint, asynchronous checkpointing and quasi-asynchronous checkpointing, where process take checkpoints independently, and systems gain the ability to restart their computation after those checkpoints. However, not all saved checkpoints guarantee to return to a consistent global snapshot (CGS). Netzer and Xu [13] gave the following definition: Definition 4. A CGS cannot contain any causally related checkpoints (see below); for any pair of checkpoints C i and C j we have: where → is now extended over checkpoints in a conventional manner.
CGS must avoid dangerous patterns as stipulated by Netzer and Xu a zigzag path or z-path) generalizes the Happened-Before Relation (HBR) in the following way: if some process P r receives m k (1 ≤ k < l), then m k+1 is sent by P r in the same or at a later checkpoint interval (although m k+1 can be send before or after m k is received), and 3.
Process P j receives m l before C y P j A z-path was defined by Helary et al. [14] in the following way: is a causal z-path if for each pair of consecutive messages m k and m k+1 , delivery(P i , m k ) → send(m k+1 ), with P i ∈ P. Otherwise, it is a non-causal z-path.
there is a z-path from C x i to C y j

Fuzzy Logic
Definition 9. Zadeh [29] gives the following definition: a set A, is defined as a membership function f A (x) that maps the elements of a universe X to elements in the interval [0 and represent a degree of membership of x in A.
Meaning that the close f A (x) values to 1 implies a higher the degree of membership of x in A.
To represent the fuzzy set A consider a pair of values: each element x ∈ X having a degree of membership to A.
Definition 10. Fuzzification converts a value or quantity to fuzzy quantity.
One of the most common fuzzifiers is the triangular function, next described: L,C and R values delimit the fuzzy set A, where C is the most substantial input value to A.

Definition 11. Defuzzification converts a fuzzy quantity into a value or quantity.
Defuzzification is done by applying the weighted average method, using the next algebraic equation: where ∑ denotes algebraic summation and a c is the centroid of each symmetric membership function.
A Brief Description of FIS A fuzzy inference system uses fuzzy logic to model almost anything into an input space to an output space. Using FIS, one can formalize human language into a simple mathematical model. FIS consists of four modules, next described:

•
Fuzzification module: This module transforms anything that is modelled by crisp numbers into a membership function and transforms them into fuzzy sets using a fuzzification function.

•
Knowledgebase: stores the rules that are given by experts in the form of if-then rules.

•
Inference engine: makes inferences from the inputs and tries to simulate human reasoning by using if-then rules • Defuzzification module: this module gives as output a crisp value that can be easily interpreted by humans or computers; it transforms membership functions into fuzzy sets.
The most known type of FIS are the Mamdani and the Sugeno [30].
• A Mamdani system gives as a result fuzzy outputs and inputs. -

If x is A and y is B then z is V
• A Sugeno system gives as outputs "crisp" values and takes fuzzy inputs from the inference engine.
-If x is A and y is B then z = f (x, y)

Architecture
We propose a general architecture for distributed and heterogeneous environments, dependable enough to warranty system fault tolerance upon failures, as illustrated in Figure 2. In a previous work, we merged checkpointing mechanisms (CiC) with autonomic computing (MAPE control loop) [1]. In this work, we add an evaluation mechanism for each transaction that takes place in the system, we evaluate all Web services in play, allowing us to evaluate the overall QoS, as shown in Figure 2. By evaluating individual QoS values for Web services we can decide to checkpoint or to discard checkpoints depending on systems current behaviour. Our approach decreases the number of forced checkpoints when the system is maintained in good or fair condition, thereby reducing the overhead of the system.
Next, we give a small description for the MAPE control loop and we establish how to complement it with this work. The Monitoring phase re-collects data from each of the processes for the system in play. Monitoring initiates the petitions, sending and receiving process's requests/responses. This stage also computes QoS parameters such as response time, CPU and memory percentage usage. The Analysis layer, in charge of detecting abnormal situations and decision making based on extracted information from messages and logs, i.e., checks whether the behaviour of the system is normal, suffers any abnormality or if it is in a faulty state. To predict the immediate future state of the system, artificial intelligence and soft computing may be used, for example, based on autonomic computing [1] or any other approach that supports predictions like Hidden Markov Model (HMM) [31], Bayesian Networks [32,33] or Fuzzy Logic. The Analysis layer computes the overall QoS value at a given time, using soft computing, during any Web service transaction. We found that using fuzzy logic for QoS evaluation does not incur system degradation, as shown in the results section, i.e., it does not affect its performance.
We consider the order of messages as well as their timing in communication events as these are relevant for system consistency. Therefore, we extrapolate the principles of the communication-induced checkpointing (CiC) mechanism to the Fault Tolerance layer. CiC does not only guarantee the order of messages and consistency, but it also establishes uniformity when building consistent global snapshots (CGSs). Distributed systems are dynamic; there exists competition for shared resources between diverse parties, for instance, many clients requesting to buy a product from a web which in turn request availability of the product to several service providers. Autonomic computing address such issues by introducing the notion of an autonomic manager; the autonomic manager is in charge of taking corresponding actions upon detecting anomalies in the system. The autonomic computing is proactive, in it, there exists a knowledgebase that generates information about previous experiences from monitored and managed elements. A fuzzy inference system can be used along with Managed Elements for diagnosis and learning purposes; in autonomic computing monitored elements are known as managed element. We represent non-functional requirements (QoS) as fuzzy variables. By monitoring QoS, at runtime, we easily know the behaviour of individual Web services and also for Web services compositions. Take for instance, Figure 3, for autonomic Web services, where Autonomic Managers control and manage Web services. The Autonomic Manager has the following functions: • Monitoring: the monitoring module aims at detecting, at runtime, changes known as symptoms for managed elements, for instance, when monitoring the Service Level Agreements (SLA) and detecting deviations in response times. • Analysis: to analyze module gives a diagnosis about the current state of the system, based on monitored data, for instance, when QoS parameters overpass a certain threshold. We use FIS to evaluate and give a diagnosis for Web services.

•
The systems common Knowledgebase stores policies; these policies can change over time.
The policies define configurations for monitored components and there exists predefined policies. New policies can appear when changes occur in the system; policies trigger corrective actions. Web services base their autonomic behaviour in these rules to achieve self-management functions.
The autonomic computing establishes that systems have to learn and adapt to changes, we achieve such behavior by relying on FIS. FIS mimics human reasoning by adapting to abrupt changes in the system environment (changes that are usually not considered in the predefined policies). Therefore, FIS complements the Autonomic Manager by achieving an efficient diagnosis for systems behavior as it contemplates all its possibilities.

Dynamically Checkpointing Autonomic Web Services
On the one hand, the Fault Tolerance layer relies on the communication-induced checkpointing (CiC) mechanism [1]. On the other hand, we propose QoS assessment and systems performance evaluation for generating checkpoints dynamically, at any given time, thus obliterating the notion that CiC mechanisms have high implementation cost. Web services performance can be known using simple if-then rules, rules brought up by fuzzy logic. It is easy to evaluate a system and for it to adapt by establishing a set of rules from relationships between inputs (monitored data) and outputs (systems actual behaviour). We exploit QoS criteria by measuring systems non-functional requirements and simulate under different workloads. Therefore, managing the QoS threshold implies the ability to checkpoint, different CiC algorithms, in a dynamic manner.
Measuring QoS parameters associated with messages exchanged between different processes; we propose a QoS temporal association. We define a fuzzy-consistency system between messages, we know the actual systems performance; we compute the overall QoS by retrieving information among exchanged messages and compute their individual QoS parameters.
We formulate the FCSE based on the fuzzy-consistency system (FCS). Let the FCS relate monitored QoS parameters such as CPU, memory percentage usage and response time (RT) as follows: "The quality of system behaviour at a certain time t relative to the QoS parameters." We consider three linguistic variables: • RT indicates the time it took a process to send a request up until it received its response back. • CPU % indicates the CPU usage; it is used as behavioural metric to know the performance at a certain time. • and the Memory % indicates the memory percentage usage; it is used as a behavioural metric to know the performance at a particular time.
We use FIS to determine the state of the FCSE; next we give a brief definition: (a) Fuzzification: To fuzzify (as crisp values) inputs and outputs we use a triangular symmetric fuzzifier, for instance, Mamdani FIS (refer to Definition 10). Each variable was defined as follows: • RT: VG(e) associated to very good, G(e) associated to good, A(e) associated to average, B(e) associated to bad, and VB(e) associated to very bad. • for the CPU: CPU_L(e) associated to low, CPU_M associated to medium, CPU_H associated to high, and CPU_VH associated to very high. • and finally for the Memory MEM_L(e) associated to low, MEM_M(e) associated to medium, MEM_H associated to high, and MEM_VH associated to very high. Table 1 lists each linguistic variable and gives its corresponding fuzzy set. Where σ represents all the values for the response time, varying from the smallest value to the biggest value; based on ITU G.1010 standard values for Web services, preferred ∈ [0 − 2) seconds, acceptable ∈ (2 − 4) seconds and unacceptable ∈ (4 to infinity). Where represents all the values for CPU ∈ [0, 100]. Finally, ϕ represents all the values for the memory used at a specific time ∈ [0, 100].

Variable Values of Membership Functions
Response Time We determine the FCSE by using Mamdani inference rules; thus, we compute the performance of the system (based on QoS parameters) at any particular time.
We defined 80 combinations for inference rules [17], therefore computational is not considered. We argue that our evaluation does not incur in high system overhead, as experimental results show.
(c) Defuzzification: The output (FCSE) of the Mamdani FIS is a fuzzy variable, we convert it to a scalar value easily interpreted by a human. We use the defuzzication for FIS's output using the weighted average method, please refer to Definition 11.

Results
The results are twofold; first, we show the performance of our approach, and results show it does not incur in system degradation, i.e., as first step we only measure the fuzzy implementation. Second, we evaluate QoS non-functional requirements between all messages exchanged in the system, reducing dynamically the number of forced checkpoints diverse algorithms generate depending on actual systems behavior.

Experimental Setup
The first aim of the experiments is to show that using our solution does not incur in high system overhead; we established a simulation environment with the following characteristics; we deployed Web services to carry out dummy services, yet interactive enough to intercommunicate with other Web services and service providers; we simulated different workload conditions for diverse system performance; we measure our model of fuzzy non-functional parameters.
We experimented in a Dell Precision Workstation with the following characteristics: 16 GB RAM, Windows 7 64-bit operating system. We host the WS02 Application Server for Web services deployment, and we stress-test those Web services with concurrent requests incoming from diverse JAVA based clients, to have a more or less real-world situation, for instance, when diverse clients are trying to buy something online from different service providers. We used the Jfuzzylogic http://jfuzzylogic. sourceforge.net/html/index.html framework for the FCSE and fuzzy related programming compatible with JAVA.

Fuzzy Setup
We performed two experiments. The first experiment, consisted of measuring systems performance using and not using the proposed evaluation method. The second experiment, consisted of stress-tests for diverse system performance and the overall QoS evaluation to dynamically checkpoint using three different communication-induced tools such as DCFI, FI and FINE.
First, we described the corresponding membership functions for inputs and outputs. The RT (response time) was ∈ (0, 100) seconds. ITU G.1010 tries to standardize Web services performance for real-life scenarios and recommends the following values: preferred ∈ (0, 2) seconds, acceptable ∈ (2, 4) seconds and unacceptable ∈ (4, in f inity). CPU % and Memory % usage ∈ (0, 100), depending on its actual usage. The FCSE ∈ (0, 5), in charge of the overall QoS; we associated a membership function with all the variables above as shown in Table 2.
Just to clarify, the FCSE set is associated to the following linguistic variables: • FCSE: FCSE_VH(e) associated to very high, FCSE_H(e) associated to high, FCSE_A(e) associated to average, FCSE_L(e) associated to low, and FCSE_VL(e) associated to very low.
FCSE evaluation gave crisp output values ∈ (0, 5); if the FCSE output value was close to zero, meaning no degradation of the system was found, the FCSE was very high, therefore the RT crisp value corresponded to very high; the CPU % and Memory % usage crisp values were very low. Contrarily, if the FCSE crisp value was close to five, it means the system suffered degradation; if the FCSE was very low it resembled performance problems, therefore, Web services had very high RT and very high CPU % and Memory % usage.

Using QoS Criteria for Diagnostics
From the first experiment, recall that the purpose was to stress-test Web services with a lot of concurrent requests and to measure our proposed approach and the impact it has in systems performance. We obtained the following results: Figures 4 and 5 show the same number of messages sent from 10 clients up to 5000 clients. In both Figs. we report the average response time; the response time measured the time it takes a client from sending their request (a message) until they get their response back (from a service provider). From Figures 4 and 5 we observed that the average RT and its corresponding standard deviation were quite similar in both (approaches without QoS evaluation and after using the QoS evaluation based on FIS) graphs for all cases even after highly increasing the number of clients; the QoS evaluation performed better by reducing abrupt increments in the average response time as the number of clients grows. In collaborative environments where intercommunication take place an important issue is resource usage and storage. Therefore, proposed solutions must remain constant and scalable even in the presence of large-scale environments, our solution aims to support these issues, and as illustrated by Figure 5. Besides, our proposal remained constant for a high number of interactions; therefore, it is scalable. To have a significant statistical sample, we executed each run for 100 times. For instance, when 20 concurrent clients requested a Web service we collected 2000 samples, and computed their average RT and standard deviation, as shown in Figure 5. For 50 concurrent clients, we collected 5000 samples, so on. In addition, each scenario constituted a dedicated server that hosted Web services and each Web service instance computed the overall QoS for each request/response. Specifically, computing at runtime, at any point in time, the fuzzy-consistency system evaluation (FCSE). The statistical sample size followed the Cochran's sample size formula [34]: where t = value for the selected alpha level of 0.05 in each tail = 1.65, (p)(q) = estimate of variance = 0.25, d = acceptable margin of error for the portion being estimated = 0.0825 (the error researcher was willing to accept 8.25% for our case). From Figures 4 and 5 we can conclude that the proposed QoS evaluation approach slightly incremented the average RT, however, it remained constant, and it helped having a fast evaluation of the system at any time. The evaluation helped us to diagnose the system and take proactive actions, in our case decide if a checkpoint was essential or not. Hence, we could compute the evaluation of the system; at runtime our next goal was to leverage this calculation. Adapting to changes in the environment one can take advantage by establishing policies for QoS non-functional requirements and reduce the number forced checkpoints taken. In the next section, we show how by evaluating QoS non-functional parameters we can discard useless checkpoints regarding Web services' behavior at any given time.

Dynamic Checkpointing for CiC Algorithms
We ran different scenarios with diverse values for QoS parameters, and followed an exponential distribution, simulating real-world conditions, see Figures 6-8. Therefore, the crisp values for the FCSE varied from 0 to 5 after evaluated by the FIS system; this was named the FCSE window. This window can be very rigorous; for instance, we took forced checkpoints when the output computed crisp value of the FCSE satisfied that it was equal to 0.625 or lower. We referred to a relax window to that one allowing a more degraded system, having an FCSE crisp output value of 3.75. Also, we   In order to show that our proposal can be adapted, to simulate several system requirements, we also ran different scenarios where we include four QoS non-functional parameters, such as CPU (Good Medium, Bad), Memory (Good, Medium, Bad), Response Time (Good, Average, Bad) and Reliability(Good, Average, Bad), see  We compared the performance of three different CiC algorithms, namely DCFI, FI and FINE. These were chosen because they are recent algorithms and the most efficient reported in the literature. We simulated and analyzed these three algorithms using the distributed checkpointing simulator ChkSim [16] and JFuzzyLogic. ChkSim simulator, models distributed systems in a deterministic manner reproducing the same behavior for two or more algorithms, and allows running simulation as often as necessary. We took the number of forced checkpoints as key performance indicator to compare each checkpointing algorithm regarding the overhead each one of them produce.       Figure 11. The number of forced checkpoints for 5000 messages sent. Evaluating four QoS non-functional parameters.
In Figures 6-8, we analyzed the performance of three different communication-induced checkpointing mechanisms (DFCI, FINE, and FI) in terms of forced checkpoints. We used the next configuration; scenario one consisted of 0 to 160 concurrent clients each sending 1000 messages; scenario two consisted of 0 to 160 concurrent clients each sending of 2500; scenario three consisted 0 to 160 concurrent clients each sending 5000 messages; for each scenario, we varied the QoS window. Again to have a significant statistical sample we executed each point of each scenario for 100 iterations with different checkpointing patterns as well as their communication or messages exchanged. Figures 6-8 shows that the DCFI, for all checkpointing mechanisms, generated the lowest number of forced checkpoints, this has to do with the QoS window. For example, for a rigorous QoS the number of forced checkpoints was similar to the one presented observed for DCFI without using our approach. Yet, as the window relaxed the number of forced checkpoints reduced drastically. For all cases the best performing checkpointing mechanism was the DCFI, because it generated fewer checkpoints compared to FI and FINE. FI was the second-best checkpointing mechanism, and FINE also reduced its checkpointing generation but compared to the other two mechanisms it presented the worst cases.
Experiments presented in Figures 9-11 shows that if the number of non-functional parameters increased to evaluate the system performance, then this allowed relaxing and detecting smaller changes in the system behavior; as a consequence, fewer forced checkpoints were generated. For all the cases presented in these Figs. the number of checkpoints generated was lower by improving the granularity that monitors the system behavior.
To conclude this section, based on results presented in Figures 6-8, we can argue that our approach correlated both cost (QoS parameters: response time, CPU, Memory, reliability) and benefit (how well the system behaves) in the following way: • When QoS was very high it implied saving checkpoints, because there were enough resources available, and the cost of storing checkpoints was minimal.

•
When QoS was very low it implied not saving many checkpoints, because there were not enough resources available, yet our approach saved some checkpoints for preserving consistency upon a failure.

•
The granularity to monitor the system performance through non-functional parameters allowed us to detect smalls changes, and the system evaluation was enriched; hence fewer forced checkpoints could be generated when the system behavior was good.

Comparison between Static Approaches and Our Proposed Dynamic Approaches
We designed a case of study to observe the differences between static CiC approaches, that did not adapt to the degradation of the system, and our proposal. The case of study considered that the QoS conditions to satisfy were the most demanding, so it was necessary to guarantee the highest QoS for the attention of users. Tables 3-5 show how as the number of messages and processes increased, considering the idealized conditions with the highest rigor in the quality of service parameters, using the most rigorous QoS window of 0.6225; there was a small decrease in the number of forced checkpoints and even in some cases the number of forced checkpoints taken were the same. We reduced the number of forced checkpoints by 0.2% to 4% with respect to CiC static methods, however Tables 3-5 only show the QoS rigorous case; the reduction in the number of forced checkpoints is clearly expressed in Figures 6-8, which would be more significant for environments where the quality of service could be relaxed without compromising the system efficiency if applications allowed. approach can adapt its behavior and maintain the system consistency and efficiency regarding to established QoS parameters by the user. To conclude, adaptability goes hand on hand with the allowed system degradation, since the number of forced checkpoints generation is adjusted for situations where the quality of service can be relaxed.

Conclusions
We show the first solution that allows a dynamic and adaptable generation of forced checkpoints. Carried out through a fuzzy consistency system evaluation, identifying useless forced checkpoints. We compared three different communication-induced checkpointing-based mechanisms; experimental results demonstrate our efficiency, as we managed to reduce the number of forced checkpoints, because we took into account the quality of service (QoS) of the system at a given time. We consider systems non-functional requirements for the generation of checkpoints, we dynamically adapt to changes in the system since we consider good and bad system performance for our measurements; we consider a window of rigorous quality of service as well as a relaxed window. For these windows the delayed checkpointing algorithm (DCFI) outperforms other algorithms and has the least number of forced checkpoints. We reduced the number of forced checkpoints by using our approach; experimental results show that the DCFI mechanism has better performance over FI and FINE. For instance, DCFI performs better than the other two mechanisms but only 3% better. All compared algorithms reduce their generation of forced checkpoints, however, as shown in the results the reduction depends on the actual systems behavior. We can also argue that CiC based solutions are scalable because they continue to work correctly even in the presence of a high number of interactions. Plus these kinds of solutions increase systems dependability without harming systems performance.
As future work, we can assign a weight to all inputs and check if a gain or reduction, in terms of the number of forced checkpoints, is present for diverse communication-induced checkpointing mechanisms. In addition, we will extend the proposed solution by including a self-healing mechanism in order to detect malicious attacks and carry out system recovery by fuzzy evaluating the consistency of the system. Funding: Mariano Vargas-Santiago specially thanks El Instituto Nacional de Astrofísica Óptica y Electrónica for the grant granted for the fulfillment of this research. Besides, Luis Morales-Rosales thanks to CONACYT for the research project 613 named, Cyber-physical Systems for the Development of Intelligent Transport Systems to carry out this contribution.

Acknowledgments:
The authors want to thank Rebekah Hosse Sullivan for taking her time and reviewing this work.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: