Communication-Induced Checkpointing with Message Logging beyond the Piecewise Deterministic (PWD) Model for Distributed Systems

This paper introduces an effective communication-induced checkpointing protocol using message logging to enable the number of extra checkpoints to be far lower than the previous number. Even if a situation occurs in which it is decided that a process receiving a message has to perform forced checkpointing, our protocol allows the process to skip the forced checkpointing action if it recognizes that the state of its sender right before the receipt of the message is recoverable. Additionally, the communication-induced checkpointing protocol is thus not required to assume the piecewise deterministic model, despite being combined with message logging. This protocol can maintain these features by piggybacking a one-bit variable and an n-size vector on each message sent. Our simulation results verify our claim that the presented protocol performs much better than the representative optimized protocol with respect to the forced checkpointing frequency, regardless of the communication pattern.


Introduction
As parallel algorithms perform many operations on a cluster of independent computing nodes, even a single node crash can cause the execution of an algorithm to halt [1]. This undesirable property may make large-scale distributed systems more vulnerable to failures [2]. For this reason, effective fault-tolerance techniques are essentially needed in such systems. Rollback recovery is one such technique that enables the current erroneous state of a distributed system to be restored to a previous failure-free state from the stable storage [3,4]. To achieve this goal, the system recovery information must be occasionally saved to the storage for normal operations [5].
Among the techniques used for rollback recovery, checkpoint-based recovery depends solely upon local states of processes maintained in the stable storage-called checkpointsto support fault tolerance [5]. In the case of a failure, the system state is recovered by using the most recent consistent global checkpoint kept in the storage. According to when and how consistent sets of checkpoints are formed, checkpoint-based recovery protocols are categorized into coordinated, independent, and communication-induced checkpointing [5]. In order to balance trade-offs between independent and coordinated checkpointing in an effective manner, communication-induced checkpointing (CIC) is used to preclude any local checkpoint that have already been taken from becoming useless by performing forced checkpointing while attempting to increase the degree of checkpointing independence as much as possible [6][7][8][9][10][11][12][13][14][15]. The CIC protocols include HMNR [8], which uses this feature to enable the number of extra checkpoints to be much lower by effectively using the control information contained in each sent message. Next, an improved HMNR protocol [12], LazyHMNR, attempts to use a lazy indexing strategy [13] to alleviate the problem of high frequencies of forced checkpointing that may occur in some particular cases. Next, two protocols, FINE [9] and LazyFINE [10], were designed to try to generate fewer forced checkpoints than HMNR and LazyHMNR with the same numbers of variables that they hold. However, they cannot ensure the property of never having useless checkpoints [7]. Then, an adaptive CIC protocol [11] was developed in an attempt to delay taking forced checkpointing actions as long as possible by evaluating some safety predicates. One of the most recent CIC protocols [14] utilizes the effectiveness of the one-to-many transmission of broadcasting links, which is widely used to lower the number of extra checkpoints. However, this may greatly degrade the applicability of the CIC. Lastly, several recent works [6,15] exploited one or more protocols mentioned above to raise the dependability of the systems to a higher level in the fields of distributed database management systems and web services.
Generally, checkpointing-based recovery, including CIC, is not subject to the piecewise deterministic (PWD) model; thus, it is less restrictive and more realistic for application in distributed systems than in message-logging-based recovery [16][17][18]. However, the former may not ensure the recovery of the system to its pre-failure state, which generally makes the rollback distance of each process much longer than that in the latter [5]. Thus, hybrid protocols that combine the two techniques were proposed to compensate for the drawbacks of the first [5]. However, these protocols cannot be freed from the PWD assumption. For this reason, none of the existing CIC protocols-including a family of HMNR protocols [6,[8][9][10][11][12][13][14][15]-developed so far can utilize the benefits of message logging on real-world distributed systems without assuming the PWD model. Whenever each process receives a message, the protocols cause the process to perform forced checkpointing if they decide that one or more checkpoints that have already been taken may become useless. Due to this inherent shortcoming, even the HMNR protocols force each process to take extra checkpoints-more than twice the number of basic checkpoints [7]. However, the property of never having useless checkpoints is ensured without performing the forced checkpointing action if the process can know precisely that the current state of the sender of the message can be deterministically restored by replaying logged messages from the stable storage in the case of a failure of the sender. We found that this observation may be an important way to drive a large reduction in the number of forced extra checkpoints that all of the previous CIC protocols can incur. This paper introduces a CIC protocol, S-CIC, with message logging that is not subject to the PWD model in order to address the aforementioned observation. The protocol can achieve this goal by only carrying a one-bit variable and an n-size vector into every message transmitted. In the checkpointing and communication patterns where the existing CIC protocols-including the family of HMNR protocols-force the receiver of a message to take extra checkpoints, if the information of the message sender piggybacked on it indicates that the sender's state right before sending is recoverable, the proposed protocol allows the receiver not to perform forced checkpointing.

Fundamentals
In this paper, we assume a distributed system with no global clock or memory and immunity to network partition [5]. Every process can crash according to the fail-stop failure model [19], and the processes collaborate with others only by making reliable message exchanges through an asynchronous transmission channel [5,8]. Each process p begins an execution from its first state and performs a combination of internal, message-sending, and message-delivering events [5]. Here, internal events are produced to perform their individual computations with no interactions with others. All of the event processes that are incurred for normal operations are sequenced according to Lamport's "happened before" relation [20].
Ck i p represents the ith local checkpoint of p, and Ck i p .lc is the local timestamp assigned to Ck i p when it is taken. Assume that each process p records the first checkpoint, Ck 0 p , on the storage containing its initial state when it begins its own computation. A global checkpoint means a set of local checkpoints that hold only one per process in the system [5,7]. A pair of local checkpoints (Ck i p ,Ck j q ) is named mutually consistent if and only if there is no case in which m is delivered before Ck j q , but is sent after Ck i p . A global checkpoint is consistent if and only if every couple of local checkpoints belonging to the first always satisfies the mutual consistency condition [21]. The concept of the Z-path [22] is exploited to check if the condition of mutual consistency is satisfied on an ordered pair of checkpoints by finding causal sub-paths, as well as non-causal (NC) sub-paths, where the two checkpoints can be connected to each other. A Z-path that includes a cycle from a local checkpoint Ck i p to itself is called a Z-cycle [22].

Related Work
HMNR is an optimized protocol that aims to have no local checkpoints that are useless while decreasing the number of extra checkpoints. To keep this feature in HMNR, each process p should always have the following five state variables [8]: lc p is a non-negative integer variable that has the present value of p's local timestamp. send_to p is a vector in which send_to p [q] keeps a boolean value to detect a non-causal path to q from p. ckpt p is a vector in which ckpt p [q] contains the total number of checkpoints that q has recorded in the stable storage from its initial execution that p currently recognizes. taken p is a vector where taken p [q] keeps the boolean value for q to indicate the existence of at least one causal Z-path from the latest checkpoint of q that p perceives to the subsequent checkpoint for p. greater p is a vector where greater p [q] has the boolean value for q, and this indicates whether p's current timestamp lc p is greater than the most recent timestamp of q perceived by p (=true) or not (=false).
This protocol includes a checkpoint-timestamping mechanism that uses Lamport's logical clock [21] to satisfy Theorem 1, implying that increasing the timestamp flow along any Z-path always ensures that no checkpoint becomes useless. The mechanism is sufficient for ensuring that no causal Z-path includes a Z-cycle formation [8,22].
However, two kinds of non-causal (NC) Z-paths [8,22], as shown in Figure 1, can violate the theorem, even if the timestamping mechanism is used. To prevent Z-cycles from forming in these cases, HMNR forces each process p after receiving a message m to save an additional checkpoint if the following condition C HMNR is satisfied.
The first case is an NC Z-path pattern connecting two checkpoints, Ck i p and Ck k+1 r , as shown in Figure 1a. In this example, three processes, p, q, and r, are exchanging messages: m 1 , m 2 , and m 3 . As Ck i p .lc = Ck k+1 r .lc, the path violates the theorem. When q sends m 2 to r, sent_to q [r] becomes true. However, p can get lc r before Ck k+1 r through m 3 , m 3 .lc(=lc r ) < lc p . Thus, greater p [r] still remains true. Then, it is brought to q when receiving m 1 , so m1.greater[r] = true and m 1 .lc(=lc p ) > lc q . As the first sub-condition of C HMNR , C 1 , is satisfied, HMNR forces q to record an extra checkpoint Ck j+1 q in the storage before conveying m 1 to the target application.
The second case is another NC Z-path pattern with Ck i p and Ck k+1 r , as shown in Figure 1b. In the figure, the path incurs a Z-cycle involving Ck k+1 r because r sends m 3 to p after Ck k+1 r , and then q receives m 1 , depending on m 3 , from p. In this case, as m 3 is transmitted to p from r, greater p [r](← greater p [r] ∧ m 3 .greater[r]) becomes false as m 3 .lc(=lc r ) = lc p and m 3 .greater[r] = false. As m 1 .greater[r] = false when q receives m 1 , C 1 is not sufficient for detecting the violation. Therefore, the second sub-condition of C HMNR , C 2 , needs to be checked. When taking Ck k+1 r , taken r [q] changes to true while the value of ckpt r [q] is still the same as the number of checkpoints associated with Ck  However, in both examples, if p's state right before sending m 1 is recoverable and q knows this, q does not need to take Ck j+1 q before delivering m 1 , even though C HMNR is satisfied. In other words, when q recognizes that both m 2 and m 3 can always be replayed in case of failure and that p's internal execution before sending m 1 is deterministic, Ck k+1 r will not be a useless checkpoint, even though Ck j+1 q is taken after delivering m 1 . Based on this new observation, we present a low-overhead CIC protocol, S-CIC, that uses message logging to detect this type of recoverability in an efficient manner without assuming the PWD constraint.
The enhanced version of HMNR [12], LazyHMNR, attempts to lessen the forced checkpointing frequency in unconventional cases that may take place due to asymmetries in the rates of increase of logical timestamps. It fulfills this requirement by delaying the swift growth of the logical timestamps of some processes-which is caused by repeated checkpointing actions on them-as long as possible.
Another CIC protocol [9], FINE, attempts to intensify the optimality of the consistency predicate of HMNR with only its mandatory state variables in order to ensure the property of never having useless checkpoints. The advanced version of FINE [10], LazyFINE, was designed to incorporate the laziness of logical timestamp increases into FINE. However, it was proved that the two protocols can create useless checkpoints because the Z-consistent timestamping rule cannot be enforced [7].
A delayed CIC protocol [11], DCFI, was introduced to lower the forced checkpointing frequency by applying several safety rules that enabled the postponement of checkpointing enforcements. This feature may have a much lower total number of checkpoints that are taken in the system. However, the protocol does not incorporate a method for significantly lowering the frequency of extra checkpointing actions by exploiting the rollback distance reduction benefit of message logging.
Another CIC protocol [14], BN-FI, was recently presented in order to curtail the number of extra checkpoints by exploiting the functional strength that broadcasting networks generally hold, which is called one-to-many transmission effectiveness. This special capability of lightweight group dissemination can speed up the updating of the last logical timestamp of each transmitter for the others on the network. This behavioral property enables each process to precisely detect if the ongoing Z-path has at least one checkpoint that has become useless much earlier than in the previous protocols. However, the performance gain of the protocol limits its applicability to network environments.
One of the most recent CIC algorithms [15] was developed in order to maintain a globally consistent state of each transaction in a distributed database management system while making the delay in failure-free transaction execution as short as possible. The algorithm attempts to enhance the recoverability of the system states with a far lower number of extra checkpoints by recording only the states of the completely committed transactions in the stable storage. However, the algorithm has the same shortcomings as those of the original HMNR protocol mentioned above.
The authors of [6] proposed an adaptive checkpoint generation algorithm in order to decrease the frequency of forced checkpointing actions by considering the system's behavior in comparison with the static algorithm, which did not reflect the environmental changes onto web services being operated. The algorithm made decisions on whether forced checkpoints should be taken based on the quality of the service parameters and the policies currently applied to their corresponding web services. In order to improve the dependability of the web services, three kinds of CIC protocols were exploited: HMNR [8], DCFI [11], and FINE [9]. Among them, HMNR and DCFI performed better than FINE on the system in terms of the number of forced checkpoints. However, the system still bore the respective limitations of the three CIC protocols stated above.
However, as all of the CIC protocols mentioned above attempt to use message logging to shorten the rollback distance of each process during recovery as much as they can, they must be applied only to deterministic services and systems, resulting in a large contraction of the scope of the application areas. Table 1 shows a summary of a comparison of some primary features of the CIC representatives.

The Proposed Protocol
S-CIC was devised to maintain the following three behavioral properties.
• Similarly to HMNR, each process attaches the state information related to other processes, as well as to itself, to every outgoing message so that the number of extra checkpoints decreases as much as possible. • Even if either of the two cases where a process should perform a forced checkpointing before delivery of a message in HMNR occurs, S-CIC does not have the process perform the task if it knows that the same message can be replayed in spite of any future failures. • Although pessimistic message logging is used to satisfy the second requirement, S-CIC is not subject to the PWD model.
Initially, each process deterministically performs its computation in a certain interval and, if a non-deterministic (ND) event occurs in this interval, the process begins its ND execution interval. In this research field, ND events can be classified into two types of events. The first type includes loggable ND events, of which there is sufficient support for forcing the replay at the same point in case of failure. Message receipt is one type of loggable ND event that most message-logging protocols detect and save in the stable storage for recovery. Aside from this, there are other types of loggable ND events, such as software interrupts or signals, which some other works [2] attempted to detect in order to make it possible to replay them in case of failure. This effort may raise the rate of the deterministic (DM) execution intervals. The second type comprises unloggable ND events, for which there is no support for taking action to enable a form of repeatable execution in case of failure.
To hold all of them in S-CIC, each process p should always have the following additional state information: • SSNmV p : A vector that saves an element composed of two variables, ssn and ND, for each process q. SSNmV p [q].ssn keeps the value of the ssn of the latest message m that q has transmitted and is known by p. SSNmV p [q].ND is a boolean value that indicates whether at least one internal unloggable ND event q has been executed before m since q's latest checkpoint. The two variables are initialized to (0,false First, we aimed to understand how to identify the recoverable state of each process with checkpointing and message logging without assuming the use of the PWD model. For this purpose, an execution mode detection method was introduced in order to consider three typical cases that can occur in CIC, as shown in Figure 2. As shown in Figure 2a, the first case is that a process performs its computation without any communication with others. In this example, a process p first executes with its internal DM or loggable ND events in a certain interval from its checkpointed state Ck i p (SSNmV p [p].ND = false), which is called the DM mode (ND-mode p = false). Then, if any first internal unloggable ND event occurs (SSNmV p [p].ND = true), p's ND interval begins and changes its execution mode to non-deterministic (ND-mode p = true). When taking its next checkpoint Ck i+1 p , p's state becomes recoverable (SSNmV p [p].ND = false); thus, its execution mode returns to being deterministic (ND-mode p = false). Then, it performs its computation in a similar way. Therefore, in this case, if p fails at a certain execution point after Ck i+1 p , it can restart from the latest checkpoint and recover to the state right before any first unloggable ND event after Ck i+1 p without considering any dependency relation with others. As shown in Figure 2b, the second case is that an execution of a process q that is affected by messages transmitted from another single process. In this case, q first executes in its DM mode from its checkpointed state Ck j q (SSNmV q [q].ND = false,ND-mode q = false). When receiving a message m x from p, whose current mode is ND (m x .SSNmV [p].ND = true, m x .ND-mode = true), q's execution mode also changes to ND (ND-mode q = true). Then, q can execute with its internal unloggable ND events, although this is not shown in this figure (SSNmV q [q].ND = true). However, even if q takes its local checkpoint Ck j+1 q (SSNmV q [q].ND = false), its execution mode still remains ND because q does not know that p's current state is recoverable (SSNmV p [p].ND = false, ND-mode p = false). When receiving m y from p, q can recognize that p's current mode is DM due to the information piggybacked on m y (m y .SSNmV [p].ND = false, m y .ND-mode = false). Then, q's mode becomes DM(ND-mode q = false).
As shown in Figure 2c, the third case is that a process q fulfills its computation depending on messages received from more than one process, p and r, which execute with internal unloggable ND events. In this case, q first executes in its DM mode from its checkpointed state Ck j q , and then in its ND mode with an unloggable ND event (SSNmV q [q].ND = true, ND-mode q = true). Next, it receives two messages, m x and m z , from p and r, which are currently both in ND mode (m z .SSNmV [r].ND = true,m z .ND-mode = true). When taking its next checkpoint Ck Let us examine how, by using the mode detection method, S-CIC can have its number of extra checkpoints lowered in the two types of NC Z-path patterns in comparison with HMNR, as shown in Figure 3. An example of the first pattern of NC paths is illustrated in Figure 3a, which violates the theorem in HMNR. Let the ssns of p, q, and r be α, β, and γ for Ck i p , Ck j q , and Ck k r , respectively. When q sends m 2 to r, its mode is DM(ND-mode q = false) and its ssn is (β + 1)(SSNmV q [q] = (β + 1,false)). The two pieces of information are piggybacked on m 2 . On receiving m 2 , r is in the DM mode (SSNmV r [r] = (γ,false),NDmode r = false). Then, it increments its rsn, rsn r , and saves a log element whose form is e (sender's identifier, receiver's identifier, ssn, rsn, data) in the stable storage-for example, e (q, r, (β + 1), rsn r , m 2 .data) for m 2 . Afterwards, it updates its mode-detectionrelated information as follows: ND-mode r ←m 2 .ND-mode∨ND-mode r = false, SSNmV r = {(0,false),(β + 1,false), (γ,false)}. After receiving m 3 , whose ssn is (γ + 1) from r, and logging it, p's mode remains DM, as r and p are both in DM mode (ND-mode p ←m 3 .ND-mode∨NDmode p = false), and its ssn vector SSNmV p is updated to {(α,false), (β + 1,false),(γ + 1,false)}. When q receives m 1 from p and it is logged, it is in ND mode (SSNmV q [q]=(β + 1,true),NDmode q = true), and C HMNR becomes true, as sent_to q [r], m.greater[r], and m.lc > lc q are all true. However, q does not need to take any forced checkpoints (m 1 .ND-mode∧C HMNR = false) because it knows p's state, including the fact that sending m 1 is recoverable (m 1 .NDmode = false); thus, m 1 can always be regenerated even if p crashes. When taking its next checkpoint Ck j+1 q , q's mode changes to DM and its vector SSNmV q is updated to {(α + 1,false),(β + 1,false),(γ + 1,false)}. Therefore, in this example, p's recoverable state until sending m 1 after Ck i p , Ck j+1 q , and Ck k+1 r comprises a globally consistent state. Figure 3b illustrates an example of the second pattern of NC paths that violate the theorem in HMNR. After r receives m 2 from q, whose mode is DM (m 2 .ND-mode = false,m 2 .SSNmV = {(0,false),(β + 1, false),(0,false)}), and logging it, its mode is ND( m 2 .ND-mode ∨ ND-mode r = true) because of its internal unloggable ND event (SSNmV r [r] =(γ,true), ND-mode r = true). However, when taking a checkpoint Ck k+1 r (SSNmV r [r]= (γ,false)), r's mode changes to DM; thus, m 3 can always be replayed (m 3 .ND-mode = false). On m 3 's receipt and logging, p keeps its DM mode and updates its variables as follows: NDmode p ← m 3 .ND-mode ∨ ND-mode p = false, and SSNmV p = {(α,false),(β + 1,false),(γ + 1, false)}. When q receives m 1 from p and logs it, C HMNR becomes true, as m 1 .ckpt[q]=ckpt q [q] and m 1 .taken[q] are all true. However, q does not need to take any forced checkpoints (m 1 .ND-mode ∧ C HMNR = false) because r and p can deterministically reproduce and send m 3 and m 1 in order, respectively, even in case of p's failure (m 1 .ND-mode = false). Therefore, Ck j+1 q , which is taken after delivering m 1 (ND-mode q = false, SSNmV q [q]=(β + 1,false)), is consistent with the states of the others right after sending m 1 and m 3 .

Module EXEC-NDEVENT(event)
and before p sends m 1 or q receives it. In this case, before p sends m 1 to q, greater p [r] is true because p knows that lc p > lc r . When q receives m 1 , it has to perform a forced checkpointing action before delivery of m 1 because C 1 is satisfied. Case 2.2: A causal sub-path, u, from r to p or q exists between after r received m 2 and before p sends m 1 or q receives it. At this point, two sub-cases must be checked. Case 2.2.1: u occurs at r before Ck k+1 r . If lc p ≤ u.lc, Ck i p .lc < Ck k+1 r .lc and Ck k+1 r never becomes a useless checkpoint, regardless of whether u's destination is p or q. If p receives u, it updates greater p [r](=m 1 .greater[r]) as false. If u goes to q, lc q is updated with u.lc and is greater than or equal to m 1 .lc. In both cases, S-CIC can recognize that C 1 is not satisfied, and q does not perform a forced checkpointing action. Otherwise, Ck i p .lc ≮ Ck k+1 r and Ck k+1 r may become useless in both cases. If p receives u, this condition causes greater p [r](=m 1 .greater[r]) to remain unchanged (true). In addition, when m 1 is transmitted to q, m 1 .lc > lc q , as u.lc ≥ lc q . If u goes to q, m 1 .greater[r] is true and m 1 .lc > lc q . In both cases, S-CIC can recognize that C 1 is satisfied and causes q to perform a forced checkpointing action before delivering m. Case 2.2.2: u occurs at r after Ck k+1 r . On receiving m 2 , r can keep the value of the latest checkpoint index of q in ckpt r [q]. As it takes Ck k+1 r , taken r [q] is set to true. When r sends u, ckpt r [q] and taken r [q] are eventually brought to q by a directed path-either <u> or <u, m 1 >. In both cases, S-CIC can recognize C 2 is satisfied and causes q to perform a forced checkpointing action before the delivery of the message received by q, as in HMNR. p i )) = l + 1. The following case is similar to the base case that was mentioned earlier. Through induction, our protocol ensures that no checkpoint becomes useless.

Performance Evaluation
Let us examine our extensive simulations to make a comparison of the performance of the two protocols, LazyHMNR and S-CIC, with a discrete-event simulation language, PARSEC [23]. LazyHMNR is one of the most recently developed versions of HMNR, which is intended to decrease the high frequency of forced checkpointing [7]. S-CIC is our improved version of LazyHMNR with the advantageous features mentioned in the previous sections.
In this comparison, we precisely examine one important performance index, NOFC. The index indicates the total number of forced checkpoints taken in each protocol. The simulated system is a cluster of N computers on a broadcast network. All processes running on each computer begin and finish their individual execution together. As a process transmits an application message, the message is destined to a single recipient at all times. The link capacity and propagation delay of the simulated network are 100 Mbps and 1 ms, respectively. Every process performs a basic local checkpointing task in a certain checkpoint interval according to an exponential distribution with a mean CI bc = 5 min. In addition, among N processes, one is selected at random and transfers a message in every timed interval according to an exponential distribution with a mean of TI send = 3 s. Furthermore, to measure the communication pattern sensitivity of the two protocols, more complex experiments were conducted by splitting applications into four groups: serial, circular, hierarchical, and irregular [24].  show the NOFC for both LazyHMNR and S-CIC with changes in the numbers of processes-denoted by NOP-scaling from 6 to 12 when the percentage of internal unloggable ND events in each process (UND) was 20%, 40%, 60%, and 80% for the four different communication patterns, respectively. In these figures, UND never changed the NOFC of LazyHMNR because, unlike S-CIC, LazyHMNR has no method for skipping forced checkpointing actions if the state of the sender of each message right before the receipt of the message is recoverable. As NOP increased, the ratio of the NOFC of LazyHMNR to that of S-CIC increased for all four patterns according to the change in UND, which ranged from 1.3 to 6.5. The main reason is that there was an increased possibility of forming the two kinds of NC Z-path patterns and inducing forced checkpointing in LazyHMNR when C 1 or C 2 was satisfied. In addition, the occurrence of fewer unloggable ND events per process (i.e., decreasing UND) led to a significant decrease in the forced checkpointing overhead of LazyHMNR due to the advantageous features of S-CIC. These results indicate that S-CIC frequently skips actions to take forced checkpoints for each process by checking the recoverability of the dependent states of other processes, unlike LazyHMNR. In addition, the features have the effect of a large reduction in the number of forced checkpoints, regardless of the communication patterns. However, the degree of their effectiveness may fluctuate in the irregular pattern because its irregularity can cause the formation of Z-cycles and can cause the first type of NC Z-path to differ in every run. From the results, we can see that, with these features, S-CIC can alleviate the shortcomings of the family of HMNR protocols, including LazyHMNR.

Conclusions
The proposed protocol, S-CIC, was developed in order to incorporate the following advantageous features. First, though situations can occur in which HMNR or LazyHMNR decides that, on receiving a message, a process has to perform forced checkpointing, S-CIC does not cause the process to perform this action when it recognizes that the state of its sender right before the receipt of the message is recoverable, leading to large reduction in the number of forced checkpoints compared with the family of HMNR protocols. Therefore, S-CIC is also not required to assume the PWD model, despite being combined with message logging. This goal can be realized by piggybacking the sender's recoverability status and a vector containing the last send sequence number and unloggable event occurrence status of every process onto each sent message. Our simulation results verified that the protocol outperforms the representative optimized protocol, LazyHMNR, with respect to the forced checkpointing frequency, regardless of the communication pattern used.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author, J.A., upon reasonable request.

Conflicts of Interest:
The author declares no conflict of interest.