THE PROBLEM OF SYSTEM FAULT-TOLERANCE

. System level self-diagnosis (SLSD) has been deeply investigated in literature. It aims at diagnosing systems composed by units, which are required to be able to test each other by exchanging information through available links. The article describes a simplified state-transition diagram model which gives a general impression of how checking, diagnosis and recovery can “conjointly” influence the system reliability and fault-tolerance. The model uses the integrated parameters and is very useful as a starting point and is a basis for further refinements.


Introduction
System level self-diagnosis (SLSD) was introduced by Preparata at al. [6] and has been deeply investigated in literature. It aims at diagnosing systems composed by units (optionally named processing elements), with the requirement that they are able to test each other by exchanging information through available links. At this level of diagnosis, each particular test is considered as atomic. It means that the details of a test are abstracted (not considered), and only the result of test is taken into consideration. Each test result is expressed via binary variable. It can take values either 0 or 1. The set of test results is called a syndrome. A syndrome contains information about the states of the system units in coded form. One of the tasks of SLSD is to decode a syndrome by using a diagnosis algorithm.

System level self-diagnosis
For providing system level self-diagnosis the tests among system units can be performed:  either in accordance with a pre-set schedule (i.e., defined a priori).  or in an adapted manner when, at the beginning, the tests are performed in accordance with defined a priori testing assignment. Once a unit is diagnosed as fault free, the tests it performs are considered reliable, and therefore, any other units should only be tested ones by this fault-free unit to correctly determine its status. Thus, the testing assignment is adapted such that units diagnosed as fault-free perform all the testing in the system [1].  or entirely randomly (i.e., from the beginning to the end of testing In all cases, the intention is to minimize the time of performance of the set of tests. Random performing of tests is considered both in context of system self-checking and system self-diagnosis. Self-checking is the process which aims at discriminating between two states of a system: fault-free and faulty. The result of self-checking doesn't indicate which of the system units has failed, and only testifies the presence of fault(s) in the system. Self-checking may require small number of tests. When P AT =1 and P S =P F =1 (see Table 1), it is only needed to find out that each of the system units has been tested, at least, once. It may happen that N tests will be sufficient for system self-checking (see Fig. 1), where N is the number of system units.

Fig. 1. Cases when each unit is tested
For providing system self-checking it is not necessary to form the syndrome at all cost, and, consequently, to perform its analysis. Only message or signal informing about system faultfree (resp. faulty) state is sufficient. This can be done, for example, by the unit which has produced the test result equal to 1. Further we are going to consider the case when tests are performed during the system operation. Hence, it is not possible to determine in advance which of the system units will be idle at the definite moment of time and, thus, will be able to test (or be tested by) another system unit. From this it follows that not only pair of units that provides a test, but also instance of test performing is random. The random value is also the number of tests which will be performed in the system during a certain period of time.
At the beginning, the self-checking procedure is performed to find out if the system possesses a faulty unit(s). The period of self-checking duration depends on the requirements to the credibility of self-checking result. If no test result equal to 1 is obtained during the self-checking (i.e., all test results are equal to 0), then the self-checking procedure ends, and the respective message or signal is delivered to the system environment. The self-checking procedure and subsequent delivering of information about the state of the system can be repeated at certain intervals as long as the system is operating. Otherwise. that is, when the test result indicating the presence of a faulty unit in the system is obtained, the self-checking procedure is terminated immediately, and the procedure of self-diagnosis will be started. The aim of self-diagnosis procedure is to identify the faulty unit(s). As the research results show, one of the most difficult tasks is the task of determining the time duration of self-checking when all test results indicate that there are no faulty units in the system (i.e., all test results are equal to 0). In Fig. 2, the cycle of self-checking (SSC) and probable self-diagnosis are depicted. Fig. 2 can also help to elucidate the important features of selfchecking. From Fig. 2, it is seen that fault occurrence doesn't lead immediately to termination of self-checking procedure. Selfchecking, as a rule, will continue until the fault is detected (captured) by one of the tests. After normal termination of each SSC, the result of self-checking is delivered to the system environment. This result indicates that the system is fault-free. Only in case of anomalous termination of SSC, no result of selfchecking is delivered to the system environment. Thus, normally, the same information is delivered to the system environment. Consequently, the idea springs to mind, that self-checking could be organized in such way that its result will not be delivered at all. In this case, absence of information about system state would mean that system is fault-free. However, this proposition has not been enough researched both from the theoretical and practical points of view. Nevertheless, it is worth noting that this situation can be considered in context of our consideration as a particular case when the time duration of self-checking cycle approaches the infinite.

Fig. 2. Self-checking cycles and fault occurrence
For organization of SSC (mainly, for defining the time duration of SSC) there were suggested several solutions [2,3,4]. Basically, SSC continues until one of the following conditions is met: 1) pre-set time has expired. Time duration of SSC is a constant value and is fixed in advance, 2) certain number of tests has been received. Time duration of SSC is defined by the certain number of performed tests, i.e., SSC continues until there is performed pre-set number of tests. Time duration of SSC is random, 3) certain diagnosis graph (DG) has been formed. SSC continues until the tests form a certain diagnosis graph (resp. DG which belongs to the subset of diagnosis graphs defined a priory). Time duration of SSC is random. The cases when time duration of SSC is fixed or defined by a certain number of performed tests can be further described from the point of view of whether the analysis of the received diagnosis graph has to be performed or not. When such analysis doesn't have to be performed, the task arises to compute the probability of the event that all system units have been tested at least once. However, in practice there can be applied the opposite attitude when the time duration of SSC (resp., the required number of tests) is computed basing on the required probability of the event that all system units will be tested. Analysis of the obtained DG aims at checking whether all system units have been tested or whether the formed DG belongs to predefined subset of diagnosis graphs. It depends on the value of required credibility of self-checking result. When analysis shows that not all of the system units have been tested, it is possible to continue the SSC by the predefined period of time (so-called, extended period). After this extended period expires, the analysis is repeated. But this time, all of the tests both performed during the main and extended periods are accounted. Determining the optimal number of possible extended periods of SSC and the time of their duration is a separate problem.

System fault-tolerance
System tolerance to the failure of its units can be modeled by using different mathematic models. Mostly, for this purpose there is used the system state-transition diagram (Markov model).
Markov model is analyzed in order to determine the probability of system being in a given state at a given point in time, the amount of time a system is expected to spend in a given state, as well as the expected number of transitions between states. On the basis of these probabilities it is possible to quantify and estimate the system reliability and system fault-tolerance.
For the systems capable of graceful degradation the statetransition diagram includes the following states: S 0all of the system units (i.e., N units) are actively engaged in performing system and diagnosis tasks. In other words, the system is fully operational, S 1 -one of the system units is isolated (i.e., it doesn't perform system tasks). The system is minorly degraded, but still continues to deliver degraded (although acceptable) services, S 2two system units are isolated. In the system, there remain N-2 active units. System is majorly degraded, but is still able to deliver acceptable services, S 3total failure.
For simplicity reason, here only systems which can tolerate the presence of not more than two faulty units are considered. Transitions of a system from one state to another are depicted in Fig. 3.

Fig. 3. Model of system fault-tolerance
By  0 ,  1 ,  2 are denoted rates of system transitions from one state to another, and by q 0 , q 1 are denoted the probabilities of corresponding transitions. The values of  i , i = 0, 1, 2 depend on the reliability of system units, and the values of q i , i = 0, 1 depend on the efficiency of self-checking, self-diagnosis and recovery procedures. Transitions between particular states can be considered following the Poisson model. Poisson model has proven suitable to describe many of natural and technical processes. Palm in [5] pointed out that in many cases the superposition of a large number of independent stationary processes can be approximated by a Poison process. This gives us the reason to apply the Poison model to system state-transition diagram under consideration. Since in Poisson model the waiting time (until the next occurrence of the event) follows an exponential distribution, the period of time of system being in a given state also has exponential distribution.
Let P i (t) be the probability of system being in state S i at point in time t. Then, When transitions from one state to another follow the Poisson model the sought probabilities P i , i = 03, can be determined from the Kolmogorov equations: Kolmogorov equations describe the dynamics of entering the particular state, resp. leaving the particular state. For example, for state S 1 this dynamics is expressed by differential equation It means that the system is leaving (sign minus) the state S 1 with intensity  1 and entering the state S 1 (sign plus) with intensity  0 (1-q 0 ). The state S 0 is the initial state. That is, P 0 (t=0)=1, and P i (t=0)=0 for i = 1,2,3. Taking Laplace transforms of Kolmogorov equations yields the following system of equations The probabilities of the system being in states S 0 , S 1 , S 2 and S 3 , i.e., P 0 (t), P 1 (t), P 2 (t) and P 3 (t) are functions of time and some other parameters ( and q). In its turn, probabilities q 0 and q 1 depend considerably on the efficiency of the checking, diagnosis and recovery procedures. Fig. 4 shows the impact of values of q 0 and q 1 on the probability P 3 (t).
Function P 3 (t) was calculated for the homogeneous system with five units which have =10 -4 1/h. The case of q 0 =q 1 =0 corresponds to "absolutely perfect" checking, diagnosis and recovery. This probability P 3 (t) allows also to estimate the amount of time the system is expected to spend in states other than S 3 (i.e., time to failure). Mostly, the time while system is operating without maintenance is relatively short (relative to its mean time to failure). Hence, the impact of checking, diagnosis and recovery on the reliability of system is essential. For the systems with a great number of units it is difficult to provide detailed examination of their state-transition diagrams for determining all the above mentioned probabilities. Usually, only the main reliability and fault-tolerance parameters are determined. The most common reliability parameter is the mean time to failure (MTTF), which can also be specified as the failure rate or the number of failures during a given period. The MTTF is usually specified in hours, but can also be used with other units of measurement (e.g., in cycles).  can be expressed as the sum of probabilities of system being in all states except the state of total failure. That is, For a system which is unable to tolerate the failures of single units, the event of system leaving the state S 0 leads immediately to system failure (i.e., direct transmission into state S 3 ). From this we can deduce that the period of time when the system is being in states S 1 and S 2 reflects the system ability to tolerate the failures of its units. The mean time of this period, T  , can be calculated as follows As an indicator of system fault-tolerance, it is normally used the total number of failed units which system can tolerate and continue in delivering acceptable services. As another indicator of system fault-tolerance, there can be used the following ratio:  In order to elucidate how the indicator Q characterizes the system fault-tolerance, let us consider two systems. Assume that both systems have the equal value of MTTF, i.e., T 0 1 =T 0 2 . Assume also that the first system has Q=0.2 and the second one has Q=0.8. In this case, we can conclude that the first system has reliable units but not very effective means of checking, diagnosis and recovery. In contrast, the second system has not very reliable units but has very effective means of checking, diagnosis and recovery. In case of T 0 1 T 0 2 , the system faulttolerance can be evaluated by value of T  . However, in this case we can make only rough estimate.

Conclusions
It should be noted, that the above considered model (state-transition diagram) is very much simplified and only gives general impression of how checking, diagnosis and recovery can "conjointly" influence the system reliability and fault-tolerance. The model uses the integrated parameters (e.g., probabilities q i ). It means that, by using this model, it is difficult to decide on what specific measures should be undertaken in order to increase these probabilities to a certain value. This model doesn't allow to estimate to what extent increasing the efficiency of each procedure (checking, diagnosis, recovery) improves the system reliability and fault-tolerance. Nevertheless, this simplified model is very useful as a starting point and is a basis for further refinements.