A novel reliAbility estimAtion method of multi-stAte system bAsed on structure leArning Algorithm nowAtorskA metodA oceny niezAwodności systemów wielostAnowych w opArciu o Algorytm uczeniA struktury

Traditional reliability models, such as fault tree analysis (FTA) and reliability block diagram (RBD), are typically constructed with reference to the function principle graph that is produced by system engineers, which requires substantial time and effort. In addition, the quality and correctness of the models depend on the ability and experience of the engineers and the models are difficult to verify. With the development of data acquisition, data mining and system modeling techniques, the operational data of a complex system considering multi-state, dependent behavior can be obtained and analyzed automatically. In this paper, we present a method that is based on the K2 algorithm for establishing a Bayesian network (BN) for estimating the reliability of a multi-state system with dependent behavior. Facilitated by BN tools, the reliability modeling and the reliability estimation can be conducted automatically. An illustrative example is used to demonstrate the performance of the method.


Introduction
For estimating the reliability of a complex system, constructing an accurate reliability model of the system is essential.A variety of popular reliability models are available, such as fault trees, reliability block diagrams, and Bayesian networks [28].Unfortunately, they require professional knowledge and experience in modeling, along with a detailed understanding of the system structure and of how the system operates.Moreover, substantial effort is required for constructing these models for complex systems, even if the systems are mediumscale.Two main difficulties are encountered in building these models.
(1) Modern systems consist of hardware and software with complex interactions, which are becoming increasingly difficult to model.(2) Reliability models and function principle graphs describe different aspects of the system.Although the designers know how the systems work, they lack the skills and experience in reliability modeling and analysis.There is a gap between system functional models and reliability models.
To overcome these problems, researchers proposed automatic transformation methods for converting the function principle graphs to fault trees [1,20] or RBDs [11].In addition, Bucci attempted to con-struct dynamic fault tree and event tree from corresponding Markov models such that time-dependent failure can be considered [2].Moreover, the dynamic reliability models such as the dynamic fault tree (DFT) or dynamic reliability block diagram can be transformed into a dynamic Bayesian network to estimate the reliability of a dynamic system [17,19,21].Even if reliability models could be automatically generated from function principle graphs, they would not reflect the changes that occur when the system is operating.In addition, when human factors are involved, it is difficult to incorporate these factors into the reliability models.
Methods such as the GO methodology and the Altarica project have been proposed for overcoming the problem of models not matching the specifications of the systems under study.The GO methodology uses a straightforward inductive logic to construct system models [18].Enhanced methods improve the performance of the GO methodology and provide additional analysis results about the systems.A new quantification algorithm that is based on BDD is proposed in [5].The quality analysis method, in combination with FMEA, is discussed in [12] and the dynamic behavior can be modeled via the extended GO methodology [25].Altarica is a high-level modeling language sciENcE aNd tEchNology that can describe the hierarchies of the system and the behaviors of components of the system [7,23].The model can be compiled into lower-level formalisms for analysis of the reliability and dependability of the systems.The methods that are discussed above can provide models that are similar to the function principle graph; however, these methods are diffcult to use in practice due to their complexity.Data-driven system reliability estimation methods such as the prognostics and health management (PHM) method were proposed many years ago.However, the traditional PHM methods are designed for specified operational conditions and the analysis results are limited to the systems that are in operation [30].Hence, researchers studied how to obtain a generic model from data.Zaitseva [31] presented an approach that is based on a decision tree and learns the structure function of the system that represents the states of the system and components from the source data.The method is used not only to construct reliability models of the general system but also to estimate the reliability of the corresponding human factor system [13].Zaitseva extended this method to multi-state systems [32].However, the approach is only suitable for relatively simple systems that have no complicated interactions among components.Doguc [4] proposed a structure learning method that is based on Bayesian networks for constructing the reliability model.However, he only utilized a binary system and did not consider the interactive effects between components.
Dependent failure is an important behavior that can substantially affect the reliability of a system.Two types of dependent failure were identified in [27]: The first type can be described by the functional-dependent gate in DFT.The second type is found in multi-state systems.Thus, dependent failure should be considered in reliability models.Unfortunately, it is difficult to identify the dependencies between components and the interactive effects between components and subsystems.
In this paper, we propose a method that is based on the structure learning algorithm for modeling and estimating the reliability of multi-state systems while considering dependent failures.We focus on the dependency behavior between components.
The remainder of this paper is organized as follows: Section 2 briefly summarizes the multi-state system with correlative behavior.Section 3 presents a methodology for modeling the cause and effect relationships of multi-state systems that is based on the K2 algorithm [3] and evaluating conditional probability tables (CPTs).In Section 4, an example is used to demonstrate the entire modeling process.Section 5 presents the experimental analysis results on the accuracy and performance of the methodology.Finally, in Section 6, the conclusions of this work are presented and discussed.

Multi-state system with dependency behavior
Multi-state systems were introduced in 1968.Many researchers gradually contributed to the reliability theory of multi-state systems and developed a variety of methods for evaluating the reliability of multi-state systems.The related works are referenced in [14,29].
For simplicity, traditional reliability modeling methods assume that the components are independent.However, this assumption is not practical for real engineering systems.In [27], the authors described several dependent failure scenarios, such as common cause failure, load-sharing, cascade failure [24], sequential failure, cross-system dependencies and interaction between components.These dependent failures would severely affect the reliability estimation of the systems.
Thus, researchers improved the traditional models to include dependent failures.A stochastic process [16] is an intuitive model for representing the correlations between components.Song [26] presented stochastic multivalued models for evaluating the reliability of an MSS with dependent multistate components.However, the space explosion problem poses substantial challenges for large-scale multi-state systems.So researchers explored the combinatorial methods to model the multi-state system with dependent failure.Levitin [15] extended the universal generating function approach to multi-state systems with dependent elements.The dependent failure is of the second type [26].Nagayama [22] analyzed the reliability of a multi-state system with partially dependent components based on the multi-valued decision diagrams.
In the works that are discussed above, the dependencies between components are formulated as conditional probabilities.The problem can be more easily represented by a Bayesian network, which can describe complicated relationships.Thus, BN is an alternative way of modeling a multi-state system with dependency behavior.Various BN tools such as BNT [6] support BN inference and implement many learning algorithms.Consequently, BN is a more suitable method for modeling multi-state systems with consideration of the dependency behavior.

Methodology
In this section, a reliability estimation method for multi-state systems that is based on the K2 algorithm [3] and the point estimation method is proposed.The dependent relationships among components, subsystems and the system are learned from data via the K2 algorithm.The parameters of the BN are estimated via the point estimation method.Then, the reliability of the multi-state system with consideration of the customer demand can be evaluated.Facilitated by BN tools, the whole process can be conducted automatically.

Structure learning algorithm K2
The K2 algorithm was proposed in [3] as a heuristic-search method.This algorithm assumes the following: 1) the variables are ordered and 2) all structures are equally likely.According to the assumptions, high-rank variables will not be the parents of low-rank variables.Hence, the search space of the parent sets of a variable can be reduced substantially.
The K2 algorithm consists of two main components: A scoring function that quantifies the associations and ranks 1) the parent sets according to their scores: i : index of the components.π i : set of parents of components i x .
i q : φ i .
φ i : list of all possible instantiations of the parents of i x in database D, namely, if 1 x ,…, s x are the parents of i x , then φ i is the Cartesian product s n represent the numbers of the states of the components.
list of all possible state values of the components x i .α ijk : the number of cases in D in which component i x is instantiated with its kth value and the parents of i x in π i are instan- tiated with the jth instantiation in φ i .
the number of instances in the database in which the parents of i x in π i are instantiated with the jth instantiation in φ i .

sciENcE aNd tEchNology
To improve the run-time speed of K2, the logarithmic version of the equation above is implemented in this paper.The equation is described as follows: where () Γ ⋅ denotes the gamma function and the other parameters are as defined above; A greedy-search method that incrementally adds nodes to the 2) parent set to reduce the search space.
With the heuristic, the K2 algorithm does not need to consider all possible parents sets; it adds incrementally the parent whose addition most increases the probability of the resulting structure.If the addition of no single parent can increase the probability, no additional parents are added to the node: Initially, node i x has no parents and the nodes 12 1 , , , i x x x −  are candidates of the parent sets.The parent sets may be − .Hence, the parent set space is reduced substantially.In addition, to increase the efficiency of the algorithm, the K2 algorithm uses a parameter u to restrict the maximum number of parents.The pseudocode of the K2 algorithm can be found in [3].
According to the description above, to use the K2 algorithm to learn the exact Bayesian network, the variables' ordering should correspond to the practical operational mode of the system.For example, a subsystem should be higher in the ordering than its child nodes.In addition, the parameter u should be set reasonably to balance the efficiency and the correctness of the Bayesian network.In [3], several suggestions are proposed for obtaining the most probable structure.

Probability distribution estimation
Via the K2 algorithm, the relationships among the components, subsystems and system can be obtained.To evaluate the reliability of MSS, the parameters of the Bayesian network should be estimated using the data.
Traditional reliability methods typically assume that the failure times of the components follow a probability distribution.However, this assumption may not hold in practice.Researchers considered integrating data from various levels of the system to reduce the uncertainty in the system reliability assessment.These methods [8,9] can use the data to estimate the parameters of multi-state components.However, these approaches require prior knowledge about the system structure and about the cause and effect between components.These conditions are difficult to satisfy when we only have the function principle graph and the operation data of the system.Consequently, statistical estimation theory is used to determine the probability distribution.
There are two main types of estimation procedures in statistics: point and interval estimation.For convenience, point estimation is applied in this paper.Interval estimation is also applicable in our method.
First, according to the BN model that was learned from the data, the nodes without parents can be identified.The probability distributions of these nodes can be calculated easily.For node i X , denotes the number of times that state j of component i X occurs in the data as ij H and the number of all instances in the data as i H .The state probability distribution of i X is presented in Table 1, where i n denotes the number of states of component i X .Second, the conditional probability distributions of the nodes that have parents are estimated.For example, the structure of nodes ,, ij k XX X is illustrated in Fig. 1.

Fig. 1. Illustrative example
The CPT of X k can be estimated via the following formula: { , , } denotes the number of instances when , , is also available.
According to the method above, a BN that represents the dependent relationships and all parameters that represent the logical relationships can be obtained.Hence, the reliability of the MSS can be evaluated.Assume the system has M unique states.The reliability of the multi-state system can be expressed by the following formula: where (t) S denotes the current state of the system and (t) w denotes the demand for the system at time t.According to the model that is obtained above, the joint probability distribution  , , , , , )  can be calculated.Thus, the marginal probability distribution { (t) } i i s P S s p = = is also obtained.

Classification of Failure Conditions
In this paper, the task processing system example from [15] is used to demonstrate our method.The system logic diagram is shown as Fig. 2.  The system consists of three independent computing blocks: A, B and C. Blocks A and B are constructed in parallel.Then, the parallel structure P and block C are arranged in series.Each block is composed of two processing units that differ in terms of priority.Elements 1, 3 and 5 have high priorities in blocks A, B and C. When the high-priority unit accesses a database, the low priority unit must to wait for the operation to be completed.Thus, the processing speed of the low-priority unit is affected by the load of the high-priority unit.The performance distributions of elements 1, 3 and 5 and the conditional performance distributions of elements 2, 4 and 6 can be found in [15].
The Bayesian network model of the task processing system is illustrated in Fig. 3. Nodes G 1 , G 2 , G 3 , G 4 , G 5 , and G 6 represent processing units 1, 2, 3, 4, 5, and 6.Nodes A, B and C represent computing blocks A, B and C. Node P represents the subsystem that consists of blocks A and B. Node S represents the task processing system.
The BN is used to evaluate the performance in model learning from the data.In the next section, the logic sampling method [10] is used to randomly generate virtual instances to demonstrate the relationship between accuracy of the BN and the number of observations.

Fig. 3. Bayesian network of the task processing system
The performance of each element is discrete; hence, we treat the performance as the state of the element, namely, the performance and the state of the element have the same meaning in this paper.
The performance distributions of elements 1, 3 and 5 are presented in Table 2.The conditional performance distributions of elements 2, 4 and 6 are presented in Table 3, Table 4 and Table 5.
The conditional performance distributions of block A are presented in Table 6.Similarly, the conditional performance distributions of nodes B, C, P and S can be obtained.
The data obtained based on the monitoring and presented in Table 7, which collected 200 samples of the task processing system.The data set demonstrates some combinations of component states and the corresponding performance levels of the system.Though the data set is small, the relationships among the components and system can be inferred partially.The detail process is illustrated in the next subsection.

Reliability estimation of the task processing system
According to the system logic diagram in Fig. 2, the ranking of the nodes depends on the hierarchy of the nodes, which will affect the efficiency of the K2 algorithm.The ranks and the number of performance value of the components are listed in Table 10.
Using the K2 algorithm, the associations among the components, subsystems and system can be identified and the score function values that correspond to parents can be obtained.The results are listed in   sciENcE aNd tEchNology Table 11 and the raw system structure inferred from the data is shown in Fig. 4. The system structure lacks of the relationship between component C and the system and the reason is that the data in Table 7 just represents partial state combinations of the system.The correlation of accuracy and data volume is analyzed in the next section.The performance distribution parameters of the components or subsystems can be estimated by the data obtained based on the monitoring.The approximate performance distributions of elements G 1 and G 2 in the data are presented in Table 12 and Table 13.The performance distributions for the other elements can be obtained via the same approach.
Then, the reliability of the task processing system with consideration of the demand can be evaluated and the system reliability as a function of the demand is plotted in Fig. 5.The result is compared with the reliability that was calculated via the UGF method in Fig. 5.

Generating random instances
The logic sampling generates an instance by randomly selecting values from the probability tables or conditional probability tables.The nodes are visited from the root nodes to the leaves; hence, all nodes in the BN can be instantiated once all nodes have been traversed.The detailed algorithm is available in [10].
Next, let us work through one round of simulation for the task processing system.Consider the root node G 1 of this network as an example.The random number generator produces a value between 0 and 1. Assume that the random number is 0.56.Then, the corresponding performance of unit 1 is 40, as illustrated in Fig. 6.The performance values of other root nodes can be obtained as well.The performance instances of all root nodes are listed in Table 7.
Next, the value for child node G 2 should be generated.According to Table 3, the conditional cumulative performance curve for processing unit 2 can be calculated.The random number is 0.72; hence, the performance value of unit 2 is 15.The performance instances of units 2, 4 and 6 are listed in Table 8.
Then, the performance values of other nodes in the BN can be computed via the method that is demonstrated above.Finally, the full combinations of values in this simulation round are listed in Table 9.
The simulation is repeated 20000 times and 20000 sets of cases are generated as the operation data of the system.These cases form the training data set.

The correlation between accuracy and data volume
The number of observations that are used to discover the associations among nodes affects the efficiency of the K2 algorithm and the accuracy of the constructed BN.The error rate that is defined in [4] was used here to analyze the relationship between accuracy of the BN and the number of observations.The error rate can be calculated as follows: FP A : the number of associations that do not exist in the actual BN.As the amount of data increases, structure learned from data approximates the actual BN with increasing accuracy.The BN models that correspond to various numbers of instances are illustrated in Fig. 6.
The last Bayesian network has same structure as the model that is illustrated in Fig. 3.The BN that is constructed using a dataset with 1000 instances is similar to the actual BN of the task processing system.The BN that was built with 2500 instances has the same associations as the BN with 1000 instances.Hence, the additional data do not provide more information regarding the associations between block C and the system.With the increase in the amount of data, node G 5 becomes associated with node S.However, the performance of the system depends on block C and not on element 5.Although the additional data provide new information about the relation between the system and block C, the associations do not represent the real relationship.Moreover, the error rates of the BNs with 3000 and 12000 instances become higher than that of the BN with 2500 instances.In the course of training, the error rate may fluctuate with the increase in the amount of data before finally converging.The error rate curve is plotted in Fig. 7 According to Fig. 6 and Fig. 7, the dependent relationships of computing blocks A and B and the system are easily constructed; however, more data are required for accurately establishing the associations of computing block C and the system.That is because the task processing system is a series system and its performance depends on the minimal performance of subsystem P and block C. Hence, the accuracy may vary with the structure of the system.

Conclusions and Discussion
Traditional reliability methods strongly depend on the ability and experience of the engineers, which results in differences among reliability models that model the same system.With the development of data acquisition and data mining techniques, the operational data of  the systems can be monitored and analyzed for reliability estimation and system optimization.In this paper, a new method that is based on the K2 algorithm is proposed for constructing a reliability model of a system and for estimating the parameters of components from data.In the illustrative example, the Bayesian network model that is learned from the data is the same as the Bayesian network model that we constructed.Moreover, comparing with the universal generating function method, the results of the two methods are very close.Hence, this approach is effective.According to experimental results, the efficiency of the method may depend on the structure of the systems.Thus, determining how to integrate the structure of the systems as prior knowledge into the process of structure learning will constitute our future work.This pa-per does not consider the scenario of missing data, which is common in practice.Determining how to handle missing data will be left for future work.
In conclusion, this method is suitable for reliability estimation of complex systems while considering multi-state and dependency behavior.Facilitated by BN tools, reliability modeling and reliability estimation can be conducted automatically without human intervention.

Fig. 2 .
Fig. 2. System logic diagram of the task processing system

Fig. 4 .
Fig. 4. The system structure inferred from the data

Fig. 5 .
Fig. 5. Reliability comparison between our method and the UGF method

FNA
: the number of associations in the actual BN that are missed.T A : the number of the associations in the constructed BN.

Fig. 7 .
Fig. 7. BN models for various numbers of instances

Table 1 .
Probability distribution of component

Table 3 .
Conditional performance distribution of element 2

Table 4 .
Conditional performance distribution of element 4

Table 5 .
Conditional performance distribution of element 6

Table 6 .
Conditional probability distribution of block A

Table 7 .
Data obtained based on the monitoring of the task processing system

Table 10 .
Ranks and the number of performance value of the components

Table 11 .
Associations and score function values

Table 12 .
Probability estimator of component G 1

Table 13 .
Conditional probability distribution estimator of component G 2

Table 7 .
Performance values of all root nodes

Table 9 .
Full combinations of values in this simulation round