Diagnosis strategy for complex systems baseD on reliability analysis anD maDm unDer epistemic uncertainty

Fault tolerant technology has greatly improved the reliability of train-ground wireless communication system (TWCS). However, its high reliability caused the lack of sufficient fault data and epistemic uncertainty, which increased significantly challenges in system diagnosis. A novel diagnosis method for TWCS is proposed to deal with these challenges in this paper, which makes the best of reliability analysis, fuzzy sets theory and MADM. Specifically, it adopts dynamic fault tree to model their dynamic fault modes and evaluates the failure rates of the basic events using fuzzy sets theory and expert elicitation to hand epistemic uncertainty. Furthermore, it calculates some quantitative parameters information provided by reliability analysis using algebraic technique and Bayesian network to overcome some disadvantages of the traditional methods. Diagnostic importance factor, sensitivity index and heuristic information values are considered comprehensively to obtain the optimal diagnostic ranking order of TWCS using an improved TOPSIS. The proposed method takes full advantages of the dynamic fault tree for modelling, fuzzy sets theory for handling uncertainty and MADM for the best fault search scheme, which is especially suitable for fault diagnosis of the complex systems


Introduction
Train-ground wireless communication system (TWCS) is a safety-critical subsystem of urban rail transit and its reliability has a direct effect on the stability and safety of the train operation system.For fast technology innovation, the performance of TWCS has been greatly improved with the wide application of high dependability safeguard techniques on one hand, but on the other hand, its complexity of technology and structure increasing significantly raise challenges in system maintenance and diagnosis.These challenges are shown as follows.(1) Lack of sufficient fault samples.Fault samples integrity has a significant influence on the system diagnostic performance.However, it is extremely difficult to obtain mass fault samples which need many case studies in practice due to some reasons.One reason is imprecise knowledge in an early stage of the new product design.The other factor is the changes of the environmental conditions which may cause that the historical fault data cannot represent the future failure behaviours.(2) Failure dependency of components.TWCS adopts many redundancy units and fault tolerance techniques to improve its reliability.So the behaviours of components in the system and their interactions, such as failure priority, sequentially dependent failures, functional dependent failures, and dynamic redundancy management, should be taken into account.(3) Uncertainty of diagnostic test cost for components.Usually, different components have different diagnostic sciENcE aNd tEchNology test cost and it is very difficult to estimate a precise diagnostic test cost due to the lack of sufficient data, especially for the new components.Aiming at these challenges, many efficient diagnostic methods have been proposed.Assaf et al. proposed a reliability-based approach to determine the diagnosis order of components using diagnostic importance factor (DIF), which uses the dynamic fault tree to model the failure dependency of components and can, to some extent alleviate fault data acquisition bottleneck [1,19].However, the solution for dynamic fault tree was based on Markov Chains (MC) modelwhich is ineffective in handing larger dynamic fault tree and modelling power capabilities.For this purpose, Duan et al. proposed a hybrid diagnosis method using dynamic fault tree and discrete-time Bayesian network (DTBN) [17].Dynamic logic gates were converted to DTBN and the reliability results were calculated by a standard Bayesian Network (BN) inference algorithm.However, it is an approximate solution for dynamic fault tree and requires huge memory resources to obtain the query variables probability accurately.Furthermore, these diagnostic methods, which are usually assumed that the failure rates of the components are considered as crisp values describing their reliability characteristics, have been found to be inadequate to deal with the challenge (1) mentioned above.Therefore, fuzzy sets theory has been introduced as a useful tool to handle challenges (1) and ( 3).The fuzzy fault tree analysis model employs fuzzy sets and possibility theory, and deals with ambiguous, qualitatively incomplete and inaccurate information [8,[12][13].However, these approaches use the static fault tree to model the system fault behaviours and cannot cope with the challenge (2).So fuzzy dynamic fault tree (FDFT) analysis has been introduced [7,22], which takes into account not only the combination of failure events but also the order in which they occur.Nonetheless, the solution for FDFT is still MC based approach, which has the infamous state space explosion problem.To overcome these difficulties and limitations, Duan et al. proposed a new diagnosis method using fuzzy sets and dynamic fault tree, which use fuzzy sets to evaluate the failure rates of the basic events and uses a dynamic fault tree model to capture the dynamic failure mechanisms [18].But the solution for the dynamic fault tree is still based on DTBN and cannot avoid the aforementioned problems.Assaf et al. firstly introduced the cost diagnostic importance factor (CDIF) to incorporate the diagnostic test cost into the diagnosis process in order to optimize the fault diagnosis [2].They assumed the test cost of the components was crisp value, which was highly impracticable and almost impossible to apply.So it cannot deal with the challenge (3).In addition, all the diagnosis algorithms are based on minimal cut sets and DIF or CDIF, which are in essence single attribute decision making, and usually cause minimal cut sets with a smaller DIF to be diagnosed first, thereby influencing the diagnosis result.
Motivated by the problems mentioned above, this paper presents a novel diagnosis strategy for TWCS based on fuzzy sets, dynamic fault tree and MADM shown in Figure 1.It pays particular attention to meeting above three challenges.We adopt expert elicitation and fuzzy sets theory to deal with insufficient fault data and handle the uncertainty problem by treating diagnostic test cost as fuzzy numbers.Furthermore, we use a dynamic fault tree model to capture the dynamic behaviours of the TWCS failure mechanisms and calculate some quantitative parameters information provided by reliability analysis using BN and algebraic technique in order to avoid the aforementioned problems.In addition, components' DIF, sensitivity index (SI) and heuristic information values (HIV) are considered comprehensively to design a novel diagnosis strategy which can locate the fault with the objective of fast and low-cost diagnosis.
The aim of this project is to present the scientific decision for the fault diagnosis of TWCS and offer a new idea for fault diagnosis in complex systems.The rest sections of this paper are organized as follows: Section 2 provides a brief introduction on TWCS and its dynamic fault tree model.Estimation of failure rates for the basic events is described in Section 3. Section 4 presents a novel dynamic fault tree solution which uses BN and algebraic technique.Section 5 presents a new diagnosis algorithm which makes use of the components' DIF, SI and HIV using MADM solution.The outcomes of the research and future research recommendations are presented in the final section.

Dynamic fault tree of TWCS
Credible wireless communication technology is one of the development directions of communication based train control because it can meet the demands of real-time large amount of information transmission of train-ground.TWCS based on orthogonal frequency division multiplexing adopts some redundancy techniques to ensure higher reliability and is widely applied in the train control system, which transmits real-time data between train and ground.TWCS mainly includes train-ground communication access devices and train-ground communication transmission system.Train-ground communication access devices are responsible for information acquisition, information composition, information decomposition, information encoding, information decoding, and information transmission security mechanism.This can guarantee a safe, reliable and real-time information transmission.Specifically, train-ground communication access devices include decentralized radio control unit (DRCU) and mobile radio control unit (MRCU).DRCU, situated in the decentralized control center, offers the interfaces between the decentralized control system and the traction power supply system and controls the information transmission of the decentralized train-ground communication devices.In addition, it also performs the most challenging tasks such as information acquisition, composition, decomposition, encoding and decoding among the decentralized control system, the vehicle control system, localization system and the traction power supply system.MRCU, located on the opposite ends of the train, not only offers the interfaces between the vehicle control system and the localization system, but also implements information processing among the vehicle control system, the localization system, the decentralized control system and the traction power supply system.Train-ground communication transmission system includes ground radio transceiver equipment, mobile radio transceiver equipment and wireless communication channel.It is its responsibility for the reliable, transparent data transmission between train and ground devices.
TWCS is a typically complex system and adopts redundancy techniques to ensure higher reliability.For example, the hardware redundancy technique is employed in the design of DRCU and MRCU.High coupling degree together with complicated logic relationships exists between these modules.So the dynamic behaviours of components in these modules and their interactions, such as failure priority, sequentially dependent failures, functional dependent failures, and dynamic redundancy management, should be taken into consideration.Obviously, traditional static fault tree is unsuitable to model these dynamic fault behaviours.Therefore, we use the dynamic fault tree model to capture the dynamic behaviours of system failure mechanisms such as

estimation of failure rates for twcs
In order to calculate some reliability parameters for diagnosis, failure rates of the basic events must be known.However, fault tolerant technology has greatly improved the system reliability and its high reliability caused the lack of sufficient fault data and epistemic uncertainty.For this reason, it is very difficult to estimate precisely the failure rates of the basic events, especially for the new equipment.
In this study, the expert elicitation through several interviews and questionnaires and fuzzy sets theory are used to estimate the failure rates of the basic events through qualitative data processing.An overall architecture of the estimation of failure rates for TWCS is shown in Figure 3.

Experts evaluation
Experts are people who are familiar with the system and understand the system working environment and the system operation.Therefore, experts can be selected from different fields, such as the design, installation, maintenance, operation and management of the system, to judge the failure rates of the basic events.They are more comfortable justifying event failure likelihood using qualitative natural languages based on their experiences and knowledge about the system, which capture uncertainties rather than by expressing judgments in a quantitative manner.The granularity of the set of linguistic values commonly used in engineering system safety is from four to seven terms.In this paper, the component failure rate is defined by seven linguistic values, i.e. very high, high, reasonably high, moderate, reasonably low, low and very low.

Fuzzification module
Experts evaluation expressed in terms of qualitative natural languages should be converted into the operational format of fuzzy numbers, for example, trapezoidal fuzzy numbers.This function can be implemented by fuzzification module.The objective of fuzzification module is to quantify the basic event qualitative data into their corresponding quantitative data in the form of membership function of fuzzy numbers.In addition, each predefined linguistic value has a corresponding mathematical representation and the shapes of the membership functions to mathematically represent linguistic variables in engineering systems are illustrated in Figure 4. To eliminate bias coming from an expert, six experts are asked to justify how likely a basic event will fail in the system under investigation.Therefore, it is necessary to combine or aggregate these opinions into a single one.There are many approaches to aggregate fuzzy numbers.An appealing approach is the linear opinion pool [6]: where m is the number of basic events; A ij is the linguistic expression of a basic event i given by expert j; n is the number of the experts; ω ij is a weighting factor of the expert j and M i represents combined fuzzy number of the basic event i.
Usually, an α-cut addition followed by the arithmetic averaging operation is used for aggregating more membership functions of fuzzy numbers of different types.The membership function of the total fuzzy numbers from n experts' opinion can be computed as follows: [ ] where f n (x) is the membership function of a fuzzy number from expert n and f (z) is the membership function of the total fuzzy numbers.

Calculating fuzzy fault rates of the basic events
Apparently, the final quantitative data taken from the fuzzification module are still in the form of fuzzy numbers and cannot be used for fault tree analysis because they are not crisp values.So, fuzzy number must be converted to a crisp score, named as fuzzy possibility score (FPS) which represents the most possibility that an expert believes occurring of a basic event.This step is usually called defuzzification.There are several defuzzification techniques.It is very important to choose a suitable defuzzification technique for a specific application.We use an area defuzzification technique to realize this algorithm, which has lowest relative errors and the closest match with the real data [16].If (a, b, c, d; 1) is a trapezoidal fuzzy number, then its area defuzzification technique is as follows: The event fuzzy possibility score is then converted into the corresponding fuzzy failure rate, which is similar to the failure rate.Based on the logarithmic function proposed by Onisawa [14], which utilizes the concept of error possibility and likely fault rate, the fuzzy failure rate can be obtained by the following equation ( 4).Table 1 shows the fuzzy failure rates of the basic events for TWCS.

Calculating reliability parameters using BN and algebraic technique
After the dynamic fault tree is constructed and all basic events have their corresponding failure rates with the exponential distribution function, reliability results of TWCS can be calculated by solving the dynamic fault tree.Traditional solution for dynamic fault tree is based on MC model [11], which has the infamous state space explosion problem and cannot solve a larger dynamic fault tree.Therefore, DTBN was proposed to solve the dynamic fault tree in [3][4].Dynamic logic gates are converted to DTBN and the reliability results are calculated using a standard BN inference algorithm.However, this is an approximate solution and requires huge memory resources to obtain the probability distribution accurately.In addition, as the number of intervals increases, the accuracy and execution time increases greatly.An innovative algorithm has been introduced to reduce the dimension of conditional probability tables by an order of magnitude [9].However, this method cannot perform posterior probability updating.In the following section, we present an improved method to calculate the reliability parameters using BN and algebraic technique to overcome the disadvantages mentioned above.

Mapping static fault tree into BN
There is a clear correspondence between static fault tree and BN.The fault tree can be seen as a particular deterministic case of the BN.Conceptually it is straightforward to map a fault tree into a BN: one only needs to "re-draw" the nodes and connect them while correctly enumerating reliabilities.Figure 5 shows the conversion of an OR and an AND gate into equivalent nodes in a BN.Parent nodes A and B  sciENcE aNd tEchNology are assigned prior probabilities, which coincident with the probability values assigned to the corresponding basic nodes in the fault tree, and child node C is assigned its conditional probability table (CPT).
Since the OR and AND gates represent deterministic causal relationships, all the entries of the corresponding CPT are either 0 or 1.The detailed algorithm of converting a fault tree into a BN was proposed in [3,15].

Fault Probability of a Module with Sequence Dependence
Let us consider an event sequence composed of n events, 1 2 , , , n e e e  including several spare events.An event in the sequence is denoted by i j e , which means that the event that failed in the j-th order of the sequence is designated a spare of an event that failed in the i-th order. 0j e denotes an event that was originally in active mode., , , e e e < >  can be calculated using the n-tuple integration as:  S is a set of events that were originally in active mode and sa S ( ss S ) is a set of spare events that fail in active (standby) mode [20].
When the failure time of i j e in active mode follows an exponential distribution with j λ , the sequence probability is: Pr e e e t L s s a where (1 ) (1 ) , 0 e S a for a λ α λ α λ and 1 L − is the inverse Laplace transform operator.If every i a in the above equation is distinct from the other, the sequence probability is: Pr e e e t e a a α λ where 0 0 a = .

Mapping dynamic fault tree into BN
Dynamic fault tree extends traditional fault tree by defining special gates to capture the components' sequential and functional dependencies.Currently there are six types of dynamic gates defined: the functional dependency gate (FDEP), the cold, hot, and warm spare gates (CSP, HSP, WSP), the priority AND gate (PAND), the sequence enforcing gate (SEQ).Here, we briefly discuss the FDEP and the WSP gates as they will be later used in our examples. (

1) WSP Gate
The WSP gate has one primary input and one or more alternate inputs.The primary input is initially powered on and the alternate inputs are in standby mode.When the primary fails, it is replaced by an alternate input, and in turn, when this alternate input fails, it is replaced by the next available alternate input, and so on and so forth.In standby mode, the component failure rate is reduced by a factor α called the dormancy factor.α is a number between 0 and 1.A cold spare has a dormancy factor =0 α ; and a hot spare has a dormancy factor =1 α .The WSP gate output is true when the primary and all the alternate inputs fail.Figure 6 shows the WSP gate and its equivalent BN.Table 2 shows the CPT of the node A. Supposing that A and S follow the same exponential distribution with λ ; Here, 1 ( ) p t and 2 ( ) p t in this table can be derived as:

p t P A S P S A P S P P A t P A S t F
The output of node WSP is an AND gate whose CPT is shown in Figure 5. FDEP is used to model situations where one component's correct operation is dependent upon the correct operation of some other component.It has a single trigger input, which could be another basic event or the output of another gate, a non-dependent output reflecting the status of the trigger, and one or more dependent basic events.Figure 7 shows FDEP gate and its equivalent BN.Table 3 shows the CPT of the node A. Here, 3 ( ) p t in this table can be derived as: The CPT of output node FDEP is shown in Table 4.

Calculating reliability parameters
According to the dynamic fault tree shown in Figure 2 and the basic failure data shown in Table 1, we can map the dynamic fault tree into an equivalent BN using the proposed method.Its equivalent BN is given in Figure 8. Once the structure of a BN is known and all the probability tables are filled, it is straight forward to calculate the reliability parameters of TWCS using the inference algorithm.These reliability parameters mainly include system reliability, DIF and SI.

(1) System reliability
Assume the mission time of TWCS is 1000 hours.We can calculate the system unreliability using the following equation: (2) DIF DIF is defined conceptually as the probability that an event has occurred given the top event has also occurred.DIF is the corner stone of reliability based diagnosis methodology.This quantitative measure allows us to discriminate between components by their importance from a diagnostic point of view.Components with larger DIF are checked first.This assures a reduced number of system checks while fixing the system: where i is a component in system S.
Suppose the system has failed at the mission time 1000 hours, we enter the evidence that TWCS has failed i.e. ( 1) 1 P S state = = and calculate DIF using the jointree algorithm.
(3) SI Sensitivity analysis allows the designer to quantify the importance of each of the system's components and the impact the improvement of component reliability will have on the overall system reliability.
Here we show how one can perform sensitivity through the usage of SI [10].SI of the i th basic event is defined as: where ( ) P S is the probability of the top event failure; ( ) P S i is the probability that the top event has occurred given the basic event i has not occurred.

Diagnosis strategy based on MADM
MADM models try to answer the question of 'what is the best alternative?' given a set of selection attributes and a set of alternatives.Generally there are three independent steps in MADM models to obtain the ranking of alternatives [23]: (1) Determine the relevant attribute and alternatives.(2) Attach numerical measures to the relative importance of the attribute and to the impacts of the alternatives on these attribute.(3) Calculation procedures to determine a ranking score of each alternative.Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) is one of the known classical methods to solve MADM problem, developed by Hwang and Yoon [5].It bases on the concept that the chosen alternative should have the shortest   distance from the positive ideal solution (PIS) and the farthest from the negative ideal solution (NIS).In the process of TOPSIS, the performance ratings and the weights of the attributes are usually given as crisp values.Under many conditions, crisp data are not sufficient to model real-life situations.Since human opinions are often vague and cannot estimate his performance with an exact numerical value.A more realistic approach may be to use linguistic assessments instead of numerical values, that is, to suppose that the ratings of the attributes are assessed by means of linguistic variables.In this paper, we treat the optimal diagnostic sequence problem as a MADM problem and propose an improved TOPSIS to solve the MADM problem.

Constructing diagnostic decision table for TWCS
DIF enables us to discriminate between components by their importance from a diagnostic point of view.SI allows the designer to quantify the importance of each of the system's components and the impact the improvement of component reliability will have on the overall system reliability.So we treat DIF and SI as attribute v1 and v2 respectively.Owing to the different complexity of components their test costs are different.A balance should be taken into account between the DIF and test costs.Therefore, we introduce a new measure of importance called HIV, which allows us to optimize the cost of diagnosis.This measure is simply the DIF per unit cost.HIV appears in the following equation (15): where i DIF is the DIF of the component i; i T is the test cost of the component i.
Test costs of the components are usually very difficult to express as crisp values because of uncertainty.So we introduce fuzzy linguistic expression to assess the test costs of components.Table 5 and 6 show the evaluation standards of the test costs and components' test costs for TWCS, respectively.HIV has an important effect on the diagnostic sequence and is treated as attribute v3.Table 7 shows the diagnostic decision table for TWCS.

Normalizing diagnostic decision table
Different attributes usually have different values and dimensions, which are not always directly comparable, so we should normalize the diagnostic decision table [21].For the quantitative data, we normalize them with the following equation: where ij a is the j th attribute value of the i th component.
For the fuzzy numbers, we normalize them with the following equation: , , We can obtain the normalized diagnostic decision table shown in Table 8 for TWCS using equation ( 16) ~ (18).Considering the same importance of each attribute, we can construct the weighted normalized diagnostic decision table shown in Table 9.

Determining the optimal diagnosis sequence
Attributes can be divided into two groups: beneficial attributes where higher values are preferable and non-beneficial attributes where lower value is preferable.There are three attributes in diagnostic decision table and they belong to the beneficial attributes.When the at- a is a beneficial attribute, the positive and negative ideal solutions are calculated as: where j r +  is the maximal value of the j th attribute and j r −  is the minimal value of the j th attribute When the attributer ij a is a non-beneficial attribute, the positive and negative ideal solutions are calculated as: min , min , min The distance of each alternative from j X +  and j X −  can be currently calculated as: A closeness coefficient is defined to determine the ranking order of all alternatives once the i D + and i D + of each alternative has been calculated.The closeness coefficient of each alternative is calculated as: Table 10 shows the distance of each alternative from the positive and negative ideal solutions together with the corresponding closeness coefficient.Obviously, an alternative comes closer to the PIS and farther from NIS as C i approaches to 1. Therefore, we can determine the ranking order of all alternatives and choose the best one from among a set of feasible alternatives.According to Table 10, we can obtain the optimal diagnostic ranking order of TWCS: X3, X6(X7), X8(X9), X12(X13), X14(X15), X4(X5), X10(X11), X2, X1, which considers the DIF, SI and HIV comprehensively.

Conclusion
In this paper, we have discussed the use of dynamic fault tree, fuzzy sets theory and MADM to diagnose the complex systems fault.Specifically, it has emphasized three important issues that arise in engineering diagnostic applications, namely the challenges of insufficient fault data, uncertainty and failure dependency of components.In terms of the challenge of insufficient fault data and uncertainty, we sciENcE aNd tEchNology adopt expert elicitation and fuzzy sets theory to evaluate the failure rates of the basic events for TWCS; In terms of the challenge of failure dependency, we use a dynamic fault tree to model the dynamic behaviours of system failure mechanisms.Furthermore, we calculate some reliability parameters used for fault diagnosis using BN and algebraic technique in order to avoid the aforementioned disadvantages.In addition, we treat the optimal diagnostic sequence problem as a MADM problem, propose an improved TOPSIS to solve the MADM problem and obtain the optimal diagnostic ranking order of TWCS.The proposed method makes full use of the dynamic fault tree for modelling, fuzzy sets theory for handling uncertainty and MADM for the best fault search scheme, which is especially suitable for fault diagnosis of the complex systems.
In the future work, we will focus on how to determine the attributes weights and take full advantage of the previous fault diagnosis results to dynamically update the diagnostic decision table, thereby optimizing the diagnosis efficiency.

Acknowledgement
This work was supported by the National Natural Science Foundation of China (71461021), the Natural Science Foundation of Jiangxi Province (20142BAB207022), the Science and Technology Foundation of Department of Education in Jiangxi Province (GJJ14166) and the Postdoctoral Science Foundation of Jiangxi Province (2014KY36).

Fig. 3 .
Fig. 3. Structure of the estimation of failure rates for TWCS

e
in standby mode.a

Fig. 5
Fig. 5 The Equivalent BN of OR and AND Gate

Table 2 .
The CPT of the node A

Table 3 .
The CPT of the node A

Table 6 .
Components' test costs for TWCS sciENcE aNd tEchNology

Table 10 .
The corresponding closeness coefficient of components