Diagnosis based on reliability analysis using monitors and sensors

https://doi.org/10.1016/j.ress.2006.10.024Get rights and content

Abstract

We develop a process for using monitors or sensors to optimize diagnostic decision trees (DDTs) generated for large systems. We present algorithms for optimizing the diagnosis process, which combines evidence data captured from monitors or sensors into the diagnostic tree generation process to produce DDTs. Since evidence data can be extracted from monitors and sensors, we developed a method for sensor modeling. Our method allows modeling monitors or sensors as an abstract layer on top of a systems fault tree model. This method of modeling allows the designer to graphically link monitors or sensors to the components that they monitor, without impacting the reliability analysis. We use a real system from the industry to demonstrate the practicality and effectiveness of our algorithms and methods.

Introduction

Diagnosis is the process by which the root cause of the failure of a complex system is determined. Generally, a set of possible “suspects” is postulated, analyzed and reduced, as a result of the application of test procedures, observations or other evidence. The order in which the tests are applied, the amount of information gained as a result, and the cost of the test all affect the efficiency of the diagnosis process. Computer-based diagnosis automates the process of test ordering by generating a testing strategy to address all potential test results and evidence.

The diagnosis problem can thus be stated as follows. System S, which is constructed from set of components C=(C1, C2, … Cn) has failed. A set of tests T=(T1, T2, …Tm) can be used to determine which of the components in C have caused the failure of S. We use the term “test” in a generic sense, where a test may be an observation (“is the fan blowing?”) a symptom (“response time is unusually long”), an indicator (“is the warning light illuminated?) or an actual test on a subsystem. Each test has a binary outcome (yes/no, test passed/test failed). Each test outcome partitions the set of components into two subsets: those components that are suspect possible causes, and those components that are exonerated by the test outcome. The set of suspect components contains those which have failed, and some which may or may not have failed. A subsequent application of a different test can similarly reduce the set of suspect components until all suspects are known to have failed, and the cause of the system failure has been diagnosed.

The test selection problem answers the related question “given the set of suspected components, which test should be applied next?” The solution of the diagnosis problem produces a decision tree that provides a map for the application of tests in a predefined sequence, based on the results of previously applied tests. With each node in the decision tree is associated a set of suspects and a test. Associated with the two possible outcomes of the test are edges pointing to child nodes with the associated reduced sets of suspects (Fig. 1).

Several criteria have been used to select the next test. Entropy-based methods [2] select the test that comes closest to balancing the outcomes. That is, the best test provides the most information by exonerating half of the suspects. Pattipati and Alexandridis [3] incorporate component failure probability into the selection, and chose the test that most closely balances the probability of the sets of suspects. Instead of striving to reduce the number of suspects by half, they strive to reduce the probability of the suspects by half. This approach tends to select tests for the most likely causes of failure earlier.

Current approaches are not adequate for the diagnosis of dependable computer-based systems for several reasons. Systems that are designed for dependability generally utilize redundancy for fault tolerance, which implies that several component failures are necessary to fail the system. Most diagnosis approaches assume a single component failure. The use of shared spare components, reconfiguration, error masking and other redundancy management techniques make diagnosis more difficult than simpler single-string systems. Reliability analysis techniques have been developed that include these dependencies, for example, the dynamic fault tree (DFT) model and the Galileo reliability analysis tool provide qualitative and quantitative analysis of dependable systems.

Until now, computer-based diagnosis has not taken advantage of quantitative and qualitative reliability analysis. Our automated diagnostics based on reliability analysis (ADORA) methodology combines diagnosis approaches based on entropy with quantitative and qualitative information produced by reliability analysis. ADORA produces a diagnostic decision tree (DDT) that designates the best test to apply next, based on the evidence gathered thus far, cost and information gained by each test, and the quantitative and qualitative analysis provided by the reliability analysis model. ADORA can thus carry the reliability analysis results that supported the design process into the diagnosis process.

In this paper, we expand the ADORA methodology to incorporate more complex statements of evidence. Until now, we have considered test results as a simple partition of the set of suspects. We generalize the statement of evidence by allowing the evidence to be expressed as a Boolean function, and show how these more complex statements of evidence can further improve the diagnosis process. The incorporation of more general evidence supports the use of sensors or other monitors that can report on the status of some combinations of components. The ability to incorporate monitors in the system model facilitates analysis of the placement of monitors to facilitate the diagnosability of a system during the design phase. We further generalize the ADORA model to consider the effects of false or misleading evidence, for example when a monitor fails.

The second section of this paper presents the active heat rejection system as an example to demonstrate the various methods presented in this paper. In this section, diagnosis without evidence incorporation is performed. The third section describes a new method for modeling monitors as a separate layer onto DFTs. The fourth section shows how to use the evidence function captured by the monitor layer to deduce the cutsets under examination (CUE) function from the characteristic function of the system and use it to construct a DDT. The fifth section describes how the DDT accounts for potential monitor or sensor failure. The sixth section discusses monitor placement as part of the design for diagnosability process. The seventh section demonstrates our methodology using a storage area network (SAN) by Dell. The last section concludes this paper.

Section snippets

Example of a DDT

We use the active heat rejection system [4], called the AB system, to explain the diagnosis methodology. This system is an abstraction of a real system used on the International Space Station by NASA. The system design of the AB system is presented in Fig. 2. The AB system consists of two sets of components (A1 and A2) and (B1 and B2). A2 is a backup (cold spare) for A1 and B2 is a backup (cold spare) for B1. At least one of (A1 and A2) and at least one of (B1 and B2) are required for system

Incorporating monitors and sensors into the DFT

Evidence is acquired from monitors or sensors. Monitors integrated into a system's design usually do not impact the reliability of the system. Thus for a system that undergoes reliability analysis, a fault tree is constructed without including monitors or sensors. To enhance our diagnosis process, we incorporated monitors into the reliability model.

A monitor layer for capturing evidence is appended onto the DFT. Enhancing DFTs allows using one model for both reliability analysis and for

CUE from evidence

The general objective of this section is to develop an algorithm for using evidence to reduce the number of suspected minimal cutsets. Since, examining a cutset that caused the system to fail then repairing the bad components in that cutset should bring the system up, we can enhance diagnosis by reducing the number of cutsets examined. The CUE is the set of all essential minimal cutsets obtained after evidence eliminates some cutsets. In the presence of evidence, the getCUE algorithm for

Managing monitor and sensor reliability

Monitors and sensors might not be perfectly reliable. A monitor that provides false information can misguide the diagnosis process, thus a monitor failure can jeopardize the correctness of the DDT. The monitor layer does not impact the fault tree solution, thus monitors do not appear in the CUE function. We augment the CUE function by adding monitors as cutsets, since a monitor failure produces a faulty diagnosis process such as a minimal cutset failure causes a system failure. The DIF for a

Design for diagnosability: monitor and sensor placement

In this paper, we demonstrated how incorporating monitors and sensors into fault trees enhances DDTs, thus good monitor placement at the design phase can be very effective. The CDIF can be used to decide between candidate monitor locations. To decide between two monitor locations one can evaluate the DDT for few design models and choose the DDT with the least diagnostic cost [12], see Eq. (7). However, building several DDTs and evaluating each prior to selecting a location can be avoided by

Case study: a Dell storage system

This is a case study of a real commercial network system. In this section, we diagnose a reliable static configuration of a SAN, which was designed by Dell. The SAN intends to provide high reliability and availability storage service at the backend of a network, which is the most important requirement for such networks. In [13] a reliable SAN structure is designed and analyzed. Other case studies are presented in [14], which cover a wider range of systems and failure scenarios.

The configuration

Summary and conclusions

In this paper, we showed how ADORA was enhanced by introducing a systematic method of obtaining information from qualitative data about the system and using it to derive the quantitative DDT models. We used the concept of CUE to reduce the produced DDT, thus enhancing the DDT capabilities and offering a significant reduction in the diagnostic cost of the diagnostic tree.

The CDIF, which was used to generate DDTs, has played a significant role in assisting the process of incorporating monitors

References (14)

  • Assaf T, Dugan JB. Approximation of diagnostic importance factors using Markov models for diagnostic test sequencing....
  • B. Moret

    Decision trees and diagrams

    ACM Comput Surveys

    (1982)
  • K.R. Pattipati et al.

    Application of heuristics search and information theory to sequential fault diagnosis

    IEEE Trans Syst Man Cybern

    (1990)
  • Assaf T, Dugan JB. Diagnostic expert systems form dynamic fault trees. In: Proceedings of the annual reliability and...
  • Dugan JB, Sullivan K, Coppit D. Developing a low-cost, high-quality software tool for dynamic fault tree analysis. IEEE...
  • J.B. Fussell

    How to hand calculate system reliability and safety characteristics

    IEEE Trans Reliab

    (1975)
  • Vesely WE. Fault tree handbook. Technical report NUREG-0492, US Nuclear Regulatory Committee, Washington,...
There are more references available in the full text version of this article.

Cited by (35)

  • Telecommunications reliability monitoring using wireless MEMS

    2013, Handbook of Mems for Wireless and Mobile Applications
  • A new fault detection method for computer networks

    2013, Reliability Engineering and System Safety
    Citation Excerpt :

    During the past few decades, many researchers have considered the network failure problem [1,2]. And numerous fault diagnosis methods have also been widely used in the modern industrial systems [3–6]. Applying these techniques into computer networks is rather important both theoretically and practically.

  • A fault diagnostic system based on Petri nets and gray relational analysis for train–ground wireless communication systems

    2021, Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability
View all citing articles on Scopus
View full text