On the Language of Reliability: A System Engineer Perspective

Abstract In its classical definition, risk is defined by three elements: what can go wrong, what are its consequences, and how likely is it to occur? While this definition makes sense in a regulatory-based framework where for the current fleet of operating light water reactors (LWRs), the risks associated with nuclear power plants typically are characterized in terms of core damage and large early release frequency (LERF), this approach does not provide a useful snapshot of the health of the plant from a broader perspective. This is due to the very narrow context in which the term “risk” typically is defined as nuclear safety aspects that have the potential to impact public health. In this paper, we take the viewpoint of nuclear safety that is reflective of the current fleet of operating LWRs for which core damage frequency and LERF are appropriate metrics. For other advanced reactor designs, other more applicable technology neutral metrics of reactor safety metrics would be specified. A possible alternate path would start by redefining the word risk with a broader meaning that better reflects the needs of a system health and asset management decision-making process. Rather than asking how likely an event could occur (in probabilistic terms), we can ask how far this event is from occurring. Our approach starts by defining and quantifying component and system health in terms of a “distance” between its actual and limiting conditions, i.e., determination of the margin that exists between the current state/condition and the state where the component/system is no longer capable of achieving its intended function. A margin is a measure that is more reflective of the current state or performance of a component, and therefore more closely tied to decisions that are made on an ongoing basis. We will show how, given the data available from plant equipment reliability and monitoring (e.g., pump vibration data) and prognostic (e.g., component remaining useful life estimation) data, a margin can be described and determined for all types of maintenance approaches (e.g., corrective or predictive maintenance). We show how classical reliability models (e.g., fault trees) can be used to quantify the system margin provided component margin values. In the approach described in this paper, the propagation of margin values through classical reliability models are not performed using classical probabilistic calculations applied to sets (as performed in a typical plant probabilistic risk assessment). Instead, we show how it is possible to propagate margin values through Boolean logic gates (i.e., AND and OR operators) through distance-based operations.


I. INTRODUCTION
In order to reduce operation and maintenance costs, nuclear power plants (NPPs) are moving from corrective and periodic maintenance to predictive maintenance (PdM) strategies.Such a transition requires changes in the data that need to be retrieved and the type of decision processes to be employed.Additionally, advanced monitoring and data analysis technologies are essential for supporting the effective application of predictive strategies.These technologies can provide precise information about the health of a component or system, track performance, identify potential degradation trends, and provide an estimate of its expected failure time.With such information, maintenance operations for a component can be planned and performed at a convenient time before failure is expected to occur.This dynamic context of operation and maintenance operations requires new methods to analyze data, propagate component health information from the component to the system level, and optimize plant resources.We refer here to plant health in a prognostic and health management 1 context where component aging and degradation are constantly monitored (e.g., through nondestructive measurement techniques), analyzed (e.g., identify anomalous behaviors and assess the state of degradation), and forecasted into the future [e.g., estimation of remaining useful life (RUL) of the monitored component].
Currently, the propagation of quantitative health data from the component to the system level is a challenge given the diverse nature and structure of the data.In general, component health evaluations are performed using qualitative methods to assess equipment health and performance and to specify maintenance strategies to provide assurance that the components are capable of achieving their intended functions.Specific guidance on ensuring plant structures, systems, and components (SSCs) achieve the desired levels of performance is provided in the Institute of Nuclear Power Operations Report AP-913 (Ref.2).From a U.S. regulatory perspective, license requirements to monitor and manage performance of plant SSCs that are important to safety are specified in the maintenance rule 3 and industry implementation guidance is specified in NUMARC 93-01 (Ref.4) with U.S. Nuclear Regulatory Commission endorsement of the approach provided in Regulatory Guide 1.160 (Ref.5).Over the more than 25 years since the implementation of this rule, U.S. NPP performance and safety have shown substantial improvements; for example, see reports published by the Electric Power Research Institute 6 and the Nuclear Energy Institute. 7uantitative plant reliability models and methods [which are typically based on fault trees (FTs) or reliability block diagrams] can effectively propagate data from the component to the system level, but values of failure rates or failure probabilities are an approximated integral representation of the past industry-wide operational experience.As an example, pump failure rates are available from equipment reliability (ER) sources (e.g., refer to Ref. 8); failure events that have been recorded for a specific pump can be used to update its failure rate used in a quantitative reliability model using Bayesian updating processes. 9On the other hand, quantitative plant risk models are average models based on historical data in which the actual current status of pump performance and condition is not considered in the failure rate estimation.
A critical challenge for the management of plant reliability is the ability to integrate plant health data and to utilize this information to support decision making (e.g., determine component optimal maintenance activity schedule) that ensures safety (via equipment/system reliability and availability) in a manner that minimizes operational and maintenance costs.To achieve this, plant ER programs primarily rely on qualitative methods described in the industry guidance previously referenced along with quantitative reliability models focusing on regulatory and safety aspects [such as the plant probabilistic risk assessment (PRA) and configuration risk management (CRM) models].In this paradigm, condition-based data and diagnostic and prognostic information generally are not considered in quantitative plant reliability models, and thus are not used to inform system engineers on the most critical components in decisions made on an ongoing basis.
Thus, the current modeling approach uses average values that do not account for the present component health status (e.g., as obtained from measured diagnostic and condition-based data) and health projections (when available from the prognostic data).Our first claim is that system reliability models should propagate health information from the component to the system and plant level in order to provide a quantitative snapshot of system and plant health and identify the most critical components given the actual conditions and status of the plant.Such an approach would provide a more comprehensive and accurate evaluation of the condition of the plant and support improved and accurate decisions.Our second claim is that, when used to support operational decision making, such as in CRM programs, it is more appropriate to assess risks based on the current status of SSC performance and health and that a reliance on historical performance data may be misleading and result in less than optimal risk-informed decision making.
This paper directly supports these two claims by proposing an alternative approach to performing reliability modeling and plant health management that more directly integrates available component diagnostic, prognostic, and condition-based data to measure component health, propagate this information through quantitative reliability (e.g., FT) models, and provide a more accurate assessment of the plant risk profile to support decision making.In the approach described in this paper, the propagation of health data from the component to the system level is performed, not in terms of probability, but in terms of margins, where margin is defined as the "distance" between the present actual status of the component/system and the expected occurrence of an undesired event (e.g., failure or unacceptable performance).A margin value is quantified for each component based on available monitoring data and the condition at which failure of the SSC is expected to occur; note that, depending on the available monitoring data, component margin is defined over the same dimension of the data itself (e.g., temperature, time, vibration spectrum).Component margin values are then propagated through a reliability model to estimate the margin of the system.
Therefore, system margin is a quantitative measure of system health based on the contribution of constituent component monitoring data.This information is then used to assess which components are providing the largest impact on system health (either positive or negative); in this respect, we show how a margin-based analysis can be employed to identify the possible ways a system can be successfully operated rather than the ways in which it can fail.At this point, responsible plant system engineers can more effectively prioritize those maintenance activities that provide the largest increases in system health (system health and asset management) in a manner that maximizes plant economics.
In this context, understanding both the positive and negative impacts of margin are important in the decisionmaking process.Plant SSCs that are observed to operate with small margins represent vulnerabilities where additional attention and resources may be warranted to improve performance and prevent the possible occurrence of SSC failures.Conversely, plant SSCs that are observed to provide excessively large margins may indicate where resources are not being employed in an economical manner.Because these resources are limited, these SSCs are candidates for reduced attention as the marginal benefit being provided by the excess margin is small and the resources could be better applied to other objectives.
Through a cause-effect lens, while classical reliability models target the effects associated with component performance, a margin-based approach focuses on the observations and causes of an undesired component performance (i.e., assessment of component health).Hence, thinking of reliability in terms of margins implies decision making based on causal reasoning.In this paper, we show how FT models can be solved using a margin language and how this process can effectively assist system engineers in identifying the most critical components based on current performance and condition.We note that a margin-based approach has long been applied in structural reliability engineering with methods that are well established and understood within that domain. 102][13][14] Our work described in this paper builds and expands upon this work.

II. SYSTEM ENGINEER DEFINITION OF RISK
In its classical definition, risk is defined by three elements: what can go wrong, what are its consequences, and how likely is it to occur?The likelihood of occurrence is typically described in probabilistic terms.While this choice makes sense in a regulatory-based framework to estimate risks associated with NPPs [e.g., as represented as core damage frequency (CDF) and large early release frequency (LERF) for light water reactor (LWR) plants], this approach, which as currently implemented across the industry, relies on static models, and thus does not provide an actual snapshot of risk that reflects the current health and performance of plant SSCs.Several observations support this statement: 1. Plant PRA models are based on Boolean logic structures 15 [e.g., event trees (ETs) and FTs] that describe the deterministic functional relationships among plant SSCs, postulated sequences of events, and human interventions.Each basic event (BE) in a PRA model represents a specific elemental occurrence (e.g., failure of a component, failure to perform an action by the plant operators, recovery of a safety system).In these models, a BE is a binary occurrence defined using a Boolean logic (i.e., the event can either occur or not occur).
2. A probability value is associated with each BE, which represents the probability that the BE can occur.Maintenance and surveillance operations typically are not completely integrated into a PRA structure (i.e., they are typically accounted for via the historical or estimated average impact on SSC unavailability).Probability values used in the plant models are updated at least every 4 years [as required by 10CFR50.71(h)(2)(Ref.16) and the American Society for Mechanical Engineers and American Nuclear Society PRA standard, 17 which specifies requirements for PRA model upgrades] based on past operational experience through use of a Bayesian statistical process.
3. The modeled availability and reliability probability values associated with plant SSCs is thus an integral representation of the past operational experience for these components, and it does not incorporate information on the present health status of SSCs [e.g., from diagnostic and condition-based maintenance (CBM) data] and health projections (when available from prognostic data) on anticipated changes in SSC condition and performance in the near future.
Given these conditions, the decision-making process related to health and asset management is not well suited to being evaluated and managed using classical PRA models and tools.From the perspective of personnel responsible for the performance of these plant SSCs (i.e., the system engineer), the question becomes: What can we change to improve decision making related to these SSCs?Our starting point is the type of decisions that we want to support.As indicated in Sec.I, we are targeting a system engineer context where, in order to optimize plant operation and maintenance costs, component maintenance activities are performed only when they are needed based on component monitoring data.The objective of this perspective is to ensure limited resources are efficiently applied to the greatest extent practicable while ensuring adequate performances 18 and levels of plant safety are maintained.Figure 1 shows this alternate perspective where, rather than asking how likely an event could occur (in probabilistic terms), we ask how far this event is from occurring given the current state of the plant and its various constituent elements (i.e., SSCs).We noted earlier this viewpoint is prevalent in the structural reliability analysis field. 10It also reflects a more holistic viewpoint on risk for use as an input to decision making than is typically applied in engineering studies (including the commercial nuclear power industry, see for example Aven 19 ).
This alternative interpretation of risk transforms the concept from one that focuses on the probability of occurrence to one that focuses on assessing how far away (or close to) a component is to an unacceptable level of performance or failure.This transformation has the advantage in that it provides a direct link between the component health evaluation process and standard plant processes used to manage plant performance (e.g., the plant maintenance and budgeting processes). 2,4The transformation also places the question into a form that is more familiar and readily understandable to plant system engineers and management decision makers.We also note that this alternative viewpoint also possesses an additional advantage over the standard approach using static models.The evaluation of margins, when combined with the integration of other predictive analyses, such as trending and prognostics, provides information related to the timeframe in which actions should be taken to address identified degradations in performance.As indicated previously, because the current PRA approaches use static models, an assessment of the time element on the impact of these conditions on plant safety risk (e.g., in terms of potential CDF and LERF impacts for the current fleet of LWR NPPs) is either not possible or is incorporated in an arbitrary manner.
At this point, to apply the alternative approach we've proposed here models need the capability to quantify component margins as a distance of occurrence between the component's existing condition and the point where its performance or condition becomes unacceptable (i.e., action is required).Note that the concept of distance does not necessarily need to be measured in terms of time, as we will show in the next sections.In other terms, rather than using "failure language," we measure component health in terms of margin.Such "margin language" relies only on the available component monitoring data.Note that multiple monitoring data sources might be available for a specific component; hence, in this situation component margin is defined over multiple dimensions.Given the data available from plant data sources and monitoring, diagnostic, and prognostic centers, a margin can be described in either probabilistic (through a probabilistic distribution function) or point value forms.Additionally, since the evaluation of margin is dependent on the existing state of the SSC being evaluated, the concept of margin is time dependent.The application of this framework to system health as proposed here is centered on the integration and evaluation of available data to assess component condition and performance.Component monitoring data (e.g., pump vibration data) here are the main ingredients to determine component margin, and hence, a margin is an observable quantity that can be estimated given information related to the current condition and performance of plant SSCs and defined operational criteria.Thus, this framework requires the definition of the following metrics:

III. RELIABILITY MARGIN: A DEFINITION
1. Space: The space definition should be based solely on the type of the data that can be directly measured and obtained by the system engineer.

Distance:
Once the space is defined, the framework needs a measure of the distance between two points located in this space.
The goal now is to define appropriate space and distance metrics for two classes of plant health monitoring-based maintenance approaches 20 : CBM and PdM.We have targeted these maintenance approaches since they have been identified as strategies that can reduce operation and maintenance costs (see Sec. I).In order to show the bridge between current reliability methods (which focus on a component probability of failure) and the proposed margin-based method, we also considered a corrective maintenance (CM) approach.For each maintenance approach, the objective is to develop measures that provide useful indicators of component condition and performance that plant system engineers can use to costeffectively assess and manage their assigned components.The following sections identify available data that are relevant to each maintenance class and the space in which the relevant ER data are applicable and identify the relevant margin measure for each class.

III.A. Corrective Maintenance
This class represents the simplest maintenance class as the components assigned to this class have been identified as not critical to plant production or safety.Since the components in this class are allowed to run to a point in time in which maintenance is required to restore the equipment to proper operation (referred to as "run to maintenance" condition 21 ), the only information available is the time a degraded condition was detected or a failure occurred that resulted in the generation of the CM work order to restore component functionality and condition.For the purposes of estimating the margin for a particular component under this maintenance regime, this can be obtained from the difference in time between when a component was placed in service and the mean time to failure (MTTF) for a population of similar SSCs operated under similar conditions, as observed in Fig. 3: 1. data: observed failure times for similar components situated within similar boundary conditions (e.g., operational environment and usage) 2. space: time 3. margin: distance between actual operation time since installation or most recent refurbishment and the set of observed failure times.
In this and the following subsections, it is noted that the depictions provided are intentionally simplistic in order to provide a clear indication of the fundamental concepts.Additionally, because the concept of margin is time dependent, monitoring the evolution of margin becomes an important input to the evaluation and asset management decision process.For this first category, these SSCs have been determined to have a negligible impact on plant operation and safety.Therefore, for this case, the use of MTTF provides a relatively simple representation of the acceptable performance criterion, and estimating the time evolution of margin for the purposes of decision making is particularly simple (i.e., reduces to the time since the SSC was last maintained assuming the MTTF is constant).

III.B. CBM: Diagnostic
For this (and the following) maintenance class, data are available that evaluate the condition and performance of the monitored components.Often the data are available  In this category, the data evaluation approaches are diagnostic in nature but limited in the extent to which the applied monitoring can predict the course of future degradation over time (Fig. 4): 1. data: actual component condition and past condition data for similar SSCs 2. space: component condition (e.g., oil temperature, vibration spectrum) 3. margin: distance between actual component condition and assessed limiting conditions.Such limiting conditions can be observed past conditions that lead to failure or specified by component technical specification (such as the required operating point on a pump curve).
Note that in this paradigm, there exists a greater level of knowledge associated with the current condition of the monitored SSC as well as an estimate of the conditions under which performance would be considered unacceptable and for which corrective actions would need to be implemented.However, similar to the previous class, the time at which these conditions would be obtained are estimated in a manner that possess a high degree of uncertainty (e.g., based on historical experience at the plant or engineering judgment).Hence, decisions made under this paradigm will incorporate a greater degree of conservatism than those that are made under the class in Sec.III.C in which explicit prognostic models are available to provide quantitative predictions of the timing and impacts related to the progress of the observed degradations.

III.C. PdM: Diagnostic + Prognostic
Similar to the preceding class, data for components in this maintenance class are available to evaluate their condition and performance, again often in real time.However, as described previously, the approaches are prognostic in nature in this category and can provide an accurate prediction of the course of future degradation over time.Therefore, once a degraded performance is detected for a particular component, the appropriate action is to apply the relevant models using the given data to obtain margin predictions for the components and to integrate these evaluations (including any identified contingency actions such as the conduct of additional or more sophisticated, targeted monitoring), as shown in Fig. 5: 1. data: actual SSC condition, past condition data for similar SSCs, and predictive model of future degradation that provides the estimated RUL for the monitored SSC 2. space: time 3. margin: distance between actual time and estimated RUL.
For components where a prognostic analysis is available, it is possible to estimate the component's RUL when component degradation starts to emerge.Typically, RUL is quantified in probabilistic terms where a probabilistic distribution function defined over the time axis t is generated, RUL,PDF RUL t ð Þ.In such a case, the margin can be estimated using two approaches.The first defines the margin as M ¼ 1 À CDF RUL t ð Þ where CDF RUL indicates the cumulative distribution function corresponding to PDF RUL .(Note that in this paper, the abbreviation CDF X with a subscript always should be interpreted as a cumulative distribution function while the abbreviation CDF without a subscript is an acronym for core damage frequency.)The second approach estimates the margin as the distance between the actual component life and a point estimate of the RUL distribution (e.g., the fifth percentile).
A graphical representation of the margins for both approaches is provided in Fig. 6 for an estimated RUL that is normally distributed as shown in red.Note that the proposed approach updates margin value when component health is measured and when a better RUL estimation (i.e., less uncertainty associated with RUL) is available from the corresponding prognostic model.

IV. NORMALIZED MARGIN
The definitions of margin indicated in the previous sections are defined over the different spaces that are reflective of the different maintenance classes and available data.The goal now is to answer this question: Is there a way to transform the definitions of margin indicated in Secs.III.A, III.B., and III.C in such a way that they can be compared?A possible approach is to normalize the margin definitions in Secs.III.A, III.B., and III.C, where we use a normalized margin M where each margin is normalized over its specific attributes indicated as and This normalization results in a standardized evaluation set such that 0 � e M � 1.When e M ¼ 1, then the component is considered healthy; similarly, if e M ¼ 0, then the component is considered failed.Intermediate cases where e M is between these values are indicative of some level of observed degradation in performance or condition (more degradation as one approaches 0) and can be viewed as representing a level that reflects a normalized estimate of the SSCs RUL.In this framework, e M thus provides a relative assessment criterion that can be compared across the different maintenance classes and for which consistent evaluation and decision criteria can be developed.

IV.A. Examples of Component Margin Estimation
The scope of this section is to provide practical examples of how component margins can be estimated using available condition-based data.As indicated in Sec.III, the margin can be calculated as the distance between actual and limiting conditions.In practical settings, limiting conditions can be represented by technical criteria specific to the component.As an example, for induction motors, oil viscosity may be required to be below a specified limiting condition to ensure proper motor function.Hence, the margin can be calculated as the difference between the specified limiting condition and the currently measured oil viscosity.
As another example, for centrifugal pumps, typical degradation processes affect pump mechanical seals.The pump vibration signal is constantly monitored, and statistical indicators, such as the root mean square (RMS), of the signal are determined.In such scenarios, RMS analyses observed when seals are degraded beyond their limit are available from the manufacturer for different pump rotation speeds.In this case, an applicable margin could be defined as the difference between the manufacturer-specified acceptance level and observed RMS data.
As another example, cage winding issues for threephase induction motors may emerge within a few years due to premature aging 22 and are usually caused by the degradation of the electrical insulation in the rotor (if present) and stator windings.In this context, current signature analysis could be used to detect cage winding issues.This is performed by identifying two sideband currents centered around frequency f 2 : Fig. 6.Margin values obtained from the two proposed approaches (green and blue lines) given an estimate of a component's RUL (red line).

ON THE LANGUAGE OF RELIABILITY • MANDELLI et al. 1643
where f 1 indicates the supply frequency (i.e., 60 Hz) and s indicates the slip factor.The two sideband currents f sb are located at (see Fig. 7) If significant sideband currents I f sb are present (dB difference between current I f 1 at f 1 and average sideband f sb height ≤ 45 dB), cage winding breaks are likely to occur.Given this information, the margin can be defined as As a final example, for motor-driven centrifugal pumps, the pump vibration signal is constantly monitored using standard accelerometers, and data for normal and failed conditions are often available from manufacturers.Statistical indicators, such as the RMS, of the signal can be measured.Examples of RMS analyses observed when seals are degraded beyond their limit for different pump rotation speeds are given in Ref. 23 and shown in Fig. 8.In this context, margin e M can be defined as where T represents the RMS value measured under difference conditions: normal, damaged, and observed.

V. MARGIN-BASED RELIABILITY CALCULATIONS
Current reliability models used in NPP PRAs are based on Boolean logic structures 15 (e.g., ETs and FTs) and are solved probabilistically using classical probabilistic calculations applied to the sets, as observed in Fig. 9. Provided a set A in a sample space S, we can associate a probability value P A ð Þ to such a set such that 0 � P A ð Þ � 1, where P ; ð Þ ¼ 0 (here ; indicates the empty set) andP S ð Þ ¼ 1.From a set theoretic perspective, it is of interest to measure (in a probabilistic sense) the probability associated with the occurrence of both (i.e., the intersection of A and B) or either (i.e., the union of A and B) events.In reliability modeling, the occurrence of both or either events are associated with the logical AND and OR operators, respectively.While the union of events maps the region in s covered by both A and B, their intersection maps the overlap region between A and B.
Given this, it is possible to calculate the probability of the union and intersection of two events A and B as follows:    The goal now is to solve the AND and OR operators by feeding margin e M values.The rationale is to propagate margin values from the component level to the system level, where the system reliability model is still represented by classical reliability modeling approaches (e.g., FTs).We want to assess the margin at a level where the degraded performance or failures of SSCs would result in actual consequences to system performance or economics (i.e., at the system or train level).Note that a margin value is an observable variable that is based on available monitoring data for each component.When such information is propagated to the system level through classical reliability models, the obtained system margin value represents an observable variable that represents system health.Even though the definition of e M results in a value between 0 and 1, it is not appropriate to interpret margin values as a probability in this paradigm.
Consider two components, A and B. The e M for both components can be visualized in a two-dimensional (2-D) space, as shown in Fig. 10.Starting with brand new components (i.e., e M A ; e M B ¼ 1), the effect of aging degradations that affect each can be represented by the blue line in Fig. 10, which parametrically represents the combination of the normalized margins e at a point in time t.Note that if no maintenance (preventive or corrective) is ever performed on either component, this path would move from the coordinates (1,1), representing components A and B at the beginning of life, to the coordinates (0,0) where both components have failed.Similar to the set-based visualization of Fig. 9, we can identify the following regions in Fig. 10: 1. the occurrence of both failure events when e M A ¼ 0 and e M B ¼ 0 (i.e., the intersection point between the two axis of Fig. 10 At this point, we can calculate the margin e M for these events.This is accomplished by following the margin definition by measuring the distance between the actual condition of components A and B and the conditions identified by the event under consideration (e.g., the occurrence of both or either events), .Thus, the metric distance should be selected based on the intended analysis, either conservative (Euclidean metric) or optimistic (Manhattan metric).
Given these two bounds, which metric should be chosen?This question can be answered by referring to how condition-based data are collected.From a practical standpoint, condition-based data are collected from plant components on a regular basis (either continuously in time or at prescribed time intervals).Hence, at any specific time we can quantify not only the margins for components A and B (i.e., e M A ; e M B Þ, but also how these margin values change over An example of this process can be visualized with the graphics in Fig. 11, where the margin value for both components uniformly decreases as a function of time (i.e., q e M A qt and q e M B qt are constant and do not change with time), but where the degradation of component A occurs at a faster rate than component B (i:e:; q e M B qt < q e M A qt ).In such conditions, the temporal evolution of e M A and e M B is represented by the continuous blue line in Fig. 11.Starting from point α (where components do not show any degradation, that is, brand new or recently refurbished to as good as new condition components), the blue line progresses up to actual measured conditions (point β).Given the estimate of q e M A qt and q e M B qt ; it is now possible to estimate the progression of e M A and e M B in the future (i.e., dashed blue line in Fig. 11).This information can be used to estimate e M A AND B ð Þ using base trigonometry rules as the length of the segment β0 ¼ βγ þ γ0.Note that this estimate of e M A AND B ð Þ is still bounded by the Euclidean and Manhattan metrics but provides a more accurate estimate.Last, note that all these distance operations can be extended from a 2-D to a generic n-dimensional space.In this analysis, once point γ is reached (reflecting the condition where component B has failed), estimating the time required to reach the point where both components A and B failed now becomes a onedimensional problem (only dependent on the condition of component A at the time of the component B failure and the rate of degradation of component A (which was linear in this simple example).It is noteworthy that the information used in the margin approach also provides the capability to estimate the time required to reach additional failure states that may be of interest for maintenance planning.
Note that the historic information of q e M A qt and q e M B qt can be used to predict future e M A and e M B evolutions.As an example, Fig. 12 shows the historic evolution of e M A and e M B and their predicted evolution based on the derivative information.The predicted evolution has been calculated by generating a random walk out of the distribution of q e M A qt and q e M B qt .The same information can be plotted on the temporal scale, as indicated in Fig. 13 where, given existing e M A and e M B data, the value of e M A AND B ð Þ using q e M A qt and q e M B qt is calculated (see the green line in Fig. 13).The predicted evolution of e M A AND B ð Þ can be calculated given the predicted evolution of e M A and e M B (see the random walks plotted in red, blue, and purple for e M A ; e M B , and e M A AND B ð Þ, respectively).

VI. MARGIN-BASED SYSTEM RELIABILITY CALCULATIONS
In the previous sections, we indicated that margin values can be propagated through Boolean logic gates (i.e., AND and OR operators).The objective of this section is to show how the proposed marginbased operations can be applied to methods that are ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1. Construct the system FT.
2. Generate the cut sets and minimal cut sets (MCSs) from the FT.
3. Assign a probability to each BE.

Calculate the probability of the union of the MCSs (Ref. 15).
Step 4 is typically time consuming since the probability of the union of the MCSs involves the calculation of the probability of the intersection between 1, 2, 3, …, MCSs as indicated in Eq. ( 4), and the current generation of nuclear plant PRA models possess a very large number of MCSs.In Sec.V, we showed how margin-based reliability calculations are not based on classical set theory, but instead on metric space (i.e., distance-based) operations.Hence, exact solutions can be obtained extremely fast.More precisely, reliability calculations using e M-based data can be performed by completing these four steps: 1. Construct the FT.At this point, a FT only contains deterministic information about the architecture of the system under consideration (i.e., it simply models how the BEs are related to each other from a functional perspective).
2. Generate the cut sets and the MCSs from the FT.As also indicated in step 1, a MCS still represents the minimal combinations of BEs that lead to the top event.

Assign a margin value e
M to each BE.

Calculate e M of the union of the MCSs (see also Sec. IV).
A relevant point here is the fact that personnel directly responsible for plant operation (i.e., plant operators and system engineers) are generally more interested in the possible ways plant systems can be successfully operated rather than the ways in which they can fail.This viewpoint also is typically the way in which plant operations are conducted; for example, plant emergency operating procedures are structured to provide alternative paths to achieving a successful outcome during a plant event with the objective of placing the plant in a safe and stable condition regardless of the number of equipment failures that may have occurred.Hence, the viewpoint of plant personnel typically occurs within a "success space" rather than in the "failure space" in which plant PRAs are performed.Thus, rather than determining system MCSs, for plant system engineers it is more relevant to obtain the system path sets. 24Given that a margin value is designed to quantify the health of plant SSCs, the evaluation of the margin of a path set (using the same operations described in Sec.V) provides a system engineer with information about the health of such a path set.

VII. RELIABILITY IMPORTANCE MEASURE
As part of system reliability modeling, it is always important to determine the importance of each BE.In a PRA setting that evaluates nuclear safety from the perspective of CDF and LERF (for LWRs), this process relies on risk importance measures, 15 such as Birnbaum or Fussell-Vesely.Given the different nature of margin values, it is possible to perform a reliability importance ranking by relying on a classical sensitivity measure (derivative based) for each BE defined as follows: where RIM is the reliability importance measure.In other words, the margin-based RIM BE indicates how a small variation of the margin e M BE of a particular BE under consideration directly affects the margin of the system e M sys .The semantic RIM BE value can support system engineer questions, such as what is the added value of allocating operations and maintenance funds to reduce the impact of component failure on a BE? Similarly, RIM BE values can be used to support future decisions, such as when would it be most appropriate to evaluate component BE health status?The first question is directly answered by RIM BE as follows.If a specific maintenance operation can restore component margin by a value of Δ e M BE , then the system margin improves by a margin quantity equal to e M sys ¼ RIM BE � Δ e M BE .The second question can be answered by measuring the temporal evolution of q e M BE qt and estimating the time where e M BE ¼ 0 using the same trigonometry rules indicated in Sec.V and graphically shown in Fig. 11.
Note that the definition of S BE does not account for any temporal evolution information.However, to address this aspect of SSC performance, an alternative RIM could be the following:

VIII. LINK BETWEEN RELIABILITY MODELING APPROACHES
At this point, it is relevant to present the structural differences between classical reliability models (i.e., based on probability of failure) and a margin-based approach.These differences can be described though a cause-effect lens, as in Fig. 14.Classical reliability models focus on the effect node (i.e., to model component failure), whereas component reliability data are used to assess the system failure probability.Such models are used to monitor plant risk (as currently done by plant risk monitors) and to set "offline" decisions, such as setting periodic surveillance and maintenance activities (i.e., a preventive maintenance context), or to set the duration of planned system maintenance outages (either as part of a plant CRM program or a plant risk-managed technical specification program). 25In a preventive maintenance context, maintenance and surveillance activities are set on a fixed frequency, and they are intended to address identified degradation mechanisms before they can result in a failure.Such fixed frequency is determined based on past experience of similar components (e.g., by considering MTTF).
Over the past several decades, plants have been moving from a reliance on periodic to more comprehensive diagnostic and prognostic (i.e., predictive) strategies where the goal is to only perform intrusive maintenance operations when needed.Advanced monitoring and data analysis technologies are essential to support predictive strategies.This is where margin-based reliability approaches can be applied to support this type of "online" decision making where, based on current condition-based data, component health data are employed to assess component and system health (i.e., the focus is now shifted to the cause node of Fig. 14).
A margin-based approach focuses on estimating component health based on available monitoring data.Such an approach targets directly a maintenance approach where activities are performed only when needed (i.e., when component health is approaching an undesired status).As a final remark, note the following: 1.The margin value of a component reflects the status of a component provided actual monitored data.
2. The temporal evolution of a margin value implicitly includes all nonlinear behaviors behind component degradation.
3. Sudden component performance degradation is mirrored by an equivalent step decrease of the corresponding margin value.

IX. EXAMPLE OF SYSTEM RELIABILITY CALCULATION
As an example, consider the very simple system shown in Fig. 15, which is composed of two redundant pumps and a motor-operated control valve.For each component, Table II shows the failure modes, examples of maintenance tasks to be performed, the related classification of the maintenance activity, and the data used to determine the margin associated with each component.Note that this is intended to be a simplified (textbox-type toy) example to demonstrate the approach and does not include all of the failure mechanisms or planned maintenance activities that would be applicable to the SSCs in this system.In an actual plant application, the systems and their representative models would be more complex; however, the basic principles remain applicable.
For this system, the following notations are used for the five BEs: The FT for the considered system is shown in Fig. 16 along with the numeric value of e M associated with each BE.The set of MCSs for this FT is A; BD; BE; CD; CE f g.By applying the rules from Sec. IV for the AND and OR gates using the data provided in Fig. 12, we obtain the results displayed in Table III.From these intermediate results, we calculated M for the top event, M(TE) = 0.5385.Given the reliability importance definition in Sec.VII, a summary of the sensitivities for each BE is provided in Table IV.Basic events can be ranked by ordering them based on their RIM BE value (from largest to smallest) as shown in the table.
The results presented in Table IV indicate that, at the present time, the pump bearings failure modes for both pumps (i.e., BEs B and D) are the ones that are most risk significant.It is important to recognize that these results reflect the current operational condition of the system components (note that different RIM BE values for the bearing failure mode for the two pumps, which reflect the different e M values indicative of differences in the condition and performance of these two pumps at the time the measurements were made).This situation is different than what would normally be encountered in a plant reliability model (such as a PRA), where average failure values are used so that the reliability data for these identical components are generally equal.Also, note that some BEs are characterized by S BE ¼ 0; this occurs when the BE is dominated by another BE (in an OR gate) with the lowest margin.This can be rephrased as improving the margin a BE characterized by S BE ¼ 0 (e.g., BE = C: pump 1 motor rotor cage winding failure) does not provide any benefit to the overall system margin since a BE that provides a support function to the same component (e.g., BE = A: pump 1 bearings failure) has a lower margin.From a decision-making standpoint, maintenance activities related to plant SSCs can use their calculated S BE values as an input that provides a quantitative measure of their impact on system reliability given their current condition and performance.

X. CONCLUSIONS
In this paper we presented an alternative approach to reliability modeling by moving from probability of

TABLE III
Margin Calculation for the MCSs Generated by the FT Shown in Fig. 12 MCS Margin ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi e ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi e ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi e ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi e occurrence of an event language to margin of occurrence of an event language.The main goal is to better integrate ER data in order to assess system health.We still employ current reliability models (e.g., FTs), but we are deploying a different calculation engine based on metric spaces rather than on set theory (typically used in probability of occurrence of an event language).The most important point of our proposed method is how ER data generated by a component under different maintenance approaches (i.e., CM, CBM, and PdM) can be used directly to measure the component margin and support more effective decision making by plant system engineers.Note that this method is not designed to be an alternative approach to current PRA methods, but instead a complementary approach designed for different kinds of decisions.We are, in fact, aiming to support decisions that target plant resources and asset management activities, such as work scheduling and project prioritization.

Fig. 1 .
Fig. 1.Regulatory and system engineer definition of risk.

Fig. 4 .
Fig. 4. Margin in a CBM context: evolution of SSC condition as a function of time and margin definition.Fig. 5. Margin in a PdM context.

Fig. 7 .
Fig.7.Typical current signature analysis where sideband currents I fsb are centered around the frequency of the supply current.26

Fig. 8 .
Fig. 8. Vibration data for different mass flow rates and margin representation provided actual pump conditions (adapted from Ref. 23).

1644
MANDELLI et al. • ON THE LANGUAGE OF RELIABILITY where P AjB ð Þ indicates the conditional probability of the occurrence of A given that B has occurred.

) 2 .
occurrence of either failure event when either e M A ¼ 0 (representative of component A failure) or e M B ¼ 0 (representative of component B failure), which corresponds to a location on the axis of whichever component has failed.
The function dist X; Y ½ � is designed to calculate the distance between two points X and Y. Like for any other n-dimensional continuous space, several distance metrics can be chosen, as indicated in TableIwhere the two most common metrics are shown.Note that in a setting where margin values decrease as a function of time, there is an infinite number of ways to move from actual conditions of components A and B [i.e., the point e M A ; e M B À � , to the point (0, 0)].However, the upper bound of such a distance is repree M B (where we note that there is no need to apply the absolute value convention as both e M A ; e M B are positive by definition).On the other hand, the Euclidean metric represents the actual lower bound dist e M A ; e M B � � ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi e

Fig. 10 .
Fig. 10.Graphical representation of event occurrences based on a margin framework.

Fig. 11 .
Fig. 11.Graphical representation of the margin calculation for e M A AND B ð Þ when considering the temporal evolution of e M A and e M B .

Fig. 12 .Fig. 13 .
Fig. 12. Plot of the historic evolution of e M A and e M B and their predicted evolution based on the derivative information.

Fig. 14 .
Fig. 14.Comparison between margin-based and probability of failure-based reliability modeling approaches.

1 .
A: valve failure due to stress corrosion cracking 2. B, D: pump 1 and 2 failure, respectively, due to bearings failure 3. C, E: pump 1 and 2 failure, respectively, due to rotor cage winding failure.

Fig. 15 .
Fig.15.Simplified reliability model for a two-pump system with a common flow control valve on the discharge line.

Fig. 16 .
Fig. 16.Reliability data expressed in terms of the margin and FT model for the system shown in Fig. 15.

TABLE II ER
Activities and Data for the System Shown in Fig.15 NUCLEAR TECHNOLOGY • VOLUME 209 • NOVEMBER 2023

TABLE IV S
BE for the Five BEs Shown in Fig.16 TECHNOLOGY • VOLUME 209 • NOVEMBER 2023