Uncertainty Quantification in Reactor Safety: Time Dependency, Available Information, and Measurability

Uncertainty Quantification (UQ) in the analysis of reactor safety is challenging. The power of UQ derives from its grounding in probability theory.1 But, certain important safety related events are not probability–measurable which is problematic for risk– analytic methodologies that rely on UQ computations. In this note we identify why the dynamics of available engineering information plays an essential role in governing the fidelity of uncertainty quantification, and why uncertainty quantification in reactor safety is limited by the un-measurability of certain critical events. We provide an historical example providing a practical context for our observations. Finally we discuss the implications of measurability for regulatory decision–making governed by recent nuclear industry legislation that advocates increased use of risk informed, performance–based regulation for advanced reactor licensing. Keywords— Uncertainty Quantification, FMEA, Failure Modes, Stopping Times, Filtration 1We hold that probability theory is the only viable theory of uncertainty.


Introduction
Engineered designs are created with an intent to avoid failures -or at least understand the uncertainty surrounding when and what failures may occur. This is especially true when protections are designed for hazardous systems (e.g., nuclear reactors). Engineers are well aware that any device or system can break down unexpectedly due to previously undiscovered failure modes. Failure modes that are first discovered under circumstances that if not immediately addressed might exceed to catastrophe are especially troubling. Extensive records are kept on such "near miss incidents''123450 [-]0, for example the Nuclear Regulatory Commission (NRC) Licensee Event Report (LER) database which documents many such failures found in many different designs. Following near miss incidents, root cause analysis is used to isolate the immediate cause for the breakdown. Engineers proficient in root cause analysis can determine if a breakdown arose from something that can be assigned to physics, improper operation, or improper maintenance. Knowing that unanticipated failure modes will be experienced, safety margin and defense in depth are primary protections against dangers that may be present. Engineers use logic structures, such as Failure Mode and Effects Analysis (FMEA), HazOp, fault trees, and event trees, to help study scenarios that have little, or no back up and may require additional consideration of safety margin or defense in depth. If a single point of failure cannot be avoided, they are reluctant to accept the design unless the device has been tested thoroughly and includes substantial safety margin. Regular inspections can help ensure the design retains safety margin over the device service life.
We use the terminology unanticipated failure modes to indicate failure modes can only be found at a future time. The potential for their appearance in service is well understood by engineers. Prior to release of potentially hazardous processes or potentially hazardous products, good engineering practice dictates a cautious approach with a gradual ramp up to full production.
Effective maintenance organizations address unexpected failures in critical equipment by aggressively pursuing root cause determination followed up with prompt corrective action. Although quantitative risk assessments work with probabilities of functional failures, the functional failures they work with are collections of one or more failure modes including any unexpected ones. The importance of unanticipated failure modes is that they cannot be included in quantitative assessments such as Probabilistic Risk Assessment (PRA).
The 2019 Nuclear Energy Innovation and Modernization Act (NEIMA) asks for, in part, "riskinformed" regulation by the NRC. While the NRC currently uses risk-informed decision-making to guide certain policy decisions as well as prioritizing inspection and enforcement, they define risk-informed to include prudent engineering practices that include defense in depth and safety margin. Reliance on such engineering practices are captured in the NRC definition of risk-informed decision-making (NRC, 2013): "A 'risk-informed' approach to regulatory decision-making represents a philosophy whereby risk insights are considered together with other factors to establish requirements that better focus licensee and regulatory attention on design and operational issues commensurate with their importance to health and safety. A 'risk-informed' approach enhances the traditional approach by: (a) allowing explicit consideration of a broader set of potential challenges to safety, (b) providing a logical means for prioritizing these challenges based on risk significance, operating experience, and/or engineering judgment, (c) facilitating consideration of a broader set of resources to defend against these challenges, (d) explicitly identifying and quantifying sources of uncertainty in the analysis, and (e) leading to better decision-making by providing a means to test the sensitivity of the results to key assumptions. Where appropriate, a risk-informed regulatory approach can also be used to reduce unnecessary conservatism in deterministic approaches, or can be used to identify areas with insufficient conservatism and provide the bases for additional requirements or regulatory actions." The un-measurability arrival times of undiscovered failure modes reinforces the need for defense in depth and safety margin to be included to help ensure the unbounded level of risk not included in uncertainty quantification is taken into account in protective system regulation. 2 We find a relatively large literature that would inform investigators on the nature of predictive modeling under uncertainty induced by random variables and probabilities that appear as random variables. A few of the authoritative studies and texts that we have reviewed over time are .
In the following, we briefly review a relevant, non-consequential, example of an unexpected failure mode from an operating nuclear power plant and then provide the analytical formalism clarifying why catastrophes arising from unanticipated failure modes defy uncertainty quantification. We conclude with a brief discussion asserting that over-reliance on UQ should be avoided in risk analyses of safety-critical protections.

An example
On December 18, 1995, the South Texas Project (STP) Unit 1 commercial nuclear reactor experienced an unexpected reactor trip due to a main and auxiliary transformer lockout while operating at 100% power as described in Head (1996). The transformer lockouts triggered a series of protective system actuations: 1. The generator output breaker tripped open, 2. The main turbine tripped, causing a reactor trip, and 3. All four 13.8 kV auxiliary bus breakers opened resulting in a loss of offsite power to the A Train Engineered Safeguards Features Bus.
The operators entered the emergency operating procedure for reactor/turbine trip which required verification of subcritical reactor and control rods fully inserted. However, three control rods, designated F10, C9, and N7, representing 3 out of a total of 57 control rods required to fully insert, indicated they had instead stopped at about 6 steps above the bottom of the core as required. 3 All the control rods that had stopped prior to complete insertion were operating in a new fuel design.
Proper operation of the reactor control rods is verified prior to full power operation after the reactor core is replaced during a refueling outage. In protective operation, the rods fall due to the force of gravity in two basic stages, a free fall stage followed by a braking stage. They are tested to verify they fall within a prescribed length of time, from time of release to the time they start braking. This test was performed successfully as normally done for the new core installed after the refueling outage prior to the event on December 18. 4

Root cause
Other similar events started occurring at other reactor plants after the STP October 18 event (Crutchfield, 1996). Root cause was investigated at the STP over the next few days and it was concluded that the fuel assemblies, which had been exposed to radiation in at least one fuel cycle were becoming susceptible to buckling as partially described by Kee (1996). The root cause was found by developing a physical process theory model that was fit to data. Investigators developed new understanding of certain aspects of the control rod fluid shear and a fuel assembly annealing process similar to a spring and viscous damper with hysteresis in series, to explain the physics Kee et al. (2005), and Kee and Björnkvist (1997). Various unrelated control rod insertion events have been observed, root cause isolated, and the knowledge base for these kinds of events continues to grow larger for example, (Lagiewsk, 1982;Jordan, 1986).

Corrective and compensatory actions
Because the root cause investigation revealed that the control rods would insert to a point where the incremental reactivity control was very small, the consequence analysis showed that, due to safety margin included in their design, the control rods would continue to meet their reactivity control design requirements with continued margin to safety. To validate the root cause and consequence analysis, the STP proposed a compensatory testing plan to the NRC which was accepted and implemented. To fully address the root cause and consequence analysis, the STP suggested a modification to the fuel assembly designs being purchased in the future that was implemented by the vendor.

Observations
As described by Head (1996), the reactor trip revealed the existence of the new failure mode in the fuel assemblies that was itself triggered by a previously unknown failure mode due to a pinched pilot wire induced by maintenance activity. The timing of the pilot wire failure mode revealed the fuel assembly failure mode at an unexpected time during full power operation. The annealing of in-vessel components is a well-known process and aspects were always included in, for example, fuel assembly and Boiling Water Reactor (BWR) channel box designs. However, the process of annealing on the approach to Euler buckling discovered by Ringhals plant was unknown to be a problem in Pressurized Water Reactor (PWR) fuel assemblies. The fluid shear process in the braking section of the fuel assembly had never been investigated, at least in the academic literature, until STP made empirical observations and Kee et al. derived the necessary equations now published in (Kee et al., 2005, Section 4.2.1).
It can be said that the safety margin included in the as-designed reactivity requirements of the control rods prevented a return to critical with the control rods not fully inserted. This observation helps support the need for such prescriptive regulations as those included in Title 10 CFR Part 50 that require safety margins and defense in depth for safety-critical functions. Root cause analysis wisely acknowledges the possible existence of unanticipated failure modes with plans to learn from their discovery.

Protection Availability
Our observations about uncertainty quantification for protections are made with respect to a filtered probability space (Ω, F,{F t } t≥0 , P), where Ω is the set of possible outcomes for a protective system over its lifecycle, F is a σ-algebra on Ω capturing all events of predictive modeling interest, and P is a probability of the measurable space (Ω, F). The filtration {F t } t≥0 is the analytical construct used to capture the flow of engineering information over time. {F t } t≥0 contains all P-measurable collections of outcomes in Ω known to modelers by time t ≥ 0. This filtration has the following properties for all s, t ≥ 0, 1. F 0 contains all P-null sets (completeness), These properties are normative and intuitively understandable. Property 2 indicates that engineering information is not necessarily pre-visible (e.g., it is not possible to know that a failure will occur the instant before it actually occurs). Property 3 asserts that information gained through discovery is not lost over time. Properties 3 and 4 acknowledge that one cannot be convinced that all useful modeling information will be revealed in finite time. 5 Consider, now, the random variable X t : Ω → {0, 1} indicating the state of system protections where X t = 1, protections are available at time t 0, otherwise.
We call (X t ) t≥0 the protection availability process, and we normatively take its trajectories to be right-continuous. 6 . We also take (X t ) t≥0 to be adapted to the filtration {F t } t≥0 . This is to say, at any time t ≥ 0, X t is F t -measurable, and we have that, σ((X s ) 0≤s≤t ) ⊆ F t . 7 Of course, at any time t ≥ 0, there is far more engineering information in F t than simply the partial protection availability trajectories that comprise σ(X t ). Information related to protection design, maintenance records, historical environmental conditions, etc are also events that can appear in the filtration {F t } t≥0 . 5 Even though it is reasonable to assert by definition that any consequential failure mode will be discovered in finite time, there is way to recognize that all consequential failure modes will have been discovered by any historically observable time. 6 Because (X t ) t≥0 is right-continuous with left-limits, it belongs to the class of càdlàg ensuring that it has a left-continuous version with respect to the probability measure P. 7 In other words, the "natural history" of the protection availability process is contained within the filtration capturing the flow of engineering information.
For the purposes of predictive modeling, we must be careful to stipulate the status of engineering information. Note that the likelihood that system protections are failed at time t is given by (1) F 0 is assumed to be complete and contains all information characterizing the protection design space with t = 0 taken as the time of deployment. So, all events associated with failure modes identified in an original design FMEA are included in F 0 .
Equation (1) reveals that quantifying the uncertainty about the operational integrity of protections at time t requires information that is not necessarily available. To see this, examine the set difference A t F \ F t . Intuitively, A t contains all as yet undiscovered engineering information at time t. It follows from eq. (1) that P(X t = 0) can be quantified with respect to probability measure P if and only if P(A t ) = 0. That is, we must be sure that there is no remaining undiscovered information that would be valuable in predicting the unavailability of protections. P(A t ) = 0 is guaranteed only in the limit as stipulated in Property 4, where a numerical value for P(A t ) = 0 is available only to a clairvoyant. Hence, engineering predictive modeling must be contented with quantifying uncertainty only up to the currently available engineering information, or (2) It is important to appreciate that system protections are useful only if they are available exactly when needed. Suppose that T is the time of arrival of some initiating event that can possibly lead to catastrophe. Here, T : (Ω, F) → (R + , B(R + )). Catastrophe can occur only if X T = 0.
Remark 1 Simply characterizing the state of protections with respect to the flow of engineering information over time is generally inadequate to inform the likelihood of successful protections; protections must hold when at the random times when initiating events occur. In practical circumstances, P(X T = 0) = P(X t = 0), when T is the arrival time of an initiating. 8

Initiating event arrivals that are stopping times.
When the events {T ≤ t} are included in F t for all t ≥ 0, T is said to be an {F t } t≥0 stopping time. It is well known that any {F t } t≥0 stopping time T must also be F T -measurable, where by definition It follows directly that F T ⊆ F t . Thus, P(X T = 0|F t ) = E[(1 − X T )|F t ] is well defined.
Clearly, stopping times play an important role in UQ supporting risk analysis and regulatory oversight. It should be appreciated that generally F T contains engineering information that is not necessarily included in the protection hardware design/maintenance space (e.g., information derived from prior understandings of weather patterns, political unrest, economic activity, etc). Under the circumstance that the arrival time of an initiating event is an F t -stopping time for all t ≥ 0, we know that P(X T = 0|F t ) is well defined. However, being well defined does not imply that numerical values for P(X T = 0|F t ) are attainable (Wortman et al., 2021).

Initiating event arrivals that are not stopping times.
There are a host of practical reasons why the arrival time of an initiating event might not be measurable with respect to the filtration {F t } t≥0 . In particular, if it happens that the initiating event reveals a previously undiscovered system protection failure mode, a catastrophe can ensue. Since the failure mode is heretofore unknown, its arrival time T is not F t -measurable for any t ≥ 0. This is to say E[(1 − X T )|F t ] is not well defined which implies that neither P(X T = 0|F T ) nor P(X T = 0|F t ) can be quantified. Thus, undiscovered protection failure modes are problematic for any regulatory oversight strategy that relies completely on UQ.
It is reasonable, at this juncture, to consider the extent to which protection failure modes not identified in F 0 are problematic. We begin with some observations that are normatively justified in the engineering pedagogy: • The discovery time T of a specific heretofore undiscovered failure mode is random and measurable with respect to probability space (Ω, F, P).
• T is not an {F t } t≥0 stopping time.
• It is impossible to quantify the probability of events that depend on T.
• There exists some finite time τ < ∞ such that P(T ≤ τ) = 1. That is, any consequential failure mode will almost surely be found over that lifecycle of protections, else cannot be consequential.
• It is possible that a failure mode is first discovered through catastrophe postmortem analysis (i.e., the failure mode caused catastrophe upon its first appearance).
• For any time t ≥ 0 a non-clairvoyant cannot rule out the possibility of remaining undiscovered protection failure modes.
If undiscovered consequential failure modes are in play, the filtration {F t } t≥0 cannot yet have converged. From the perspective of predictive modeling, it is understood that A t must contain the undiscovered failure mode information, and thus P(A t |F) > 0. But, clearly, A t ∉ F t and thus P(A t |F t ) is not well defined ... that is, a non-clairvoyant is unable to quantify the inaccessible event probabilities related to undiscovered failure modes. However, if a modeler is sufficiently confident in a non-clairvoyant belief that that all consequential failure modes have been discovered, then all information impacting protection design will be found in the tail σ-algebra T, where T t≥0 σ((X s ) s>t ).
By definition, T ⊂ F and characterizes design information the remote future of (X t ) t≥0 . In the remote future of protections, there can be no undiscovered consequential failure modes.
Typically, UQ methodologies (e.g., PRA, Quantitative Risk Analysis (QRA), Probabilistic Safety Analysis (PSA)) implicitly rely on the assumption that events associated with protections are Tmeasurable, because this assumption guarantees the existence of Expected values of these X and X and their corresponding statistical estimators play essential roles is UQ. Informally, when time t = 0 is taken to be in the remote future of the protection availability process (X t ) t≥0 , then T ⊂ F t and X and X each become F t -measurable, and numerical estimates of their respective expected values can (in principle) be computed using historical data collected up through time t. 9 Importantly, modelers should appreciate that enabling computation of UQ statistics nearly always requires assuming that (X t ) t≥0 be T-measurable. And, assuming that (X t ) t≥0 is T-measurable implicitly ignores the possibility of undiscovered consequential failure modes that might lead to catastrophe. This leaves open the engineering question, "How much time must elapse before one might reasonably trust that all consequential protection failure modes have been discovered?". Mathematics dictates that only a clairvoyant can answer this question with complete confidence. Of course, no one is clairvoyant and undiscovered protection failure modes present especially difficult problems for engineers. Hence, because protection failure mode discovery times defy UQ, care must be exercised when applying UQ methodologies to studies of protection efficacy.

Discussion
Unknown-unknowns, including undiscovered protection failure modes, are nothing new to engineering design. Engineering practices such as stress testing, accelerated life testing, burn-in, and advanced physics-based simulations have been developed and refined with the specific objective of discovering failure mechanism. Additionally, engineers routinely adopt non-probabilistic strategies including defense in depth and layers of protection for mitigating the consequences of yet undiscovered failure modes. Quantitative risk methodologies have been increasing in popularity in many areas of costbenefit decision-making and regulatory oversight. Further, elaborate data bases are constructed and maintained so that discoveries can be shared across across various designs. In particular, the NRC has developed and implemented data recording and sharing protocols specifically intended to capture and alert the entire Nonetheless, quantitative risk analysis unconditionally produces optimistic estimates of predicted protection performance that cannot be overcome. An assessment that estimates forward cost as the product of initiating event frequency, protection failure probability, and consequence, will, of course, underestimate the cost. Engineers tasked with developing processes and designs that may pose a hazard as a result of a failure in use routinely compensate for unmeasurable uncertainty by adding defense in depth and safety margin in areas where either experience informs them as to the level of uncertainty or they lack knowledge. Although the example we show is from nuclear power, our observations apply equally to other hazardous industrial sectors such as chemical processing, oil and gas exploration and production, public and freight transportation, and others. They also apply to repairable as well as unrepairable systems although understanding the behaviors is nuanced.