Optimal Maintenance Policy for a Technical System Subject to Hidden Faults and Randomly Occurring Hazards

The paper presents a method of finding the optimal time between inspections for a system subject to degradation-related faults which make the system vulnerable to randomly occurring external hazards that may cause its damage. Since faults are assumed to be hidden, periodic inspections and repairs have to be performed in order to detect and remove them. Otherwise, leaving the faulty system unmaintained would eventually lead to a very costly damage. It is also assumed that the time to occurrence of a fault is exponentially distributed and hazardous events constitute a Poisson process. The fault rate, the intensity of the Poisson process and the probability with which a hazardous event results in the system damage are the known parameters. The author presents two main results achieved by analyzing this maintenance model. First, the criteria to be fulfilled by the system parameters in order that preventive maintenance be cost-effective are given in the form of simple inequalities. These criteria must be met so that operating the system with preventive maintenance in place be less costly than operating it until a damage occurs and replacing it thereafter. Second, fairly simple equations are obtained from which the optimal time between inspections can be found numerically by the Newton-Raphson method. The analytical derivation of both the criteria and the equations is presented in detail and is the author’s original work. To the best of his knowledge the obtained results are new in the area of maintenance modeling and analysis. For better understanding, theoretical considerations are illustrated by an example of a generic explosion prevention system. KeywordsSystem fault, System damage, Damage-triggering factor, Long-run operating cost, Periodic maintenance, Optimal time between inspections.


Introduction with the Literature Review
This paper analyses a maintenance model of a technical system subject to degradation-related nonself-revealing faults and random occurrences of a hazardous event that, if the system is faulty, may trigger a catastrophic incident resulting in system damage. For example, an occasional spark from the electric installation in the vicinity of an LPG tank (a randomly occurring hazardous event) may set off an explosion (a catastrophic incident) if there is a leak in the tank (a hidden system fault). In order to reduce the chance of a catastrophe, the system should undergo periodic inspections and, in cases of fault detection, repairs or renewals should be carried out. A natural problem that arises here is how often the inspections should take place so that the system operating cost per unit time be minimal. Clearly, in order to cope with this problem, we need certain data about the system, i.e. the average hazardous-event-frequency, the distribution function of the system's TTF (time-tofault), the costs of inspection and preventive maintenance, the probability that the hazardous event causes damage to the faulty system, and the cost of such a damage. For the sake of analytical tractability it is assumed that TTF is exponentially distributed with the parameter  and damagetriggering events constitute a homogenous Poisson process with the intensity .
The system's operation proceeds as follows. An inspection is planned to take place after time T from the last inspection or from the start of operation of a new or replaced system. If a fault is detected during an inspection, the system undergoes repair and is put back into operation. However, it can happen that an occurrence of a fault is followed by a hazardous event that precedes the next inspection and causes the system damage. Such a situation results in a very high damage cost, significantly higher than that of an inspection or repair. Each system damage is self-revealing, unlike a system fault, and is followed by replacement of the damaged system. The total operating cost involves inspection, repair, damage, and replacement costs. The research objective is to find the optimum T which minimizes the long run total operating cost per unit time. It turns out that such a T exists provided that the system parameters fulfill certain criteria formulated as simple inequalities, and it can be found from analytically obtained equations. The derivations of both the criteria and the equations for T are the author's main results constituting the main body of the paper.
The problem considered in this paper can be classified as that of optimizing an inspection and maintenance policy for a repairable technical systemone of the main issues in reliability engineering, which has long been a subject of extensive research. A comprehensive overview of this topic, along with a wide literature survey can be found in the following books: Gertsbakh (2000), Zequiera and Berenguer (2005), Jardine and Tsang (2006), and Duffuaa and Raouf (2015). Also, Vasili et al. (2011), Wang (2012, 2013, and Zhao et al. (2017) give a broad insight into many aspects of inspection and maintenance scheduling for various types of systems and policies. For some recent results in this area see Mendes et al. (2017), Badía et al. (2018), Sun et al. (2018), and Peng et al. (2019).
There exist numerous inspection/maintenance policies depending on the adopted models of the considered systems. They can be divided into two main categories: policies applied to single-unit systems, and those applied to multi-unit systems with particular reliability structure. Systems in the first category are often assumed to have a set (discrete or continuous) of degradation levels and the undertaken maintenance action is degradation-dependent. See Abdel-Hameed (1987, 1995, Nakagawa et al. (2010), Le and Tan (2013), Wang (2013), Guo et al. (2015), and Alaswad and Xiang (2017) for different models of one-unit systems and reviews of the respective literature. Comprehensive surveys of policies for multi-unit systems can be found in Nicolai andDekker (2008), andCao et al. (2018). As regards types of maintenance, Azadeh and Zadeh (2015) lay out the taxonomy of various maintenance policies, according to which the policy investigated in the current paper can be classified as age-based preventive maintenance.
The maintenance model proposed in this paper comprises the features of both a safety system and a delay-time model. In general terms, a safety system is a subsystem of a larger system whose task is to safeguard the latter against malfunction or damage. In our model, system fault can be regarded as the safety system failure which, in the case of a hazardous event, can be detrimental to the main system. The issue of safety systems maintenance is thoroughly outlined in Pascual et al. (2011), along with the analysis of a versatile model of such a system developed by the authors. In turn, the delay-time maintenance model is a concept introduced in Christer and Waller (1984). Delay-time is a (random) time between the first symptoms of the system degradation and the system failure. If an inspection performed during the delay-time reveals a degraded state, then the appropriate maintenance action is taken in order to avoid the impending failure. However, in most delay-time models the delay-time distribution is a property of the system itself, whereas in this paper the time from the occurrence of a system fault to the next triggering event depends on a factor external to and independent of the system. Besides, the system damage does not occur as a single event, but as two combined ones (a system fault and a triggering event). Nevertheless, the problem considered in the current paper can be defined in terms of a delay-time model, where the delay time elapses from an occurrence of a system fault to the next triggering event causing the system damage. As shown further in the paper, this delay time has exponential distribution with parameter p. The issue of delay-time modeling is comprehensively treated in Werbinska-Wojciechowska (2019)a freshly published monograph. For surveys of recent and older results concerning this topic, refer to Wang (2012), and Cha and Finkelstein (2019).
Problems similar to the one studied here have been investigated by several authors, e.g. Lienhardt et al. (2008), Huynh et al. (2011), andBadia et al. (2018). The first of these papers addresses the issue of choosing an appropriate maintenance policy to detect hidden failures of warning devices or backup components that do not interrupt normal aircraft operation. In the second one the optimal time between inspections is found numerically for a system with hidden failures caused by random shocks and gradual degradation modeled by a gamma process. The third paper investigates a system with two types of failuresminor (followed by repairs) and catastrophic (followed by replacements), where repairs are performed according to a General Polya process. The assumptions adopted in the above cited papers, although encompassing a wide spectrum of technical systems, do not allow for analytical solution of the optimal inspection policy problem. The possibility of analytical approach to this problem is the main advantage of the inspection/maintenance model proposed in this paper. To the best of its author's knowledge, the solution presented here cannot be found in the relevant literature, hence it is a new development in the area of maintenance modeling and optimization.
It should be noted that the proposed model must not be confused with age replacement models, where a system is replaced (or renewed) when it reaches age T or upon failure, whichever occurs first. For such models, preventive renewals make sense if the distribution of the system's time-tofailure (TTF) has the increasing failure rate (IFR). If TTF is exponentially distributed (constant failure rate), then, due to the lack-of-memory property, the only plausible maintenance policy is to replace the system upon its failure.
For better understanding, theoretical considerations will be illustrated by an example of a generic explosion prevention system (EPS) which, in order to perform its protective function reliably, has to undergo periodic inspection and maintenance. An explosion can be caused by the presence of combustible gas, dust or vapor, and triggered by an electric spark. Main functions of an EPS are suppressing an explosion and providing the proper venting. Guidelines for installing and operating EPS-s are given in the standard NFPA 69 (2019), published by National Fire Protection Association.

Detailed Assumptions and Notation
Main assumptions: (i) A fault occurs after time S from putting the system into operation.
Damage can only happen to a faulty system and is caused by a triggering event. (iv) A triggering event causes damage with probability p. (v) Triggering events occur according to a Poisson process with rate . (vi) Occurrences of system faults are independent of triggering events.
System faults are not self-revealing, contrary to damages which manifest themselves immediately.
(viii) In order to decrease the possibility of system damage, periodic inspections are carried out for the purpose of detecting and removing system faults.
An inspection is planned after time T from each moment when the system is put into operation. (x) If a system fault is detected during an inspection, the fault is removed and the system is again put into operation. (xi) A damaged system is replaced by a new one.
It is assumed that the time unit for ,  and T is one month, i.e.  is the average number of triggering events per month and 1/ is the expected value of S in months.
Additional notations are Tplanned time to next inspection AT -the event "system damage occurs up to time T", where time is measured from the moment when the system is last put into operation BT -the event "system fault occurs up to time T", where time is measured as above AT, BTnegations of the events AT and BT a random time that elapses from putting the system into operation to its damage, provided that the system is left unmaintained LTthe length of one operation cycle, i.e. the time that elapses from putting the system into operation to the next planned inspection or system damage, whichever occurs first CIthe cost of inspection CRthe cost of repair of a faulty system CDthe cost of system damage and the ensuing replacement ca(T)the long-run operating cost per unit time As shown further, the considered maintenance model yields four optimal maintenance policies related to four different configurations of the parameters p, , , CI, CR and CD. This is due to the fact that the cost functional ca(T) is different in the cases p= and p, as In consequence, the above parameters' values assumed for the example system will vary depending on the currently analyzed configuration.

General Auxiliary Formulas
Pr(AT), E(LT) and ca(T) defined in the previous section will be given by formulas (6), (17) and (18) respectively. Their derivations are presented below, starting with Pr(AT). Since BT  AT, we have: System faults and triggering events are independent, hence for sT we have: Since Pr(BT) = 1 -FS(T), formula (1) converts to The main assumption 2 states that hence Pr[(AT)BT)], equal to the integral in (3), is computed as follows: Our next goal is to compute E(LT). For this purpose, we will need a formula for the distribution function of , i.e. Pr(t). We have: Where, X1 is the time between s and the first event occurring after s, and Xi -the time between the (i-1)-th and i-th event, i2 . By main assumption 5, the sum X1+…+Xk has Erlang distribution with parameters  and k, i.e.
From (7) and (8) we obtain: Let us convert the expression under the integral in (9) to a simpler form.
The second and the fourth of the above equalities follow from the formula for the sum of a geometric series. Substituting the sum under the integral in (9) according to (10) yields: It is convenient to compute E(LT) using Pr(>t) which, in view of (11), is given by The definition of an operation cycle and formula (12) In consequence, we obtain In the last part of this section we will construct a formula for the long-run operating cost per unit time, using Pr(AT) and E(LT) given by (6) and (17). Clearly, the time points in which the system is put into operation (following an inspection or damage) constitute a renewal process. From the elementary renewal theorem (see Gertsbakh, 2000) it follows that Indeed, the above theorem implies that ca(T) is equal to the expected cost in one cycle divided by its expected length. To obtain the numerator in (18) let us note that 1) an inspection is performed at time T provided that damage does not occur up to T; 2) if, prior to an inspection, the system becomes faulty, it undergoes a repair; 3) if a damage occurs up to time T, the cost CD is incurred.
In the next two sections we will formulate conditions to be fulfilled so that ca(T)<ca() for certain T>0, i.e. the inspection policy with period T is better than the "wait until damage and then replace" policy. We will then derive equations for the optimal T minimizing ca(T). Due to different expressions for Pr(AT) and E(LT), the cases p= and p have to be considered separately.

Analysis of the Case p=
From (5) Differentiating ca(T) we obtain:  If CD  2(CI + CR) then ca(T) decreases in T for T>0.
If CD  2(CI + CR) then The inequalities in (22) hold due to the fact that exp(-T)<2 and exp(-T)  1-T. Thus, in view of (20), the first derivative of ca(T) is strictly negative, which ends the proof.
Corollary: Under the assumption of Lemma 1 the optimal maintenance policy is "do not perform inspections or repairs and replace the system only after its damage". Lemma 2 is illustrated in Figure 2, by the shape of f(T), for the example system with the following parameters: CI=5, CR=15, CD=150, p==0.2. Corollary: Under the assumption of Lemma 2, ca(T) decreases for 0<T<T*, attains minimum at T*, and increases for T>T*, where T* is the only solution of the equation f(T)=0 (see Figure 5).

Analysis of the Case p
Using again (5), (6), (17), and (18), this time for p, we obtain: Let f1(T) = (p-) f(T) and g1(T) = (p-) g(T). We have: As in the previous section, we will differentiate ca(T), and the numerator of the resulting quotient will be the object of further analysis. Let h1(T) = f1'(T)g1(T) -f1(T)g1 ' (T), where f1' and g1' denote first derivatives of f1 and g1. It thus holds that We will now compute h1(T), which requires some effort.
It will be now shown that h(T) fulfills the following two lemmas:

Lemma 3
The function h(T) is negative for T0 if the following condition holds: Proof: To begin with, let us compute h(0). Putting CX = CD -CI -(p+)CR/p yields: Since p, it is natural to consider the cases p> and p<. Let us assume that p> and CD fulfills an inequality stronger than (30), i.e.

≤ ( + )⁄
We will show that dh(T)/dT<0 for T>0. Indeed, from (29) we obtain: The inequality in (33) holds because the expressions in parentheses and brackets are positive for T>0, and CX  -CI < 0 due to (32). From (33) it follows that h(T) is decreasing in T for T>0, which, in view of (31), means that the lemma holds provided that (32) is fulfilled.
From the above argument it follows that in either case (1 or >1) there exists strictly one T*>0 such that h(T)<0 for T[0,T*), h(T*)=0, and h(T)>0 for T>T*. The proof for the case p< is analogous.
Lemma 4 is illustrated in Figure 4, by the shape of h(T), for the example system with the following parameters: CI=5, CR=15, CD=150, p=0.4, =0.2. Let us note that >1, where  is defined in the proof of Lemma 4, hence h(T) first decreases to its minimum value, and then increases to infinity. Corollary: Clearly, Lemma 4 also holds for h1(T) and, in consequence, for dca(T)/dt. This means that, under the assumption of Lemma 4, ca(T) decreases for 0<T<T*, attains minimum at T*, and increases for T>T*, where T* is the only solution of the equation h(T)=0 (see Figure 6).

Main Results
The results of the two previous sections can be summarized as Theorems 1 and 2 which hold for the cases p= and p respectively.
If CD  2(CI + CR then ca(T) decreases in T for T>0, i.e. the optimal maintenance policy is "do not perform inspections or repairs and replace the system when damaged". In turn, if CD > 2(CI + CR) then the optimal T minimizing ca(T) is found by solving the equation f(T)=0, where f(T) is given by (21).
If ≤ max ( + , then ca(T) decreases in T for T>0, i.e. optimal maintenance policy is "do not perform inspections or repairs and replace the system after its damage". In turn, if "" is replaced with ">" in (42) then the optimal T minimizing ca(T) is found by solving the equation h(T)=0, where h(T) is given by (29).
The above theorems are illustrated in Figures 5 and 6. For ease of comparison, the example system parameters are the same as for Figures 2 and 4 respectively. Let us note that ca(T) converges to a constant value, equal to CD/2 if p=, or pCD/(p+) if p. This can be easily calculated from (19) or (26).

Conclusion
The paper provides detailed reliability analysis of a technical system that can become faulty during operation, and is subject to repeatedly occurring hazardous events each of which can trigger damage to a faulty system. Assuming that time-to-fault is exponentially distributed and damage triggering events constitute a Poisson process, it was manageable to formulate conditions for the system parameters under which periodic maintenance policy is better (with regard to the long run operating cost per unit time) than the policy "wait until damage happens, then replace". The respective conditions, valid for the cases p= and p, are given by Lemmas 2 and 4, and repeated in Theorems 1 and 2. Also, it was possible to analytically derive fairly simple equations for the optimal time between inspections, i.e. the equations f(T)=0 and h(T)=0, where f(T) and h(T) are given by (21) and (29) respectively. These equations can easily be solved by a numerical method finding the zero of a function, e.g. the Newton-Raphson method. It can be seen in the proofs of Lemmas 2 and 4 that both f(T) and h(T) fulfill the conditions allowing to apply the Newton-Raphson procedure.
The model analyzed in this paper can be applied to optimize the maintenance process of many realworld technical systems, in particular a wide range of safety systems. Apart from the explosion protection systems, mentioned in the introduction, the results of the paper can be implemented for power system protection devices (line protection relays or generator protection equipment) subject to hidden faults which, in some randomly occurring circumstances, may cause widespread power outages. This problem is addressed in Bae and Thorp (1999). Another example of the model's application is an electronic equipment with protection device whose failures are hidden and, if undetected, lead to the protected equipment damage. Such a system is considered in Jiang et al. (2015). Last but not least, as discussed in the introduction, the model can also be applied to systems with delayed failures.
The author believes that it is possible to add some complexity to the above considered maintenance model without losing its analytical tractability. E.g. we can assume non-homogeneity of the Poisson process of the damage-triggering events, as well as multilevel faults and/or multiple fault modes.
Analysis of the extended model will be a subject of future work.