Optimal replacement pOlicy fOr a periOdically inspected system subject tO the cOmpeting sOft and sudden failures

Pressure from the market spurs the advancement of maintenance strategies. Traditional maintenance strategies such as corrective maintenance, which only takes place when the failure is observed, and agebased or length-of-usage-based policy, which are merely performed at scheduled intervals, are not always able to meet the ever-growing demand for high level of system reliability while balancing the operating costs. In recent years, the advance of instrumentation & measurement technology has brought about the idea of condition-based maintenance (CBM). In contrast with any pre-determined maintenance strategy, CBM enables maintenance actions based on the condition monitoring information collected at real-time. Some successful examples of implementing CBM in real systems have demonstrated its efficiency and effectiveness in preventing catastrophic failures and improving maintenance performance (e.g. [1, 4, 6, 11, 12, 14]). The interest of most CBM policies has chiefly focused on a single failure mode, and the failure mode being mostly considered is the failure due to degradation, which is usually referred to as the soft failure. Soft failure is identified when the system state defined by the degradation level exceeds a predetermined threshold, and when it happens, the system is no longer assumed to be able to function satisfactorily or safely and it should be stopped and replaced, even if no physical failure is observed. The most frequently used CBM strategy for a continuous-state degrading system subject to the soft failure is the control-limit policy. Wang [19] jointly optimized the inspection interval and a control limit of preventive maintenance for a degradation process described by a linear growth model with random coefficients. Liao et al. [13] investigated a condition-based availability limit policy for a gamma-process-based degrading system, considering imperfect maintenance and an availability constraint. Policies with multiple control limits have also been proposed with the aim to improve CBM effectiveness and to satisfy different requirements. In [7], a CBM policy using multi-level control limits was proposed and optimized. The degrading system was modeled by a gamma process, and multiple control limits were used to determine the current maintenance actions and future inspection times. Elwanly et al. [6] analyzed a replacement problem for the exponentially increasing degradation system. They TANG D, YU J. Optimal Replacement Policy for a Periodically Inspected System Subject to the Competing Soft and Sudden Failures. Eksploatacja i Niezawodnosc – Maintenance and Reliability 2015; 17 (2): 228–235, http://dx.doi.org/10.17531/ein.2015.2.9.


Introduction
Pressure from the market spurs the advancement of maintenance strategies.Traditional maintenance strategies such as corrective maintenance, which only takes place when the failure is observed, and agebased or length-of-usage-based policy, which are merely performed at scheduled intervals, are not always able to meet the ever-growing demand for high level of system reliability while balancing the operating costs.In recent years, the advance of instrumentation & measurement technology has brought about the idea of condition-based maintenance (CBM).In contrast with any pre-determined maintenance strategy, CBM enables maintenance actions based on the condition monitoring information collected at real-time.Some successful examples of implementing CBM in real systems have demonstrated its efficiency and effectiveness in preventing catastrophic failures and improving maintenance performance (e.g.[1,4,6,11,12,14]).
The interest of most CBM policies has chiefly focused on a single failure mode, and the failure mode being mostly considered is the failure due to degradation, which is usually referred to as the soft failure.
Soft failure is identified when the system state defined by the degradation level exceeds a predetermined threshold, and when it happens, the system is no longer assumed to be able to function satisfactorily or safely and it should be stopped and replaced, even if no physical failure is observed.The most frequently used CBM strategy for a continuous-state degrading system subject to the soft failure is the control-limit policy.Wang [19] jointly optimized the inspection interval and a control limit of preventive maintenance for a degradation process described by a linear growth model with random coefficients.Liao et al. [13] investigated a condition-based availability limit policy for a gamma-process-based degrading system, considering imperfect maintenance and an availability constraint.Policies with multiple control limits have also been proposed with the aim to improve CBM effectiveness and to satisfy different requirements.In [7], a CBM policy using multi-level control limits was proposed and optimized.The degrading system was modeled by a gamma process, and multiple control limits were used to determine the current maintenance actions and future inspection times.Elwanly et al. [6] analyzed a replacement problem for the exponentially increasing degradation system.They TANG D, YU J. Optimal Replacement Policy for a Periodically Inspected System Subject to the Competing Soft and Sudden Failures.Eksploatacja i Niezawodnosc -Maintenance and Reliability 2015; 17 (2): 228-235, http://dx.doi.org/10.17531/ein.2015.2.9.

Diyin TANG Jinsong YU
Optimal replacement pOlicy fOr a periOdically inspected system subject tO the cOmpeting sOft and sudden failures

Optymalna pOlityka wymiany dO zastOsOwania w systemach pOddawanych przeglądOm OkresOwymnarażOnych na kOnkurujące uszkOdzenia parametryczne i nagłe
This paper analyzes a replacement problem for a continuously degrading system which is periodically inspected and subject to the competing risk of soft and sudden failures.The system should be correctively replaced by a new one upon failure, or it could be preventively replaced before failure due to safety and economic considerations.Dependent soft and sudden failures are considered.The degradation process of the system observed by inspections exhibits a monotone increasing pattern and is described by a gamma process.The failure rate of the sudden failure is characterized by its dependency on the system age and the degradation state.By formulating the optimization problem in a semi-Markov decision process framework, the specific form of the optimal replacement policy which minimizes the long-run expected average cost per unit time is found, considering a cost structure that includes the cost for inspections, the cost for preventive replacement, and the costs for different failure modes.The corresponding computational algorithm to obtain the optimal replacement policy is also developed.A real data set is utilized to illustrate the application of the optimal replacement policy.sciENcE aNd tEchNology demonstrated that the optimal replacement policy was a multi-level control-limit policy with monotonically increasing control limits.However, considering only the soft failure seems to be inadequate for the degrading systems that are also subject to the sudden failures.In many practical situations, sudden failures are very likely to interrupt the graceful degradation and then result in more serious consequences.Therefore, in this paper, we consider a competing risk maintenance situation.The system is regarded as failed when the degradation process reaches a critical threshold or when the sudden failure occurs although the degradation process has not reached the threshold.Most of the present papers that deal with such failure scenario assume that the degradation process and the sudden failure are independent with each other.Nevertheless, even if independence is demonstrated to be appropriate for certain types of competing risks, e.g.[2,8,20,21], in many situations the dependent structure between the two failure modes is of importance and should not be neglected.We consider the dependence is described by the failure rate of the sudden failure, which is influenced by both the age and the degradation state of the system.Similar assumptions were also found in Huynh et al. [9], Liu et al. [15], Castro et al. [3] and Huynh et al. [10] to deal with competing risk maintenance situation.A preventive threshold in terms of the degradation state was optimized in Huynh et al. [9] and Liu et al. [15] with different monitoring strategies.The former focused on periodically inspected systems and the latter dealt with continuously monitored ones.Castro et al. [13] also considered a degradation state -based control limit as the alarm of preventive maintenance, but multiple degradation processes were involved in the failure mechanism.In Huynh et al. [10], a novel CBM strategy was proposed using the mean residual life as the control limit.The effectiveness and potential of condition indices that are not strictly limited to failure mechanism in CBM decision-making problems was firstly investigated.
In this paper, we will focus on analyzing the optimal replacement policy for a periodically inspected system subject to the competing soft and sudden failures.This policy performs the preventive replacement only at inspection instants, and correctively replaces the system at the time of failure.We describe the degrading system using a gamma degradation process, which implies that the soft failure of the system results from a gradual and irreversible accumulation of deterioration.The failure rate corresponding to the sudden failure is described by a proportional hazards model, which means that it is influenced by both the age of the system and the state degradation of the system.Using the above models, we will show that for a certain group of maintenance situations the optimal replacement policy minimizing the average cost has a specific form, and it is also in fact a monotonically non-increasing multi-level control-limit policy.Computational algorithm to calculate the optimal multi-level control limits is developed as well, using a semi-Markov decision process framework.Finally, we will present a case study using a real laser data set from Meeker and Escorbar [16] to illustrate the proposed policy.
The paper is organized as follows.Section 2 describes the model for system degradation and sudden failures.In Section 3, we present the replacement problem.In Section 4, we examine the structure of the optimal replacement policy.The computational algorithm to calculate the optimal replacement policy is developed in Section 5. Section 6 gives an example based on a real data set.Conclusions are in Section 7.

Model of system degradation
We consider a continuously deteriorating system subject to periodic inspections.Due to limitations such as difficulty in placing sensors, the costs by condition monitoring, the internal structure of the system, and etc., not all systems can be continuously monitored.Therefore, the periodic monitoring strategy is a typical approach applied in many real applications.The degradation state of the system is hidden and can only be known by inspections.Generally, due to the physical nature of most degradation processes, the degradation state usually presents a monotonically increasing (or decreasing) trend.Even some fluctuations may occur due to measurement errors, self-recovering mechanism, etc., when the inspection interval is long enough, the observed increment between two inspections is still very likely to be non-negative.For degradation process involving s-independent and non-negative increments, gamma process is an appropriate stochastic model to describe it (see e.g.[18]).We assume that the system degra- , where h is the inspection interval.Then, the probability density function (pdf) of the increment has the form: where η( ) t is a given, monotone increasing function and η η n nh = ( ) .If the soft failure threshold f D is given, the probability of soft failure for the next inspection interval ( , ( 1) ] ) can be written as: In the following, we use homogeneous gamma process, i.e. η η ( ) t t = , at first for rigorous mathematical derivation of the optimal replacement policy.Extensions of the policy to cover a more general form of the gamma degradation model will be discussed as well.

Model of sudden failure
Sudden failure is a common failure mode which may interrupt a graceful degradation.In many practical situations, the sudden failure rate is very likely to be influenced by the degradation process.For example, the higher the degradation, the more the system is prone to sudden failures.Thus, it is reasonable to assume that the failure rate of sudden failures depends on the degradation process.In this paper, we assume that the failure rate of sudden failures is described by the proportional hazards (PH) model (see e.g.[5]), which explicitly includes both the effect of the age and the degradation state.It can be expressed by the following relation: where λ 0 (t;α) denotes baseline failure rate at time t with unknown vector of parameters α, and θ[Y(t); β]with the unknown vector of parameters β is a positive function dependent only on the values of the degradation state ( ) Y t .
Due to the restraint of the periodic inspection policy, the values of ( ) Y t are only known at some discrete points of time.Thus, we approximate the failure rate at time t as:

The optimal replacement problem
Consider a non-repairable single-unit system described as in Section 2. The degradation state of the system is hidden and the soft failure is non-self-announcing.No indicator can exhibit the degradation state except to do an inspection.The system starts working at time 0 t = and is inspected every h time units.The inspections are assumed to be perfect and incurred a cost 0 C .Three kinds of replacement actions are available on the system: If the system's degradation state identified by inspection ex-1.
ceeds its soft failure threshold f D , a corrective replacement is performed with the expected cost 2

C C
+ .
If sudden failure happens before the degradation state reaches 2.
the threshold f D , the system is also correctively replaced, but with a possibly more expensive expected cost 1 ).At the time of inspection, if the system still operates and its 3.
degradation state observed by inspection is below the soft failure threshold f D , a preventive replacement may exert on the system instantaneously at the expected cost C .
After the replacement, the system is back to as-good-as-new state.Even though both the preventive and the corrective maintenance actions bring the system back to the as-good-as-new state, they are generally different in practice because the unplanned maintenance actions (i.e.corrective replacements) may have to include a larger economic loss.Moreover, the corrective replacement for sudden failure is quite possible to be more expensive than that for soft failure because of its unexpected nature and the damage resulting from physical breakdown.
In addition, we introduce the following assumptions: Any replacement, whether corrective or preventive, takes neg- Y .We note that discretization of the continuous degradation state is applied in the following proof procedure in order to guarantee a finite state space for the policy optimization model.We will explain the details of the discretization scheme in Section 5. Let ( , ) n V n Y be the relative value function (see e.g.[17]) that formulates the relative cost in the infinite-horizon decision process when the system is currently in state ( , )   n n Y ∈ Ω , and Y + , then the optimality equations can be expressed as follows: where The first line of Eq. ( 7) describes the situation that if the observed degradation state by current inspection exceeds the soft failure threshold f D , we perform an immediate corrective replacement at the expected cost 2

C C
+ and put the system back to as-good-as-new state.The second line of Eq. (7) proposes the maintenance rule that if the observed degradation state is below the soft failure threshold f D , we can either choose to preventively replace the system, or do nothing and continue operation, depending on the relative costs of the two different maintenance actions.

Structure of the optimal replacement policy
In this section, we will examine the structure of the optimal replacement policy for the replacement problem defined in Section 3.
, the relative value function ( , ) n V n Y defined by the second line of Eq. ( 7) is non-decreasing for ( ) and for any positive constant g .Proof.Following the second line of Eq. ( 7), since V n Y denote the relative value function at the k th iteration of the value iteration algorithm.We start by defining the initial value: are true at the k th iteration, then from Eq. ( 7), 1 ( , ) where: sciENcE aNd tEchNology ) is non-decreasing for any positive constant g .Since both 0 (0, ) ( ) W n Y in Eq. ( 10) are non-decreasing, 1 ( , ) + is non-decreasing as well, which completes the proof.
Based on Theorem 1, we are able to find the form of optimal replacement policy by analyzing the optimality equations Eq. ( 7) and Eq. ( 8).
Theorem 2. Let 0 represent the immediate preventive replacement upon observation of the system state, 1 represent the immediate corrective replacement for the soft failure, and , the optimal replacement policy has the following form: Proof.

Consider the case if:
: So that in this case the optimal decision is no replacement.Consider the case if: ( , , ) τ and assume that 0 ( , ) ( , ) (0, ) , we have: ( the optimal decision is an immediate preventive replacement.This establishes the result.
Next, we will show in Theorem 3 that for the replacement problem defined in Section 3, the optimal replacement policy is a monotonically non-increasing multi-level control-limit policy in terms of the degradation state.
, for all inspection times n t nh = , if the observed system state n Y is below the soft failure threshold f D , the optimal decision is to preventively replace the system if and only if * n n Y w ≥ , where * n w is the optimal control limit at n t nh = .The control limit * n w is monotonically non-increasing in n .
Proof.By Theorem 2, consider the condition for the preventive replacement: For any n t nh = , the left-hand side of Eq. ( 13 Y w ≥ is also to preventively replace.Thus, the optimal replacement policy at time n t nh = is a control-limit policy with control limit * n w .On the other hand, since the left-hand side of Eq. ( 13) is nondecreasing in n , so that for any n Y , there exists a starting time * n h such that for any * nh n h ≥ the optimal decision is to preventively replace the system.By the existence of a control limit * n w for each inspection time and a starting time * n h for each degradation state, the control limit is * n w monotonically non-increasing in n .
Remark.We only use the homogeneous gamma process to derive the above optimal replacement policy.However, the optimality still establishes using other degradation models, if that Pr n n Y ∈ Ω indicates two characteristics of a degradation process.Firstly, at the same inspection time, the degradation process which has a more severe deterioration would have a larger probability of soft failure during next inspection interval; secondly, for the same degradation state, the system which is "older" would be more likely to confront soft failure for the next inspection interval.These two phenomena can be found in many real situations.
is not strictly non-decreasing for ( , ) n n Y ∈ Ω , the above replacement policy still might be the optimal since ( ) , , n R Y n h , the conditional reliability function of sudden failure, may dominate the trend of ( ) is also a quite strong assumption in order to ensure the monotonicity; in fact, if ( ) , , n R Y n h decreases quickly enough as the system ages, the theorems still hold.

Computation of the Optimal Replacement Policy
We demonstrate in Section 4 that the optimal replacement policy has a specific form, and it is also a multi-level control-limit policy with non-increasing control limits * n w .To apply this policy, it is necessary to compute the minimum long-run expected average cost per unit time g .We thus develop a computational procedure using semi-Markov decision process (SMDP) to obtain g .The computation of SMDP requires discretizing the possible range of values of .Thus, we approximate the state space ( , ) Ω = H K by a non-decreasing homogeneous Markov chain with countable state space considering states ( , ) k . Based on the above discretization scheme, for the system which has not failed by time n t , if n Y is known, the conditional reliability function defined in Eq. ( 5) is calculated as follows:

R Y n t R k n t T t t h Y y for k n
Then, the expected sojourn time defined in Eq. ( 6) is calculated by: Next, we will derive the one-step transition probabilities in the SMDP.Using homogeneous gamma process, the probability that the value of Similarly, the probability that the value of With the definition of the state space, for a fixed control limit χ , the warning state , where n k can be approximated by =   .Thus, the determination of a control limit χ can be transformed into determining the n χ  .For a fixed integer n χ  , the corresponding control limit χ is given by: Once all of the quantities above are defined, for a fixed integer n χ  , the long-run expected average cost per unit time ( ) g χ for the competing risk of soft and sudden failure can be obtained by solving the following systems of linear equations (see e.g.[17]): where U is positive integer and δ is the selected small number.

Case study
In this section, we use a real degradation data presented by Meeker and Escobar [16] (Chapter 13, Example 13.5) to illustrate the application of our proposed optimal replacement policy.This data set consists of 20 degradation histories, describing the degradation process of some GaAs lasers subject to the competing soft and sudden failures.During the life of GaAs lasers, the degradation causes an increase in the laser's operating current.The laser is considered to be failed if the operating current increases to f D percent of its original value.On the other hand, physical breakdown due to the sudden failures may also occur and consequently interrupts the graceful degradation.For illustration purpose, we assume 5 f D = as the soft failure threshold, and the operating currents were inspected every ( ), where c is a real coefficient.Using joint like- lihood function and the interior-reflective Newton method, we obtain the ML estimates for the model parameters, as shown in Table 1.
In order to validate the fitted model, we use the probability plot to assess whether the increments of laser degradation follow the gamma process Ga h η γ , ( ) and compare the ML estimates with non-paramet- ric Kaplan-Meier estimates for the PH model.Fig. 2 and Fig. 3 show the results, demonstrating that the fitted model is well suited for the laser data.Note that in Fig. 3, we approximate the cumulative distribution function by: , the optimal replacement policy proposed in Section 4 is applicable to this case.We partition the continuous state space of w , the optimal maintenance policy is to preventively replace the system.We demonstrate the appropriateness of 128 L = using the stopping rule of Eq. ( 23).The results are shown in Table 2.

Conclusion
In this paper, we have investigated the optimal replacement policy for a periodically inspected system subject to the competing risk of soft and sudden failures.This policy focuses on the system whose degradation process can be described by a gamma process and sudden failure rate by a PH model.If the preventive replacement only performs at inspection times and the sudden failure dominates in the failure mechanism, it is demonstrated that the optimal replacement policy has a specific form and it is actually a multi-level control-limit policy with control limits in terms of the degradation level.A computational algorithm based on a SMDP framework has also been developed to obtain the optimal replacement policy.The entire procedure of applying this policy is illustrated by a real laser example.
a finite set of states.Define [ , ) f D + ∝ as the failure state F .We can then divide the continuous state space of 0 [ , ] f y D into L equi- probability that the value of 1 n Y + will be in the failure region determined by state F given the current value of n Y is calculated by: 100 h = hours up to τ = 2000 hours or until the sudden failure happens.The degradation paths are plotted in Fig.1, in which 13 out of 20 degrade gradually till the censored time, while the other 7 samples show a sharp increase in operating current when the sudden failure occurs.First, we fit the data using the model described in Section 2. We assume the baseline hazard function is Weibull hazard function deto quantify the effect of system degradation state on the failure rate is exponential having the form θ long-run expected average cost per thousand hours g( ) χ for different control limits χ are plotted in Fig.4, in which we can find the minimum long-run expected average cost per thousand hours is g( ).* χ = 1586 052 with the optimal control limit χ * .= 1587 573.Using the control limit χ * .= 1587 573, the optimal control limits * n w in terms of degradation level at each inspection time n t nh = presents a decreasing trend in time, as shown in Fig. 5.When the observed degradation level n Y exceeds * n

Fig. 3 .Fig. 4 .
Fig. 3. ML estimates and the Kaplan-Meier estimates for the marginal CDF of sudden failure times For the system which has not failed by time nh , if the latest C is very small compared to any replacement cost.It is reasonable because condition-monitoring loses its significance if the monitoring cost is too high.The objective of our replacement policy is to minimize the longrun expected average cost per unit time g by performing appropriate preventive replacement.Denote the state space by 0

Table 1 .
ML estimates for the model parameters