AvAilAbility AnAlysis of series redundAncy models with imperfect switchover And interrupted repAirs

This paper considers N + 1 series redundancy, where N components are active and 1 component is standby in normal state. The active components execute the service, while the standby component is ready to take over the active role if the active components fail. When an active component fails, the standby, if available, automatically takes over system operations. However, the automatic switchover of the standby component to active mode might not be possible due to hardware or software issues. When a component failure or an imperfect switchover occurs, it immediately begins to be repaired. However, the repair process is possible to be interrupted. The most existing literature of redundancy models has focused on Markovian systems with uninterrupted repairs. This paper considers a non-Markovian redundancy model with interrupted repairs, where the repair time, the non-automatic switchover time, and the interrupted time are generally distributed. Using supplementary variable method and integro-differential equations, we obtain the steady-state availability for the redundancy model.


Introduction
The availability of a system is defined as the probability that the system is operational at a point in time [16,25].High availability is becoming a must in various domains such as telecommunication networks, power plants, and industrial and manufacturing systems [5,16,27].During the past years, many efforts have been made to improve system availability.
Redundancy is a common approach to improve system availability [6].The redundancy service can offer different levels of availability depending on its redundancy model.Availability Management Framework [16] defines the following four redundancy models: 2N, N+M, N-way, and N-way active.The 2N redundancy model ensures one standby replica for each component in active mode.The active components execute the service, while the standby components are ready to take over the active role if the active components fail.The N+M redundancy model extends the 2N redundancy by allowing more than two components to be active or standby.In the N+M redundancy, N represents the number of components in active mode and M represents the number of components in standby.The N-way redundancy model extends the N+M redundancy allowing a component to be simultaneously active and standby for different services.Lastly, the N-way active redundancy model differs from the 2N, N+M, and N-way redundancy, as it does not support standby service assignments, but allows a service to be assigned active to several components [16].
The availability analysis of a redundancy model is based on analyzing the various states that the model undergoes during its lifespan [21].The analysis mainly focuses on capturing the failures that cause the system to switch to a faulty state and the repairs that shift the system back to a healthy state [6].Since the occurrence of failures is erratic by nature, stochastic models have been used to conduct the availability analysis [20].Markovian models have been extensively used for this purpose because of their expressiveness and their capability of capturing the complexity of real systems [1,[23][24][25].One of the major problems of using Markovian models is that a large number of states are required to represent the model accurately [1].As an alternative, Kanso et al. [6] used Stochastic Reward Nets (SRNs) to model sciENcE aNd tEchNology various redundancy models and evaluated the availability by using the Stochastic Petri Net Package (SPNP).Kim et al. [9] analyzed the networking service availability of 2N redundancy model with nonstop forwarding by using the SPNP.The analytic-numeric methods of SPNP provide the capabilities of solving the Markovian SRNs but fail for non-Markovian SRNs.Actually there is no reason to assume Markov property in modeling of repairable systems [11].Recently, modeling and analysis of repairable systems with general repair time have drawn a lot of attention.Kuznetsov [11] evaluated the availability of repairable networks with general repair time distribution by a simulation method.However, because of the mathematical complexity of non-Markovian redundancy models, the closed form solutions of the models are extremely difficult to obtain.
In redundancy models with standby components, one of the standby components takes over the active role if an active component fails [16].This process is called a switchover from standby mode into active mode.However, the switchover process is not always perfect [14].That is, the switchover process may fail during the transition of a standby component to active mode.Lewis [14] first brought the concept of imperfect switchover in the availability analysis of redundancy models.Wang et al. [22] studied the availability of four different repairable systems with imperfect switchover.Ke et al. [7] used a Laplace transform method to study the availability of a Markovian repairable system.Hsu et al. [3] considered the profit analysis of a repairable system with imperfect switchover.Sadjadi and Soltani [18] considered a series-parallel system with the choice of redundancy strategy.In the above-mentioned works, the repair times have been assumed to be exponentially distributed.However, the assumption of exponential repair time distribution limits its use for solving real problems.In this paper, we consider non-Markovian model with imperfect switchover.
We consider a N+1 redundancy model.This is a special case of the N+M redundancy [16].The classic "k-out-of-N" model [26], which is a very popular type of redundancy in fault-tolerant systems, can be seen as a special case of N+M redundancy if it is assumed that the switchover is perfect and instantaneous and the failure rate of a standby component is equal to the failure rate of an active component.Actually, the standby component may be different from the active component normally operating and may have a different failure rate in the operational mode [6].In N+1 redundancy model, a single component acts as a standby for all components in active mode.In operation, the active components provide their service while the standby component is prepared to become backup to any of the active components, should one of them fails [16].The N+1 redundancy itself has many real-world applications.One of the examples is a network device, DSLAM (Digital Subscriber Line Access Multiplexer), which connects the customer's end to the Internet through NICs (Network Interface Cards) [4,10].There may be multiple primary NICs and one standby NIC on DSLAM.When one of the primary NICs is faulty, services can be switched to the standby NIC.The switchover of the standby NIC to primary mode may fail due to hardware or software issues [10].The failed NICs can be fixed through the remote server, which may also function wrongly [10].Other examples include: Servers designed with multiple power supplies with one reserved as a cold backup [15]; A bank website deployed to a cloud platform, which has a dynamic number of active instances with a running backup always ready to replace a failed instance [19]; A factory having multiple industrial robots and one backup [17].
When a component failure or an imperfect switchover occurs, it immediately begins to be repaired.In realistic environments the repair process is possible to be interrupted [10].Therefore, considering an interrupted repair in a repairable system is practical and imperative.The most existing literature of redundancy models has focused on uninterrupted repairs with exponentially distributed repair time.Little attention has been given to redundancy models with interrupted repairs and generally distributed repair time.Lee [12] analyzed the steady-state availability of a simple parallel 1+1 redundancy model with one active and one standby component.Bosse et al. [2] estimated the availability of a redundancy model with imperfect switchovers and interrupted repairs by using a Petri net Monte Carlo simulation.Kuo and Ke [10] and Lee [13] studied the steady-state availability of series systems with switching failures, interrupted repairs, and generally distributed repair time.However, they did not distinguish between the repairs of the component failures and the imperfect switchovers.
This paper focuses on the analytical expression of the availability for N+1 series redundancy model with imperfect switchovers, generally distributed repair times, and interrupted repairs.Furthermore, we distinguish between the repairs of the component failures and the imperfect switchovers.Using supplementary variable method and integro-differential equations governing the steady-state behavior of the model, we obtain the analytical expression of the steady-state availability.Some numerical examples for the steady-state availability of the redundancy model are presented.

Models
We describe a redundant system with one repairer and N + 1 components, among which N components are active and 1 component is standby in normal state.The components in active mode operate normally and the component in standby mode is ready to assume the active role should the active components fail.The system is available only when there are N active components.It is assumed that the components in active mode operate independently from each other and their position in the serial structure of the system is not important.When the system is available, each component may fail independently of the state of the others.When the system is unavailable, it is shutdown and the additional failures do not occur.Components are repaired on a 'first come first served' basis.After the repair of a component is completed, the fixed component becomes standby if there are already N active components; otherwise, it becomes active.If one of the active components fails and there is a standby component, then the standby component automatically takes over system operations with neglibile switchover time and becomes active.The automatic switchover from standby to active may fail due to hardware or software issues.In this case, the repairer first switchs over non-automatically the standby component to active, then repairs the failed component.Moreover, the repairer may function wrongly or fail sometimes in its busy period, i.e., when it is repairing a failed component or it is switching over non-automatically a standby component.When the repairer is not available, its ongoing repair or non-automatic switchover process is interrupted.Once the repairer becomes available again, it resumes the interrupted process.
Let the time-to-failure of the active and the standby components be exponentially distributed with rate λ and μ, respectively.The repair time X is generally distributed with probability density function (PDF) f(x) and cumulative distribution function (CDF) F(x).The automatic switchover is assumed to fail with probability p and the non-automatic switchover time Y is generally distributed with PDF g(y) and CDF G(y).Moreover, the repairer may fail in its busy period with an exponential failure rate δ.The interrupted time Z is generally distributed with PDF h(z) and CDF H(z).
For mathematical analysis, we define some supplementary variables.The random process X_(t) denotes the amount of repair time already received by a failed component in repair at time t.We call X_(t) the elapsed repair time.The random processes Y_(t) and Z_(t) denote the elapsed non-automatic switchover time and the elapsed interrupted time, respectively, at time t.We also introduce: sciENcE aNd tEchNology The function α(x) is the PDF for the repair time X on condition that X > x: Note that the function α( )

Availability analysis
( ) 0 if the repairer is idle at time , 1 if the repairer is busy at time , 2 if the repairer is in failed state at time , Note that when Mt= , the system is unavailable and the repairer, if available, is repairing one of the two failed components; when ( ) 1 Mt= , the system is unavailable and the repairer, if available, is switching over non-automatically the standby component to active; when Mt= , the system is available and the repairer, if available, is repairing the failed component; and when Mt= , the system is available and the repairer is idle.Let us define  We construct the following integro-differential equations governing the steady-state behavior of the model by using supplementary variables: sciENcE aNd tEchNology We solve the above equations with boundary conditions: ( ) ( ) Solving the above integro-differential equations, we obtain , , Note that: for CDF ( ) H z .Then, we get: where and: Thus, we obtain: From ( 20)-( 29), ( ) , and 3 Q can be clearly expressed by ( ) 2 0 Q .Now we need to find the expression of ( ) (26), we obtain: From ( 25) and (29), From ( 22), (30), and (32), Similarly, from (20), ( 21), (31), and (32), we obtain: By normalization condition: we obtain: , and m P , 0,1, 2 m = , are obtained.
Then, the steady-state availability 1 N Av + can be obtained as: sciENcE aNd tEchNology

Numerical examples
For numerical examples, we consider three different models: 1 active and 1 standby(1 1 + ); 2 active and 1 standby( 2 1 + ); and 3 active and 1 standby( 3 1 + ).As shown in Table 1, nine cases are provided for illustration purposes.We consider three different distributions: Exponential (M), Deterministic (D), and Weibull (W) with shape parameter 2 .We will compare the steady-state availability among three different redundancy models with five different triads of the repair time, the non-automatic switchover time, and the interrupted time distribution: MMM, DDD, DDW, WWD, and WWW, where the notation ABC represents that the repair time distribution is A, the nonautomatic switchover time distribution is B, and the interrupted time distribution is C.For example, MMM represents that the three random variables are all exponentially distributed and WWD represents that the repair time and the non-automatic switchover time follow a Weibull distribution with shape parameter 2 and the interrupted time is deterministic.Note that all parameters can be modified to reflect other situations.
Table 2 shows the effect of parameter λ on the steady-state availability for three different models with five different triads of the repair time, the non-automatic switchover time, and the interrupted time distribution.Under our numerical environments given in Case 1, we find that: Table 3 shows the effect of parameter δ on the steady-state availability for Case 2.
For the 1 1 For the 2 1 + and 3 1 + models: Table 4 shows the effect of parameter ( ) E Z on the steady-state availability for Case 3.For the 1 1 + model: For the 2 1 For the 3 1 Table 5, 7, and 9 show the effect of p , ( ) E X , and ( ) E Y , respectively, on the steady-state availability when λ, δ, and ( ) . Under our numerical environments given in Case 4, 6, and 8, we find that:

sciENcE aNd tEchNology
As expected, we also find in the tables that Av 1+1 >Av 2+1 >Av 3+1 for all parameter values given in Case 1-9.

Conclusions
By using supplementary variables and integro-differential equations, we have obtained the analytical expression of the steady-state availability for series redundancy model with imperfect switchovers, generally distributed repair times, and interrupted repairs.Numerical examples have been provided for 1+1, 2+1, and 3+1 models.
The drawback of this paper is to focus only on computing steadystate availability due to non-Markovian assumptions.Although the study of steady-state availability is important to understand the characteristics of redundancy models, it is more interesting to have the system availability at any time than steady state availability.However, it is difficult to obtain a transient solution in explicit form for a is called the hazard rate or the agespecific failure rate in renewal theory.The functions β( ) y b and γ( ) z g are the hazard rates of the random variables Y and Z , respectively: this paper, b*(s) is the Laplace transform of a function b(t).

Let ( ) 0
Mt= and ( ) K t be the state of the N+1 components and the state of the repairer, respectively, at time t: 0 if there are 1 active and 2 failed components at time , 1 if there are 1 active, 1 standby, and 1 failed component at time , 2 if there are active and 1 failed component at time , 3 if there are active and 1 standy component ( )

Table 1 .
Values of parameters

Table 2 .
Steady state availability versus λ for Case 1

Table 4 .
Steady state availability versus E(Z) for Case 3

Table 3 .
Steady state availability versus δ for Case 2 if ( )

Table 5 .
Steady state availability versus p for Case 4

Table 7 .
Steady state availability versus E(X) for Case 6

Table 8 .
Steady state availability versus E(X) for Case 7

Table 9 .
Steady state availability versus E(Y) for Case 8 the N+1 model because of complex structure due to non-Markovian assumptions.The analysis considering transient availability may constitute a challenging research topic and draw research interests.Steady-state analysis for availability of N+M redundancy models with more than one standby component is also not an easy task.Further studies are necessary in order to obtain the transient availability of N+M redundancy models.AcknowledgementThis work was partially supported by the ICT R&D program of MSIP/IITP [R0101-16-0070, Development of The High Availability Network Operating System for Supporting NonStop Active Routing] and by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (2017R1A2B1009504).

Table 10 .
Steady state availability versus E(Y) for Case 9