Optimizing preventive replacement schedule in standby systems with time consuming task transfers

https://doi.org/10.1016/j.ress.2020.107227Get rights and content

Highlights

  • Warm standby systems with predetermined preventive replacements are considered.

  • The replacement process includes task transfer during which two elements must operate.

  • The replacement time depends on amount of work completed before the replacement.

  • The mission succeeds if all its parts are completed and no operating elements fail.

  • The optimal preventive replacement scheduling problem is solved.

Abstract

In many industrial and technological systems, due to factors such as deterioration, corrosion, attacks, etc., preventive replacement is typically performed according to a predetermined schedule to renew the worn or aged online element using a standby element, enhancing the mission success probability. This paper models such a standby system with its mission time being divided into a certain number of mission parts (MPs). During different MPs, according to a pre-specified sequence, different system elements are activated to perform the mission operation. Upon completing each MP, following the preventive element replacement, a time-consuming transfer procedure must be conducted to start the new MP. The mission is successful when the last MP can be accomplished by an available system element. An event transition-based method is proposed to evaluate the mission success probability (MSP). The optimal preventive replacement scheduling problem is then solved, which finds the number of MPs and their durations maximizing the MSP. In the case of heterogeneous system elements, a combined optimization problem that finds the optimal element activation sequence and the optimal preventive replacement schedule to maximize the MSP is formulated and solved. Effects of several model parameters (the number of MPs, the number of system elements, task transfer time) are investigated through examples.

Introduction

In many industrial and technological systems (e.g., power, aviation, healthcare, railway), it is a common practice to perform preventive maintenance to remove or reduce accumulated deterioration of system elements, thus improving the overall system reliability [1], [2], [3]. The preventive maintenance can be triggered based on a certain pre-determined schedule or by the presence of some system condition [4,5]. In this work, we focus on the scheduled preventive maintenance for systems with standby elements.

There exists a rich body of works on modeling and optimizing the preventive maintenance (PM) policy. For example, in [6] a reliability-based periodic PM model was proposed for systems with deteriorating elements. In [7], the optimal PM scheduling problem was solved for a composite power system balancing maintenance, reliability, and failure costs. In [8] the optimal PM policy was determined for a k-out-of-n system considering the threshold number of malfunctioned elements to avoid the entire system failure. In [9], the optimal periodic PM policy was investigated for a single-element system working under time-varying operating conditions modeled by a continuous-time Markov process. In [10] the optimal PM policy was investigated for a parallel system subject to common-cause failures. In [11], a hybrid age-based and condition-based PM policy was proposed for systems operating in random environments undergoing Poisson-distributed external shocks. In [12], periodic PM and warranty policies were co-optimized for repairable products. In [13], the optimum preventive replacement interval was studied for a parallel redundant system with the damage self-healing mechanism. In [14], three different age-based PM models (replace first, replace last, and replace next) were studied and compared.

Recently some researchers have investigated the preventive maintenance planning for standby systems, where one or multiple elements are initially online and operating with extra elements staying in the standby mode; in the case of an online element failing, an available standby element is activated to take over the mission task [15], [16], [17]. For example, in [18] a degradation level-based PM model was suggested for a two-unit standby cooling system. In [19], the optimal PM interval was studied for a two-unit priority standby system to maximize the expected profit value per unit time. In [20], a backup-based PM policy was modeled and optimized for a standby system undergoing periodic backups (checkpointing), preventive replacements, and corrective replacements. In [21], the model of [20] was extended by allowing different overheads (cost and time) incurred by corrective and preventive replacement actions. In [22], a standby system undergoing random inspections and state-based PM was modeled and optimized. In [23], the periodic inspection and PM policies were investigated to maintain degrading standby elements in acceptable conditions, maximizing the overall system reliability. In [24], a shock-based preventive replacement policy was studied for a heterogeneous standby system to maximize the mission availability over a finite time horizon.

Despite the rich literature on modeling and optimizing PM policies, to the best of our knowledge, none of the existing works have explicitly addressed the task transfer incurred by the preventive replacement. The task transfer often takes a significant amount of time and may fail due to the malfunction of involved system elements, affecting the system reliability greatly. For example, some cargo should be transported over a certain distance. The transportation units (i.e. ships) deteriorate and can fail during the transportation, which causes the loss of cargo (ship sinking). To reduce the failure probability the route is divided into several parts and different units should cover different parts of the route (relay race). To change the units, the time-consuming failure prone cargo reloading procedure should be performed. Several units can wait for the cargo in each reloading point (port) in the standby mode. When one of the standby units appears unavailable, the next one can take over the transportation task. The mission succeeds when the cargo is delivered to the destination point.

As another example, a product should undergo some technological process in reactors operating in corrosive environment. Each reactor can deteriorate during the process. The failure of reactor causes loss of the product. To reduce the time during which each reactor is exposed to the corrosive environment, the process is divided into several stages performed in different reactors. Between the stages the product should be reloaded from previous reactor to the next one. The amount of product can change during the process. Thus, the reloading time depends on the elapsed process time.

The time-consuming task transfers also take place in distributed systems performing computational tasks. Each processor performing the task attracts the attention of attackers and can be corrupted with probability increasing with operation time. The intensity of attacks increases with time since the attack beginning as more attackers get information about the operating processor's location. The idle processors can also be attacked and corrupted, but with lower probabilities. To increase the chance of task completion the software migrates among processors. The processor completing its part of the task should transfer the software and/or produced data to the next one. The amount of data to be transferred depends on the number of computation operations performed from the mission beginning. If during processing or data transfer any operating processor is corrupted, the computational task fails.

In this paper, we explicitly model the time-consuming task transfer procedure in an event transition-based reliability analysis of standby systems undergoing preventive replacements per a predetermined mission time schedule. Based on the reliability (mission success probability) evaluation, we make a further contribution by solving the optimal preventive replacement scheduling problem to maximize the mission success probability (MSP). In the case of heterogeneous standby elements, the element activation sequence matters. Therefore, we also solve a joint optimization problem that finds the optimal replacement schedule and element activation sequence, maximizing MSP.

The remainder of the paper is organized as follows. Section 2 presents the standby system model and assumptions made by the proposed solution method. Section 3 presents the MSP evaluation method for the considered system. Section 4 describes the optimal element replacement scheduling problem and the optimization method. Section 5 presents examples to illustrate the proposed method and optimization. Section 6 concludes the paper and points out a few directions for future research.

Section snippets

System model

The system consists of K heterogeneous elements characterized by increasing failure rates. It has to perform a mission that requires time T of operation. To reduce the mission failure probability, preventive element replacements are implemented in predetermined time instances. Specifically, the mission time is divided into J parts where different elements perform the mission operation in a predetermined sequence. The duration of the jth mission part (MP) is τj such that j=1Jτj=T. When a

Evaluation of MSP

An event transition-based method is suggested in this section to evaluate the successful completion probability of a multi-phase mission with each phase/MP performed by a different element and time-consuming task transfers in-between MPs.

Optimal replacement scheduling

To obtain the optimal element replacement schedule, which includes the number of MPs and their durations, we apply the genetic algorithm (GA) heuristic [29,30]. The GA requires representing solutions in the form of strings. Given the number J of MPs, the optimal replacement scheduling problem becomes the optimal MP durations problem and the solution can be encoded by a string consisting of J integer numbers (x1,…,xJ) ranging from 0 to H each. The duration τj is determined as τj=Txj/j=1Jxj. It

Illustrative example

Consider a tank that should contain a pressurized aggressive liquid during time T = 100. The liquid can cause corrosion of the tank, leading to penetration of its shell and serious damage to environment. To reduce the risk of the damage, the liquid is periodically transferred to other tanks. During the transfer the pressure is reduced, which causes milder condition for the tank. The empty tanks waiting in the standby mode can also be corroded by the ambient factors (air humidity, pollution

Conclusion and future directions

This paper evaluates and optimizes the success probability of a mission composed of multiple MPs performed by different system elements. At the end of each MP, a preventive element replacement is performed followed by a time-consuming task transfer to the newly activated system element before starting the next MP. An event transition-based method has been suggested to evaluate the MSP. Applying the GA heuristics, the optimal preventive replacement scheduling problem has been solved to determine

Author statement

The paper has been revised according to the reviewers’ comments.

Declaration of Competing Interest

There is no conflict of interests associated with this paper.

References (31)

Cited by (24)

  • Optimal structure of multiple resource supply systems with storages

    2023, Reliability Engineering and System Safety
  • Availability analysis of shared bikes using abnormal trip data

    2023, Reliability Engineering and System Safety
  • Minimizing mission cost for production system with unreliable storage

    2022, Reliability Engineering and System Safety
  • Optimizing the maximum filling level of perfect storage in system with imperfect production unit

    2022, Reliability Engineering and System Safety
    Citation Excerpt :

    When the PU is activated after its PM, the APS is Cmax-ωD. To consider effects of PM and repair on the PU's failure probability, the concept of equivalent age associated with the cumulative exposure model (CEM) is used [26,27]. According to CEM, the cumulative failure probability is a function of cumulative exposure time (CET) where the time of an element during the non-operation mode is multiplied by a deceleration factor.

View all citing articles on Scopus
View full text