Integrated Optimization of Design, Storage sizing, and Maintenance Policy as a Markov Decision Process Considering Varying Failure Rates

Various strategies can be applied to improve reliability at certain costs, including equipment redundancy, product storage, and maintenance, which gives rise to the problem of optimally allocating the reliability improvement costs among various strategies and balancing them against the potential loss due to unavailabilities. Motivated by the reliability concerns of air separation units, we use Markov Decision Process to model the stochastic dynamic decision making process of conditionbased maintenance assuming bathtub shaped failure rate curves of single units, which is then embedded into a non-convex MINLP (DMP) that considers the trade-off among all the decisions. An initial attempt to directly solve the MINLP (DMP) for a mid-sized problem with several global solvers reveals severe computational difficulties. In response, we propose a custom two-phase algorithm that greatly reduces the required computation effort. The algorithm also shows consistent performance over randomly generated problems around the original example of 4 processing stages and problems of larger sizes.

at an even higher stake since the interruption of the pipeline supply will also result in production interruption at the customer site.
If we focus on the system design and the reliability of the process itself, there is a clear trade-off between higher expected availability of the process and higher capital investment for backup units.
A few works have looked into this point. Pistikopoulos (1994, 1995) incorporate the reliability index of each unit as part of process design optimization considering flexibility and reliability. Aguilar et al. (2008) address reliability in utility plant design and operation by considering a few pre-specified redundancy selection alternatives and failure scenarios. Ye et al. (2018) propose a general mixed-integer programming framework for the optimal selection of redundant units.
Another strategy to improve product availability for air separation units is to provide buffer storage of the liquified products, which can be evaporated to sustain the pipeline supply when the air separation units fails. This strategy also incurs costs of building the tanks and maintaining the stock, which increases with the size of the storage tanks (Terrazas-Moreno et al., 2010). Our previous work (Ye et al., 2020) has looked into the exact modeling of the stochastic process of liquid storage consumption.
Quantifying and optimizing the maintenance efforts after the commissioning of a plant is a well recognized topic in industry (Van Rijn, 1987) as well as in academia. Tan and Kramer (1997) present a general framework for preventive maintenance optimization utilizing Monte Carlo Simulation and genetic algorithms overcoming certain drawbacks of analytics based methods and Markov based methods. With regards to timing and resource allocation of maintenance tasks, Pistikopoulos et al. (2001) optimize maintenance alongside with normal production tasks. Cheung et al. (2004) address a short-term scheduling problem of plant shutdown, overhaul, inspection and startup actions within one plant. Amaran et al. (2016) focus on optimizing the maintenance turnaround planning of integrated chemical sites for minimum cost accounting for workload and manpower uncertainties. Achkar et al. (2019) optimize the scheduling of general maintenance tasks on oil and gas wells and surface facilities.
However, there have only been limited number of works reported on the simultaneous optimization of design and maintenance schedule , Nápoles-Rivera et al.
(2013), Liang and Chang (2008), Godoy et al. (2015), Wibisono et al. (2014)), especially ones that 2 model the stochastic process of failure and maintenance in detail (Redutskiy, 2017). Another example of the latter is our previous work (Ye et al., 2019), where the process is modeled with a Markov Chain, where the failure rates are considered constant for the entire horizon, and are subject to the inspection frequency decision. It was shown in the paper that incorporating maintenance decisions into the economic trade-off can have major impact on the overall cost. As an improvement, in this work, we incorporate the more realistic assumption that the failure rates of single units vary with time, and model the condition-based maintenance decisions as the action policy of Markov Decision Process, which largely refers to Amari et al. (2006), while the latter work only focuses on one unit and has no design component.
Markov Decision Process is a one-step look-ahead framework for dynamic decision making under uncertainty that was first proposed by Bellman (1957). It has since received wide interest and has had broad applications (White, 1993), such as inventory management (Ahiska et al. (2013), Yin et al. (2002)), planning and scheduling (Shin et al., 2017) , investment (Bäuerle and Rieder, 2011), and maintenance (Byon and Ding (2010), Chen and Trivedi (2005)), as well as the recently rising reinforcement learning area (Powell, 2004), where MDP is used as the basic framework for describing the behavior of various systems including human beings. The optimality condition of a Markov Decision Process is given by the Bellman equation. Basic ways of finding the solution that satisfy the Bellman equation include linear programming, policy iteration and value iteration. In this work, we employ a slightly modified version of the linear programming formulation, which is mixed-integer and has a more specific objective function to suit our need of simultaneous design and operations optimization.
In section 2, we describe the problem scope including the decisions to make and their correspondence to the modeling components. Section 4 presents the mathematical model, where we start by introducing the basics of Markov Decision Process in section 4.1. Then, in section 4.2, we propose the MINLP model as described in the last paragraph. In order to keep the model tractable, each processing stage is modeled as an MDP (Markov Decision Process), and the stage interdependency is captured via functional relationships of the MDP parameters. In section 4.3, we describe the interaction between the MDPs of all the processing stages and its impact on the entire system.
In section 5.1, we propose to exactly linearize the bilinear terms in the model. Furthermore, in section 5.2, we show a reformulation of the objective function that helps to tighten the objective bounds.
In section 6, we present an example with 4 processing stages to discuss how to determine the basic parameters of the MDPs based on the reliability and price specifications of individual units, and show that directly solving the original MINLP model faces considerable computational difficulties. It motivates us in section 7 to propose an algorithm with two phases: Enumeration and Bounding, and Rewards Iteration. The objective reformulation proposed in section 5.2 also helps with the proof of a proposition that supports the validity of the algorithm. Finally in section 8, we show that the proposed algorithm is efficient on the example introduced in section 6 as well as 20 problems randomly generated around it. The performance of the algorithm on larger problems is slightly less stable as reflected by solving another group of 20 randomly generated problems with 6 processing stages.

Problem statement
The decisions to make include the selection of redundant units and sizes of storage tanks, as well as basic maintenance policies for a chemical process. As our motivating example, the general flowsheet of an air separation unit is shown in Figure 1, where the failure of one stage results in the failure of the entire process. Following common practice of the air separation industry (Linde, plc (2020), Chart (2020)), the sizes of the storage tanks will be selected from a few discrete standard options. If more redundant units are selected for critical processing stages, such as compressors, the plant will be less likely to fail. Also, as discussed in the introduction, the liquid oxygen and liquid nitrogen can be evaporated to sustain the pipeline supply when the air separation units fails, and the larger the storage tanks are, the longer are the downtime periods that can be covered. Furthermore, in the operation phase, placing more efforts into maintenance can increase the process reliability.
Each pipeline interruption will incur a fixed amount of penalty, which is to be balanced against the costs of the above strategies to increase availability. To address this problem, we propose a mixed-integer programming model based on a Markov Decision Process framework. Figure 2 shows the conceptual modeling structure. Assuming that for a process, the base flowsheet is given, the exact design and maintenance policy decisions are described in section 2.1 and 2.2, respectively.

Design decisions
There are two major design decisions: • The number and selection of redundant units of different prices and reliability specifications for each processing stage. The binary variable z k,h = 1 indicates that in processing stage k (e.g. compressor stage), design h (e.g. compressor 1 and compressor 3 each of 100% capacity) is selected.
• The sizes of end product storage tanks. Binary variable x p,n = 1 means that for product p (e.g. Nitrogen), tank size n (e.g. 100,000 gallon) is selected.

Maintenance policy decisions
The bathtub curve shown in Figure 3(a) is widely accepted on how the failure rate of a unit varies with time (Henley and Kumamoto (1981), Barlow and Proschan (1975), etc.). We propose 5 to capture the bathtub-like deterioration/failure process of the equipment and the condition-based maintenance with a discrete-time Markov Decision Process. As shown in Figure 3(b), the bathtub curve is discretized into three "working states" of the unit: 1. Infant, 2. Stable, and 3. Worn-out.
When a unit is not working, there are three other possible states: 4. Stand-by, 5. Stopped, and 6.
Failed. At each state, there is a finite number of possible actions: Inspect-T ins m (which means to carry out inspection after T ins m days. m ∈ M is the index set of possible inspection intervals), Stop, Maintain, or Repair. One and only one action is to be assigned to each state, which is the main decision to make in terms of the maintenance policy.  Figure 4 shows the state space and action space of a processing stage with only one unit. When in the Infant state, no action will be taken except for waiting for the unit to either fail or proceed to the Stable state (blue arcs). When in the Stable state and the Worn-out state, the unit can be 6 stopped (green arcs), which leads to state Stopped with probability 1. From the Stable state, the action can also be Inspect-T ins m , m ∈ M (blue arcs), in which case, the next state will be revealed at the next inspection, and the probability of going to each of the next possible states depends on the inspection interval T ins m . When in the Stopped state or the Failed state, the only applicable actions are to maintain (the yellow arc) or to repair (the red arc), respectively, which leads back to the Stable state if the unit is needed instantly, or to the Stand-by state if other units in the stage is working properly. With one unit, the Stand-by state will always be skipped. When there are multiple units, actions at the Stand-by state are by no choice and denoted as None, and it will lead back to the Infant state with probability 1. That is because the only transition out of a Stand-by state is to the Infant state, and is always accompanied by the stopping or failing of other units, which are already accounted for. The Stand-by state being followed by the Infant state accounts for the higher likelihood of failure of the units after a period of idling.

Mathematical Model
As mentioned above, we propose to model the process with MDP (Markov Decision Process), as preparation for which the state space representation is formulated as in the example shown in Figure 4. Arguably, the exact modeling of the system with multiple processing stages would require construction of the entire state space and state-action pairs, which can become intractable as the number of stages and potential redundant units increase, as shown in Table 1. Therefore, we take a simplified approach to model each processing stage as an MDP (Markov Decision Process), and capture the stage interdependency via functional relationships of the MDP parameters. In section 4.1, we introduce the basic principles of MDP, especially the optimal condition and how to compute the stationary distribution of the reduced Markov Chain given the optimal actions, which constitute the theoretical foundation of this paper and the mathematical prgramming model afterwards. In section 4.2, we derive the optimization formulation of the MDP of each stage. In section 4.3, the interaction between the MDP of each processing stage is modeled.
Then in section 5, we propose to improve the model formulation by linearizing the bilinear terms into two constraints, and reformulating the objective function into an expression with tighter lower bounds for relaxations that arise when dropping the discrete requirements. However, we will show later that directly solving the original MINLP model still faces considerable computational difficulties, which motivates us to propose a customized two-phase algorithm. The objective reformulation proposed in section 5 also helps with the proofs of the supporting propositions of the algorithm. Process shown on the right, actions A1 or A2 could result in different transition probabilities P between state S1 and S2, and the rewards R depend on current state, the action, and the next state. A policy δ of a Markov Decision Process is a projection from its state space S to the action space A.

Markov Decision Process and the Optimality Condition
δ : S → A For the example in Figure 5, a possible policy is δ(S1) = A1, δ(S2) = A2, in which case the transition diagram becomes as shown in Figure 6, which reduces back to a Markov Chain. Under a certain policy δ, a value function is defined for each state s as in equation (1). The first term is the reward associated with s and its designated action according to δ. The second term is the weighted sum of the value function of the next possible states s ∈ S discounted by a future factor γ. In this paper, we assign a 10% discount factor for γ.
With the reward function R, the transition matrix P and the discount factor γ given, the optimal policy for maximum overall value V * s is defined as in equation (2), where a ∈ A s is the set of possible actions for state s.
In our case, the instant rewards R(s, a, s ) are the negative of the costs. Therefore, we let U s = −V s and flip the sign of the conventional expression. We then remove the negative sign before R(s, a, s ), and let R(s, a, s ) be the instant costs of going from state s to state s by action a. We still denote it as instant rewards to distinguish it from the breakdown cost parameters. With that, (3) is equivalent to (2).
The equivalent linear programming model is shown in (4) (d' Epenoux (1963), Puterman (2014)), where c s are arbitrary positive weights. With given P (s, a, s ) and R(s, a, s ), problem (4) solves for the optimal value of each state and the corresponding policy. Notice that if the optimal policy selects action a for state s, then the optimal dual variable of constraint (s, a) is greater than 0; As mentioned above, the weights in the objective function, c s , do not affect the solution of the optimal actions in the above LP model. But in order for the objective function to reflect the operational cost, the weights are set equal to the stationary probability distribution π * s of the reduced Markov Chain under the optimal policy. π * s satisfies the following constraints.
Below we show how to enforce the LP formulation minimizing operational cost (4), as well as the constraints (5) and (6) in the overall optimization model.

Mixed integer programming formulation based on the MDP optimal condition of each stage
As indicated in the problem statement, we index the processing stages by k, the potential designs of individual stages by h, the products by p, and the storage tank sizes by n. The binary variable z k,h indicates the selection of design h of stage k, while the binary variable x p,n indicates the selection of tank size n for product p. These binary variables satisfy equations (7) and (8).
Based on the above, we define the binary variable y p,n,k,h that indicates the overall design decision, which satisfy the logical relationship (9), where the logical clauses are true when design h of stage k and size n of product p are selected simultaneously.
Another way of expressing the above logical relationship is (10) and (11), which can be directly translated into the algebraic constraints (12) and (13) that are tighter than those translated from (9) (see Castro et al. (2008) The state space of design h in stage k is denoted as S k,h . Similarly, the transition probability matrix The constraint stated in the general formulation (4) is adapted into constraint (14) that involves the design decision y p,n,k,h . If p∈P,n∈Np y p,n,k,h = 0, it means that design h in stage k is not selected, and then the first term on the right-hand-side of (4) is zero, making v k,h (s) = 0, s ∈ S k,h a feasible solution. Considering that in the objective function below, v k,h (s) are being minimized The objective function is shown in (15), which minimizes the capital cost of the selected processing units (C U k,h ) and storage tanks (C T p,n ), plus the weighted state values of the MDPs (π k,h · v k,h ), which takes into consideration both the operational costs and the reliability values. min p∈P,n∈Np Notice that in the original LP form shown in (4), the weighted sum of the value functions (v k,h (s) in the model) are maximized in the objective function, which enforces the equality at the action that yields the minimum value. However, as shown in (15), the weighted sum of the value functions have to be minimized together with the investment costs. Therefore, we need another set of 13 constraints to make sure that v k,h (s) does not become unbounded. In disjuction (16), Boolean variable W k,h (s, a) is true if action a is the optimal action for state s, in which case it enforces that v k,h (s) be greater than or equal to the right hand side.
It can be translated via Big-M reformulation into the algebraic constraint (17).
Logic condition (18) requires that if design h is not selected, meaning z k,h = 0, then any action for that design is not valid. Constraint (19) is the corresponding algebraic equation of (18).
π k,h (s) is the stationary probability of state s in the state space of the MDP of design h in stage k, which satisfies (20), where a * (s) is the optimal action at state s.
To represent (20) by algebraic inequalities, we define π sub k,h (s, a) as the disaggregated variables of π k,h (s) with regard to action a as shown in (21), whose valued are related to the Boolean variables W k,h (s, a) as shown in (22), which translates into algebraic constraint (23).
Furthermore, as shown in (25), we require that π k,h (s) are 0 if design h is not selected for stage k, and that π k,h (s) sum up to 1 if otherwise. The condition can be expressed with the algebraic constraint shown in (26)

Stage interdependency
As mentioned in section 2, the processing stages impact each other's performance. Especially, a stage k can only transition into the next state when the other stages are not in the Failed state.
Therefore, as shown in equation (27), the reward of transitioning from state s into another state s , R p,n,k,h (s, a, s ), is to be corrected with the portion of time the other states being available, based on the reward parameter when the stage k stands alone, R idv k,h (s, a, s ) and R idv pen p,n,k,h (s, a, s ). Notice here R idv k,h (s, a, s ) has to be evenly distributed to each product.
where t ratio A k equals to the weighted probability of available states divided by the probability weighted time length of all states of stage k, as shown in (28), where t res is the expected residence time in state s if taking action a.

Reformulations
As will be shown in this section, the model can be improved through several reformulation steps. In section 5.1, we perform an exact linearization of the bilinear terms in the model, and in section 5.2, we propose to reformulate the objective function taking advantage of the functional relationships specified by the constraints.

Standard linearization of the bilinear constraints
We use Glover's linearization scheme (Glover, 1975), a special case of the McCormick Envelopes (McCormick, 1976), to linearize equations (14) and (17), which also involve the reformulation of equation (27). We first introduce a new variable R pen p,n,k,h (s, a, s ) and rewrite the three constraints mentioned above as follows in (29) We then replace the bilinear term y p,n,k,h R pen p,n,k,h (s, a, s ) with the new variable R y pen p,n,k,h (s, a, s ) as shown in (32) and (33), and use equations (34)-(37) to require that they be equal to each other.
R y pen p,n,k,h (s, a, s ) ≤ y p,n,k,h R idv pen p,n,k,h (s, a, s ), ∀p ∈ P, n ∈ N p , k ∈ K, h ∈ H k , s ∈ S k,h , a ∈ A(s), s ∈ S k,h R y pen p,n,k,h (s, a, s ) ≤ R pen p,n,k,h (s, a, s ) + (y p,n,k,h − 1)(R pen p,n,k,h (s, a, s )) L , R y pen p,n,k,h (s, a, s ) ≥ R pen p,n,k,h (s, a, s ) + (y p,n,k,h − 1)R idv pen p,n,k,h (s, a, s ), ∀p ∈ P, n ∈ N p , k ∈ K, h ∈ H k , s ∈ S k,h , a ∈ A(s), s ∈ S k,h (37) The element-wise lower bounds of R pen p,n,k,h (s, a, s ), (R pen p,n,k,h (s, a, s )) L , are found by first solving for the lowest possible portion of available time (t ratio A k ) L of each stage k as shown in (38), and apply (31) as shown in (39). Notice that all the constraints of problem (38) are linear, and the objective function is linear fractional, which is nonlinear but pseudo-convex and pseudo-concave (Avriel, 2003). Pseudo-convex function has the property of a convex function with respect to finding its local minima. Therefore, when we relax the integrality requirement in problem (38), it becomes a convex NLP.

Reformulation of the objective function
In this section we show a valid reformulation of the objective function (15) that potentially provides tighter lower bounds, that is derived based on the MDP optimal conditions implied by the constraints. Thus, the MINLP (referred to as (DMP)) consists of the objective function (40) that represents net present value of the system and constraints (7), (8) x,y,z,w,π,π sub ,π a ,v,R pen p∈P,n∈Np In Appendix A, we prove in Proposition 1 that the new objective function in (40) has equal value as the previous one in (15) when all the constraints are satisfied. The new objective function in (40) has two important properties. First, comparing to the original objective function in (15) which also contains the sum of bilinear terms, the new objective function breaks down to more terms, which gives rise to potentially stronger lower bounds when being directly relaxed (dropping the discrete requirement). Also, one part of the bilinear terms in the new objective function is the complicating variable R y pen p,n,k,h that reflects the stage interactions, so when the complicating variable is fixed, the objective function becomes linear.

Illustrative example
The motivating example of air separation unit has a few critical processing stages, such as main air compressor, pre-purifier, booster air compressor, and the LO2 pump. In this example, three redundancies are considered for each processing stages, and the pre-purifier stage has to have at least 2 units to function. The superstructure is shown in Figure 7. There are two products, O2 and N2. Capital cost of each unit range from $30k -$150k. Maintenance times are 6 hours. Repair costs range from $4k -$20k per time. Maintenance costs range from $2k -$10k per time. Table 2 shows the penalty rates and pipeline flow rates used in the model.  Table 3 shows the standard tank size options in terms of number of days of demand it can cover, and the corresponding costs normalized based on the costs of the smallest tanks. In section 6.1, we will show how the equipment and contract specifications shown above are transformed into the parameters used in the MINLP described in sections 4 and 5. In section 6.2, we show the preliminary computational results for directly solving the problem.

Transforming the specifications into the MDP parameters
The transition probability matrix P k,h (s, a, s ), s, s ∈ S k,h , a ∈ A(s) and the reward matrix For the working states on the bathtub curve, we assume Weibull distributions whose cumulative distributive function is denoted by F (x; λ, β) = 1 − e −(λx) β . For Infant state, the shape factor is β 1 . For Worn-out state, the shape factor is β 3 . For Stable state, the shape factor is β 2 = 1, where the Weibull distribution reduces to exponential distribution.
P (Worn-out, None, Worn-out) = 1 − F (T inf ant + T stable + T worn ; λ, β 3 ) The expected residence time in state s when taking action a is denoted as t res (s, a). For example, for Stable state, the next point of observation is either the next inspection depending on the choice of inspection interval, or the failure before the inspection. Therefore, as shown in (52) Similarly, for Infant and Worn-out states, the expected residence times are calculated as shown in (53) and (54) based on the Weibull lifetime distributions. γ(s, x) = x 0 t s−1 e −t dt is the incomplete gamma function.
The basic instant reward of transitioning from state s to state s by taking action a is denoted as R idv (s, a, s ), which is the corresponding operational cost of taking Inspection, Maintenance or Repair actions.
The penalty instant reward is 0 if s is not a failure state.
If the destination state is Failed or Stopped, then the penalty instant reward is the outage penalty associated with the repair/maintenance rate of the failure scenario, µ r (s)/µ m (s), the storage size of each product V p,n , and the consumption rate of each product δ p , as shown in equations (60) and (61).

Preliminary computational results
The MINLP model (DMP) for the system in Figure 7 has 39,555 equations and 47,559 variables with 1,762 binary variables. A first attempt is made to solve the MINLP model (DMP) with the objective function (40) and constraints (7), (8) In order to overcome the computational difficulty, we propose below an algorithm that takes advantage of the problem structure.

A two-phase algorithm
Considering that the decision regarding storage tank sizes has a rather small search space and an impact on the entire system, i.e., the storage tanks are "shared" by the processing stages, we propose an algorithm with two major execution phases. The first phase is called Enumeration and Bounding, where we exhaustively screen all the possible tank size decision nodes by solving two MILPs (denoted as (DMP-relax) and (DMP-diseng)) at each node for the respective objective function bounds, and prune all the nodes with lower bounds greater than the upper bound of any other node. The second phase is called Rewards Iteration, which is carried out for each of the remaining nodes: A sequence of MILPs (DMP-diseng) with iterative reward parameters are solved until the decisions converge. Figure 8 provides a more detailed description of the algorithm.
1. Enumerate over all possible tank size selections (e.g. 8 days of LO2 and 14 days of LN2). Each is defined as a node.
2. Calculate the lower bounds and upper bounds of the objectives at each node by solving two MILPs (DMP-relax) and (DMP-diseng).
3. As the calculation proceeds, prune nodes whose lower bounds are greater than the upper bound of any other node.
Enumeration and Bounding For each remaining node: While (R pen ) (i) not converged: 1. Solve the MILP (DMP-diseng) where R pen is fixed to (R pen ) (i) , and obtain solution ξ (i) .

Update (R pen
The optimal solution of the node is the final iteration MILP solution

Rewards iteration
The optimal solution obtained at each node that gives the lowest objective function value is the overall optimal solution. In section 7.1, we introduce the bounding problems and prove that the bounds are valid. In section 7.2, we explain the iterative procedure for solving each node.

Enumeration and bounding
In this section, we show that valid upper and lower bounds of the objective function with certain storage tank sizes can be obtained by solving two MILP models where R pen is fixed to its upper and lower bounds, respectively, and we describe how to determine these variable bounds.
First, let us represent the MINLP (DMP) with the compact form (62). For simplicity, we let ξ stand for the binary and continuous variables other than storage tank selection variables x and the 24 penalty reward R pen subject to the multilinear constraint (31). Therefore, ξ = (y, z, w, π, π sub , t ratio A , v), and we let (x * , (R pen ) * , ξ * ) be the true optimizer of the problem. min x∈X,0 R pen , R idv pen ξ∈Ξ For certain node of storage tank sizex, we let the (R pen ,ξ) be the minimizer of the restrained problem as shown in (63) (R pen ,ξ) = arg min At each node of certain storage tank sizex, if we go further by removing the multilinear constraint (31) that engage R pen of each stage with the stationary probability distribution of other stages, and fixing R pen to certain valid valuesR pen , the MINLP model (DMP) will reduce to an MILP, which we call (DMP-diseng). The compact form of (DMP-diseng) is written in (64).
Based on the MILP model (DMP-diseng), a small relaxation can be derived to form another MILP model (DMP-relax). If we have the element-wise lower and upper bounds of R pen : (R pen ) U and (R pen ) L , the lower bound and upper bound of the objective function f (x,R pen ,ξ) at each node of storage tank sizex can be obtained by solving (DMP-relax) and (DMP-diseng) as shown in (67) and (68). It is to be noticed that in (DMP-diseng) (68), theR pen in the objective function and the constraints are fixed to different values. In Appendix B, we will show the exact forms of (DMP-diseng) and (DMP-relax), and prove that inequalities (67) and (68) hold with respect to the problem nature.
The element-wise upper bounds of R pen : (R pen ) U , are set as the original penalty parameters R idv pen . The element-wise lower bounds of R pen : (R pen ) L , are set as (R pen p,n,k,h (s, a, s )) L which are first derived in section 5.1.

25
The algorithm of the Enumeration and Bounding phase is described in Algorithm 1.

Rewards iteration
For each node with certain storage tank size selectionx that was not pruned in the bounding step, a rewards iteration algorithm that guarantees convergence is performed. We first illustrate the algorithm on a case with two stages (k = 1, 2) as shown in Figure 9(a) and 9(b). Each dot in Figure 9(a) stands for a pair of availability values (t ratio A 1 , t ratio A 2 ) that is the result of a distinct combination of redundancy selection and maintenance policy (z 1 , z 2 , w 1 , w 2 ) of the two stages. The number of these combinations is geometric with regard to the number of processing stages and design alternatives, but is finite.
Starting from an initial point (t ratio A 1 , t ratio A 2 ) (0) , we first substitute it into (31) and calculate (R pen 1 , R pen 2 ) (0) , then solve the MILP (DMP-diseng) defined in (64) withR pen fixed to (R pen 1 , R pen 2 ) (0) to obtain the optimal design and maintenance policy in this case, (z 1 , z 2 , w 1 , w 2 ) (1) , which corresponds to a next blue point that the arrow leads to as shown in Figure 9(a). The exact form of the MILP (DMP-diseng) is shown in Appendix B. As the iterations of the algorithm proceed, it is possible to get trapped into loops, which is also shown in Figure 9(a). In that case the optimization step will be repeated at the point before the loop with the solutions corresponds to points in the loop excluded by integer cuts. As shown in Figure 9(b), the algorithm stops when it converges to a stationary point (t ratio A 1 , t ratio A 2 ), where the optimization step leads back to itself. It is worth mentioning that the loop points are to be excluded from the feasible region of the MILP (DMP-diseng) only once when re-optimizing at the previous point to explore a different path, but not for future iterations, as the stationary property of a point is only proven when the optimization step leads back to itself among all the points without exception.
The algorithm is guaranteed to converge, because as discussed above, the number of the dots is finite, and the global optimum, which is a stationary point, is among them. As there can be several stationary points, the algorithm does not guarantee global optimality. However, as shown in section 7.1, the lower bounds obtained in phase 1 of the algorithm are rigorous. We will also show later that for the examples we consider, the algorithm converges quickly to optimal or near optimal solutions. Algorithm 2 gives a detailed description of the algorithm. The convergence criterion depends on w, the binary action selection variables, instead of R pen , because the same action selection will lead to the exact same probability distribution π and R pen , and w as binary variables suffer less from computing precision issues.

27
Algorithm 2: Reward parameters iteration The Rewards Iteration of the candidate nodes should be carried out in parallel, as the stationary point of a node is a valid upper bound for the node. Therefore, when a stationary point is found, any other node whose lower bound is greater that the objective value of this point should stop Rewards Iteration and be pruned.

Illustrative example revisited
In this section, we solve the example problem shown in section 6 again with the proposed algorithm to minimize the total cost in (40). The MILPs are solved with CPLEX 12.8.1.1 in Pyomo 5.6.9, and the algorithm is implemented with Python 3.6 on Intel(R) Core(TM) i5 @ 1.60GHz. Table 4  and UBs are calculated are pruned by the node(s) examined after them. For example, the node with 2 days of LO2 and 2 days of LN2 is pruned after the node with 8 days of LO2 and 2 days of LN2 is examined, which is later also pruned by the node with 8 days of LO2 and 8 days of LN2.
It is guaranteed that the global optimal solution of the problem lies in the only highlighted node.  Figure 10 shows the optimal design, which selects the 2 cheapest units for the first three stages, and the cheapest one unit for LOP. The total capital cost is $766k, and the expected operational cost is $130.89k by the Rewards Iteration, and $130.21k by directly solving the node with BARON.  Table 5 shows the maintenance policy for the booster air compressor (BAC) as an example, which is the same as that of the main air compressor (MAC). The table has four main columns.
In the first column are the states with no unit being maintained or repaired. In the second and the third columns are the states with one unit being maintained or repaired. In the last column are the states where no unit is available. With the other redundancy on standby, the best action for the Stable state is to be inspected once a year. With the other unit Failed or Stopped, the best action for the Stable state is to be inspected every half month.

Demonstration of the algorithm's efficiency
In order to test the efficiency of the algorithm, we randomly generate 20 problems around the example shown above. In particular, the reliability parameters and randomly perturbed within the range of ±10%. Another group of 20 problems of 6 stages are randomly generated in the similar fashion and solved. The two additional stages are generated with data of similar orders of magnitudes. The computational statistics are shown in Figure 11(a) and 11(b). It can be seen that the bounding step can prune most of the nodes for the problems of 4 stages (Figure 11(a)), and the rewards iterations tend to converge fast at the remaining nodes. However, for the 6 stage problems (Figure 11(b)), generally more than half of the nodes have to go through the Reward Iteration phase, and the CPU times tend to fluctuate more.
(a) Number of nodes to solve for each randomly generated problem (b) Total CPUs for each randomly generated problem Figure 11: Computational results of randomly generated cases of two different sizes

Conclusion
This paper considers redundancy selection, storage tank size selection as well as basic maintenance policies for a chemical process at the conceptual design phase. Markov Decision Process is used as the fundamental framework to model the stochastic dynamic decision making process of condition-based maintenance. We embed the optimal condition of Markov Decision Processes and the stationary probability distribution conditions of the reduced Markov Chain into an MINLP (DMP) that considers the economic trade-off among all major decisions. In order to make the model more solvable, we propose a standard linearization for the bilinear terms of binary variables and continuous variables, and a reformulation of the objective function that potentially provides a stronger relaxation of the objective.
An example based on the reliable design of an air separation unit is used to demonstrate how to extract the model parameters from the raw data. We attempted to solve the MINLP (DMP) directly with several global solvers and found that they would not be solved in reasonable amount of time. Therefore, we propose an algorithm that consists of two phases, Enumeration and Bounding, and Rewards Iteration. The validity of the bounding is based on the reformulation of the MDP objective function introduced earlier in the paper. Resolving the example shows that the two-phase algorithm greatly reduces the required computational effort. The algorithm also has consistent performance over 20 randomly generated problems around the original example of 4 processing stages. Another group of 20 random problems of 6 processing stages are also solved and show good computational results. Proof. First we focus on the optimal condition of a certain Markov Decision Process based on the notation of section 4.1, where U * s is the optimal value of state s, and a * (s) is the optimal action of state s.: (1 − γ)U * = γ(P * − I)U * + diag(P * (R * ) T ) (A.3) Next we rewrite equation (20) in the matrix form (A.4), which specifies the conditions that the stationary distribution π has to satisfy.
Since equations (21) and (23) require that π sub k,h (s, a * (s)) = π k,h (s), and that π sub k,h (s, a) = 0 if a = a * (s). (A.6) is equal to (A.7) when (21) and (23)  As defined in section 5.1, R y pen p,n,k,h (s, a, s ) is zero for non-selected storage sizes. Therefore, (A.7) can be further written as (A.8), which is the right hand side of (A.1). Thus, the proposition is proved. In the following, we will display their exact formulation, whereR pen can be replaced with (R pen ) L or (R pen ) U depending on the needs, and prove in Proposition 2 and 3 that the MILPs provide valid bounds.

Appendix B Objective bounding regarding key parameters
(B.3) shows the exact form of c(x,R pen ) · ξ.
Proof. As shown in section 7, assuming thatR pen is known, the result of (B.7) is f (x,R pen ,ξ).
min ξ∈Ξ {c(x,R pen ) · ξ | A diseng · ξ ≤ b diseng (x,R pen )}. (B.7) As a constraint in (B.1), (B.5) relaxes the requirement that the state value should be less than or equal to the right-hand-side corresponding to every possible action, and only requires that it be no greater than the selected action. That makes all the possible action decisions (possible solution of variable w) feasible to (B.1), which automatically includes the action decisionŵ corresponding to the optimal solution of (B.7),ξ.
Arguably, substitutingŵ into (B.1) will lead to different state values thanv, butπ sub will stay the same, as certain action decisionŵ leads to unique reduced Markov Processes for each stage, and unique stationary probability distributionsπ andπ sub . Since onlyπ sub goes into the objective function, we can still insertπ sub as part ofξ into the objective function of (B.1) as a feasible solution, which by the definition of optimality is no less than the optimum of (B.1), and let (B.8) hold with a little abuse of the notation. c(x, (R pen ) L ) ·ξ ≤ c(x,R pen ) ·ξ ≡ f (x,R pen ,ξ) (B.9) Therefore, we have (B.10) and the proposition is proved.
Proof. As shown in section 7, assuming thatR pen is known, the result of (B.7) is f (x,R pen ,ξ).
Also, we denote ξ L as the optimal solution of (B.2).
Constraints (B.4) and (B.6) are the only constraints in (B.2) that contains R pen . Since (R pen ) L R pen , ξ L which satisfies (B.4) in A diseng · ξ ≤ b diseng (x, (R pen ) L ) also satisfies (B.4) in A diseng · ξ ≤ b diseng (x,R pen ). As for (B.6), since exactly one action is selected for each state, v is always the unique solution of a square linear system and self-bounded by the parameters.
Therefore, the big M can be made large enough such that (B.6) is always satisfied.
Therefore, ξ L is a feasible solution of (B.7), and gives an objective value no less than the optimum as shown in (B.11) c(x,R pen ) · ξ L ≥ c(x,R pen ) ·ξ ≡ f (x,R pen ,ξ) (B.11) On the other hand, sinceR pen (R pen ) U , (B.12) holds given the exact formulation of the objective function shown in (B.3).