Computing Optimal (R, s, S) Policy Parameters by a Hybrid of Branch-and-Bound and Stochastic Dynamic Programming

A well-know control policy in stochastic inventory control is the (R, s, S) policy, in which inventory is raised to an order-up-to-level S at a review instant R whenever it falls below reorder-level s. To date, little or no work has been devoted to developing approaches for computing (R, s, S) policy parameters. In this work, we introduce a hybrid approach that exploits tree search to compute optimal replenishment cycles, and stochastic dynamic programming to compute (s, S) levels for a given cycle. Up to 99.8% of the search tree is pruned by a branch-and-bound technique with bounds generated by dynamic programming. A numerical study shows that the method can solve instances of realistic size in a reasonable time.


Introduction
Single-item, single-stocking location, stochastic inventory systems have long been investigated under various operational assumptions, and the associated literature is large. Scarf's seminal paper, Scarf (1959), addressed this problem over a finite planning horizon comprising discrete time periods, non-stationary stochastic demands, a fixed ordering cost, and linear holding and shortage costs.
Scarf proved that the (s, S) policy (more precisely the (s t , S t ) policy) is costoptimal. In this policy the decision-maker checks the current inventory position at review epochs (the start of each time period) and if the inventory position is at or below s t an order is placed to raise it to S t . For a planning horizon of T periods the optimal (s t , S t ) policy requires 2T policy parameters, computed in a here-and-now fashion at the start of the planning horizon. Actual replenishment timings and associated order quantities are instead determined in a wait-and-see manner.
In this paper, we address a more general form of the inventory control problem described by Scarf. According to Silver (1981) the (R, s, S) policy is one of the most commonly adopted inventory control strategies (also called (T, s, S) or (s, S, T ) in the literature, Babai et al. (2010); Lagodimos et al. (2012a)). In an (R t , s t , S t ) system the inventory level is checked only at review epochs R t , which are policy parameters that are fixed at the start of the planning horizon.
After a review, an order may be placed to raise the inventory level up to S t if it is at or below s t .
Two special cases of the (R, s, S) policy naturally arise. Firstly, it reduces to the (s, S) case if there is no explicit cost involved in carrying out inventory reviews. Inventory review (also known as stock-taking) is costly in practice, so we consider the case in which a fixed system control cost Silver (1981) is incurred when the inventory is reviewed, e.g. Fathoni et al. (2019); Christou et al. (2020). The (R, s, S) policy relaxes the cost accounting assumption that the fixed cost of replenishment covers both review and delivery costs, and separates the fixed cost of conducting a review from the fixed ordering cost. One practical implication of this relaxation is that the order cancellation and relevant costs can be explicitly incorporated into inventory planning.
Secondly, the (R, s, S) reduces to the (R, S) policy (the replenishment cycle policy) if reorder levels s t are equal to the order-up-to-levels S t . In a replenishment cycle policy, the replenishment periods are fixed at the beginning of the planning horizon and the replenishment orders are placed only in these periods after period demands so far have been observed.
Although (R, s, S) is one of the most general and frequently used inventory policies, as pointed out by Silver (1981) the determination of the exact best values of the three parameters is extremely difficult . To the best of our knowledge no approach to computing them has been presented in the literature. We fill this important gap in the literature by making the following contributions: • we introduce an efficient hybrid of branch-and-bound and stochastic dynamic program (SDP) to compute optimal policy parameters; • we improve the branch-and-bound by using tighter bounds computed through a separate dynamic programming (DP) method; • we show empirically that the new algorithm performs significantly better than a baseline method and that it is able to solve problems of realistic size in a reasonable time.
The paper is structured as follows. Section 2 surveys related literature. Section 3 provides a problem description. Section 4 introduces a simple SDP formulation. Section 5 introduces a branch-and-bound strategy. Section 6 carries out a comprehensive numerical study. Finally, Section 7 concludes the paper.

Literature review
The problem of computing policy parameters for an inventory control system under stochastic demand has received a great deal of attention. In this section, we survey the relevant literature on the classic stochastic inventory control problem. We then survey different versions of the problem. Finally, we survey (R, s, S) real-world applications.
An important class of these problems is the single-item single-location nonstationary stochastic lot-sizing under linear holding costs, penalty costs and both linear and fixed ordering costs. Different policies can be used to determine the size and timing of the orders on such setting. In his seminal work, Scarf characterises the structure of the optimal policy for such a problem. The framework proposed by Bookbinder & Tan (1988)  compares different policies performances in the receding horizon, Sani & Kingsman (1997) and Babai et al. (2010) are comparative studies on the performances of (s, S) heuristics.
Modifications on the original inventory model have been proposed to allow a closer representation of real-world problems. Dillon et al. (2017) proposes an (R, S) policy solution to manage the blood supply chain, that includes per-ishable products. Alvarez et al. (2020)'s model considers both quantity and quality decay of the inventory product; the quality can be improved by mixing it with a higher quality product. A set of heuristics for the lot-sizing problem with remanufacturing of returned products is presented in Kilic & Tunc (2019). All-units discount (s, S) policy has been analysed in Wang et al. (2019).
Uncertainty can involve other aspects of the inventory system; for example, Bashyam & Fu (1998) They use a heuristic based on Schneider et al. (1995) for the (R, s t , S t ) policy. control policies with service level. They define an optimal policy starting from the (s, S) SDP introduced by Scarf (1959). They present a value-iteration algorithm to find the (R, s, S) parameters that minimise the inventory cost subjected to service constraints. As the parameters are fixed, their solution is unsuitable for a non-stationary setting.
The analysis of the state-of-the-art confirms the novelty of the solution, and the practitioners' interest in the (R, s, S) policy usage in a stochastic environment.

Problem description
We consider the single-item, single-stocking location, stochastic inventory control problem over a T -period planning horizon. Without loss of generality, we assume that orders are placed at the start of each period and that the lead time is zero, as is common in the literature Scarf (1959); Bollapragada & Morton (1999); Tarim & Kingsman (2004). An inventory control policy defines the timing and quantities of orders over the planning horizon. We define a review moment, or review period, as a period in which the level of the inventory is assessed and an order can be placed. A replenishment cycle is represented by the interval between two review moments. We denote by Q t the quantity of the order placed in period t, and an inventory review cost by W . Ordering costs are represented by a fixed value K and a linear cost, but we shall assume without loss of generality that the linear cost is zero. The extension of our solution to the case of a non-zero production/purchasing cost is straightforward, as this cost can be reduced to a function of the expected closing inventory level at the final period Tarim & Kingsman (2004). At the end of each period, a linear holding cost h is charged for every unit carried from one period to the next.
Demands d t in each period t are independent random variables with known probability distributions. Backlogging of excess demand is assumed, so if the demand in a period exceeds on-hand inventory the rest of the demand is carried to the next period; a linear penalty cost b is incurred on any unit of back-ordered demand at the end of a period.
Under the non-stationarity assumption the (R, s, S) policy takes the form (R t , s t , S t ) where R t denotes the length of the t th replenishment cycle, while parameters s t and S t denote the reorder-level and order-up-to-level associated with the t th inventory review. We consider the problem of computing the (R, s, S) policy parameters that minimize the expected total cost over the planning horizon.

Stochastic dynamic programming formulation
In this section, we provide a simple technique to compute the optimal (R, s, S) policy parameters. It can be considered the state-of-the-art in computing such parameters in the presence of stochastic non-stationary demand. Moreover, it constitutes the basis of the branch-and-bound technique introduced later.
We represent the replenishment moments by binary variables γ t (t = 1, . . . , T ) which take value 1 if a review is placed in period t and 0 otherwise. We assume Q t = 0 if γ t = 0 so no order will be placed outside a review moment. The opti-mal (R, s, S) policy for our problem is represented by the parameters γ t , s t , S t that minimize the expected total cost.
Consider an arbitrary review cycle plan with γ t as a parameter, not a decision variable. We denote the closing inventory level for each period by I t , and the given initial inventory level by I 0 . We assume that the orders are placed at the beginning of each time period and delivered instantaneously. The problem can be formulated and solved to optimality as an SDP Bellman (1966). The expected immediate cost combining ordering, review, holding and penalty costs, Let C t (I t−1 ) represent the expected total cost of an optimal policy over periods t, . . . , n and ½ is the indicator function. These variables are the states of the DP formulation. We model the problem with the functional equation: where M is a sufficiently large number. The boundary condition is: , where I 0 is the initial inventory level, contains the expected cost for the optimal (s, S) policy associated with the γ assignment. To reduce the computational time we can exploit the property of K-convexity Scarf (1959) when solving the SDP.
LetĈ 1 (I 0 ) represent the expected total cost of the optimal (R, s, S) policy, given the initial inventory level I 0 at period 1. We can define it as: Evaluating the optimal (s, S) policy for all possible assignments of γ 1 , . . . , γ T yields the optimal (R, s, S) policy. The model works with every possible demand distribution, as long as it is finite and discretisable. This is our baseline method on which we aim to improve.

Unit cost
The algorithm can be extended to model the per unit ordering cost. There are two options, reducing it to a function of the expected closing inventory, e.g. Tarim & Kingsman (2004); or including it in the immediate cost function.
Let v be the per unit ordering/production cost, Equation 1 is replaced by:

Lost sales
Complete backlogging of the demand is a limiting assumption in many realworld settings. Studies analysing customer behaviour show that in case of a stock out, only a minority delay the purchase Verhoef & Sloot (2006). According to Bijvank & Vis (2012a), the lost sales configuration is underrepresented in the lot-sizing literature, even if it is more appropriate to model customers' behaviour. Approximating a lost sales model with a backlog model results in a non-negligible increase in costs Zipkin (2008).
The SDP formulation can be extended to model lost sales. We considered the partial backorder configuration presented in dos Santos & Oliveira (2019).
They define as β (β ∈ [0, 1]) the fraction of the unmet demand that is carried on to the next period and the reminder is lost. This parameter gives the flexibility to model both backlog (β = 1), lost sales (β = 0) or a combination of the two.
The functional equation 2 becomes:

Example
We use a simple example to illustrate the application of our method, with a 3-period planning horizon. We assume an initial inventory level of zero and a  consider an ordering cost value K = 30, a review cost W = 10, and holding and penalty costs of h = 1 and b = 10 per unit per period respectively.
The algorithm must choose replenishment moments γ = γ 1 , γ 2 , γ 3 that minimize the expected cost of the policy. Table 1 shows the expected cost of each (s, S) policy computed with different review periods. The optimal solution is γ = 1, 0, 1 with expected cost 142.7. However, exhaustive search becomes impractical as the planning horizon grows so in Section 5 we develop a more efficient method.

A hybrid of branch-and-bound and SDP
In this section, we present a hybrid technique that combines SDP and branch-and-bound. The algorithm obtains optimal (R, s, S) policies associated with specific review plans at leaf nodes. The search tree (defined in Section 5.1) is explored by depth-first search (DFS). The subproblems associated with the nodes are defined in Section 5.2. Section 5.3 introduces the pruning condition and lower bound computed with DP. Finally, Section 5.4 presents the node resolution process.

Search tree
The branch-and-bound goal is to find the review plan with the minimum expected cost. During branching of γ t , the value is fixed to 1 or 0. The search tree has T + 1 levels, and the branching at its root fixes the value of γ T . At level ℓ branching involves the variable γ T −ℓ+1 . The path from the root to a node at level ℓ represents a fixed assignment of the suffix γ T −ℓ+2 , . . . , γ T . A leaf node represents a complete assignment of the γ values. Figure 1 shows the search tree of a 3-period problem, as in the example presented in the previous section.

Subproblems
Given the period t and the partial assignment of a suffix of the review moments γ t , . . . , γ T , the problem at a node is to find the γ 1 , . . . γ t−1 that minimizes the expected cost of the optimal policy. We denote this problem as BnB-SDP(t, γ t , . . . , γ T ). For each subproblem using Equation 2 we can compute the expected cost of the optimal policy starting at period t with inventory level i. This is possible because all review moments after period t are fixed, and because of the SDP stage structure presented in Section 4.

Bounds and pruning
If all the solutions in the subtree rooted in a node are suboptimal then we can prune that node without compromising optimality.
Proposition 1. Given a fixed assignment of γ: From the functional equation (2) it is clear that C t is equal to the expected value of C t+1 plus some non-negative costs, so the minimum cost in each stage increases monotonically with tree depth.
During tree searchC records the expected cost of the best plan computed so far, that is the minimum C 1 (I 0 ) among all leaves already computed. This is used as an upper bound for the expected cost of the optimal plan as follows.
Considering the subproblem BnB-SDP(t,[γ t , . . . , γ T ]) with the associated C t (i) expected costs: then because of the monotonicity of the cost function: (7): Finally, since the expected cost associated with a plan (C 1 (I 0 )) is part of C 1 : Hence if (8) is true the subproblem BnB-SDP(t,[γ t , . . . , γ T ]) is not part of an optimal solution and the search tree can be pruned.
However, this pruning condition makes no assumption on the costs faced on periods 1, . . . , t − 1, and a lower bound on the costs in those periods leads to more effective pruning. Let M C t (I t ) represent a lower bound on the cost faced in periods 1, . . . , t with a closing inventory of I t in period t. The pruning condition (8) can be refined to: min It (C t (I t−1 ) + M C t−1 (I t−1 )) ≥C Having a bound independent from the review periods allows us to compute it only once before the branch-and-bound algorithm.
The bounds can be computed by a DP with stages and states equivalent to the SDP presented in Section 4 and functional equation: where, as defined in Section 4, I t is the current inventory level, and f t (I t , Q t ) is the ordering-holding-penalty cost. In the first case, an order has been placed in period t so the inventory level in the previous period was less than or equal to the current level. In the second case, an order has not been placed so the previous inventory level was greater than or equal to the current level. The boundary condition is: where I 0 is the initial inventory. Considering finite demand, the DP has an amount of states equal to the number of periods multiplied by the maximum inventory level. Each state requires a single computation of Equation 12, that is pseudo-polynomial in relation to the maximum inventory. The overall complexity of a DP is the number of states multiplied by the complexity required to solve one of them, so the overall complexity is pseudo-polynomial.

Node computation
Algorithm 1 summarises the branch-and-bound procedure. In line 1, the SDP stage t is solved. In line 7, the pruning condition is evaluated: if a pruning occurs the branching phase is skipped. In lines 8 and 9 DFS recursively continues. Lines 3-6 relate to leaf nodes: if the policy represented by the leaf is better than the best found so far, the value ofC is updated. The algorithm starts by invoking BnB-SDP(T + 1,∅), and at the end, the expected cost of the optimal policy is given byC.
The algorithm as always shown branches by assigning first γ t = 0, but its performance can be improved by randomisation. If, during each branching phase, we randomly order lines 8-9 we obtain a better solution earlier, leading to a stronger pruning of the search tree. We evaluate the effect of this randomisation in Section 6.

Guided tree search
The random descent can get stuck in inferior branches of the search tree.
It takes a considerable time to get a reasonable review plan, and a good cost upper bound in these cases. Computing a near-optimal review plan and using it to guide the search leads to the immediate computation of a policy with a low expected cost. This tighter bound increases the number of nodes proved not-optimal by the pruning condition.
A reasonable review plan can be computed using the (R t , S t ) policy. As mentioned in the introduction, this policy places an order at each review moment. The replenishment cycles (R t ) can be used as a review plan, while the order-up-to-levels S t can be ignored. During the first descend of the branchand-bound search tree, the delta values are selected following this review plan; thus, the first leaf to be computed is the one that has R t as review moments.
This leaf represents the optimal (R t , s t , S t ) policy for that review plan, and it should have a low expected cost. After computing the first leaf of the tree, the search proceeds in the replenishment plan's neighbourhood using a randomised approach.
The experimental section shows the improvement in pruning efficacy and computational time. Good computational performance and implementation simplicity make the MILP formulation presented in Rossi et al. (2015) a good solution to compute the (R t , S t ) policy; this formulation is used in the experimental section.

Example
The search tree with the DP bounds for the example of Section 4.3 is represented in Figure 2. Each internal node contains the value of the pruning condition with the DP bounds (11). An internal node is underlined if the pruning occurs in that node. Each leaf is in bold if it contains an improvement compared to the previous best solutionC. Pruned nodes are indicated by an asterisk (*).
We define pruning percentage as the percentage of nodes that are proved to be suboptimal by the pruning condition during tree search. In this example, the number of computed nodes is 10 and 4 nodes have been pruned, so the pruning percentage is 4/14 = 28.57%.

Computational study
In this section, we evaluate the new methods, including an assessment of the effects of branching randomisation and problem parameters (costs) empirically.
We conduct two sets of experiments as follows. In Section 6.1, we analyse the scalability of the new approaches by increasing the number of periods until no method is able to solve the problem within a 1-hour time limit consistently. In • SDP, the SDP technique described in Section 4 which we consider the current state-of-the-art.
• BnB, the branch-and-bound solution introduced in Section 5.
• BnB-Guided, branch-and-bound with a guided tree search, Section 5.5.
We compare these in terms of computational time, pruning percentage and average number of review periods (but not expected costs because the solutions are optimal in each case). All experiments are executed on an Intel(R) Xeon E5620 Processor (2.40GHz) with 32 Gb RAM. For the sake of reproducibility, we made the code available 1 .
We base our numerical studies on the set of instances originally proposed by Berry (1972)

Scalability
This experiment aims to assess the improvement provided by the branchand-bound approach compared to what we can consider as the state-of-the-art.
Furthermore, we aim to assess how the randomisation and the guided search affect the computational performances and the pruning percentage. For the scalability analysis, we use randomly generated parameter values and progressively increase the number of periods. We fix the holding cost per unit at h = 1, but the other cost parameters are uniform random variables: ordering cost is in the range K ∈ [80, 320], review cost is in the range W ∈ [80, 320] and penalty cost per unit is in the range b ∈ [4, 16]. Demands per period are uniform random variables in the range [30,70]. We generate 100 different instances and for planning horizons in the range 4-20 periods. new method is able to solve instances almost twice as large in a reasonable time. Though it still has exponential behaviour, its slope is considerably less than that of the SDP. The randomisation reduces the computational effort needed. The guided search requires the computation of an (R, S) policy before the BnB approach. For small instances, the added computational effort is higher than the improvement provided by a higher pruning percentage. However, for medium/big instances, the improvement is considerable. Figure 4 shows the range of the minimum and maximum computational times for increasing planning horizon lengths; we omitted BnB-Rand to improve the readability of the plot. The SDP solution has a low variability in the required computational time. BnB-Guided presents the highest variability among the different solutions. This is due to the fact that in some instances, the precomputed replenishment plan is the optimal one and leads to a strong pruning of the tree that reduces the computational time considerably.  ). We can see that BnB-Guided provides considerable improvement.

Instance type analysis
In the parameter value analysis, we aim to understand how the cost parameters affect the computational effort required to find the policy and the pruning percentage. We use a testbed of 324 instances. To generate the average demand values we use seasonal data with different trends:  Tables 2 and 3 give an overview of the computational time, the pruning percentage and the average number of reviews of the methods for the 10-and 20-period experiments. They show that SDP is not strongly affected by the cost parameters and that the main difference is caused by the demand patterns.
This is due to the maximum average demand per period being lower for STA, LCY1 and RAND. The stationary case is faster to compute as its maximum is 50, the second-fastest is the first life cycle with a maximum of 75, and the erratic pattern is slowest. All the other patterns have a maximum of 100.
The pruning percentage gives an indication of the efficacy of the branchand-bound. Our algorithms perform particularly well on high review costs. For instance, with 20 periods and W = 320 the pruning percentage reached an impressive average of 99.83% for the BnB-Guided, solving one instance in less than 13 minutes on average, while the baseline is expected to take more than six weeks. For the BnB, the percentage is 98.52%, so it visits more than twice the nodes compared to the guided version. The randomised search (not shown in the table for the sake of readability) reaches an average of 98.92%. We note that the penalty cost also affects performance: a higher penalty cost reduces pruning.
The average number of review moments of the optimal policies decrease as the ordering and the review increase. Also, a higher penalty cost leads to more frequent reviews, which reduces the probability of demand excess and mitigates the uncertainty of the inventory level. We observe that the decreasing pattern requires fewer review periods than the others, due to its decreasing tail that reduces the number of orders needed.
Our best-proposed method outperforms the baseline by factors of 50 and 1300 on 10-and 20-period instances, respectively.

Conclusion and Future Work
In this paper, we considered a single-item single-stocking location inventory lot-sizing problem with non-stationary stochastic demand, fixed and linear ordering cost, review cost, holding cost and penalty cost. We present the first algorithm to compute optimal (R, s, S) policy parameters. This policy has a high practical value, but the computation of optimal or near-optimal parameters has been considered extremely difficult. Our proposed technique is a hybrid of branch-and-bound and stochastic dynamic programming, enhanced by ad hoc bounds computed with dynamic programming, and by a randomised depth-first exploration of the search tree.
In an extensive numerical study, we first investigated the scalability of the   This technique opens up multiple research directions on the determination of (R, s, S) policy parameters. It can lead to new optimal solutions for the same problem, and it can be improved with tighter bounds. It is also useful for computing optimality gaps of new heuristics.
As future studies, the approach presented herein can be extended to overcome some of the limitations of the problem setting. Considering multiple items with joint shipping or modelling a more complex supply chain with multiple echelons are generalisations of this problem, and they would increase the applicability of the (R, s, S) policy.