Stochastic Dynamic Programming Heuristic for the (R, s, S) Policy Parameters Computation

The (R, s, S) is a stochastic inventory control policy widely used by practitioners. In an inventory system managed according to this policy, the inventory is reviewed at instant R; if the observed inventory position is lower than the reorder level s an order is placed. The order's quantity is set to raise the inventory position to the order-up-to-level S. This paper introduces a new stochastic dynamic program (SDP) based heuristic to compute the (R, s, S) policy parameters for the non-stationary stochastic lot-sizing problem with backlogging of the excessive demand, fixed order and review costs, and linear holding and penalty costs. In a recent work, Visentin et al. (2021) present an approach to compute optimal policy parameters under these assumptions. Our model combines a greedy relaxation of the problem with a modified version of Scarf's (s, S) SDP. A simple implementation of the model requires a prohibitive computational effort to compute the parameters. However, we can speed up the computations by using K-convexity property and memorisation techniques. The resulting algorithm is considerably faster than the state-of-the-art, extending its adoptability by practitioners. An extensive computational study compares our approach with the algorithms available in the literature.


Introduction
The computation of solutions for the non-stationary stochastic lot-sizing problem is a well-developed branch of inventory control.The stochasticity of the demand allows modelling the uncertainty of real-world problems, while its non-stationarity allows considering seasonality or life cycle of products.Under this setting, the inventory must satisfy a demand represented by a set of stochastic variables of known probability distributions generally considered independent.Arrow, Harris, and Marschak (1951) is considered to be the first known work on stochastic inventory models.
The single-item, single-echelon, non-stationary lot-sizing under ordering, holding and penalty cost is an important class of inventory problems.The problem considers a time horizon split in T periods.A wide variety of policies has been developed to manage these systems (Silver (1981)).A policy defines when an order has to be placed and its quantity.According to the classification proposed in Bookbinder and Tan (1988), three policies have been used to deal with stochasticity: static, dynamic and static-dynamic strategies.
In the static uncertainty, also known as (R, Q), the timing and quantity of the orders are fixed at the beginning of the time horizon; this policy does not react to demand realisations.
The dynamic strategy, (s, S) policy, checks the inventory level in each period.If the inventory is lower than the order level s an order to raise it to the order-up-to-level S is placed.This policy allows to react to unforeseen demand realisations quickly, Scarf (1959) proves its optimality when the review costs are ignored.However, this policy suffers from a high degree of setup-oriented nervousness (Tunc et al. (2011); De Kok and Inderfurth (1997)); meaning that the order timings frequently change, limiting its practical applicability.
The static-dynamic strategy, (R, S) policy, aims to tackle this issue by fixing the replenishment times at the beginning of the time horizon.In this policy, an order that raises the inventory up to S is placed every R periods.Knowing the replenishing time in advance allows to deal better prices and schedule joint deliveries.
When the demand is non-stationary or the time horizon is finite these policy parameters vary across the planning horizon assuming the (R t , Q t ), (s t , S t ) and (R t , S t ) form, for t ∈ [1, . . ., T ].
The (R, s, S) policy is a generalisation of the dynamic and static-dynamic strategies.In the (R, s, S) policy, the inventory level is assessed at review intervals R if it falls under the s level, an order is placed; the order raises the inventory level to S. If the cost of reviewing the inventory is null, the policy reviews the inventory in each period behaving as the (s, S) one.If the s level is set equal to S, an order is placed at each review.In the non-stationary stochastic problem configuration, the policy parameters change across the time horizon, assuming the (R, s, S) form.
The (R, s, S) is widely used by practitioners (Silver (1981)).In the case of stochastic non-stationary problems, three sets of parameters have to be jointly optimised to minimise the expected cost.This task has been considered extremely difficult.In a recent work, Visentin et al. (2021) introduce the first algorithm to compute the optimal policy parameters.They apply a branch-and-bound approach to explore the possible replenishment plans while computing the order levels and order-up-to-levels using stochastic dynamic programming (SDP).While their method computes the optimal set of parameters, it struggles to scale to big problems, limiting its applicability by practitioners.In this work, we fill this gap in the literature by: • presenting a relaxation that allows a greedy computation of the replenishment cycles.We combine it with an SDP formulation for the computation of (R, s, S) policy parameters; • introducing computational enhancements that make the model computable in reasonable time; • analysing an extensive numerical study that shows that the heuristic computational effort significantly outperform the optimal method; • investigating the problem configurations for which the policy computed by the heuristic differs from the optimal one.
The paper is structured as follows.A survey of the literature is presented in section 2. Section 3 provides the description of the problem and of the best-known solution, later used as a comparison.Section 4 introduces the relaxation used, the greedy approach and the computational enhancements.Section 5 shows a comprehensive numerical study.Finally, Section 6 concludes the paper.

Literature review
This section surveys the relevant stochastic lot-sizing literature.In the first part, we position our approach in comparison to other inventory control policies.We then analyse recent practical applications of the (R, s, S) policy.
An inventory control policy defines when: to assess the inventory, place an order, and the size of the order.The problem of computing policy parameters to satisfy a stochastic demand appears in a wide variety of industrial settings, and it has been extensively investigated in the literature Silver (1981).Bookbinder and Tan (1988) propose a broad framework of inventory control strategies: static uncertainty ((R, Q) policy), dynamic uncertainty ((s, S) policy), and static-dynamic uncertainty ((R, S) policy).It classifies the approaches based on when the replenishments' decisions are taken, if at the beginning of the planning horizon or after realising a period demand.In the (R, Q) policy, the full replenishment plan is fixed at the beginning.A fixed ordering plan is preferred in industrial settings where rigid production/shipment plans are needed.For these reasons, the computation of this policy under uncertainty has been widely investigated, e.g.Sox (1997); Meistering and Stadtler (2017); Tunc (2021).Scarf (1959) proves that the (s, S) policy (dynamic strategy) is cost-optimal.In this policy, the decision to place an order and its quantity are taken after observing the demand.This policy is particularly effective in dealing with unexpected demand realisations.Recent works involving this policy are Jiao, Zhang, and Yan (2017); Xiang et al. (2018); Azoury and Miyaoka (2020).The static-dynamic uncertainty ((R, S) policy) fixes the replenishment moments at the beginning of the planning horizon and decides the size when placing the order.This policy is preferred because it reduces the setup-oriented nervousness Tunc et al. (2011), a known order schedule also allows better deals with the carriers.We refer the readers to relevant studies on this policy, e.g.Tarim and Kingsman (2004); Rossi, Kilic, and Tarim (2015); Tunc et al. (2018).Ma, Rossi, and Archibald (2019) presents a survey of stochastic inventory control policy computation.However, Bookbinder and Tan (1988) classification does not take into account the stock-taking cost commonly present in real-world problems, e.g.Fathoni, Ridwan, and Santosa (2019); Christou, Skouri, and Lagodimos (2020).The (R, s, S) policy has a lower expected cost compared to the (s, S) one when a cost for assessing the inventory is considered.As mentioned in the introduction, the (R, s, S) can be seen as a generalisation of the (s, S) and (R, S) policies.
The (R, s, S) policy has a vast number of applications in the literature; due to a reduced nervousness compared to the (s, S) and a better cost-performance than the (R, S).These policies have also been studied for different problem configurations.Schneider and Rinks (1991); Schneider, RNKS, and Kelle (1995) introduce two heuristics to compute (R, s, S) parameters in a two-echelon inventory system with one warehouse and multiple retailers.Strijbosch, Moors et al. (2002) propose a technique to simulate an (R, s, S) inventory system in which the parameters remain constant.It can compute fill rates or find parameters values to achieve a prescribed service level.Chen and Lin (2009) adopt a hedge-based (R, s, S) policy portfolio, with constant parameters in the short term, for a multi-product inventory control problem.In Cabrera et al. (2013) the (R, s, S) policy is used to manage the inventory of multiple warehouses.Göçken et al. (2015) use a simulation optimisation technique to determine the optimal policy for distribution centres in a two-echelon inventory system with lost sales.Johansson et al. (2020) use an (R, s, S) policy for controlling one-warehouse, multiple-retailer inventory systems; their configuration is motivated by a real problem faced by a company selling metal sheet products.In the surveyed papers, the policy parameters are optimised and kept constant across the time horizon, or the R value is a given fixed value reducing the problem to an (s, S) policy.For example, Lagodimos, Christou, and Skouri (2012) solves the continuous-time problem with stationary demand; Christou, Skouri, and Lagodimos (2020) extend their work to consider the order quantity as a multiple of given batch size.This is due to the complexity in jointly optimising the three sets of parameters.The additional cost of using a stationary policy when the demand varies is well known Tunc et al. (2011).Visentin et al. (2021) introduce the first optimal approach to compute (R, s, S) policy parameters with stochastic non-stationary demand.However, their approach requires considerable effort to solve big instances, limiting its usability for practitioners.
The survey presented in this section places our work in the stochastic lot-sizing literature.The (R, s, S) policy has a wide variety of applications due to clear advantages over other policies.The algorithm presented herein aims to boost its adoption by providing a heuristic that computes near-optimal policies using a fraction of the computational effort compared to the state-of-the-art.

Problem description
This work considers the single-item, single-stocking location, stochastic inventory control problem over a T -period planning horizon.The (R, s, S) policy defines three aspects of inventory management: the timing of inventory reviews, when an order is placed, and the order's size.A review takes place when the inventory level in the warehouse is assessed; these moments are fixed at the beginning of the time horizon.An order can only be placed after a review takes place.The interval between two review moments represents a replenishment cycle.
The demand's stochasticity and non-stationarity of period t are modelled through the random variable d t .Demands are independent variables with a known probability distribution.Cumulative demand of periods t to the beginning of period j takes the form of d t,j with j > t.If the demand in a given period exceeds the on-hand inventory, the excess is backlogged and carried to the next period.In Section 4.4, we extend the model to the lost-sales configuration, where the exceeding demand is lost; a common approach when competitors' products are available.Under these assumptions, the (R, s, S) policy takes the vectorial form form (R, s, S), with R = (R 1 , . . ., R T ); where R t , s t and S t denote respectively the length, the reorder-level and order-up-to-level associated with the t-th inventory review.
Policies are compared based on their expected cost.Stocktaking has a fixed cost of W .We denote by Q t the quantity of the order placed in period t.Ordering costs are represented by a fixed value K and a linear cost, but we shall assume that the variable cost is zero without loss of generality.The extension of our solution to the case of a variable production/purchasing cost is straightforward, as this cost can be reduced to a function of the expected closing inventory level at the final period Tarim and Kingsman (2004).At the end of each period, a holding cost h is charged for every unit carried from one period to the next.In case of a stockout, a penalty cost b is charged for each item and period.We denote with I t the closing inventory level for period t, making I 0 the initial inventory.
We consider the problem of computing the (R, s, S) policy parameters that minimize the expected total cost over the planning horizon.The order quantity Q t is fixed at every review moment before the demand realisation using: the order is placed only if t is a review period and the open inventory is below the order level s t .For the sake of brevity, in the following formulas we use Q t as a replacement for q t (S t , s t , I t−1 ).The problem of computing the optimal (R, s, S) can be formulated as follow: Where C 1 (I 0 ) is the expected cost of the optimal policy parameters starting at period 1 with the initial inventory I 0 .In general, C t (I t−1 ) represent the expected inventory cost of starting at period t with open inventory is the expected cost of a review cycle starting in period t and ending up in period t + R t ; it comprises review, ordering, holding and penalty cost for the review cycle.
C t (I t−1 ) values can be computed recursively when all the policy parameters are computed using the following formula: until the base case is reached: For a given (R, s, S) parameters set, this formulation allows to compute the expected policy cost.However, the number of combinations of parameters is exponential, making this approach unusable for the computation of optimal ones.

Branch-and-bound approach
Visentin et al. ( 2021) present the first algorithm for computing the optimal parameters for the (R, s, S) problem.Their work is based on the following lemma: Lemma 3.1.If the replenishment cycles (R) are fixed, the problem is reduced to a particular version of the (s, S) policy computation and can be solved to optimality using Scarf 's SDP (Scarf (1959)).
In this case, the problem is formulated as: Where s and S are dependent on R. The proposed baseline compute the optimal replenishment cycles by testing all R possible combinations and computing the optimal policy cost for each of them.Their best technique, our comparison in the experimental section, uses BnB to avoid recomputations and prune sub-optimal R assignment.Optimal s t and S t levels can be computed by considering only future periods when R t is fixed, ignoring the expected opening inventory level; this is not valid for the computation of the R vector.

Heuristic technique
The heuristic introduced in this work aims to compute locally optimal R t values to produce a near-optimal (R, s, S) policy.The main idea is to move the assignment of the decision variable R t at period t and do not fix all of them at the beginning of the time horizon such as in Equation 6.This can be done by transforming the recursive Equation 4 into: Solving this recursion could lead to different optimal R t for different opening inventory levels I t−1 .For example, if the opening inventory level is slightly higher than s t but considerably lower than S t an order is not placed, but the next review cycle might be shortened.However, in the (R, s, S) policy, the review cycles are fixed at the beginning of the time horizon and not after the demand realisation.This is the reason why we need to know the probability of the opening inventory level to determine the optimal R t .
Our heuristics consists of choosing a locally optimal R t assuming that an order is placed in period t and the possibility of placing a negative order.We define these locally optimal replenishment cycles as R a t .The independence of the replenishment cycles is similar to the (R, S) policy, and the negative order relaxation is widely used, e.g.Özen, Dogru, and Tarim (2012).
Knowing the expected cost of future periods C j with j > t, it is possible to compute the optimal s t and S t for that specific replenishment cycle R t using SDP.The best S t is the value that minimizes C t (S t ), since we place an order to reach the point with the lowest future expected cost.

S t = arg min
So, assuming that an order is placed, the best replenishment cycle is the one that has the lowest cost after the inventory level is topped up to S t : and As mentioned above, the computation of C t requires the expected costs of future periods C j with j > t, which are dependent on the optimal R j .We relaxed the cost function by defining C a t as the expected cost of using local optimal R a j for all periods j after t.Given C a T +1 (I T ) = 0, it is possible to compute the relaxed cost function in a backward way using the following approximate SDP functional equation: This formula computes a near-optimal replenishment schedule R a , and the set of order and order-up-to levels optimal for that given schedule.Due to the relaxation, R a can differ from the optimal R; however, as the experimental section shows, this event is rare.
The resulting approximate SDP formulation is more complex than the (s, S) one, making the computational effort required to solve it prohibitive.This is mainly due to the computation of the expected cycle cost (Equation 3); its computation involves three variables in each period: current inventory, order size and length of the replenishment cycle.This computational effort can be considerably reduced applying the K-convexity property (Scarf (1959)) used in the (s, S) SDP formulation.The deployment of search reduction and memoisation techniques further improve the performances, and it has a crucial impact on the applicability of this model.In the next subsections, we present the pseudocode for the solution and how these enhancements affect it.

Pseudocode
Algorithm 1 shows the procedure to compute the heuristics backwards.Lines 1-2 contains the boundary condition.Line 3 goes through all the periods in a backwards order.Line 5 searches through all the possible replenishment cycles, line 6 through all the inventory levels and line 7 through all the possible order quantities.Lines 12-13 save the current value of R a t according to Equation 9, while line 14 updates the relative expected costs, Equation 10.
Algorithm 1 RsS-SDP() 1: for i from min inventory to max inventory do for i from min inventory to max inventory do 7: for q from 0 to max order do 9: R a t ← r 14: For clarity and for the sake of the enhancements, we separate the computation of the immediate cost.Let ζ t,t+j be a value of the random variable d t,t+j and P (ζ t,t+j ) be the probability of assuming that value.Algorithm 2 computes the immediate cost, Equation 3.
for each ζ t,t+j value of d t,t+j do 6:

K-convexity
We can exploit the property of K-convexity presented in Scarf (1959) in solving the dynamic program.This approach is widely used to optimise the (s, S) SDP computation.
The property is defined as: for all positive a, b and x.Scarf (1959) shows that considering s * t and S * t the optimal reorder level and order up-to level for period t: This is done by computing the C t (y) for different values of y starting from an upper bound of S t .The value y is then decremented, and the lowest value of C t is remembered.When the cost is greater than C t + K the search terminates.S t is the inventory level in which the cost assumes the minimum value, s t is the one in which we stop the search.This approach greatly speeds up the computation of the SDP.
Similarly to the computation of the (s, S) policy, we can use the K-convexity property for the (R, s, S).Considering the Equation 10, for a fixed R t the problem is reduced to an (s, S) one with the next R t − 1 periods in which an order can not be placed.
Algorithm 3 shows the pseudocode of the enhanced SDP, clarifying the improvement's reason.For a fixed review cycle length R t , there is no need to search for the best order quantity Q t .When the order level s t is determined, the lower inventory levels assume the same expected cost.
Algorithm 3 RsS-SDP-KConv() 1: for i from min inventory to max inventory do for i from max inventory down to min inventory do 8:

Cycle Cost Memoisation
The calculation of the cycle cost is particularly time demanding.There is a summation of expected costs over multiple periods.However, it is possible to identify situations in which the same computations occur multiple times.Let l t (I t , R t ) be the function that computes the holding and penalty expected cost of starting at the end of period t with closing inventory I t and with the next review moment in R t periods.This new function is defined as: considering d i,j = 0 when i = j.Equation 3 can be rewritten as: The l t (I t , R t ) function can be computed in a recursive way: this can be considered as the functional equation of an SDP, where the holding/penalty cost of period t is the immediate cost.There are two boundary conditions: The states are represented by the tuple (t, I t , R t ) and are computed in a forward manner.To avoid recomputations, we store the computed tuples in a dictionary with constant access time.

Unit cost and lost sales extensions
Similarly to Visentin et al. (2021), unit ordering cost can be easily modelled as a function of the expected closing inventory or included in the immediate cost function.
In the case of a stockout, the lost sales model is more common than a delay of the demand Verhoef and Sloot (2006), especially in a retail setting.Lost sales models have been underrepresented in the inventory control literature Bijvank and Vis (2012); however, many recent works are considering mixed lost-sales and backorder configurations ElHafsi, Fang, and Hamouda (2021).The model presented herein can be adapted to include partially lost sales.Dos Santos and Oliveira (2019) defines as β the percentage of unmet demand that is backlogged, the remaining is lost.The functional equation 10 becomes:

Experimental Results
This section conducts an extensive computational study of the heuristic presented in this paper.We aim to evaluate the quality of the policies computed by the heuristic and the computational effort required.In Section 5.1, we assess the computational effort required to compute a policy and the quality of the policy itself under an increasing time horizon.An analysis of the heuristics behaviour under different demand patterns and cost parameters is presented in Section 5.2.Finally, we analyse an example in which the algorithm computes a near-optimal replenishment plan.
For the experiments, we use as a comparison the branch-and-bound (BnB) technique presented in Visentin et al. (2021).This is the only (R, s, S) solver for this problem configuration available in the literature.We use the same solver to compute the optimality gap.The solvers are: • BnB-Guided, the fastest branch-and-bound approach presented in Visentin et al. (2021).It pre-computes an initial replenishment plan using Rossi, Kilic, and Tarim (2015) to improve the computational performances.• SDP, the basic implementation of the SDP heuristic model presented in Algorithm 1.We include this to appreciate the impact of the optimisation techniques deployed.• SDP-Opt, the heuristic implementation deployed using the K-convexity property (Algorithm 3) and the immediate cost memoisation.
All experiments are executed on an Intel(R) Xeon E5640 Processor (2.66GHz) with 12 Gb RAM.For the sake of reproducibility, we made the implementation of all the techniques and the data generators available 1 .Since our approach is an heuristic, we use the optimality gap as measure to compute the computed policy's quality.The optimality gap is the estimated extra-cost of using the policy instead of the cost-optimal one for a particular problem.It is defined as:

Optimality gap
Policy cost − Optimal cost Optimal cost (18) Better policy parameters exhibit a lower optimality gap.It can be used to estimate the inventory cost of deploying a non-optimal system.

Scalability
We used the same testbed presented in Visentin et al. (2021) Figure 1 shows the logarithm of the average computational time over the 100 instances in comparison with the fastest technique available in the literature.The simple implementation of the heuristic can barely solve tiny instances before the time limit, making it useless for every practical use.The reduction of computational effort provided by K-convexity and memoisation is massive.The guided BnB slightly outperforms the optimised SDP for small instances up to 8 periods, then the gap between the two strongly increases, making it able to solve instances more than twice as big in the same amount of time.The K-convexity performances improvement is more significant than the memoisation one.Moreover, it generally avoids the computation of all the DP states associated with a negative inventory (line 13 of Algorithm 3).The memoisation offers a great speed up in the computational times, which is more significant in bigger instances.For bigger instances, the physical memory needed grows to require the usage of memory swap and a slow down in performances.
In this testbed, the heuristic always computes the optimal replenishment plan.

Instance type analysis
These experiments aim to analyse the performances of the heuristic under different instance parameters.We want to analyse which cost parameters are affecting the computational performances and the optimality gap of the heuristic.We use a modified version of the instances used in Section 6.2 of Visentin et al. (2021).The algorithm proposed herein computes the optimal policy parameters for all the instances used therein.
Our extension aims to find problem settings where the heuristic under-performs the optimal approach.We do it by examining a wider range of review and ordering costs and increasing the demand's uncertainty.Poisson distributed demand does not have a parameter to increase the uncertainty over its expected value; for this reason, we included normal demand in our experiments.We use two different planning horizons: 10 and 20 periods.For the cost parameters, we use all the possible combinations of review and ordering cost values K, W ∈ {20, 40, 80, 160, 320}, holding and penalty cost fixed respectively at h = 1 and b = 10.We consider Poisson demand and normally distributed one with σ ∈ {0.1, 0.2, 0.3, 0.4}.In the literature, the standard deviation used is generally not higher than 0.3; we use ) and an erratic one (RAND); more details on these patterns can be found in Visentin et al. (2021).The combinations of the parameters mentioned above lead to the creation of 1 500 instances.
Table 1 and Table 2 show the results for the 10 and 20-period instances.Regarding the policies' quality, we consider the average optimality gap, the percentage of computed policies that differs for the optimal and their optimality gap.We also compare the time required to compute the policies and the average number of reviews.
The cost factors suggest that the algorithm does not compute the optimal policy in situations with a high ordering cost and a low review cost.Due to the relaxation, the approximate SDP computes the parameters of the cycles based only on the state values, considering the uncertainty of the future periods but ignoring the one related to the period opening inventory.When the review cost is low, the BnB uses more review periods compared to the SDP to counteract this uncertainty.Up to 16.67% and 18.67% (for the 10 and 20 instances) of the policy computed differ from the optimal one; however, their gap averages less than 1%.The average optimality gap across all the instances with the lowest review cost is 0.22% and 0.29%.A higher ordering cost leads to longer intervals between orders, so a higher uncertainty on the opening inventory level of a period.This leads to a maximum of 10.67% and 13.33% of near-optimal policies.While for these particular settings, the percentage of non-optimal policy is relatively high; their optimality gap is low.
The direct correlation between the demand uncertainty and the optimality gap is evident.In the literature, the standard deviation used is in the range [0.1, 0.3]; we used expected demand with a higher degree of uncertainty to show more clearly the situations in which our approach struggles.With realisations of the demand that strongly differ from their expected value, it is more likely that the opening inventory level is higher than the order-up-to-level in a review moment.In these cases, our approach relaxes the problem by placing a negative order and setting the inventory to S, so the policy is not optimal.The majority of the instances in which the SDP computes a near-optimal policy have a σ = 0.4.
The pattern analysis provides interesting insights.The approach performs better with the increasing (INC) pattern regardless of the other instance parameters; in 10-periods instances, it always computes the optimal policy.We have the worse performances in the decreasing (DEC) one.This is in line with Özen, Dogru, and Tarim (2012) that considers a similar problem relaxation.If we have increasing demands, the base stock levels likely increase as well to satisfy higher demands.If the base stock levels increase monotonically, the relaxation generally computes the optimal policy.The second worse pattern is the random one due to randomly generated decreasing patterns.We observe the biggest gap between the number of reviews with 0.2 and 0.5 fewer reviews on average in the decreasing pattern.
On average, the optimality gap between the two approaches is only 0.04% and 0.05%, with 4.4% and 5.87% of the policy computed that are near-optimal.This proves the quality of the heuristic in computing policies.
Our approach is 4 and 300 times faster, respectively, on the 10 and 20 periods regarding the computational time.Moreover, the cost parameters do not affect the SDP performances, while they affect the BnB pruning efficacy.For example, low review cost 20-period instances takes six times more effort than high review ones.Uncertainty on the forecast affects the performances of both approaches since it makes the computation of a state expected cost more expensive.However, the SDP manages to reduce this impact using memoisation.The SDP increases less than times its computational effort for σ = 0.4 compared to σ = 0.1, while the increment for the BnB approach is higher than 40.

Non-optimality of the relaxation
In this section, we analyse a single instance to better understand the differences between the computed policies.This example shows a situation in which the heuristic computes a non-optimal policy.When computing the solution, it considers only the expected demand for future periods.On the other hand, the BnB approach presented in Visentin et al. (2021) tests all the possible replenishment combinations of the previous periods during the search process.Not considering the previous demands means ignoring the possibility of having such a low demand that at a period t, the opening inventory level I t is higher than s t , and that this will strongly affect future decisions.This difference worsens the heuristics performances for high values of uncertainty and the decreasing pattern (DEC).In these instances, the high demand with high uncertainty at the beginning of the time horizon makes unexpected high inventory levels at a replenishment moment more likely.In this situation, the BnB solution adds more review moments (especially when the cost associated W is low) to assess the inventory level and react to the uncertainty.For example, considering the instance of Table 1 with K = 320, W = 20, σ = 0.4 and decreasing demand pattern.Table 3 shows the two policies.The BnB approach considers the higher uncertainty at the beginning of the time horizon; it also reviews the inventory level at periods 5 and 6.While these reviews add an extra cost in an almost deterministic system, they allow a better reaction to unexpected demand.At the end of the time horizon, the uncertainty on the inventory level is lower, and the two policies are identical from period 7 on when a lower demand leads to lower absolute variations of the realised demand.
The BnB policy has an expected cost of 1793, the SDP of 1845; a difference of 52  3. (R, s, S) policy parameters for the K = 320, W = 20, σ = 0.4, DEC pattern instance.
that leads to an optimality gap of 2.9%.

Conclusions
This paper presented a heuristic for the non-stationary stochastic lot-sizing problem with ordering, review, holding and penalty cost, a well-known and widely used inventory control problem.Computing (R, s, S) policy parameters is computationally hard due to the three sets of parameters that must be jointly optimised.We presented the first pure SDP formulation for such a problem.The algorithm introduced solves to optimality a relaxation of the original problem, in which review cycles are considered independently, and items can be returned/discarded at no additional cost.A similar relaxation has been previously used in (R, S) policy computation works.
The extensive numerical study proved the reduction of the computational effort needed to compute a policy.The basic formulation requires a prohibitive computational effort.Two enhancements based on K-convexity Scarf (1959) and memoisation strongly improve the computational performance, making it able to solve instances twice as big as the state-of-the-art.This allows practitioners to use such policy in a wider range of real-world situations.We then investigated the SDP performance under different types of instances.We measured the computational effort to compute the policy and how much the relaxation affects its quality.The heuristics' computational effort is less affected by the instance configuration.The proposed algorithm rarely computes a nonoptimal policy when there is less uncertainty on demand and high review, low fixed ordering cost instances.For Poisson distributed demand, the SDP always computes the optimal policy.The average optimality gap is 0.04% and 0.05% with 95.6% and 94.13% of computed policies identical to the optimal respectively for the 10 and the 20 periods instances; more than half of the non-optimal policies are related to extremely high uncertainty of the demand (σ = 0.4) a configuration hardly considered in the lotsizing literature.These differences are caused by a reduced number of review moments in the SDP computed policies.
In future studies, we plan to extend such a method's applicability by considering more complex supply chains such as multiple items, multiple echelons, and different cost structures.We plan to further enhance the current formulation to improve the non-optimal computed policies, similarly to what Rossi et al. (2011) did with state space augmentation.
for t from T down to 1 do 4: best review cost ← ∞ 5:for r from 1 to T − t + 1 do 6:

Figure 1 .
Figure 1.Computational time of the (R, s, S) SDP over the number of periods, time limit 1 hour best cycle cost then . A fixed holding cost per unit h = 1.The other cost factors are sampled from uniform random variables: fixed ordering cost K ∈ [80, 320], fixed review cost W ∈ [80, 320] and linear penalty cost b ∈ [4, 16].The demand is modelled as a series of Poisson random variables.A uniform random variable draws the average demands per period with a range of 30 to 70.We generate 100 different instances.We replicate the experiments for increasing values of the number of periods.

Table 1 .
Optimality gap and pruning percentage for the techniques for instances of 10 periods

Table 2 .
Optimality gap and pruning percentage for the techniques for instances of 20 periods