Deep Reinforcement Learning for One-Warehouse Multi-Retailer inventory management

The One-Warehouse Multi-Retailer (OWMR) system is the prototypical distribution and inventory system. Many OWMR variants exist, e.g. demand in excess of supply may be completely back-ordered, partially back-ordered, or lost. Prior research has focused on the study of heuristic reordering policies such as echelon base-stock levels coupled with heuristic allocation policies. Constructing well-performing policies is time-consuming and must be redone for every problem variant. By contrast, Deep Reinforcement Learning (DRL) is a general purpose technique for sequential decision making that has yielded good results for various challenging inventory systems. However, applying DRL to OWMR problems is nontrivial, since allocation involves setting a quantity for each retailer: The number of possible allocations grows exponentially in the number of retailers. Since each action is typically associated with a neural network output node, this renders standard DRL techniques intractable. Our proposed DRL algorithm instead inferences a multi-discrete action distribution which has output nodes that grow linearly in the number of retailers. Moreover, when total retailer orders exceed the available warehouse inventory, we propose a random rationing policy that substantially improves the ability of standard DRL algorithms to train good policies because it promotes the learning of feasible retailer order quantities. The resulting algorithm outperforms general-purpose benchmark policies by ∼ 1−3% for the lost sales case and by ∼ 12−20% for the partial back-ordering case. For complete back-ordering, the algorithm cannot consistently outperform the benchmark.


Introduction
Supply Chain Management (SCM) is responsible for efficiently delivering goods and services from suppliers to consumers.A typical supply chain may include multiple stages (echelons) of locations through which the goods move from the supplier to the consumer.The supplier may, for example, be a manufacturer of spare parts or raw materials, while the consumer can be a machine at a production facility or a customer at a retail shop.Since supply chains involve multiple locations spread geographically, efficient decisions about positioning goods in time and space are crucial for competitive advantage (Mentzer et al., 2001).The overall increase in the demand for goods and the highly competitive markets force companies to adopt innovative approaches to supply chain management strategies to ensure cost-efficient availability of products.
A key objective within supply chain management is inventory optimization.The goal of inventory optimization is to find the best possible applications.A range of variants arise in practice, e.g. demand in excess of supply may be back-ordered, lost, or partially back-ordered and partially lost; demand may have continuous or discrete distributions, etc.In general, the problem is challenging due to its combinatorial complexity and the dependencies between inventory locations.As a result, exact solution methods are available only for a handful of problem variants under restrictive assumptions, and the focus of prior studies has been mostly on heuristic policies and numerical methods (de Kok et al., 2018).However, constructing well-performing policies is a difficult and time-consuming task that must be redone for every problem variant.As a consequence, well-performing policies are lacking for ill-researched problem variants.For example, the case where demand is completely backlogged has received a lot of attention (cf.de Kok et al. (2018); for policies and results for that case see e.g.Axsäter et al. (2002), Marklund and Rosling (2012)), while the partial back-ordering and lost sale cases have received comparatively little attention, even though partial backordering ''is an accurate description of customer behavior for many retail items'' (Nahmias and Smith, 1994), while lost sales are common for many seasonal items (Li et al., 2021).
Deep Reinforcement Learning (DRL) is a general purpose technique for sequential decision making that is promising for inventory control, as underlined by a study by Gijsbrechts et al. (2022) that finds that the A3C algorithm yields competitive policies for three challenging inventory problems, including the dual sourcing problem and the single location lost sales inventory problem.Deep Reinforcement Learning is equally promising for the OWMR problem.Advances in deep learning have established the remarkable ability of Deep neural networks to process multi-dimensional inputs.Since neural networks commonly represent policies in DRL (cf.Boute et al., 2021), this flexibility may enable DRL to accurately learn good OWMR policies, since the optimal OWMR policy is dependent not only on the inventory position of the stock-points, but also on the vector of outstanding orders for all stock-points.Moreover, applications of neural networks and deep reinforcement learning have been shown to yield around 10% cost savings compared to heuristic benchmarks for OWMR problems with partial back-ordering (Gijsbrechts et al., 2022;Van Roy et al., 1997).These results have been obtained using algorithms that assume symmetric OWMR instances, i.e., problems where the lead-time, cost parameters, and demand distribution are the same for all retailers.
We contribute DRL algorithms and performance tests for several variants of general OWMR systems, i.e., including systems where retailers have differing demand distributions, lead-times, and cost parameters.General OWMR systems are multi-action inventory systems: each period, they require an action for each stock-point, i.e., for the warehouse and each retailer.Typically, applications of DRL to multi-action inventory systems have assigned one output node of the neural network to each combination of actions (Boute et al., 2021).For example, to apply A3C to the dual sourcing inventory problem, Gijsbrechts et al. (2022) assign an output node to each combination of an order for the slow source and an order for the fast source.Similarly, Vanvuchelen et al. (2020) apply DRL to the joint replenishment problem, and let each output node correspond to the probability that a certain combination of orders is placed by the two coordinating companies, forming a so-called discrete action distribution.When applied to a general supply chain, e.g. a OWMR problem with  locations, this would require the last layer of the neural network to have   outputs, where  are the available order quantities.Indeed, each combination of orders for every location would be an action, which becomes prohibitive for larger .Symmetric OWMR systems, i.e., systems where all retailers have identical parameters, facilitate the dynamic specification of a single order-up-to level for all retailers, thereby side-stepping this issue (cf.Gijsbrechts et al., 2022;Van Roy et al., 1997).
Our proposed approach tackles the underlying problem using a multi-discrete action distribution, which allows for the reduction of the size of the last layer to  × .The output of a neural network forms several probability distributions, one for each stock-point individually, and the corresponding order quantities are sampled independently.This allows for simultaneous decision making across all nodes in the system by performing only one forward pass on the neural network.Total shipments to retailers are however restricted by the available inventory at the warehouse, and the order quantities resulting from the neural network may exceed that amount.In the literature on OWMR systems, it is common to ration the orders placed by individual stock-points, e.g. with the goal of minimizing imbalances.While clever rationing may improve a given ordering policy, we find that it hampers the learning of a good ordering policy: It seems to promote over-ordering.To instead incentivize the learning of exact mappings from states to feasible actions, we propose to allocate sequentially.That is, we ship the individual amounts in full if possible, where a randomization process is applied to determine the sequence in which retailer shipments are executed each period.We find that this provides incentive for the model not to over-estimate the order quantity, which in turn leads to substantially better policies.
To apply DRL to the problem, we first formulate it as a Markov Decision Process.To subsequently train the neural network, we adopt the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017), a popular policy gradient Reinforcement Learning (RL) method.We present results across 14 instances with up to 11 stock-points.To underline the general applicability of the developed approach while testing the DRL performance vis-a-vis relevant benchmarks, we include instances with complete back-ordering, with partial back-ordering, and with lost sales.As benchmarks, we adopt echelon base-stock policies with two representative general-purpose rationing/allocation policies in case of shortage at the warehouse.Without hyper-parameter tuning, we train policies that respectively outperform the benchmark by 1−3% (for lost sales), outperform the benchmark by 10−20% (for partial lost sales), and that are outperformed by the benchmark by a 0−3% margin (for complete back-ordering).Our results demonstrate the added value of DRL also for the important class of multi-action inventory problems, especially for model variants where it is difficult to construct appropriate heuristics.The developed approach may be more generally applicable to multi-echelon supply chains beyond the OWMR problem, as ideas underlying its development are quite generic.
The remainder of this paper is structured as follows.We review the literature in Section 2. Section 3 introduces and formalizes the OWMR model, and presents our PPO approach for multi-echelon decision making.We report results of our numerical experiments in Section 4, and conclude in Section 5.

Literature
We first discuss literature on multi-echelon distribution systems, with a focus on periodic review One-Warehouse Multi-Retailer systems (OWMR).We then discuss key literature on DRL, with a focus on studies that apply DRL to inventory problems.

Inventory management for multi-echelon inventory problems
Various classifications can be applied to multi-echelon inventory problems.The number of echelons and the network topology are the most important factors; see de Kok et al. (2018) for a typology and extensive review on multi-echelon inventory systems.
Multi-echelon systems are typically difficult to analyze because of the dependence between locations.Uncapacitated serial systems under full backlogging form an exception: For those systems base-stock policies are optimal, and the optimal base-stock policy an be identified using dynamic programming, working backwards through the network topology (Clark and Scarf, 1960).Under similar assumptions, pure assembly systems form a second exception; these can be reduced to an equivalent serial system (Rosling, 1989) and analyzed accordingly.For divergent systems, however, the common practice is to use approximation methods, where the underlying dynamic program is relaxed by one or more manipulations (relaxations, restrictions, projections, cost approximations; (Geoffrion, 1970)).Our focus is on the OWMR inventory system, which is a prototypical multi-echelon divergent inventory system under periodic review with constant positive lead times between the nodes and stochastic demand on the retailers.
The balance assumption (see Clark and Scarf, 1960;Eppen and Schrage, 1981) is the most common relaxation technique for divergent inventory systems; it works in the context of complete back-ordering.Assuming balanced inventories leads to a complete characterization of the optimal policy, that can even be extended to multi-echelon systems (Diks and De Kok, 1998).The assumption effectively relaxes the physical constraints of strictly positive allocation quantities and has various interpretations, such as allowing negative allocation quantities, lateral transshipment between retailers, and permitting the inventory's immediate return to the warehouse with no delay.These interpretations lead to the same result: the retailers' inventory positions become irrelevant, and the allocation decisions are based solely on the warehouse echelon stock.The balance assumption may be rather effective in some cases, but effectiveness is dependent on the system parameters (Doğru et al., 2009).Despite the balance assumption being widely used to derive close to optimal policies, it may not be suitable when the retailers are prone to imbalance (Axsäter, 2015), and several works have provided analytic insights into distribution systems without the balance assumption, which in turn have yielded improved allocation heuristics (e.g.Axsäter et al., 2002;Marklund and Rosling, 2012).
Alternatively, the policy space could be restricted to a more convenient policy class.In this case, the goal is to find a best performing policy in this policy class, either with respect to the original problem or its approximation.Some of the most common policy classes are the base-stock policies (Park, 1998;de Kok et al., 2018).Base-stock policies refer to reorder point policies that follow a rule: if the (echelon) inventory position falls below a reorder point, an order is placed to raise the (echelon) inventory level to a specified order-up-to level.Therefore, in this case, the optimization objective is to find the policy parameters: reorder point and order-up-to level.Since such a policy is fully determined by a few parameters, it makes cost estimations far simpler.Numerous papers provide methods to discover policy parameters or calculate the associated costs (e.g.Van der Heijden et al., 1997;Huh et al., 2016;Gayon et al., 2016).
A limitation of most prior studies on OWMR systems is their assumption of complete back-ordering, even while lost sales and partial back-ordering occur frequently in practice.It would be desirable to have a more generic approach that does not lean on such assumptions, and additionally can be easily adapted to the numerous variations of the OWMR systems that appear in practice.Deep Reinforcement Learning may offer such an approach.

Deep reinforcement learning
In the past years, Deep Reinforcement Learning (DRL) stands out as a general framework that facilitates the solving of complex sequential decision-making problems (Bertsekas, 2019).Due to its computational requirements it is mostly used in combination with simulation.DRL utilizes Reinforcement Learning to solve a temporal credit assignment problem and Deep Neural Networks for compressing the state representation.DRL has already advanced the state-of-the-art on a multitude of challenging domains.Some of the most notable examples include playing Atari 2600 games directly from pixels (Bellemare et al., 2013), performing continuous robotic control (Schulman et al., 2017), and winning against human champions in complex strategic games such as GO and Chess (Silver et al., 2017) and the team-based video game Dota 2 (Berner et al., 2019).
Early DRL applications in supply chain management include work on the beer game (Oroojlooyjadid et al., 2022) and application of the A3C algorithm to three challenging inventory problems (Gijsbrechts et al., 2022).In recent years, application of DRL to inventory problems has gained significant attention (Boute et al., 2021), mostly because DRL is a general purpose technology that can be applied to a wide range of problems with limited adjustments.Apart from the two early works mentioned above, notable works include applications to joint ordering problems (Vanvuchelen et al., 2020) and capacitated lot sizing (van Hezewijk et al., 2022).
To scale DRL applications to real-life problems, it is crucial to address current limitations on applying multi-action/multi-echelon models (see Boute et al., 2021, for extensive discussions), and this topic is the subject of another recent paper: (Vanvuchelen and Boute, 2022) propose the use of continuous output nodes to efficiently represent policies for multi-echelon joint replenishment problems.Their work underlines increasing interest in multi-echelon DRL, while their approach and problem are rather different from the developments in the work presented here.Pirhooshyaran and Snyder (2020) and Wang and Hong (2023) are also related to our work, in the sense that methods are proposed that utilize neural networks and simulation to solve largescale inventory systems.However, these studies consider stationary base-stock policies, and their aim is to find scalable algorithms for identifying close-to-optimal base-stock levels for each node.In contrast, we apply DRL to seek dynamic policies, i.e. state-dependent base-stock levels.
Applications of DRL to divergent multi-echelon systems are rather scarce, we are aware of only two examples: (1) the application of neurodynamic programming (Bertsekas and Tsitsiklis, 1996) for OWMR (Van Roy et al., 1997), and (2) Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) on dual sourcing, lost sales, and OWMR problems (Gijsbrechts et al., 2022).Both papers report an approximate 10% improvement over base-stock policies, but sidestep the issue that multiaction systems pose by crucially relying on the assumption that the OWMR system is symmetric (cf.Section 1), i.e., that all retailers are identical.
In our paper, we directly address the issue that each period requires the simultaneous selection of replenishment quantities for all stock locations.Our neural network inferences multi-discrete action distributions, and is paired with a randomized sequential allocation rule to ensure effective learning.Our approach represents the first application of DRL for general OWMR systems, and it yields competitive results.It is also a step towards applying DRL to larger supply chains.

Methods
The One-Warehouse Multi-Retailer (OWMR) system that is the focus of this study is a key example of a supply chain network.We introduce supply chain networks and the OWMR system in Section 3.1.To apply Deep Reinforcement Learning (DRL), these systems must be formalized as Markov Decision Processes (MDPs).MDPs are introduced in Section 3.2 and used to formalize OWMR problems in Section 3.3.In Section 3.4, we discuss our multi-echelon DRL algorithm, while general-purpose benchmarks for the OWMR system are introduced in Section 3.6.

Multi-echelon inventory models
A supply chain network delivers goods from the supplier of the raw materials to the final consumer.A node in a multi-echelon supply chain network is a location that can hold inventory.See Fig. 1 for a schematic depiction of an inventory system consisting of a single node/location.
Supply chain networks consist of a number of locations.Downstream locations face customer demand, while other locations face internal demand, i.e., demand originating from other nodes.The most upstream nodes receive goods from external suppliers, while other (intermediate and downstream) nodes receive goods from their direct predecessors in the network.Supply chain networks may be studied in a periodic review setting: In each period  ∈ N, first the inventory level and pipeline of all nodes is reviewed, and replenishment orders are placed The squares represent the pipeline inventory, i.e., the inventory that has been ordered but that did not yet arrive at the location, which is represented by the circle.In each period, an order is placed in the leftmost square of the pipeline.When the system transitions to the next time-step, the goods located in each of these squares move to the square to their right until they reach the final location.Thus, after a lead time delay of  periods, the goods are delivered at the store and can be used to satisfy the customer demand.
accordingly.Subsequently, orders shipped a lead-time earlier for the various nodes arrive at those nodes.Then the replenishment orders are processed, taking into account that actual shipments are constrained by the availability of inventory at corresponding predecessor nodes.After shipment quantities are determined, the shipped orders enter the pipeline.At the end of each period, customer demand is processed, and costs are incurred.Note that the pipeline corresponding to each node (see Fig. 1) serves to keep track of orders that have shipped towards a node but did not arrive there yet.
The focus of this paper is on the OWMR system.It has  nodes/ locations.Locations 1, … ,  − 1 represent the retailers, and face customer demand.All retailers receive goods from node/location 0, referred to as the warehouse.Fig. 2 depicts the OWMR system.
Inventory pooling at the warehouse brings multiple benefits.Risk pooling is a key advantage: note that the surplus inventory at one retailer cannot compensate for a shortage at another.Therefore, by pooling inventory at the warehouse we effectively postpone the time at which we need to allocate inventory to specific retailers.Additionally, holding inventory at the warehouse is typically less expensive than holding it at the retailers.This creates two opposing incentives.On the one hand, the warehouse should allocate as much to the retailers as possible to reduce potential shortages.On the other hand, positive on-hand inventory at the warehouse can be beneficial if the inventory levels among the retailers become unbalanced.In that case, the warehouse can address the imbalance by shipping additional inventory to those retailers that are low on inventory, relative to the other retailers.
The stochastic demand at the retailers is realized every period and satisfied with on-hand inventory.Each unit of demand represents a customer request for products.In case the on-hand inventory at a retailer is not sufficient to fully satisfy the demand at that retailer, a shortage occurs.We consider three possible models regarding customer behavior in case of shortages: 1. Complete back-ordering.Unmet demand at the retailers is backlogged.Backlogged demand is satisfied as soon as the items become available at the retailer.This customer behavior model underlies most studies of multi-echelon inventory systems, mainly because of mathematical tractability (see de Kok et al., 2018).2. Lost sales.Unmet demand is lost and a full penalty cost is incurred; this customer behavior model is for example common for seasonal items, see e.g.Li et al. (2021) for a real-life example.3. Partial lost sales (see e.g.Nahmias and Smith, 1994).When there is insufficient inventory, with probability 1 −  em demand is immediately lost and a lost sales penalty is incurred.With probability  em , a customer agrees to an emergency shipment from the warehouse.If warehouse inventory is sufficient, demand is immediately satisfied, otherwise demand is lost and a penalty is incurred.
Note that the warehouse faces internal demand that originates from the orders placed by the retailers.Its inventory may not be sufficient to satisfy the orders from all retailers.In that case, not all orders can be shipped, and to distribute the available inventory an allocation strategy is required; we discuss details in Section 3.6.

Markov decision process
For the application of Reinforcement Learning, the OWMR problem needs to be formulated as a Markov Decision Process (MDP); these are briefly introduced next.Mathematically, an MDP is represented as a tuple (, , , , ), where: The Reinforcement Learning agent observes a state   ∈  and produces an action   ∈  that feeds back into the environment.The environment transitions to  +1 ∈  and emits a scalar reward   = (  ,   ,  +1 ).This loop continues indefinitely, or until the episode terminates, e.g. after  time-steps.The sequence of states, actions, and rewards constitutes a rollout or a trajectory of the policy.Every trajectory accumulates rewards from the environment  =   =1     .The objective is discovering a policy  ∶  ×  →  that specifies the actions to take in each state; following policy  implies that (  = |  ) = (  , ).We seek to maximize the expected cumulative reward .We can express this objective in terms of finding an optimal policy  * ∈ argmax  E[|] that involves solving the recursive optimality equations (Bellman, 1953).In the case where a neural network represents a policy, the goal is formulated as finding the parameters  of the policy   that maximize the expected cumulative sum of rewards, i.e.,  ∈ argmax  E [|].

MDP formulation of one-warehouse multi-retailer systems
For any supply chain network, let  denote the number of nodes, and index the nodes from 0 to  − 1. Actions are vectors   = ( 0 (),  1 (), … ,  −1 ()).Denote by   () the order that will arrive in period  in node .Recall that orders arrive after a lead-time; we denote the lead-time for node  by   which implies that the order placed in period  is denoted by   ( +   ).(We assume that   is integer, i.e., the lead-time is a multiple of the review period; see Bijvank and Johansen (2012) for a discussion of the limitations of this assumption.)The state of the system is comprised of the state of each of the nodes in the system.The state observed for node  ∈ [0, 1, … ,  − 1] at the start of period  consists of the inventory level   () and the pipeline vector   () of outstanding shipments due to arrive in periods , … ,  +   − 1, i.e.,   () = (  ( +   − 1), … ,   ()).The state of the network is then given by The order of events for period  is as follows: 1. Determine action   based on state   .2. Shipments   () arrive, and orders   () are processed to yield shipments   ( +   ) that will arrive in period  +   .3. External and internal demands are processed by the corresponding nodes, and costs are incurred.
Accordingly, processing an action for a node  involves assigning it to the first position of the pipeline   (), receiving the order in the last pipeline position, and shifting the remaining orders in the pipeline forward, see Fig. 3 for an illustration.For a OWMR system, we assign index 0 to the warehouse, while indices 1, … ,  − 1 correspond to the  − 1 retailers.We next detail the  dynamics for that system.Since retailers order from the warehouse, the inventory level of the warehouse is first updated as follows: (Recall that   ( +   ) denotes orders shipped to retailer  in period .) For the inventory update of the retailers, we must distinguish between various models of customer behavior.The complete backlogging case yields the following update: In the case of lost sales, the retailers' inventory levels cannot be negative, thus: Let   () denote the amount of lost demand in period , then: A momentary reward is calculated as the negative of all the costs incurred during the time-step  at all locations.Each retailer incurs holding costs and backlogging or penalty costs depending on the customer behavior model; the warehouse incurs only holding costs.Denote holding costs for the warehouse in period  by  0 () = ℎ 0  0 (), where ℎ  is the holding costs for node .Denote by   () the costs for retailer  for period : For lost sales we find   () =     () + ℎ    (), with   the lost sales penalty.For complete backlogging we find   () =   max(−  (), 0)+ℎ  max(  , 0), with   the per-period backlogging costs.The total reward per period becomes:  et al., 1997;Gijsbrechts et al., 2022).Also, (1) turns to: We next briefly discuss the processing of the orders   () for  ∈ {1, … ,  − 1}.When ∑ −1 =1   () ≤  0 () +  0 (), then sufficient inventory is available to ship all orders and accordingly   ( +   ) =   () for all retailers.Conversely, when ∑ −1 =1   () >  0 () +  0 () the inventory at the warehouse is insufficient.In that case, an allocation policy that takes the orders as input decides upon the final shipment quantities   (+  ).In particular, internal orders that cannot be satisfied in full are processed, and in doing so they are effectively reduced to an amount that can be shipped in full.The approach for doing so will be discussed in more detail in Section 3.5.
To apply DRL, it is common to adopt pragmatic or analytic upper bounds for the orders placed by RL agents (cf.Boute et al., 2021).We adopt such an approach in the present paper.In particular, the initial upper bound for retailer actions corresponds to the 99.9th percentile of per-period demand, and the lower bound equals 0. For the warehouse, the upper bound is based on system demand instead.The procedure then trains the PPO agent and simulates a trajectory with the trained agent.If the agent chooses actions that lie on the boundary of the search space, limits are adapted and another agent is trained, etc.

Deep reinforcement learning for multi-echelon decision making
PPO (Schulman et al., 2017) is a policy gradient method that enhances the well-known REINFORCE algorithm (Williams, 1992).It does so by incorporating elements from the Trust Region Policy Optimization (TRPO), importance sampling, and value methods.TRPO ensures a monotonic policy improvement by leveraging second-order optimization methods (Schulman et al., 2015).PPO achieves similar results by clipping the optimization objective (see Schulman et al., 2017, for details).Unlike REINFORCE, however, PPO reuses training data multiple times.In combination with a value function used to improve the training stability, PPO is characterized by stable training with good wall-time performance.
It has been reported that an expensive hyper-parameter tuning process is required for achieving good performance on inventory optimization problems (Gijsbrechts et al., 2022).PPO has empirically shown to be less sensitive towards hyper-parameter tuning.Additionally, PPO was chosen to perform stochastic inventory optimization tasks due to its good convergence properties and good sample complexity.Moreover, an efficient implementation is available from the RLlib library (Liang et al., 2018), which is used in this work.
Our MDP formulation of the OWMR system has a multi-dimensional action space in order to provide individual control over the replenishment decisions, which enables us to solve asymmetric problem instances, i.e., instances with non-identical retailers.A common approach for multi-action MDPs is to enumerate each possible combination of individual actions (see Gijsbrechts et al., 2022;Vanvuchelen et al., 2020).
However, the activation function of the last layer of the neural network is typically a softmax that forms a discrete probability distribution over all possible action combinations, and for multi-action MDPs the number of action combinations grows very quickly in the number of retailers.Beyond ∼ 10 3 action combinations, DRL becomes increasingly unwieldy and training effective policies becomes prohibitively timeconsuming; this limit may already be reached for  = 3, i.e., two retailers.
To overcome these limitations, the approach proposed in this paper differs on two important aspects from the standard approach.First, we adopt a multi-discrete action distribution, i.e., our neural networks output several probability distributions, one for each stock-point individually, and the corresponding order quantities are sampled independently.Second, we adopt a custom allocation policy that enables the DRL algorithm to efficiently learn well-performing multi-discrete policies.
We explain the multi-discrete action distribution and its advantages by example.Suppose the replenishment orders for the warehouse are restricted to lie in the [0, 20) range.Similarly, suppose each retailer is restricted to order in the [0, 10) range.Then the dimension of the output of the actor-network equals the sum of all possible actions for each location, i.e., 20 + 10( − 1) values.The softmax activation function is applied with respect to the first 20 elements of the output vector to form a probability distribution over the warehouse's actions.Similarly, element 21 to 30 are transformed to a probability distribution for retailer 1, etc.To sample an action using the network, we independently sample for the warehouse and each retailer based on the corresponding probabilities.The final output of the network becomes a vector   = [ 0 (),  1 (), …  −1 ()], where  is the total number of locations.This allows us to specify an exact ordering decision for each of the stockpoints, while the size of the output layer grows only linearly in the number of actions.(Note that the enumeration of action combinations would yield 2 × 10  outputs for this case.) To visualize the network, we instead consider a case with only 2 retailers, that can each place an order  ∈ {0, 1}; we also omit the warehouse actions for clarity.Fig. 4 illustrates the resulting network.

Allocation in case of shortages at the warehouse
We next discuss the need for an allocation policy, as well as the adopted allocation policy for DRL training.Since the multi-action distribution that is part of our approach samples orders for each location independently, there is no a priory guarantee that the orders placed by retailers can be met in full by warehouse inventory, i.e., they may violate below constraint: When the actions outputted by the neural network violate (7), then the shipments   (+  ) must be lower than the orders   ().To this end, the order quantities will be reduced to the amount that the warehouse can deliver, but since there are multiple retailers there are multiple ways to achieve this.This raises two questions: (1) how to adjust the actions to become feasible and (2) how to communicate to the RL agent that these actions were not feasible.Question 1 can be resolved by adopting an allocation rule (cf.Section 3.3).A proportional allocation rule is attractive because it is intuitive and general-purpose.It can be formalized as follows: 0 () +  0 () This rule is widely used in combination with echelon base-stock policies as it facilitates analysis.However, as empirically validated, proportional allocation does not provide an incentive for the DRL agent to output feasible actions.When trained to output orders which will be processed with a proportional allocation rule, the agent tends to suggest high order quantities at the retailers (in the top of the available range), while limiting on-hand inventory at the warehouse.This results in undesired behavior: the agent does not learn to ration the resources, leading to poor performance.The problem persists even when a penalty for infeasible actions is added; this could be a consequence of the highly stochastic environment, which renders it challenging to learn which actions are considered infeasible.Additionally, penalties may have undesirable side-effects and it is not easy to define appropriate penalties.
For training the RL agent, we will instead rely on a custom randomized sequential allocation rule.For this rule, the sequence of retailers is randomized for every time step, and the orders   (),  ∈ 1, … ,  − 1 provided by the actor-network are executed in full, one by one until the available inventory at the warehouse is depleted.When the inventory at the warehouse is insufficient to satisfy an order in full, the order is truncated so that the warehouse inventory falls to 0.
For infeasible actions, i.e., those violating (7), the last retailer(s) in the sequence will not see their order shipped in full, which will increase the retailer's costs.We indeed find empirically that RL agents trained to output orders processed using this rule learn to output feasible actions, in contrast to agents trained using proportional allocation, see Fig. 5.
Apart from the intuitive appeal of learning to output exact order quantities rather than relying on allocation to render retailer orders feasible, one may expect that learning exact order quantities leads to tangible performance benefits.Indeed, if a DRL policy relies on allocation for outputting feasible order quantities, actions taken for one retailer can influence the orders dispatched by other retailers, creating complex interdependencies.Arguably, retailer-specific decisions decoupled from the decisions of other retailers are more effective: They reduce dependencies and allow for granular, adaptable strategies, fostering stability and performance in the overall network.Indeed, in Section 4 we will see that training with sequential allocation leads to substantial performance improvements, which are maintained even when switching to the proportional allocation rule while evaluating the trained agent.

Benchmark policies
Our aim in this section is to come up with reasonable generalpurpose benchmarks, i.e., policies that are appropriate for all three consumer choice models: lost sales, complete back-ordering and partial back-ordering.The optimal policy is unknown for all three cases.However, base-stock policies are optimal for single node systems with complete back-ordering, and are intuitive and widely used in research and in practice (Axsäter, 2015).We adopt echelon base-stock policies, which we found to perform better then regular base-stock policies.Note that our benchmarks are stationary replenishment policies: Our

Table 1
Parameters for instances 1-10; divided between complete back-ordering and lost sales.
investigations aim to gain insight into the potential value of more complex state-dependent (DRL) policies.We pair these policies with two allocation rules.The first rule is the proportional allocation rule formalized in (8).For the second rule, we consider the shortfall of retailers after allocation, i.e., the difference between their respective base-stock levels and inventory positions.When the warehouse has sufficient inventory, the policy ensures that these differences are zero.In case of shortage at the warehouse, we allocate inventories to minimize the maximum shortfall; we refer to this heuristic as the min shortage allocation.

Numerical experiments
In this section, we will test the developed training algorithms on a wide range of cases.We discuss the experimental setup in Section 4.1, and results in Section 4.2.

Experimental setup
To evaluate the performance of PPO algorithm applied to OWMR systems, we consider 14 problem instances distributed over the 3 categories of customer behavior models.We also consider 6+6 instance variants to analyze sensitivities.Throughout all instances, demand processes at retailers are assumed to be i.i.d., and they will be either Poisson distributed ( ()) or Gaussian ( (, )).The values sampled from the Gaussian distribution are rounded to the nearest integer, or set to zero if the value is negative.The instances are further differentiated based on customer behavior (CB): we consider lost sales (), back-ordering (B) and partial back-ordering, where the first 10 instances fall in the first two categories.For these instances, the CB categories as well as lead-times and demand distributions are listed in Table 1.Except where explicitly noted, all instances share the same cost structure (  = 9,  = 2 … , ℎ 1 = 0.5, ℎ  = 1,  = 2 … ).
Instances 11-14 use partial backlogging with  em = 0.8 and emergency ordering costs of 0; for other parameters see Table 2.For instances 11 and 12, the warehouse lead-time is  0 = 2 and retailer lead-times are   = 1,  = 1, … ,  − 1, and the standard cost structure is adopted.For instances 13 and 14, we use  0 =   = 2, and the cost structure ℎ 0 = ℎ  = 3 and   = 60 for  = 1, … ,  − 1.This brings instance 13 in line with Setting 1 in Gijsbrechts et al. (2022) and the corresponding setting in Van Roy et al. (1997), while instance 14 presents similar parameters but with highly asymmetric retailer demand distributions.
To assess the impact of the number of retailers, we conduct a sensitivity analysis to instances 1, 7 and 11.In particular, we create two variants for each of these instances where all parameters are kept fixed, except the number of retailers which is changed (from 3) to 5 and 7, respectively.To similarly assess the sensitivity to penalty cost asymmetry, we create two additional variants for each of the instances 1, 7, and 11.The original instances have a penalty cost of 9 for each of the three retailers; for the first variant we let  1 = 4,  2 = 9, and  3 = 19, for the second variant we let  1 = 3,  2 = 9, and  3 = 39.
Together, the 14 instances and 12 variants cover a range of problems, including symmetric and asymmetric cases.Instances 13 and 14 are by far the most challenging, since the size of the state vector equals ∑  =1 (  + 1) = 33 and the size of the action vector is 11.
As a benchmark, we adopt an echelon base-stock policy with the two allocation policies as outlined in Section 3.6, with echelon basestock levels set using a heuristic procedure.For all instances except instance 14, the base-stock vector is set via exhaustive search in the base-stock levels.In particular, we try all combinations of echelon basestock levels for each of the nodes, such that each base-stock level lies within a reasonable range for that node, guided by heuristic lower and upper bounds.Since  > ℎ, demand during lead-time serves as a lower bound.Since ∕( + ℎ) is typically below 0.9 and always below 0.975, demand during lead-time plus 3 standard deviations serves as an upper bound.For the echelon base-stock level of the warehouse, we adopt similar approach, but with system demand and the longest cumulative leadtime from outside supplier to retailer.For instances where the retailer parameters are symmetric (i.e., same demand distribution, leadtime, and cost parameters), we consider only those combinations for which the retailer base-stock levels are equal.
For each combination of base-stock levels obtained in this fashion, we employ simulation of 1000 trajectories of length 100 to identify the combination of base-stock levels that leads to the lowest cost.Each trajectory corresponds to a simulation run with a specific sequence of demands; we use common random numbers when assessing different combinations of base-stock levels for improved convergence.As a partial validation of our heuristic bounds, it was verified that none of the resulting base-stock vectors lie on the boundary of the search space.
For Instance 14, the large number of locations in combination with the lack of symmetry renders exhaustive search over all combinations of base-stock levels intractable.Instead for that instance, we search exhaustively over the base-stock level  0 for the warehouse, and a parameter  that is used to resolve the retailer base-stock levels   for  = 1, 2, … ,  − 1.In particular,  yields retailer base-stock levels as follows: where   and   are mean and standard deviation for the single-period demand distributions for retailer , calculated after rounding and clipping.That is, we set base-stock levels to an approximate percentile of the lead-time demand distribution.For each of the instances, we train a neural network policy/agent using the methods described in Section 3.4.It was found that PPO converges after between 50 and 100 million observations, which translates to 0.5 to one million episodes since we have 100 observations per episode.Convergence was found to be relatively stable; during early stages of the research it was observed that several training runs on the same scenario consistently converged to the same performance.Hyperparameters and other implementation details are discussed in Appendix B. The resulting agents are evaluated on 1000 simulation trajectories of length 100, generated independently from the trajectories used for training.During evaluation, we use proportional allocation.We also compute the performance of benchmarks corresponding to optimized echelon base-stock policies under two allocation policies, using the methods discussed above.The benchmarks are likewise evaluated on 1000 simulation trajectories of length 100, generated independently from the trajectories used for training.

Results
We present the results on relative performance of the various policies in this section, while Appendix A contains in addition the estimated absolute costs obtained for all instances and policies.The relative performance for instances 1-10 are presented in Fig. 6.The figure shows that for the complete back-ordering case, results for each of the benchmarks are on average in line with DRL performance, but the best benchmark tends to outperform DRL, except for instance 3.For the lost sales case, DRL tends to outperform both benchmarks by ∼ 1−3%.The results for instances 11-14 are presented in Fig. 7.The figure shows that the proposed DRL algorithm convincingly outperforms the benchmarks, also for the highly asymmetric cases 12 and 14.Note that on instance 13 our proposed approach attains an improvement of more than 20%, which is substantially higher than the improvements reported for that instance by Gijsbrechts et al. (2022) and Van Roy et al. (1997), who applied the A3C algorithm and neurodynamic programming, respectively.Based on information available to us, we attempted to eliminate any differences in testing procedure or in the implementation of the base-stock policies, and are inclined to believe that our proposed approach may indeed perform better than the approaches of Gijsbrechts et al. (2022) and Van Roy et al. (1997) for this case.Possibly, our proposed algorithm benefits from the ability to freely set orders for each of the 10 retailers at every time step, whereas for both (Gijsbrechts et al., 2022)    is restricted to dynamically specify the same target inventory position for all retailers, which may be inappropriate since excess sales are lost for the retailer.
The proposed algorithm thus convincingly outperforms the benchmark for partial back-ordering, but only marginally outperforms the benchmarks for lost sales and is outperformed by the benchmark in the complete back-ordering case.This may either imply that the benchmarks perform comparatively better in the complete back-ordering case versus the partial back-ordering case, or that our DRL approach somehow does much worse (compared to the optimum) in the complete backlogging case.We suspect that for all cases, our proposed approach learns a policy that can approximate the optimal policy reasonably well.The benchmarks on the other hand perform very well in case of complete back-ordering, not so well for lost sales, and rather poorly for partial back-ordering.Indeed, echelon base-stock policies are fixed inventory position targets, which are known to under-perform in lost sales settings even in the single-echelon case (see e.g.Zipkin, 2008).Similarly, partial backlogging features lost sales, and in addition the emergency shipments call for keeping some inventory at the central warehouse at all times, which echelon policies are poorly equipped to do.
In an attempt to find some support for this hypothesis, we perform further experiments on variations of instance 13, where we vary the standard deviation of retailer demands while keeping other parameters fixed.The absolute performance of the best-performing echelon basestock policy and our proposed DRL algorithm for these variants are shown in Fig. 8.We observe that the performance benefit of our proposed approach decreases with the decrease of the demand variance, and eventually even turns into a benefit for the benchmark.Since low demand variances would result in low inventory imbalances and consequently in close-to-optimal performance of our benchmark, these results are consistent with the hypothesis outlined above.
Our sensitivity analysis continues with an investigation of the impact of varying the number of retailers, see Fig. 9.The Figure shows that DRL performance tends to deteriorate as the number of retailers increases, but continues to convincingly beat the benchmark for the partial backlogging case.Furthermore, Fig. 10 shows the results for the variants with asymmetric penalty costs.For the full backlogging instance 1, penalty asymmetry does not qualitatively change the relative performance of DRL, but it does seem to deteriorate relative performance compared to the benchmark.Interestingly, for the lost sales instance 7, as penalty costs asymmetry increases the relative performance of DRL improves; it changes from performing on par with the benchmark to outperforming it by 1.4%.For instance 11, penalty cost asymmetry causes a slight deterioration of relative performance of DRL, but DRL still convincingly beats the benchmark.
In an attempt to gain insights into the ability of DRL to beat the stationary base-stock benchmarks for the partial backlogging instance 11, we analyze the actions generated by the DRL and benchmark policy for a range of states, and plot the results in Fig. 11.The figure shows that the benchmark policy has a very regular structure, in line with its definition.Note that states visited by the DRL policy would typically have retailer inventory levels below 10 (top part of the figure) and warehouse inventory levels between 5 and 25.It seems that for retailer inventory levels of about 5, the model tends to place orders of 9, enough to maintain the inventory in the system since the average system demand per period is 9.When the inventory level becomes too small or too large, the policy switches to orders of 12-13 or 4-6 respectively.Thus the trained policy favors orders that match the expected system demand, rather than always keeping the echelon inventory level at some fixed number.
Finally, the overall structure of the policy is really rather different from the base-stock benchmark; considering the performance gap, this might indicate that the base-stock policy is inappropriate for this problem and a different logic altogether is needed.Our findings provide weak indication that ideas adapted from constant-order policies, which have been shown to be asymptotically optimal in various settings single-location settings (Goldberg et al., 2016;Bu et al., 2020), could perhaps underlie new policy structures that could potentially be able to mimic the optimal policy.
In Section 3.5 we discuss the benefits of using a sequential allocation rule while learning good policies.Those benefits also translate to the performance of the trained policies, as illustrated by Fig. 12.The figure shows that training the neural network to set orders that are processed using a sequential allocation policy yields substantially better performance than training while processing orders with proportional allocation policies.

Conclusion
We have presented an application of a new method for stochastic control in multi-action inventory management in this work.Our approach solves instances with non-identical retailers by applying multidiscrete action distributions.As a result, the last layer of the neural network requires in the order of  ×  outputs, where  denotes the number of possible actions and  the number of nodes.This is in contrast with the   outputs that would be needed in previously applied approaches, and enables to scale to problems with a large number of nodes.Additionally, we have demonstrated that to ensure effective reinforcement learning, it may help to pair the learning algorithm with a custom allocation policy; we note that this allocation policy yields very poor allocations in general.
We compared our approach with the widely accepted echelon basestock heuristics.PPO is shown to be a suitable candidate for finding control policies in multi-echelon inventory control, though it cannot outperform the benchmarks in all cases.We suspect that for some of the complete back-ordering cases where DRL failed to outperform the base-stock policy benchmark, the benchmark may be close to optimal and our proposed policy still performs reasonably well.Still, better DRL algorithms tailored to inventory decision making would be welcome.
For all instances, carefully handcrafted heuristics should be able to outperform our proposed approach.The main benefit of our approach may lie in it being general-purpose and easy to adapt to different models.As such, it may be a useful benchmark for researchers working on new heuristics, and for cases where appropriate hand-crafted heuristics are not available, such as the partial back-ordering case.
Our approach may be more generally applicable beyond OWMR, as underlined by our development of a fairly generic MDP formulation  that could be extended to other supply chain networks.The multidiscrete action distribution in combination with tailored rationing may be a valuable building block to tackle larger supply chains.An interesting direction for future research are continuous review systems.Continuous review OWMR systems have received a lot of attention in research (e.g.Axsäter, 2000;Axsäter and Marklund, 2008;Marklund, 2011;Stenius et al., 2016Stenius et al., , 2018;;Malmberg and Marklund, 2023;Berling et al., 2023), and this literature stream could provide a good source for appropriate benchmarks.On the other hand, applying DRL in a continuous review setting comes with new challenges, such as deciding when exactly to query the neural network for potentially placing an order.

Fig. 1 .
Fig. 1.Depiction of a single node/location inventory system with positive lead-time.The squares represent the pipeline inventory, i.e., the inventory that has been ordered but that did not yet arrive at the location, which is represented by the circle.In each period, an order is placed in the leftmost square of the pipeline.When the system transitions to the next time-step, the goods located in each of these squares move to the square to their right until they reach the final location.Thus, after a lead time delay of  periods, the goods are delivered at the store and can be used to satisfy the customer demand.

Fig. 2 .
Fig. 2. Illustration of the nodes and pipeline inventories for the OWMR system.

Fig. 3 .
Fig. 3. Example of the evolution of the state for single node system with lead time  = 3 and zero demand.

Fig. 4 .
Fig. 4. Simplified visualization of the actor-network for a case with 2 retailers, each with 2 possible actions: Placing an order of 0 and placing an order of 1. Dashed arrows represent inputs, while solid arrows symbolize weights (Actions for the warehouse, biases and hidden layers' activation functions are omitted for clarity).

Fig. 5 .
Fig. 5. Comparison between actions made by PPO trained with Proportional Allocation vs Sequential Allocation, for typical trajectories.. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 6 .Fig. 7 .
Fig. 6. (Figures best viewed in color.)Comparison of our multi-action PPO trained with randomized allocation and two heuristic policies using different allocation strategies.The echelon stock policy with proportional allocation is normalized to 0, other bars represent the percentage increase or decrease in cost with respect to that policy.Black error bars indicate a 95% confidence interval.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) and Van Roy et al. (1997) the agent

Fig. 8 .
Fig. 8. Performance of the best-performing echelon base-stock policy and the proposed DRL algorithm for variants of instance 13: We vary the standard deviation of demand while keeping other parameters fixed.Figure reports absolute costs incurred over 100 periods.

Fig. 9 .
Fig. 9. Results on the sensitivity analysis on the number of retailers, with variants of instances 1, 7, and 11.The original instances are in boldface and have 3 retailers, for the variants 1 − 5 denotes the variant of instance 1 with 5 retailers, etc.

Fig. 11 .
Fig. 11.Policy visualization for scenario 11.Warehouse level and Retailer 1 level is changed (x and y axis respectively).Inventory levels of other retailers are equal to lead-time demand, while pipelines are empty.The values inside heat-map indicate the action selected by the agent for the warehouse replenishment order.

Fig. 12 .
Fig. 12. Relative performance of training with proportional allocation vs training Random Sequential Allocation, as proposed in this paper.In both cases, policies are evaluated using proportional allocation.

Table A .3
Mean absolute cost across instances; the table corresponds to results shown in Figs.6 and 7.

Table A .4
Mean absolute cost across the variants constructed to assess the impact of the number of retailers; the table corresponds to Fig.9.Mean absolute cost across the variants constructed to assess the impact of penalty asymmetry; table corresponds to Fig.10.