Solving the Order Batching and Sequencing Problem using Deep Reinforcement Learning

In e-commerce markets, on time delivery is of great importance to customer satisfaction. In this paper, we present a Deep Reinforcement Learning (DRL) approach for deciding how and when orders should be batched and picked in a warehouse to minimize the number of tardy orders. In particular, the technique facilitates making decisions on whether an order should be picked individually (pick-by-order) or picked in a batch with other orders (pick-by-batch), and if so with which other orders. We approach the problem by formulating it as a semi-Markov decision process and develop a vector-based state representation that includes the characteristics of the warehouse system. This allows us to create a deep reinforcement learning solution that learns a strategy by interacting with the environment and solve the problem with a proximal policy optimization algorithm. We evaluate the performance of the proposed DRL approach by comparing it with several batching and sequencing heuristics in different problem settings. The results show that the DRL approach is able to develop a strategy that produces consistent, good solutions and performs better than the proposed heuristics.


Introduction
Warehouse management systems play a pivotal role in the supply chain strategy. Warehouse management systems mainly focus on storing and moving goods within a warehouse by performing several operations (including shipping, receiving, and picking). Order picking is the process of retrieving items from their locations in a warehouse [25,26,6]. This process is a significant operation in a warehouse and according to [6], [2], and [4], the order picking cost are estimated to account for 55% of the total warehouse operating cost.
DRL is able to capture a high dimensional state space, which is beneficial to the studied OBSP with the combination of the PtG and GtP system. In e-commerce and our case, uncertainties come from order uncertainty and seasonal peaks. With DRL, the learning agent might learn during which period of the day picking-by-order, picking-by-batch or the combination of the two is preferable, and what to do in case of a small disruption in peaks etc. A DRL agent is trained on sets of problem instances (or data), and once a good model is created, it can adapt to new situations. Therefore, the model learns to respond to different situations in a fast and reliable way. This property of the approach is highly desirable in the current, high dynamic e-commerce setting. However, in contrast to heuristics or meta heuristics, DRL is much more complex to model and executed actions can be difficult to interpret. Furthermore, the complexity of the environment also has a negative effect on the computation time to train the model. In this paper, we focus on the modelling and application of DRL in the studied order batching and sequencing problem.
Our work in successfully creating a learning based order batching and sequencing algorithm contributes to both the fields of warehousing and to that of applying deep reinforcement learning to find strategies in this domain. We are the first to apply DRL to find strategies that solve the OBSP.
Besides approaching the OBSP with DRL, we are also the first to apply DRL in the warehousing domain itself. We incorporate the combination two storage areas with two picking strategies: pickingby-order and picking-by-batch. Current applications mainly focus on solving the Order Batching Problem either for a PtG system or a GtP system and only few studies include picking-by-order.
We apply an Actor-Critic method called the Proximal Policy Optimization algorithm to obtain a strategy for our problem. We evaluate the DRL algorithm with several heuristics as a practically relevant benchmark, on the problem instances derived from real dataset.
In the remainder of this paper, a brief overview of OBSP and DRL related literature is presented in Section 2. Subsequently, the dynamics of the OBSP and the simulation model that represent the warehouse are described in Section 3. In Section 4 the DRL model and environment are discussed and the Proximal Policy Optimization algorithm is explained. In Section 5, the experiment evaluation of our work is described. Here, the performance of our DRL approach is compared to heuristics that are proposed by practitioners. In section ?? we elaborate on our conclusions.

Related work
This section presents related work on Order Batching and Sequencing Problems (OBSP) in warehousing and Deep Reinforcement Learning (DRL) approaches to solving industrial optimization problems. We refer to [4] and [30] for overviews of warehousing in e-commerce and reinforcement learning, respectively.

Order Batching and Sequencing
The order batching problem is known to be N P -hard if the number of orders per batch is larger than two [11]. Due to its complexity, the existing algorithms for the OBSP are mainly heuristics: constructive solution approaches and metaheuristics. In this section, we will discuss the existing heuristics on OBSP where the objective is to minimize the tardiness.
The author of [8] was one of the first that assumed distinct due-dates for orders for an OBSP, where seed and savings algorithms are developed to batch and sequence the orders to tours such that the total travel time and the total tardiness of retrievals per group of orders are minimized.
In a subsequent work ( [10]), the authors focus on minimizing the tardiness for an OBSP in a GtP system. They consider picking-by-batch and reduce the number of orders per batch towards one to analyze its affects. When the average number of orders per batch is close to one, the authors state that that batching does not provide significant improvements when compared with picking-by-order.
In [17] the authors extend the problem of [10] with penalties for early retrieval and suggest a sequencing procedure. They show which of the 4 heuristics are superior to sequence the storage and retrieval requests to improve the due date related performance. The authors of [9] also consider the OBSP for an GtP system, where they include the incoming but also the outgoing of goods(r-orders).
Furthermore, they introduce and evaluate several heuristic rules for both picking-by-batch and picking-by-order in their problem. In [31], two algorithms based on Genetic Algorithm (GA) are developed to solve the OBSP in PtG system. One algorithm that finds the optimal batch picking plan by minimizing the sum of the travel cost, earliness and tardiness penalty. The second algorithm searches for an optimal travel path in a batch by minimizing the travel distance.
In [14], the authors state that since the number of possible batches grows exponentially with the number of orders, the use of heuristics for OBSP seems unavoidable. They propose an iterated local search (ILS) heuristic and an attribute-based hill climber (ABHC) heuristic based on a tabu search.
The proposed approaches are compared with the Earliest Due Date (EDD) heuristic, and show an 46% improvement. Later in [5], a Variable Neighborhood Search algorithm is proposed that exploits the idea of neighborhood change in a systematic way and outperforms the ILS proposed by [14]. The authors of [18] improve the total tardiness over that in [14] using a more aggressive improvement strategy. The proposed algorithm improves the current solution by selecting only those batches that contain at least one order with associated tardiness. For those batches, the procedure only focuses on orders without a tardiness value, since they could be retrieved later without affecting the objective function. Indeed, the move of these orders to a different batch could positively affect to those that currently have an associated tardiness.
In Table 1, we provide an overview of the scope and approaches of the existing work for solving the OBSP. To summarize, all the discussed related work consider either a Person-to-Goods storage system or a Goods-to-Person storage system. To the best of our knowledge, there is no algorithms  available that solves the OBSP with two types of storage systems. Moreover, although almost every study takes into account the capacity constraints of batches and pickers, few studies considers workstation constraints as we do in this paper.

Deep Reinforcement Learning for solving optimization problems
Recently, there is a great progress in developing machine learning (ML) methods to solve NP-hard problems. A popular line of the ML methods is Deep Reinforcement Learning (DRL), which is the integration of Reinforcement Learning (RL) and Deep Neural Networks (DNN).

Reinforcement Learning
In an RL approach, an agent interacts with an environment (see Figure   1). At each time step, the agent observes the current state S t of the environment, and selects an action A t to perform. Following the action, the agent receives a reward R t and the environment moves to a new state S t+1 . The state transitions are assumed to satisfy the Markov property, that is, the state transition probabilities depend only on the state S t and the action A t taken by the agent, independent of all previous states and actions. The agent has no prior knowledge of the environment in terms of state transitions or what the reward is. By interacting with the environment, the agent can observe such knowledge. The learning goal of the agent is to maximize the expected cumulative reward over the relevant time horizon. For more details of the RL, we refer to [30]. Figure 1: The agent-environment interaction in a Markov decision process, from [30].
The agent chooses a certain action based on a policy, which is a probability distribution over actions in states: π : π(s, a) → [0, 1]. Since there are many possible states and actions in most real-world problems, a function approximator is often used to enable generalization from seen states to unseen states. Deep Neural Networks have been successfully used as function approximators for solving large-scale optimization tasks.
DRL applications The Traveling Salesman Problem (TSP) and closely related Vehicle routing problem (VRP) have been become popular problems that are approached using DRL in the AI community. The authors of [3] use policy gradient and a variant of Asynchronous Advantage Actor-Critic (A3C) algorithm [19] to train a DNN. In [22] the authors study the VPR with the objective of minimizing the total route length while satisfying the demand from all customers. They employ the Asynchronous Actor-Critic Agents (A3C) algorithm [19]. The authors of [16] investigate a TSP where the pointer network is incorporated with attention layers and they train the model with the REINFORCE algorithm. The authors state that the operational constraints in the problem often lead to many variants of combinatorial optimization problems for which no good heuristics are available. This is the case for the OBSP. In [7], the authors propose a deep reinforcement learning algorithm trained via Policy Gradient to learn improvement heuristics based on 2-opt moves for the TSP.
In supply chain management, the authors of [23] examine the applicability of DRL to the beer game, which is a simplified model of a serial supply chain with four echelons and is played to demonstrate the bullwhip effect in supply chains. The Deep Q-learning (DQN) algorithm [21] is used in their approach. in [12], the author solves a dual sourcing inventory problem wherein an inventory can be replenished from a fast but expensive source or from a regular, cheaper source with longer lead time. The authors show how the A3C algorithm can be trained to produce policies that match performance of several existing approaches and conclude that DRL provides solid inventory policies for environments for which no heuristics have been designed, and can inspire new policy insights.
In manufacturing systems, the authors of [32] simulate an DRL approach in a dynamic manufacturing environment to allocate waiting jobs to available machines. They apply the DQN algorithm and have an agent for each work center. In [28] a job scheduling for production control environment is considered. The authors model their system as a Semi-Markov decision process (SMDP) and interact with a simulation model. By applying the proximal policy optimization (PPO) algorithm [29], the authors shows good performance of the DRL approach in the experimental application.
They also state that DRL is best suited to applications where no existing control methods are satisfying, which is also the case with our OBSP problem.

Summary
We first investigate the existing algorithms for solving the OBSP. We show that although many heuristics already have been developed, they are problem-specific. Moreover, the problem we consider, a combination of a PtG and GtP storage systems, has not been investigated in the literature. Hence, no existing algorithms can be directly applied to solve our OBSP. In the experiment section, we will fine-tune some existing heuristics to compare the performance of our DRL approach. Furthermore, due to its complexity and strategies that are required to be learned from the agent, reinforcement learning seems very well suitable for this problem.
Deep learning makes Reinforcement Learning applicable to more complex environments such as our OBSP. More and more research have shown the effectiveness of applying DRL in the complex decision and optimization problems. To the best of our knowledge, no DRL based approach exists yet that tackles order batching and sequencing problems. Among different DRL algorithms, the value method such as the DQN is powerful in learning difficult strategies as it considers each state-action pair.
However, when action spaces and observation space become large, the required computational power for the DQN can become a challenge. Policy search methods such as PPO require less computation power since only the policy is updated and not individual state and action pairs. Inspired by the work of [28], we apply PPO to solve the OBSP.

The warehouse setting with two picking decisions
We consider a warehouse setting with a PtG and a GtP storage system to store SKU and consider three different type of workstations to consolidate the orders, see Figure 2 for an overview.
In the PtG storage area, a person (picker) has to walk through the warehouse to pick the goods.
This system is constrained by the number of pickers that can collect items and the number of items that can be included on the picking cart. Each picker can collect 50 items on a picking cart. In the GtP goods are automatically transported to people at picking stations in unit loads (totes or bins).
The GtP is constrained by the number of shuttles. Shuttles are automated vehicles that bring totes with items to one of the lifts of the GtP system. Lifts are connected to a conveyor of one of the workstations. Shuttles have the advantage to travel through every aisle at a relatively high speed.
Therefore, the vehicles offer much higher retrieval capacity and are also significantly more flexible in capacity compared to a crane-based AS/RS that is mostly fixed within one aisle [1].
For both storage systems, orders can be picked-by-batch or picked-by-order and be released to the workstations once there is sufficient capacity available. Depending on the picking decision and storage system, a workstation is required to consolidate the order. Totes or bins that contain items or orders can be released to three different types of workstations: Direct-to-Order (DtO) workstations, Sort-to-Order (StO) workstations, and pack stations. The number of workstations of each type is constrained. At a DtO workstation a picker removes items from a product tote and collects them into a carton box. At this workstation, only one order at a time is packed and sent for shipping, thus this workstation is used for picking-by-order. If an order consists of multiple order lines (thus multiple different SKUs), multiple product totes are provided. The product totes arrive in sequence and once the each item is picked from product tote, it is placed in the carton box.
An StO workstation can be used for picking-by-order and picking-by-batch and is divided into three processes: sorting, buffering, and packing. During sorting, a picker removes the items from a batch tote and sorts them into a put wall. A put wall is a simple rack with shelves separated into multiple locations with a maximum of approximately 50 order locations. The locations are temporarily assigned to a unique order. Once the put wall is filled with all the items for each order, the put wall is placed in the buffering area. Here, it waits until an operator is available to pack all the orders. At packing, the operator first requests a put wall and drives the remote put wall to the packing area. The operator then places all the items of each order into a carton box and sends the carton box via a takeaway conveyor to shipping.
The pack station is the simplest workstation of the three and is only used for the picking-by-batch decision. These batch totes do not require sorting because each item in the batch is one order. The orders packed at this station thus only contain one SKU.
With the two storage systems and the three types of workstations, 5 picking routes are available. Figure 3 shows a schematic overview of these five routes through the warehouse. In total there are ten ways to pick orders (see Table 2). Besides the picking decision and the storage location, the order type also has influence on which workstation is required to consolidate the order. As mentioned earlier, orders can consist of one or more order lines. The number of order lines classify an order either as a Single Item Order (SIO) or a Multiple Item Order (MIO). An MIO containsmultiple items of multiple SKU. In a SIO, only one item and thus one SKU is ordered.
Whether an order is a MIO or a SIO will have a strong influence on the further processes.
To model our system we make a number of assumptions and simplifications. First, we assume that each SKU is either stored in the PtG or in the GtP system and not in both. Second, we do not consider the exact storage location of an SKU. Instead, we simulate the transportation time for each SKU. Third, we assume that transporting totes between processes is always possible; possible deadlocks or queues on these conveyors are not taken into account. Fourth, replenishment of SKUs and transportation of empty totes is left out of scope.

Problem formulation
The OBSP problem that is studied in this paper can be formulated as follows. Given a warehouse setting with pickers, shuttles, packing, StO and DtO workstations and order arrivals events, the proposed algorithm needs to sequentially (1) assign orders to pickers and shuttles, and (2)

Deep reinforcement learning approach
We solve the OBSP using deep reinforcement learning, by modeling it as a Semi-Markov Decision Processes (SMDP). In the SMDP the state of the system is primarily composed of the orders of each type that must be picked and the current capacity that is available for picking those orders.
The actions of the system are the decisions to pick a particular order in a particular way. When an action is taken, this results in a new state and a reward for taking the action. The reward is primarily determined by the number of orders that is picked on time. The DRL agent learns from those rewards what the optimal picking action is in a particular state of the warehouse.
The new state and reward are computed using a simulation model of the warehouse. In that sense the simulation model forms the environment of the DRL agent. The simulation model computes the state and reward that result from a particular choice of action of the DRL agent, by simulating the effect of that action on the state of the warehouse, the duration of the picking action and, consequently, the orders that are not picked on time.
In the remainder of this section, we present the SMDP that models our OBSP (see Section 4.1).
We then present the simulation model that simulates the warehouse (see Section 4.2) and finally the algorithm (see Section 4.3) that the DRL agent uses to learn which actions are optimal in a particular state.

SMDP formulation of the order batching and sequencing problem
In order to create model in which the DRL agent can train and learn a strategy, we define a SMDP which consists of four components: (1) the time to transition τ ; (2) the set of states S; (3) the set of actions A; and (4) a reward function R. Another main component of an SMDP is P , the transition probabilities, which represent the uncertainty about what the next state will be. In our system this uncertainty is the result of not knowing which capacities become free and when orders arrive into the system. In DRL these transition probabilities are learned by the DRL agent.

Time to transition τ
Since our environment heavily depends on time, a SMDP is used instead of a Markov Decision Proces (MDP). In an MDP the time to transition τ is fixed and in each state, the agent can choose a picking action or to "do nothing", after which it waits for a time τ to take the next decision. In that case we run the risk of setting τ too low, thus increasing the number of "do nothing" actions, because it does not allow for enough time for the environment to change state. We also run the risk of setting τ too high, possibly leading to cases in which the environment changes state before the agent is ready to take the next decision. As a result, the environment remains idle until a new action is performed. This is undesirable because it may negatively effect the throughput time of orders, which are waiting idly for a decision to be taken by the agent. This, in turn, negatively affects the number of orders that are delivered on-time and consequently the reward function.
In comparison, in an SMDP, the time to transition τ is not fixed. In our system, it is determined by the simulator: the time to transition is the time until the next order arrives or the next order is picked. As a consequence, the DRL agent is always able to perform an action and there are no unnecessary idle times. This approach is similar to the approach taken by [28], who also make use of a simulation model that contains capacity that can become idle.

State Space S
To model a state in such a way that it captures all the relevant information of our warehouse setting, we define three main components: (1) Current remaining orders for picking; (2) Capacity availability in the warehouse system; and (3) Extra information beneficial for learning, i.e., number of tardy orders, number of processed orders and current simulation time. We now describe these components in detail.
Current remaining orders for picking keeps track of the number of orders O c i l j e k for each Capacity availability in the warehouse system is also included in the state as the second part of the state representation. All capacities are represented by a pipeline variable that includes the current available capacity plus a (virtual) queue variable. This queue variable is included to make sure that the resource does not become idle too quickly between states. In case no queue is included and capacity becomes available, the agent could choose to process an order that requires the capacity that has just become available. However, if the agent chooses a different order that requires different capacity, the capacity that has just turned available stays available and the resource thus stays idle. If a queue variable is included, more orders are put into the system and end up in a queue.
When orders are waiting in the queue and the resource capacity becomes available, the resource can start directly processing these orders from the queue without having to wait for the agent.
The pipeline variable p = r + R represents the capacity availability of the PtG area in our state, where r is the current number of available pickers and R is a constant variable, representing the queue length of the PtG area. The length of the queue R is equal to the maximum number pickers available to ensure that when pickers become available, sufficient orders are available for processing.
The pipeline variable g = h + H represents the capacity availability of the GtP area in our state, where h denotes the current number of available shuttles and H is a constant variable representing the queue length of the GtP area. Picked batches in the PtG area, are placed in the GtP queue. This increases the average occupation of the GtP queue. However, to ensure that there are always enough orders/batches in the queue, we also include a constant variable H. This is especially advantageous at the beginning of the episode since processing batches can take some time before being available to the shuttles.
The pipeline variables d, v and b represent the capacity availability's of the DtO workstation, StO workstation and packing station respectively. At the workstations, capacities are expressed in the number of totes that can be placed in the queue plus a virtual queue. Orders that require capacities at the workstations have to be picked first. Therefore, the queue length is only increased once orders arrive in the queue. The maximum queue length of each workstation is based on the tote processing time at the workstation and the time it takes to place a tote in the queue. However, if no virtual queue variable is included in the state and only the queue size, slots of the queue are reserved for that specific tote and if all slots are reserved, no more totes can be placed. As a result of the picking time, it will be the case that the workstation is then waiting for the totes to arrive. When including a virtual queue, more orders can be released into the system such that the throughput is increased.
Therefore, the second part of the state representation regarding the capacities is represented by the following variables: (p, g, d, v, b).
As extra information beneficial for learning, we include the variables t, n and u in the last part of the state representation. Variable t represents the number of tardy orders, n represents the number of processed orders and u represents the current simulation time. An order becomes tardy when the order has not been shipped before its cut-off time and is represented by the variable t.
Additionally, the number of processed orders n and the current simulation time u are represented in the state. Both increase as we get closer to the end of the episode. Variables t, n and u are updated once the state changes in capacity or order arrivals, making sure that the state only transits to the next state based on changes in capacities or order arrivals.
Note that many other components can be included in the state. Components such as the number of items of each order, processing times of next order/batches or the number of orders waiting in the queues could also be included in the state representation. However, there is a challenge in representing these variables. Including information such as processing times or the number of items for each individual order will increase the size of the state vector. As a result, the number of possible actions is also increased and consequently more computation time is required. Depending on the picking decision and the type of order, capacities are required for each action.
For example, when choosing to process one order of the first index in the vector; O c 1 l 1 e 1 , only one picker is required. In the next state the O c 1 l 1 e 1 variable will represent 179 orders instead of 180 and g will represent nine pickers instead of ten. When performing a batch action for O c 1 l 1 e 1 , one picker and one tote at the packing station is required. When performing this batch action, O c 1 l 1 e 1 in the next state is reduced from 179 orders to 169 orders. the next state. g and b are both reduced by 1.
After performing these two actions, the agent will end up in the following state. Note that the current simulation time u has also increased. When assigning orders to pickers or shuttles, the state changes. Assigning these orders takes a really small fraction of time to change the state. After this change, the agent is directly provided with a new state.
Since the number of states is large, applying some state reduction methods to improve the learning speed is beneficial. In [27], the authors use a tabular reinforcement learning algorithm to find the best aggregation strategy for reducing the state space. p in the state is de-normalized to 25 such that p = 25 is represented in the state. This is applied for the shuttles and the pickers capacity variables as they can become relatively small numbers when orders are being processed. Processing orders is one of the main goals of the algorithm. As a result, p and g are desired to be fully utilized and therefore only indicating small values when becoming available. By de-normalizing p and g, their importance is encouraged.
We also de-normalize the variables t, u and n as these variables increase over time and can become greater than M . As an example, the initial state after de-normalizing becomes: , , , , , , , , , , , , , , , , , , , , , ,  Note that the difference with the previous example in which the vector is capped to a maximum constant M of 25, in this example the pickers and shuttles are maximized such that recognizing these capacity availabilities is encouraged.

Action Space A
We formulate an action space that directly maps to the orders available in the state. In total there are 15 different types of orders O c i l j e k that can all be picked-by-order and picked-by-batch. Therefore, for each type of O c i l j e k there are two possible actions resulting in 30 different actions. We also include a wait action that allows the environment to wait until capacity becomes available. This action must be added because no pick actions can be taken when no capacity is available for picking.
Consequently, in total there are 31 actions available in our action space.
When a picking-by-batch action is performed, a batch size of ten orders is created. The sequence of the orders and batches for the picking-by-batch and picking-by-order actions, is based on the tardiness; orders that have the lowest tardiness are picked first.
Depending on the available orders O c i l j e k in the state and on the capacities (p, g, d, v, b), the agent can choose to process a certain order. Choosing an order for which no capacity is available, or selecting an action for which no order is available, is considered an infeasible action. The wait action is only feasible when no capacities are available. In other words, if the wait action is chosen while there are capacities available to process at least one of the available orders, the wait action is then considered as an infeasible action. When a feasible action is chosen, the order with the earliest cut-off time is processed. When an infeasible action is performed by the agent, the state does not change, no orders are processed and the agent can immediately take another decision. In addition, an infeasible action will also lead to a penalty, which is discussed in more detail in the next section.

Reward function R
We construct a reward function r(s, a, s ) in a way that it provides a relatively high reward at the end of the episode and small penalties during the episode. We include penalties for orders that become tardy and for infeasible actions in the reward function as follows.
r(s, a, s') = In the constructed reward function, w = t + m, where t denotes the number of tardy orders and m the number of orders that have not been processed at the end of an episode. m is greater than zero when an episode terminates too early. This can happen when too many orders have become tardy.
m then indicates the number of orders that remained in the state and have not been processed. If an episodes terminates, we determine the ratio of processed orders by dividing w by the sum of the the total orders N and subtracting it from 1. The end reward shows, therefore, the ratio of processed orders that have been shipped before its cut-off time and the reward increases exponentially as this ratio increases. As a result, the importance of fewer tardy orders to the agent is also increased.
The penalty for infeasible actions (i.e. 0.005) is relatively small compared to the end reward such that the end reward has the upper hand and influences the learning process the most. However, it is big enough such that the agent will less likely choose this action again when the same state is encountered. Making the penalty for infeasible actions too large, will result in a strategy in which the agent performs as little actions as possible to complete the episode such that the total penalties of infeasible actions are minimized. In this case, the penalty for infeasible actions influences the learning more than that for tardy orders (i.e. 0.0075).

Interaction with the environment: the simulation model
The DRL agent learns which action is the best action in a particular state, by trying actions and observing how the environment responds to those actions (see Figure 1). In our approach, the environment is formed by a simulation model that simulates how a picking action, with random arrivals of orders, leads to a new state of the warehouse and a reward for the agent. The state changes when capacities change or when an order arrives into the system. Feedback is provided to the agent when orders have become tardy. This is determined once the order has been consolidated at one of the workstations and is ready for shipping. While the agent waits for capacities to become available or orders to arrive, time τ increases and tardy orders may leave the system. When all capacities are available and the agent selects to process an order, the state directly changes as capacities are diminished. The time to transition τ between these two states is almost zero. When an order is received by the simulation model, the model assigns a picker or shuttle to this order. Therefore, either p or g changes and feedback is directly provided back to the DRL agent and the simulation model is frozen until it receives another action. Whereas, τ is very small between states in which capacities are available. τ is larger when the wait action is chosen. In these situations, the duration of τ depends on arriving orders or capacity becoming available.
When the state changes, the new state is passed on to the DRL agent, which checks if the state represents the end of an episode.  Network. These parameters are called poicy parameters, and the policy can be then denoted by π θ (a | s).
We use the proximal policy optimization (PPO) algorithm [29], which is a policy based algorithm that learns by performing gradient descent on the policy parameters θ.
The policy gradient methods seek the optimal approximation parameters online, which means that the algorithm does not apply a replay buffer to store past experiences and instead learns directly from the experiences that are encountered by the agent. Once a batch of experiences has been used to do a gradient descent update, the experiences are then discarded and the updated policy moves on to perform new actions. Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form:ĝ where π θ (a t | s t ) is a stochastic policy andÂ t is an estimator of the advantage function at time step t, which we will explain later.Ê t indicates the average over a finite batch samples, which is used to alternate between sampling and optimization. When implementing this in the neural network such that the objective function is the policy gradient descent, the estimator ofĝ can be obtained by differentiating the following loss: where π θ is our policy in which the neural network takes the observed state as an input and suggests the action to take as an output, indicated as log probabilities. The second term is the advantage functionÂ t which tries to estimate what the relative value is from the selected action a t in the current state s t .Â t can be computed by subtracting the baseline estimate V t from the cumulative discounted reward G t . G t is the weighted sum of all the rewards the agent received during each time step in the current episode, computed as follows: where the discount factor γ ∈ [0, 1] is usually a value between 0.9 and 0.99, presenting the value of future rewards, i.e., a reward received k time steps in the future is worth only γ k−1 times what it would be worth if it were received immediately. The advantageÂ t is calculated after the episode sequence is collected from the environment, i.e., after all the rewards are known. The second part of the advantage function is the baseline estimate or the value function V t . V t estimates the discounted sum of rewards given s t . In other words, the value function attempts to predict what the final reward is going to be for this episode when in state s t . Note that this value estimate is output of the neural network. The advantage function determines how much better was the action a t that the agent took, compared to the expectation for s t . Finally, by multiplyingÂ t with the log probabilities of the policy actions, the final optimization objective function is derived that is used in basic policy gradient methods.
The PPO algorithm introduces Trust Region Policy Optimization (TRPO) to improve training stability. TRPO avoids parameter updates that change the old policy too much at one step. In TRPO, the objective function is maximized subject to a constraint on the size of the policy update. Specifically, where θ old is the vector of policy parameters before the update. The KL-divergence constraint is used to make sure the updated policy does not move too far away from the current policy. TRPO can guarantee a monotonic improvement over policy iteration. We refer to [29] for more details.
In addition, PPO uses a clipping operation to improve the learning. The new objective function with clip is defined as follows: where r t (θ) = π θ (at|st) π θold (at|st) is a ratio between the new policy and the old policy that measures how different the two policies are. In this way, the objective function clips the estimated advantage function if the new policy is far away from the old one. The algorithm computes the expectation over batches of samples and the expectation operator is taken over the minimum of two terms. The first of these terms in r t (θ) multiplied with the advantages estimateÂ t . This is the default objective for normal policy gradients which pushes the policy towards actions that yield a high positive advantage over the baseline. The second term is very similar to the first term by applying a clipping operation between 1 − and 1 + , where is typically set to somewhere between 0 and 0.2. Then finally, the min operator is applied to the two terms to get the final expectation.
We use a deep neural network architecture with shared parameters for both policy and value functions. In this case, a combined loss function is applied: This final loss function is used to train the agent. In addition to the clipped PPO objective function (4.6), the loss function has the additional loss function L V F 1 (θ), which is a squared-error loss It is in charge of updating the baseline network, determining what the discount reward will be over the run while being in state s t . The second additional term S denotes an entropy bonus. It encourages exploration. c 1 and c 2 are hyperparameter constants to weigh the contributions of these terms.
To successfully train the PPO algorithm that learns which action to take in a given state, a training algorithm is created, see Algorithm 1. The algorithm starts by initializing the environment by creating a connection with the simulation model. Then the first state is received with orders and initial capacities. Given this state, the agent estimates the advantageÂ t . Each step the agent performs an action that is provided by the policy. Feasible actions are sent to the simulation model and simulated in the environment. While the action is not feasible, the agent keeps on predicting until a feasible action is chosen. Subsequently, the action is simulated in the environment and possible reward r t is received by the agent in the meantime. After the state has changed, the simulation model provides the next state s t+1 . Based on s t , s t+1 , r t , the policy is updated by maximizing the PPO objective, via stochastic gradient ascent with Adam [15].

Experimental evaluation
This section describes the experimental evalution of the proposed DRL approach for solving the OBSP. First, we present the experimental setup and define two scenarios based on problem instances from practice. Second, we propose several heuristic rules for the OBSP, which are used as performance comparisons to our DRL approach. Finally, we present and discuss the experiment results.

Experimental setup
The experiment instances are derived based on a dataset from practice. We used a data set that contains 257.585 orders with a total of 376.522 items. Moreover, we took samples from the dataset to vary the parameters of which we expect that they have an impact on the performance of the algorithm. In particular, we vary the following parameters.
1. Order throughput. Throughput rates between 300 orders per hour and 500 orders per hour are analyzed.
2. Resource settings. In the resource settings the number of pickers, shuttles, StO workstations, DtO workstations, and Packing workstations can be adjusted. These settings are adjusted to the desired throughput rate such that the tardy orders are minimized.
3. Distribution of SKUs in storage systems. In the setup of the experiment, the percentage of orders that are picked within the PtG area is kept at 70% and 30% in the GtP area. In the data that is used for this experiment, 80% of the requested orders can be delivered by approximately 20% of the SKUs, these are the fast movers. The fast movers are therefore stored in the PtG area and the slow movers are stored in GtP. Then during peak days, additional operators can be easy deployed in the PtG area to pick the fast movers. Whereas deploying additional shuttles in the GtP area is not that flexible. Therefore, slow movers are stored in the GtP area and thus 30% of the orders are picked in this area; 20% slow movers and 10% fast movers. These ratios can be adjusted to satisfying needs however this is left out of scope in this paper.

4.
Length of run. The length of a run that is analyzed is set to a simulation run time of 60 minutes. In these 60 minutes, we analyze what the number of tardy orders is for a certain throughput with specified resource settings. Consequently, only the hours that include cut-off times are analyzed in the settings of the experiment, i.e. every hour before 19:00 is not included.

5.
Order releasing moments. We examine two settings for releasing orders into the system.
Orders can be released every 60 minutes or every 15 minutes. When including more release moments per hour, a more dynamic environment is tested.
6. Cut-off moments. We also examine two settings with different cut-off times, similar to the order releasing setting. cut-off times can occur every; 60 minutes or every 15 minutes. In  Table 3. We use this distribution for our experiments.

Training Parameters
The model parameters for this experiment consist of parameters to train the PPO algorithm and the values of these parameters are similar to the original work of [29]. However, some adjustments have been made to fit the OBSP environment. The algorithm is set to train for 750.000 steps, depending on the size of the problem instance, the episode requires 50 up to 100 actions.
We initialize a Neural Network similar to [29] with two hidden layers of 64 units, and tanh nonlinearities, outputting the mean of a Gaussian distribution for our action space of 31 units. The clipping parameter that showed the best performance in [29] was set to 0.2. This ensures that the updated policy cannot differ from the old policy by 0.2. The discount factor γ is set to 0.9999 instead of 0.99. This means that we care more about reward that is received in the future than immediate reward. Both the DRL agent and the simulation model are utilized on the same processing machine with an Intel(R) Core(TM) i7 Processor CPU @ 2.80GHz and 32GB of RAM. Picking-by-order is then performed. Slack time is the amount of time left until its due date of the order is reached when also taking into consideration the processing time that is required to pick and consolidate the order/batch.

The proposed heuristics for the OBSP
The picking-by-order small batches (POSB) is also an extension rule to the greedy algorithm and is created for batches with four or fewer orders. When a batch consist of 4 or less orders in the PtG area, picking-by-order is applied instead of picking-by-batch. This is possible since a picker can carry a maximum of four carton boxes on a picking cart.
Finally, combining LST and POSB batching rules is a combination of the LST and POSB batching rule.
After the orders and batches have been created, the orders can be sequenced with the following 5 sequencing rules.
The Earliest Due Date (EDD) rule sequences all orders based on cut-off time i.e. earliest due dates. The greedy algorithm already does this for releasing orders and batches to both storage systems. However, batches that are transported from PtG to GtP, are placed at the end of the queue. This is the same queue in which all the orders wait that only require GtP picking. In case that these batches have an earlier due date than some GtP orders, the batches with the earlier due date are then again sequenced such that these are released and picked earlier than the batches/orders that are awaiting in the GtP area. MAXTP is applied, more priority could be assigned to orders that do have a later due date time.
The LST does not provide this priority to these orders but provides this priority to the orders that have a longer processing time but also have to be shipped earlier. Orders with the Least Slack Time are processed first.

The experimental results
The objective of OBSP is to minimize the number of tardy orders. In the result section, we report the performance of the DRL approach and proposed heuristics in terms of the number of tardy orders that is produced for a particular experiment setup.
steps and the discount factor γ is set to 0.9999.  Why is the PPO superior in scenario A and not superior in scenario B? On basis of the strategy analysis, we found that the PPO applies a different strategy for scenario A compared to the heuristics.
Whereas the heuristics apply picking-by-order is the GtP area, the PPO prefers applies both picking strategies: picking-by-order and picking-by-batch. It can be concluded that by the use of DRL, new strategy insights have been provided which show to be beneficial to the OBSP. Initially, this strategy was found to be disadvantageous and was therefore not applied in the heuristics. However, the DRL approach showed the opposite.
When comparing the strategy of agents of scenario A with the strategy of agents of scenario B, we can conclude that the agents for scenario B apply a more varied strategy. Agents in scenario B performs more picking-by-order actions than the agents for scenario A. Agents for scenario B in total include 22 actions of the 31 actions into their strategy. Whereas the agents for scenario A eventually only consider 9 actions in their strategy. This is a result of more cut of times for scenario A. In this scenario there are more critical orders than in scenario B and therefore more individual orders are picked-by-order. As a result, more actions are considered by the agents for scenario B.
Furthermore, we investigated the generalizability of the DRL approach with two different cases; two trained agents for specific instance size problems are tested on different instance sizes and a trained agent on a fixed time period (i.e. from 19:00 to 20:00) is tested on a random time period (somewhere between 19:00 and 00:00). For the first cases, we trained two agents for scenario A; a400 and a500. These two agents considered a problem instance of 400 and 500 orders respectively during training. For this case, the trained a400 agent is tested on a problem instance with a throughput of 500 orders, and vice versa; the trained a500 agent is tested on a problem instance with a throughput of 400 orders. Results in terms of the percentage of tardy orders are shown in Figure 5. It can be concluded that both agents that are tested on unfamiliar instances, perform as equally as the agents that are trained on these instances. The trained agents show approximately the same percentages of tardy orders and standard deviations, see Table 8 for a complete overview.
For the second case, we train an agent on an problem instance that starts at 19:00 and then test Instance size a400 a500 400 orders 1.08% (2.00%) 0.75% (1.68%) 500 orders 3.57% (2.52%) 3.66% (2.59%) Table 8: The average percentage of Tardy orders for the a400 agent and the a500 agent on an instance size 400 and 500 orders. The results are the mean and standard deviation (between brackets) over a training duration of 1000 episodes this agent on random hours between 19:00 and 00:00. Figure 6 shows the performance of the a400 agent that is tested on random hours between 19:00 and 00:00 and random hours between 19:00 and 23:00. It can be seen that the a400 does not indicate a stabilized performance during all episodes when the last hour is also included. After approximately 180, 410 and 590 episodes the agent does not seem to be able to solve the instance and many orders become tardy. However, when excluding the last hour, generalization of the DRL approach is improved. In the instances that include the last hour, we process the last remaining orders that have to be shipped just before 24:00. However, it can be the case that only 100 to 200 orders are requested during this hour. As a result of a significantly smaller instance, the agent could end up in states that the agent has not encountered during training.
The overall performance is nevertheless similar, with an average percentage of tardy orders of 0.66% and 0.94% for the agent that included and excluded the last hour respectively.

Conclusion
In this work, we have shown that benefits from recent advancements in Deep Reinforcement Learning (DRL) in solving a challenging decision-making problem in warehousing, i.e. the order batching and sequencing problem. We apply the Proximal Policy Optimization (PPO) algorithm by first transferring the OBSP to a Semi-Markov Decision Process formulation. Based on this formulation, we create a DRL agent that takes OBSP decisions and interacts with a simulated environment that calculates the effects of these decisions. We have compared the performance of the DRL approach to that of several proposed heuristics with a set of experiments with with a variety of settings. The experimental results have demonstrated that the DRL agent is able to find good strategies to solve the OBSP for all different tested settings and outperforms the heuristics for most of the tested settings. In addition, the experiment results demonstrate some level of robustness and generalizability of the DRL approach. More specifically, the agents are trained on a particular set of orders (e.g. 400 orders) but the learned strategies can produce good results on different settings with less, or more orders. Moreover, the agents trained to handle the orders between 19:00 and 20:00 are able achieve consistent good performance when being applied to different hours.
We have shown in this paper a successful application of the deep reinforcement learning in solving the OBSP. Deep reinforcement learning remains a promising approach to solve sequential decision making problems with complex, dynamic environments and is currently well suited to solve problems where no existing strategy is satisfactory, not only in the context of the OBSP, but also many real-world decision making problems that are extensively studied in the Operations Research and Management community.