Innovative Applications of O.R. Approximate dynamic programming for container stacking

typically


Introduction
Intermodal containers are used to transport various goods efficiently across different modes of transport.Containers exist in various shapes and sizes.Commonly the term Twenty-foot Equivalent Unit (TEU) is used as a unit of measure for a container.Worldwide, the annual container throughput increased from 560 million TEU in 2015 to 793 million TEU in 2017 ( UNCTAD, 2018 ).The increasing throughput requires extended capacity or more efficient use of the existing capacity of container terminals.To increase the capacity and minimize the size of the container yard, containers are stacked on one another.Stacking containers impose a Last-In-First-Out (LIFO) retrieving sequence within a stack of containers.If one wants to retrieve a container, first all containers above need to be moved to a different stack.Such a move is called a reshuffle and is often regarded as an unproductive move ( Gharehgozli, Yu, De Koster, & Udding, 2014 ).Preventing reshuffles has two main benefits.First, it avoids the costs of executing the reshuffle.Second, it decreases the time needed to fetch a container.This decrease, in turn, lessens the time needed to retrieve all containers for a car-rier, resulting in higher throughput for the terminal and less waiting time for the carrier.
Various studies dedicate their attention to some flavor of the problem of preventing reshuffles.The most common problem is known as the Container Relocation Problem (CRP), also known as the Block Relocation Problem (BRP) ( Caserta, Schwarze, & Voß, 2009 ).In this problem, one tries to empty a given terminal configuration, with given stacks of containers, using as few reshuffles as possible.A different problem is the pre-marshaling problem.Here one tries to proactively rearrange the containers in such a way that the number of reshuffles needed to empty the terminal is minimized.Besides the reshuffles and pre-marshaling moves, two other movements can be identified: loading and unloading .To prevent confusion, we refer to the placement of containers in the terminal as handling of inbound containers (loading) and the retrieval of containers from the terminal as handling of outbound containers (unloading).Commonly, only outbound containers are considered in the CRP, BRP, and pre-marshaling problem.
To perform the reshuffle, loading, and unloading moves, various equipment is available ( Stahlbock & Voß, 2008 ).On the quayside, where ships are loaded and unloaded, typically quay cranes are being used.On the landside, where containers typically arrive and depart via train or truck, containers are transported from the quay crane to the container yard and from there to the hinterland.The container moves on the landside are typically performed by straddle carriers, reach stackers, or a combination of trucks and gantry cranes.Depending on the equipment used, different stacking restrictions may apply, e.g., gantry cranes are generally able to reach all topmost locations whereas the reachability of a container for a reach stacker depends on neighboring locations given the diagonal lifting.Also the container itself might impose stacking restrictions (e.g., dedicated areas for dangerous goods and reefer containers that need access to electricity).
In this paper, we study the CRP considering both the handling of in-and outbound containers, allowing for different types of equipment posing different stacking restrictions, and considering uncertainty in the arrival and departure times of containers.This research is motivated by our collaboration with seven small to medium-sized Dutch container terminals where, possibly in combination with gantry cranes, also reach stackers are being used, which is typically not considered in existing CRP studies.Furthermore, various studies ( Caserta, Voß, & Sniedovich, 2011;Güven & Türsel Eliiyi, 2019;Kim & Hong, 2006 ) assume that the arrival and departure times of containers at/from the terminal are known, whereas in reality these times are unpredictable ( Gambardella et al., 1996 ).Although the exact arrival and departure times of containers are hard to predict, the uncertainty in the sequence of arriving transport modalities (ships, trains, trucks) is limited as plans are typically made long in advance and delays are communicated with the terminals.Still, the sequence in which, e.g., containers are unloaded from a ship might be fully unknown before arrival as they depend on the stowage plan.Hence, only partial knowledge of the arrival process can be included within the CRP.This partial knowledge makes it difficult to control the sequence in which containers are handled ( Gharehgozli et al., 2014 ).Studies that explicitly consider uncertainty are limited to either the inbound problem ( Gharehgozli et al., 2014;Kim, Park, & Ryu, 20 0 0;Zhang, Wu, Kim, & Miao, 2014 ) or the outbound problem ( Galle, Manshadi, Boroujeni, Barnhart, & Jaillet, 2018b;Ku & Arthanari, 2016 ).
The resulting combined in-and outbound stochastic CRP, which involves sequential decision-making under uncertainty, lends itself to be modeled as a Markov Decision Process (MDP), and can theoretically be solved using Stochastic Dynamic Programming (SDP).Because real-life settings are too large to solve exactly using modern-day computers, we solve the problem heuristically using Approximate Dynamic Programming (ADP).The contribution of our work is threefold.First, we formulate an MDP that covers both the in-and outbound handling of containers.Second, we present an ADP approach to efficiently solve the problem, considering both an aggregated lookup table and a linear value function approximation.Third, we provide insight into the performance of these approaches on various sized problem instances, based on the characteristics of seven small to medium-sized Dutch terminals connected to our research.
The remainder of this paper is structured as follows.Section 2 discusses various related works.Section 3 explains and defends the assumptions made in this paper, and formally defines the problem in terms of an MDP.The proposed ADP approaches are presented in Section 4 .Section 5 applies the ADP approaches to various problem instances and reports their performance.We end with conclusions in Section 6 .

Related work
Various problems related to the stacking of containers have been addressed in literature.This section describes these studies by dividing them into two categories: deterministic problems and stochastic problems .

Deterministic problems
In deterministic problems, the arrival and departure times for every container are assumed to be known in advance.As this assumption is common in the CRP literature, we hereby only provide a brief overview.Güven & Türsel Eliiyi (2019) propose heuristics for the inbound problem.The location for a stack is chosen based on departure time, size, trade type, and weight.Also, the receiver, owner, and destined vessel are taken into account.The outbound-only problem is considered by Zeng, Feng, & Yang (2019) , who propose five heuristics for this problem.In the described model, container departures are based on the group arrival information of external trucks.This information is based on how the appointments in a Truck Appointment System (TAS) work.In such a system, trucks are given time slots in which they can bring or retrieve a container.Containers are grouped using these slots.The order among slots is fixed, while the order within a group may be arbitrary.The proposed heuristics are adjusted from Caserta, Schwarze, & Voß (2012) and Tang, Jiang, Liu, & Dong (2015) .
Both Kim & Hong (2006) and Tanaka & Takii (2016) use Branch & Bound (B&B) to solve the outbound problem optimally.Kim & Hong (2006) use the number of blocking containers as a lower bound, while Tanaka & Takii (2016) use a more sophisticated lower bound function based on Kim & Hong (2006) and Zhu, Qin, Lim, & Zhang (2012) .As the B&B approach of Kim & Hong (2006) is not suitable for solving realistically sized problems, a heuristic is proposed.Caserta et al. (2011) use Dynamic Programming in combination with the Corridor Method (CM) for the outbound problem.CM does not evaluate all stacks for a reshuffle but only those within a certain distance from the original stack.Li, Xiaoning, & Zhengyu (2019) tackle the combined problem in rail-truck transshipment terminals.They aim to minimize reshuffles as well as crane moving distances.Wan, Liu, & Tsai (2009) and Tang et al. (2015) also address the combined problem using Integer Programming (IP).As solving an IP model is impractical for real-life instances, they propose various heuristics.Caserta et al. (2009) and Galle, Barnhart, & Jaillet (2018a) present an IP model where all stacks are represented using a single matrix.They also provide an efficient way to see which containers are located above a certain container.
This section showed various instances of deterministic stacking problems and different methods used to solve them.The next section discusses stochastic problems.

Stochastic problems
Stochastic approaches include uncertainty.Arrival and/or departure times are not fully known in advance and decisions are made based on incomplete information.To the best of our knowledge, all studies on the stochastic CRP are included in this section.Kim et al. (20 0 0) consider the inbound problem and divide containers into weight groups (heavy, medium, and light).A state in their model is represented by the number of empty slots in a stack and the highest weight group of that stack.It is assumed that containers from the heavier weight group are to be placed on a vessel earlier than lighter ones.This precedence is used to evaluate how good a certain state is.A decision tree based on the exact solution of smaller instances is used to solve larger problems.Gharehgozli et al. (2014) define an SDP model for the inbound problem.Containers are divided into families based on the ship to be loaded on, their weight, and their port destination.They assume that containers of a family are to be retrieved before those of the next family and have an equal chance to arrive next.As solving the SDP is not feasible for realistically sized instances, they also propose a decision tree heuristic.The proposed solution uses the results of the SDP applied to small-scale instances to generate generalized decision trees that can solve large-scale problems.
The model of Ku & Arthanari (2016) is based on truck retrievals.Trucks are assigned to a certain time interval.The sequence of truck arrivals that fall within the same time interval is unknown.The probability of the next truck arrival (and thus which container needs to be retrieved next) is assumed to be uniformly distributed (i.e., every truck within the same time interval has the same probability of arriving next).Information on the k th container to be retrieved is known when the (k − 1) th container is retrieved.The described problem is referred to as the CRP with Time Windows (CRPTW).To solve the proposed SDP model, Ku & Arthanari (2016) use a search-based algorithm that models the SDP as a decision tree with chance nodes.The decision tree is searched using depth-first search (DFS) with backtracking to save memory consumption.To reduce the size of the search space, aggregation is used.A stack representation is transformed into an abstract one by removing all empty stacks and reordering the columns based on the pickup priorities.Although the aggregation structure reduces the required computation time, it still does not allow it to be used in real-life settings.Therefore, an index-based heuristic is proposed that can be applied to the CRPTW.Galle et al. (2018b) introduce the batch model for the outbound problem.In this model, like the CRPTW, the sequence of outbound containers is not fully known in advance.However, some knowledge is available about the relative order between groups of containers, coming from, e.g., a Truck Appointment System.Therefore the notion of batches is introduced.Containers are grouped into batches according to their departure times.The sequence among batches is known, but the sequence within a batch is not.Where the CRPTW model reveals the next container to depart per container retrieval, the batch model does this per batch, i.e., the sequence of handling containers within a batch is revealed before handling the first container of the batch.From this model, a decision tree is created.Two search methods are proposed: Pruning Best-First Search (PBFS) and PBFS-approximate (PBFSA).PBFS considers all possible retrieval orders within a batch and calculates the number of reshuffles needed to optimally retrieve the batch.It does this bread-first by first considering states with the lowest lower-bound and disregarding states that have a lower lowerbound than the current best solution.PBFSA works the same as PBFS except that it only considers a subset of the possible outcomes.Zhang et al. (2014) tackle the outbound problem by extending the work of Kim et al. (20 0 0) .They propose two adaptations to the original model.The first adaptation is a new way of calculating the punishment coefficient (value function) of the SDP model.The second adaptation changes the representation of the state.The old model had no information on which weight groups were present in a stack, only the heaviest was known; this information is included in the new model.Both adaptations are compared to the original model and they both yield better solutions.As the SDP model with the second adaptation can only solve smaller instances exactly, a rollout ADP algorithm is proposed.The rollout algorithm did not outperform the approach of Kim et al. (20 0 0) but is still regarded as promising when a good base policy can be found.Lehnfeld & Knust (2014) propose a classification scheme that applies to the handling of in-and outbound containers.It also applies to the pre-marshaling problem and combinations with the handling of in-and outbound containers.The classification scheme forms an abstraction of the most common problems considered in the literature.All above-described related works are classified using this scheme and summarized in Table 1 .In addition, the prob-lem considered in this paper is included in the bottom row.When papers discuss multiple alternatives to the problem, the cell in the table is split into multiple rows.

Overview
One part of the scheme involves notation to define what kind of container streams are handled.The basis of this notation is formed by π and I, where π indicates a single sequence of containers and I indicates a sequence of batches of containers ( Lehnfeld & Knust, 2014 denote these batches by sets).The superscripts in and out denote an inbound or outbound stream of containers, respectively.The symbol ∼ indicates an unknown sequence with an underlying probability distribution.A subscript K denotes a sequence of the defined pattern.
Another part of the scheme indicates the metrics minimized or, in other words, the goals.The goals identified in the described works are the number of reshuffles ( # RS), transportation costs ( T C), and the number of adjacent unordered stackings ( US adj ).If a stochastic version is tackled, the expected value is minimized.It may also happen that multiple goals are followed.
As can be observed, there is a clear lack of focus on the combined problem of handling both in-and outbound containers, with uncertain arrival and departure times.In addition, no research was found that addresses the stacking restrictions imposed by a reachstacker.This paper, consequently, seeks to fill this gap by addressing both.A common modeling methodology used in the stochastic CRP is SDP.However, solving such a model exactly for larger instances is typically not feasible.To be able to cope with larger instances, we propose a solution based on the framework of ADP.

Problem description
The problem considered in this paper represents that of a terminal handling inbound and outbound containers.It is assumed that the terminal is characterized by predefined places where containers are allowed to be stored.These places are referred to as stacks and are grouped together to form a bay.The height of a stack is referred to as its tier.An overview of these terms is given in Fig. 1 .
During the container handling process, the terminal needs to make various stack assignments for the incoming containers.Besides, when a container blocks another one, i.e., a container is situated above one that needs to be retrieved, also a stack assignment needs to be sought for the blocking one.Moving this blocking container is commonly referred to as reshuffling.Containers arrive and depart via various possible transportation modes, and they typically arrive and depart in batches.For example, multiple containers arrive via a vessel and are stored somewhere at the terminal.In turn, containers currently residing at the terminal are retrieved from the terminal and placed on the vessel.It can also happen that a transportation mode only delivers or only retrieves containers.The delivery and retrieval of containers is a continuous process.During the retrieval, it might happen that containers are blocking the ones needing to be retrieved.New stack assignments need to be sought to reshuffle these containers.The goal of this research is to find stack assignments, for both inbound and reshuffled containers, that minimize the total number of reshuffles.To formally define the model, we first state and explain the assumptions made.Seven assumptions are made, referred to as A 1 to A 7 .
A 1 The terminal consists of a pre-determined fixed number of stacks and tiers.This research focuses on the operational side of the stacking problem.Therefore, the layout of the terminal is pre-defined.Furthermore, due to equipment limitations and safety concerns, stacks cannot exceed a given height.This assumption is also used in, e.g., Güven & Türsel Eliiyi (2019) , Wan et al. (2009) and Tang et al. (2015) .A 2 Parallel execution of container moves is disregarded.The handling of containers can be seen as a list of tasks that need to be executed.Usually, a terminal has multiple resources available to carry out these tasks partly in parallel.Some of these tasks may require another one to be completed first, i.e., storing two containers on the same stack.However, we assume a list of tasks that needs to be executed in a given sequence.The possibility of optimizing this sequence is seen as a problem on its own and therefore disregarded in this paper.Note that this assumption is only used to learn the value of container positions (see Section 4 ); hence, parallel execution of moves based on the outcomes of our model is still possible in practice.A 3 Containers arrive and depart in batches.This research considers the combined inbound and outbound problem, where containers are grouped into batches and the sequence of handling containers within a batch is unknown.A batch is created by grouping containers that are expected to arrive or depart roughly around the same time.For example, a batch might consist of the inbound and outbound containers corresponding with a single incoming modality.However, in principle, an incoming modality might contain multiple batches or a batch could consist of containers from multiple incoming modalities.The uncertainty within a batch could relate to the stowage planning of vessels or the uncertainty in the arrival sequence of trucks.This assumption is similar to the one made in Galle et al. (2018a) .A 4 The sequence among all batches is known.The exact arrival and departure times of containers are uncertain ( Gambardella et al., 1996 ).However, usually some information is known about the sequence of arriving batches, e.g., the arrival sequence of large vessels is roughly known in advance (see Section 1 ).As argued by Gambardella et al. (1996) , this assumption is less realistic for trucks, but recent technological advancements such as truck appointment systems and GPS tracking might help predicting relative truck arrival times.For those situations where multiple modalities are expected to arrive roughly at the same time and where we are not able to provide a good prediction of the sequence, we might artificially model the containers from these multiple incoming modalities as one batch for which the sequence will be revealed upon arrival of the first modality.A 5 A container may only be reshuffled when it is blocking the current target container.This is a common assumption in literature ( Lehnfeld & Knust, 2014 ) and is often referred to as the restricted CRP .It decreases the dimensionality of the problem while not sacrificing the solution too much ( Petering & Hussein, 2013 ).Note that this assumption is only used to support the positioning decision; in practice, the terminal operator still has the freedom to proactively reshuffle containers (and our algorithm tries to circumvent these reshuffles).A 6 The cost of reshuffling is the same for every reshuffle.
The goal of this research is to minimize the number of reshuffles.Therefore only the reshuffles are counted and, for instance, the equipment costs and travel distances of the reshuffles are ignored.This assumption allows stacks to be interchangeable, thereby reducing the state space.A 7 All in-and outbound containers are of the same size and type.This assumption is also used by, e.g., Gharehgozli et al. (2014) , and is only introduced to ease the presentation and numerical experiments.However, additional stacking restrictions, e.g., restrictions on where a container may be stored, can easily be added to the model.To show this, we provide an example of such an extension in Section 5.3 , where some containers are only allowed to be stored at specific bays.
As mentioned before, a terminal could consist of multiple bays where each bay consists of one or more stacks.All topmost containers and locations can be accessed if the terminal is operated by gantry cranes.However, this is no longer the case for a terminal operated by reach-stackers.The use of reach-stackers imposes additional constraints on which containers can be retrieved and where containers can be placed.For example, a reach-stacker cannot reach a stack behind a higher one.In addition, it is not uncommon that a bay resides against a wall (or some other immovable structure), meaning that the bay can only be accessed via one side.A more detailed description of the reach-stacker restrictions can be found in Appendix A .As our focus is on small to mediumsized terminals, we include the possibility of both gantry cranes and reach-stackers.
We formulate our problem as a Markov Decision Process (MDP).
Let B be a list of all batches, such that B t , t ∈ { 1 , . . ., | B |} represents the tth batch.The number of containers within a batch may be any natural number including zero.Assumptions A 3 and A 4 state that the batches are known at the time of planning.Therefore, a problem instance is specified by the batches and the containers present within these batches.As we include knowledge of the planning horizon in our MDP, we are not looking for a stationary policy; hence, it makes sense to model the problem as a finite-horizon one.In the MDP, during each stage t ∈ { 1 , . . ., | B |} , a single batch is handled at the terminal.Hence, the time between stages may vary and t does not refer to a specific time but to a given batch to handle.The goal of this problem is to handle all containers in a given terminal layout using as few reshuffles as possible.
Following the classification scheme of Lehnfeld & Knust (2014) , the problem is classified as: Section 2.3 .Here S i, j denotes the used stacking restrictions.These stacking restrictions, as well as the state, decision, and transition function of our MDP, are discussed in the following subsections.
State .The state S t at stage t is defined by the set of all containers that reside at the terminal, their assigned locations, as well as the realized sequence of the batch to handle at the current stage t.Let L t describe the current residing containers in state S t .Let L t,x , x ∈ { 1 , . . ., # bays } denote the set of containers in an individual bay x and let L t,x,y , y ∈ { 1 , . . ., # stacks _ in _ bay _ x } denote the sequence of containers stacked in the y th stack of bay x (from bottom to top).Thus, L t can be defined by a three-dimensional array, with dimensions given by the number of bays, the maximum number of stacks within a bay, and the maximum tier.Finally, let ˆ B t denote the realized sequence of batch B t .Hence, S t is defined by the tuple < L t , ˆ B t > .Let c b,o denote an individual container being part of batch b and having order number o.The batch number may be bigger than the total number of batches ( b > | B | ).In this case, the container does not leave the terminal within the current planning horizon.The order number indicates when a container within a batch is handled.If o = 1 , it will be the first container handled in a given batch, and will be the last container handled.The order number o is to be revealed just before handling batch b.This assumption typically holds for the terminals under study as the stowage plan of incoming ships (inland barges) is generally revealed upon arrival.In general, some information might already be available before the arrival of batch b or the terminal might have some freedom in deciding about the sequence of loading/unloading batch b.Note that this assumption does not affect the terminal operations, only the calculation of the expected future impact of a container placement or reshuffle.Furthermore, when some information is known before the arrival of the batch, it can be included in the exogenous information ζ t,k introduced later on.
Decision .A single decision a t represents a sequence of container moves to handle batch B t at time t.In our model, three types of container moves are identified: the inbound move (i.e., placing a container in the terminal), the outbound move (i.e., retrieving a container from the terminal), and a reshuffle move (i.e., moving a container within the terminal).The inbound move of container c in stack y of bay x , given L t , results in adding c as the last element to the sequence L t,x,y .The outbound move of container c from the top of stack y of bay x , given L t , results in removing the last element of the sequence L t,x,y .The reshuffle move of the top container of stack y from bay x to the top of stack y of bay x , given L t , results in moving the last element of the sequence L t,x,y to the sequence L t,x ,y .
The decision a t involves handling all containers in batch B t at time t, where the handling of a single container might already require multiple moves in case of a reshuffle.Therefore, the decision is described by a vector of vectors: the decision a t is a vector In the remainder of this paper, we use the term partial decision to refer to a single element within a t,k , i.e., one container move.Due to assumption A 5 , if B t,k is an inbound container or an outbound container that is not blocked by another container, then | a t,k | = 1 (the number of partial decisions).If there are blocking containers, then | a t,k | = # nr _ blocking + 1 .The notion of blocking containers used in this paper as well as the reach-stacker restrictions are further discussed in Appendix A .
Information and Transition Function .Two types of states can be identified in a single stage.The first type of state is the one where the exogenous information just has been revealed (i.e., the sequence of containers with which the current batch has to be handled) and where we have to make a decision on how to handle each container in the current batch (i.e., where to position the inbound containers and what reshuffle moves to make).The second type of state is the one directly after the decision has been made and before the new exogenous information has arrived.
These states are referred to as the pre-and post-decision states, respectively ( Powell, 2013 ).In the remainder of this paper, for the sake of clarity, the pre-decision state is referred to as S t and the post-decision state as S a t .Let W t+1 denote the exogenous information made available after handling B t .In the proposed model, this information solely consists of the sequence of containers in the next batch.Let ζ t,k be a random variable taking values in { 1 , . . ., | B t |} , such that { ζ t,k = i } is the event that c t,i is the k th container to be in handled in B t .Assuming no prior information on the sequence is available, we use the same probability for each container of being the k th container within a batch.For those cases where some prior information is available at the time of decision-making, we can simply replace this probability distribution function with something else.The distribution of ζ t,k is given by: Let S M denote the transition function.The transition function is split into two parts.Let S M,a be the transition function that transforms a pre-decision state into a post-decision one (i.e., handling a batch) and let S M,W be the transition function that transforms a post-decision state into a pre-decision state (i.e., the disclosure of exogenous information).The transition functions are defined as: Using the transition function, the value functions can be defined as: Here γ denotes the discount factor and C(S t , a t ) is defined as the total number of reshuffles taking place while executing action a t .Formally, it can be defined as the sum of the number of partial decisions minus 1 for each container k in the current batch (minus 1 as the move of container k is not considered a reshuffle): The total number of possible realizations for a single batch of size n is equal to n ! .This means that the total number of unique realizations over the planning horizon T is equal to the product of the number of realizations of each batch ( T t=1 | B t | ! ).This number quickly becomes too large to evaluate all possible realizations.Therefore, we are going to solve it approximately using Approximate Dynamic Programming, as introduced in the next section.

Approximate dynamic programming solution
Approximate Dynamic Programming (ADP) is a suitable methodology for solving large-scale discrete-time multistage stochastic control processes ( Mes & Pérez Rivera, 2017 ).It can be seen as a framework containing various approaches and techniques for dealing with large sequential decision problems under uncertainty.Powell (2013) describes three "curses of dimensionality".These curses make it difficult to come up with an exact solution for these problems.The first curse is that the state space S is too large to evaluate the value functions V t (S t ) for each state in a reasonable amount of time.The second curse is that the decision space A t may be too large to evaluate all possible decisions.Third, computing the expected future costs may be intractable when the outcome space is too large.These three curses are present in the problem tackled in this paper.
To overcome the curses of dimensionality, one needs to introduce approximations.Various types of approximations exist within the framework of ADP.In this paper, we rely on approximate value iteration, with the aim of finding a solution to the following equation: Here V t+1 denotes the approximated value function.We refer the reader to Powell (2013) for an algorithmic outline of this approach relying on approximate value iteration.As this paper is the first to consider the combined problem of handling both in-and outbound containers, with uncertain arrival and departure times, our main interest is to study the potential of ADP.Hence, we limit ourselves to the basic forms of approximating the value function, i.e., using a lookup table with aggregated states and the basis function approach , discussed in Section 4.1 and Section 4.2 , respectively.With this, we pave the way for further research on more advanced approximation architectures.

Lookup table with aggregated states
The first approach tested is the lookup table approach.In this approach, we store the estimated value for each state in a lookup table.However, given that we are only interested in the number of reshuffles, some states can be seen as equivalent ( Ku & Arthanari, 2016 ).Take for instance a terminal with n empty stacks operated using gantry cranes.It does not matter in which stack a container is placed, as the result obtained from one location may be applied to the others.Therefore, the number of states that need to be explored, can be reduced by reusing the solution found for equivalent states.It now remains how to determine whether two states A and B are equivalent.For this, an aggregation function is introduced.
Galle et al. ( 2018b) introduce an aggregation structure based on the sorting of individual stacks.In this aggregation, stacks are ranked according to their height.For stacks with equal height, ties are broken by looking at the departure sequence of containers within the stack starting from bottom to top.A container with an earlier departure will receive a lower rank.Stacks are rearranged from the lowest rank on the left to the highest on the right.This approach will not work for our problem, as in our problem a state consists of bays with multiple stacks.Therefore we add a ranking rule that first considers the ranking between bays.Bays are ranked based on three features.First, it is checked whether a bay is reachable from both sides.A bay that is reachable from both sides is put after one that is not.Second, the size of the bays is compared.A bay with fewer stacks is given a lower rank than one that has more.Third, the stacks within the bay are compared pair-wise using the outbound batch numbers of the containers.Finally, if a bay is reachable from both sides, it is mirrored and compared to itself.Using the same rules, the layout that comes first is chosen in the aggregated state.An aggregated state is a terminal where all bays are sorted according to their rank.An example of this aggregation is given in Fig. 2 .

Basis function approach
The lookup table approach using aggregated states needs to explore all unique aggregated states in order to work properly.This is infeasible for larger problem instances.To solve this, a value function that generalizes across states is needed.A way of accomplishing this is using basis functions.The idea behind basis functions is that they extract certain features (i.e., characteristics) from a state and that the values of these features are combined to come up with an estimate of the value of the state.The efficiency of basis functions relies on the identification of useful features that together accurately describe the value of a state.Let F denote the set of features.The value of a state for a feature f ∈ F can be obtained by evaluating the basis function φ f (S t ) .These values, in turn, can be combined in a linear combination using a weight θ n f for each feature f as follows: The weight θ n f is updated in each iteration n .As we consider a finite horizon problem, each t has its own set of weights.Several methods are available to tune the weight θ n t, f for each feature f ∈ F. We use the recursive least-squares method as it has been shown to be an effective approach for related problems ( Mes & Pérez Rivera, 2017 ).The weights θ n t are updated each iteration using: where H n is a matrix and where B n −1 is a |F| by |F| matrix that is updated using: Matrix B n is initialized using B 0 = ρI , where I is the identity matrix and ρ is a small constant.The value of α n determines the weight on prior observations and is defined as: where δ is a tunable constant.Setting δ = 0 yields equal weights on all observations, whereas a lower value decreases the weight on later observations.For a detailed description on RLS, we refer to Powell (2013) .
The used features will be explained in Section 4.2.1 .Next, two adaptations of the basis function approach are discussed to make it applicable to larger problem instances.Section 4.2.2 discusses a reduction in the decision space by using partial decisions and Section 4.2.3 describes a way to reduce the decision space using the concept of a corridor.

Table 2
Overview of VFAs and the features they use.

Features
In this paper, we consider nine features.Besides using features derived from the literature, we also propose additional ones to improve the approximation.Using these features, five value function approximations (VFAs) are defined.Table 2 shows the combination of features used for each of the VFAs and their corresponding name used in the remainder of this paper.Each feature is discussed in more detail below.
Blocking Lower Bound .The Blocking Lower Bound feature (BLB) is based on the lower bound described by Galle et al. (2018b) .It describes the minimum number of expected reshuffles to empty a given stack.It serves as a lower bound on the total number of reshuffles and can be calculated using the locations of the containers currently residing at the terminal.Let H(s ) denote the height of stack s and s i the container at tier i (where i = 1 represents the lowest container), then BLB (L ) as a function of the current collection of containers in our state is defined as: A high BLB value indicates a bad state as the number of expected reshuffles is high.
Unordered Stacks .The Unordered Stacks feature (US) counts the total number of stacks that (may) cause a reshuffle.This feature is similar to BLB, only now the stack is counted rather than the blocking containers.A good state most likely has a low value for the US feature.
Composite Measure .The Composite Measure (CM) is proposed by Scholl, Boywitz, & Boysen (2018) .They found that a combination of an underestimate and overestimate function gives decent approximations on the number of reshuffles in the classic CRP.They found that C M − (# C RL, 0 .65) yielded the best results.
Although the explored methods specifically target the outbound problem, they may give insights into the combined problem.In this paper, we use the best-found solution by Scholl et al. (2018) .
Batch Label Difference .The Batch Label Difference f eature (BLD) gives insights into the difference in batch labels among the containers within a stack.A value around zero indicates a stack of which the batch labels lay close to each other (i.e., these containers depart relatively soon after each other).An extreme value, both positive and negative, indicates a stack where the batch labels vary a lot.Most likely a good state will have a measure close to zero.BLD (L ) is defined as: Average Stack Height .The Average Stack Height feature (ASH) indicates the average stack height of all non-empty stacks.It is calculated by summing all heights of each stack and dividing it by the total number of non-empty stacks.This feature plays a big role in the leveling heuristic.The leveling heuristic has proven to produce competitive results in the setting where the order of retrieval is completely unknown in advance ( Zehendner, Feillet, & Jaillet, 2017 ).
Non-Reachable Stacks .The Non-Reachable Stacks feature (NRS) indicates the number of stacks that cannot be reached.This feature is only applicable to a terminal that uses a reach-stacker, as a gantry crane will have access to every stack at all times.Whether a stack is reachable depends on whether the topmost container of the stack can be retrieved using the reach-stacker restrictions described in Appendix B .If the topmost container cannot be retrieved, the stack is counted as a non-reachable stack.Minimizing the number of non-reachable stacks enforces a way of building stacks in such a way that they remain accessible.
Non-Reachable Containers .The Non-Reachable Containers feature (NRC) is an extension of NRS.This feature counts the number of containers that cannot be directly reached.If a stack is reachable, the containers below the topmost are counted as not reachable.If a stack is not reachable, all containers in that stack are counted.Minimizing the number of non-reachable containers minimizes the likelihood that a reshuffle is needed to retrieve a container.
Future Blocking Stacks .The Future Blocking Stacks feature (FBS) indicates the number of stacks that may become blocking in the future.Whether a stack may become blocking in the future is based on the containers currently residing in the stack and the remaining inbound containers.For every remaining inbound container, the following is checked.First, all containers in the stack that depart in a batch earlier than the current inbound container that is being handled, are removed.Then for the remaining containers, the batch numbers are compared with the inbound container.If an inbound container is departing earlier than one of the remaining containers in the stack, the stack may cause a reshuffle in the future.This stack is then counted for this feature and ignored in the remainder of the calculation (i.e., a stack can only be counted once for this feature).Minimizing the number of future blocking stacks reduces the chance of causing a future blocking stack and thereby the number of reshuffles.
Future Blocking Containers .The Future Blocking Containers feature (FBC) indicates the number of containers that may become blocking in the future.Similar to FBS, this feature is calculated by looking at the remaining inbound containers.For every remaining inbound container, it is checked whether it may cause a reshuffle in the future.This is done by first ignoring all containers that depart in a batch before the inbound container arrives.The inbound container is counted as future blocking if there is a stack where a container remains that departs earlier than the inbound one.FBC works similarly to FBS, only now the inbound containers are counted rather than the stacks.
Constant .Apart from the above-mentioned features, also a constant is introduced in every VFA.This constant returns always the same value independent of the state.When there is no independent constant in a linear regression model, the model is forced to go through the origin.This may cause a wrong bias in the model.In the remainder of the paper, the value of the constant is referred to as κ.

Optimized basis function approach
The decision space A t still grows exponentially in terms of the batch and terminal size.Therefore, when considering real-life instances, the presented ADP approach might still be impractical, as the number of possible decisions is too big to evaluate within a reasonable time.Therefore an optimized ADP approach (in terms of |A t | ) needs to be found.
As mentioned in Section 3 , the sequence of container moves a t,k for handling the k th container in batch t requires partial decisions.These partial decisions involve retrieving a target container, placing a container on a specific stack, or reshuffling a container from one stack to another.The latter two of these decisions have various possible outcomes.Evaluating a single partial decision is still manageable, even for bigger instances.The problem arises when evaluating all possible combinations of multiple of these partial decisions.Evaluating only a single partial decision at a time, and making these decisions sequentially, eliminates exponential growth.
Limiting the partial decision to a single decision can be done in various ways.One option is to use a heuristic resulting in one decision instead of evaluating all possible decisions.This approach would, however, mean that the ADP approach most likely will not perform better than said heuristic.Therefore, we decided to use the VFA itself to make the heuristic decision, as the VFA provides an estimate of the value of the state resulting from the decision.This can also be applied to each of the intermediate states between t and t + 1 that occur each time after making a partial decision.Choosing the intermediate state that has the best value according to the VFA yields a single possible outcome per partial decision (breaking ties by choosing the first discovered).This approach follows the same idea as A * -search, where a function is used that dictates which state should be expanded next.The major difference is that the proposed approach selects a state, whereas A * stores all reachable states in a priority queue and explores the most promising one first.This way of sequential decision-making has the advantage that the problem scales linearly in terms of batch and terminal size, making it applicable to real-life instances.

Corridor method
During experimentation, we found that the optimized basis function approach still requires large computational efforts when solving larger problem instances.Therefore another measure is considered to further decrease computation time: the corridor method, as proposed by Caserta et al. (2011) .This method limits the decision space by only allowing reshuffles to neighboring stacks.Depending on the size of the problem and computational resources, this corridor could be set to a tight region or to the terminal as a whole (i.e., not using the corridor method).From a practical point of view, it makes sense to include such a restriction, as a single reshuffle requiring a large travel time might be more costly than two reshuffles to neighboring stacks.Caserta et al. (2011) applies the method to the deterministic CRP.In this paper, the approach is applied to the stochastic CRP and adapted to serve the combined in-and outbound problem and to work with bays.In Section 5.3 , we provide a comparison of the model with and without the corridor method.
In the corridor method of Caserta et al. (2011) , the stacks are counted as neighbors.We decided to count bays as neighbors.The corridor thus forms a range of bays rather than a range of stacks.Only bays within the range are considered when making a decision.The middle of the range is the bay from which a blocking container is reshuffled.The value of the corridor indicates which bay, seen from the middle, is included.For example, a corridor value of two indicates that in total five bays are considered.
Applying the corridor method for the reshuffle decision decreases the number of bays that need to be considered.Ideally, the same is done for inbound containers.The problem, however, is that inbound containers do not have a bay from which they need to be reshuffled.Therefore, it is unknown where to place the corridor.To solve this, a random bay index is drawn (uniformly distributed) around which the corridor is set.
The presented features of the basis function approach are defined over the entire terminal.However, most features can be split into two parts.The first part calculates the value of the bays within the corridor and the second part outside the corridor.An example of this is the calculation of the Blocking Lower Bound (BLB).The number of blocking containers in the terminal is equal to the blocking containers within the corridor plus the number of blocking containers outside the corridor.As only changes within the corridor are allowed, it is known that the value outside the corridor remains constant across all considered decisions.Therefore, one only needs to calculate the value within the corridor to determine the best-valued state.
The Composite Measure (CM) is the only feature that does not adhere to the independent calculation.Simultaneously, it is the measure that takes the most time to compute.Therefore an adaptation to this feature is introduced.When using a corridor, only containers within that corridor are considered.CM now calculates the value as if the corridor was a terminal on its own.All containers and available locations outside the corridor are ignored.

Numerical experiments
This section discusses the experimental setup and the evaluation procedure used to test our approach ( Section 5.1 ).Next, parameter tuning is discussed using virtual problem instances ( Section 5.2 ).Finally, the experimental results using various reallife instances are presented ( Section 5.3 ).The source code as well as the data we used can be found at https://github.com/boschma2702/ContainerStacking .

Experimental setup
We distinguish between two experimental phases.The first phase is used for parameter tuning and consists of two relatively small terminal layouts.The second phase considers larger problems and reflects real-life terminals, motivated by seven small to medium-sized Dutch terminals connected to our research.
In the first experimental phase, two terminal layouts are considered.The first represents a terminal where all stacks can be served using a gantry crane.This can be modeled by using the stacking restrictions imposed on bays by a reach-stacker and only allowing a single stack per bay.In total, this terminal consists of 7 stacks (and thus 7 bays).The second represents a terminal where all stacking is done using reach-stackers.This terminal consists of 3 two-way bays, each with 5 stacks.In the remainder of this paper, the two terminal layouts for the first experimental phase will be referred to as terminal layout 1 and 2, respectively.Both these instances are generated using 25 inbound containers (which later on become outbound containers).
The second experimental phase reflects real-life terminals, where we limit ourselves to terminals solely operated by reach-stackers.For this, an exploratory analysis of seven small to medium-sized Dutch terminals has been carried out to get insights into the day-to-day practices of these terminals.For these terminals, on average, the number of handled inbound containers per week roughly equals the number of stacks available.The average dwell time follows an exponential distribution, with an average (non-weighted) dwell time of three days.The common bay size consists of five stacks.Most of these terminals do not operate during the weekend.Based on the characteristics of the 7 terminals, we consider four different-sized experiments consisting of 100, 250, 500, and 1000 inbound containers, denoted by tiny , small , medium , and large , respectively.The biggest of these matches with the number of inbound containers, on average per week, of the largest of the seven terminals analyzed.We decided to use a planning horizon of one week, as the planning should be roughly known over this period.Furthermore, we assume that all bays consists of five stacks, which are accessible from both sides.This results in 20, 50, 10 0, and 20 0 bays for the real-life problems.For both types of problems, we decided to limit the maximum stack height to four.
The first experimental phase considers problems with a horizon of 20 hours.For the second experimental phase, we draw upon the conclusion that the terminals operate mostly only during weekdays, using 8-hour working days.Therefore, a planning horizon of 40 hours was chosen.Containers are given a random arrival time (uniformly distributed) within the planning horizon.Then a random dwell time is sampled using an exponential distribution with a mean of 12 hours for the first phase.For the second phase, a dwell time of 15 hours is used.After generating the arrival and departure times (given by the arrival time plus dwell time), we create one inbound batch and one outbound batch for each hour in the planning horizon, where all containers scheduled to arrive in that hour are part of the inbound batch and all containers scheduled to depart in that hour are part of the outbound batch.Regarding the sequence, we first generate the inbound batch and next the outbound batch.Although our model does not require this specific setup of generating container arrivals, it allows for a compact presentation of the experimental settings and for others to replicate our experiments even without using the provided data sets.
For the batch arrivals and departures, 151 training and 150 evaluation realizations are generated1 For all problems, 16 instances using the same parameters are generated.By generating the random realizations, each algorithm can be trained and evaluated using the same set of realizations, resulting in a fair comparison.
Each ADP algorithm runs 151 iterations using the generated training realizations.For each 10th iteration, starting after the first iteration, an evaluation is executed.Every evaluation consists of 150 simulation runs of using the learned values during a certain horizon (so decisions are made exploiting what the ADP algorithm has learned so far).During each evaluation, two things are recorded: the average number of reshuffles that occurred during the 150 evaluation simulations and the approximate value of the initial state ( S 0 ).The averages over the 16 problem instances of these measures are reported.To remain concise, only the results after the final learning iterations are recorded for the tuning instances.We refer the reader to Mes & Pérez Rivera (2017) for more details on the evaluation process used.
Using Bellman's Equation, one can calculate the optimal solution for a small problem instance.However, even for a small problem instance, the size of the decision space is too big to evaluate all possible decisions in a reasonable amount of time (the set A t of all possible decisions at time t grows exponentially in terms of the number of available stacks and batch sizes).To limit the decision space, artificial stacking restrictions are introduced along the reach-stacker restrictions.These restrictions are based on common day-to-day practice.An explanation of the artificial stacking restrictions can be found in Appendix B .With this, we could solve all instances of terminal layout 1 to optimality, resulting in 0 reshuffles.However, it is still not possible to obtain solutions in a reasonable amount of time for terminal layout 2 and for all problems considered in the second experimental phase.Therefore, we consider two benchmark heuristics to compare our solution with.The first benchmark heuristic is the myopic approach.The myopic approach makes decisions solely based on the current batch, i.e., it will select those stack assignments that yield the least number of reshuffles for the current batch.In case of a draw for a certain move, it will select the first one discovered.The second benchmark heuristic is the Min-Max Heuristic (MM), introduced by Caserta et al. (2009) .As this heuristic is designed for a terminal operated by gantry cranes, we slightly adapt it.This adaptation is referred to as MM-A.MM, as well as MM-A, are described in Appendix C .

Phase 1: Parameter tuning
ADP is a flexible approach that allows for many adaptations and settings of tunable parameters.Within the scope of this research, only a few adaptations are explored and some parameters are fixed.All algorithms described in this paper are using the epsilon greedy policy, where we explore with probability = 0 .05 and exploit otherwise.Furthermore, we always use a discount factor γ = 1 .For the lookup-table approach, the initial estimate of a state is 0. Both a Fixed stepsize (with α = 0 .05 ) and a Harmonic stepsize (with a = 1 and α 0 = 0 .05 ) are tested.Each instance is run using a single pass and a double pass approach.For the basis function approach, we set the value of ρ to 0.1, and initialize the matrix B n using B 0 = 0 . 1 * I , where I is the identity matrix.For more details on the stepsizes, single versus double pass, and the matrices used by RLS in the basis function approach, we refer to Powell (2013) .
The remainder of this section presents the results of the methods proposed in Section 4 applied to the tuning problem instances.At the end of this section, the selected benchmark heuristics will be applied to the tuning instances and compared to the proposed methods.
Lookup-up Table Approach using Aggregated States .The average numbers of reshuffles after the latest learning iteration of the lookup table approach using aggregated states are shown in Table 3 .Both a single and double pass approach in combination with a fixed and harmonic step size are tested.On terminal layout 1, the algorithms perform way off the optimal solution of zero reshuffles.Also the approximated value of the initial state remains zero after all iterations for the single pass.This could be caused by two things.The first is that the costs experienced in the future are not propagated back to the initial state.The second cause could be that there are still unexplored states directly reachable from the initial state.These states have an initial value of zero, resulting in a value of zero for the initial state as well.This problem is not apparent in the double pass.Therefore, the remainder of the ADP solutions presented in this paper makes use of the double pass version Basis Function Approach .The basis function approach is tested using two different runs.The first run serves as a baseline and uses the following settings: initial weights θ 0 f = 1 for each feature f , the constant κ = 1 as used in the VFA, and δ = 0 .9 for the nonstationary RLS.The average numbers of reshuffles after the latest learning iteration of this run are shown in Table 4 .Notable is the difference in performance compared to the lookup table approach.The VFAs perform near-optimal and are also able to obtain a solution to terminal layout 2. In terms of reshuffles, VFA5 seems to perform the worst and VFA3 the best using 151 learning iterations.The difference between the performance of all VFAs, however, is relatively small.This might be caused by the dependencies between the features.For example, a high number of unordered stacks means a high number of blocking containers.In addition, it might be that a different VFA performs better when more learning iterations are used.VFA1 has the most features and should be able to learn more complex behavior than the other VFAs.However, this does take more learning iterations to accomplish.
The second run considers other settings for the initial weights and constant used: θ 0 f = 0 for each feature f , and κ = 5 .The results can be found in Table 5 .A negligible difference with the baseline run is observed.
Optimized Basis Function Approach .To distinguish between the standard and optimized basis function approach, we use VFA and VFA-O, respectively.The results of the optimized approach can be found in Table 6 .The same ADP settings as the baseline run are used.
The performance of the optimized algorithm is way worse than the non-optimized approach.This is expected as another level of approximation is introduced.A reason why the performance is poor might be simply due to the relatively low number of iterations; typically far more iterations are being used in ADP.Another reason might be that the tested VFAs are not suitable to limit the number of decisions.An argument in favor is that the VFAs are not consistent .Consistent in this sense means that the estimate is always less or equal then the estimate of a reachable state plus the costs to get there ( Pearl, 1984 ).Formally: Where V n t (S t ) is the value function approximation for S t , a t the decision that leads to S t+1 , and C(S t , a t ) the costs of making decision a t .
Corridor Method .The corridor method approaches are referred to as VFA-CM n , where n denotes the size of the corridor used.Because we did not observe a significant difference with the optimized basis function approach for the small instances, we do not show the results of the corridor method here.
Benchmark Heuristic .The results of the benchmark heuristics and the most promising versions of the basis and optimized basis function (with θ 0 f = 1 , ∀ f , δ = 0 .9 and κ = 1 ) are shown in Fig. 3 .
This figure shows the number of reshuffles achieved by these algorithms on every 10th iteration after the 1th learning iteration.In addition, the number of achieved reshuffles without learning (the 0th iteration) is also included in the graph.
Looking at the results without learning iterations (i.e., n = 0 ), it can be observed that ADP is outperformed by the myopic policy.Furthermore, the resulting larger number of reshuffles causes larger computation times.Therefore, the results for VFA3 on terminal layout 2 could not be obtained.MM yields compelling results on terminal layout 1 but performs mediocre on terminal layout 2. MM-A seems to yield a slightly better solution than MM.Due to the poor performance and large computation times, the myopic approach and the 0th iteration of the VFAs are omitted in the remainder of the experiments.
Although the focus of this paper is on terminals operated by reach-stackers, we end this tuning phase of our experimental study by performing one additional experiment on a gantry crane operated terminal.For this, we use the tiny real-life instances (see Section 5.3 ), except that we use 15 bays with a single stack; this way the additional reach stacker restrictions no longer play a role, such that it coincides with the operations of a gantry crane (see explanation at the beginning of Section 5.1 ).The results of this experiment can be found in Fig. 4 .
MM outperforms the ADP approach over all iterations, coinciding with the results on Terminal Layout 1.This performance difference can be explained by the fact that MM is specifically designed for terminals operated by gantry cranes.However, the difference is quite noticeable, e.g., also when comparing Figs. 4 with 5 from the next section (although these terminals are not exactly the same in terms of stacks and bays).Further investigation reveals that this is largely the result of the additional stacking restrictions of the reach-stacker that limits its decision freedom, which in turn drastically speeds up the learning process.Hence, it is likely that differences in the gantry crane instances will diminish with increasing learning iterations.Aside from the performance differences, this experiment merely shows that the solution for gantry crane ter-  minals can be applied without changes to a terminal operated by reach-stackers.

Phase 2: Real-life instances
From the ADP approaches, VFA3-O, VFA3-CM1, and VFA3-CM2 were selected to be applied to the real-life instances.Tiny .The tiny terminal layout consists of 20 bays, each with five stacks.In total, 100 containers arrive during the planning horizon.The result of the tiny instance can be found in Fig. 5 .VFA3-O outperformed the heuristic significantly.Even after a single iteration, the algorithm already causes fewer reshuffles than the heuristics.
An interesting observation is that VFA3-OC1 and VFA3-OC2 outperform VFA3-O.One would expect that these algorithms perform worse as the number of possible decisions is reduced by the corridor.One argument for why VFA3-OC1 and VFA3-OC2 outperform VFA3-O could be the random corridor selection for inbound containers.Our hypothesis is that the random corridor forces the algorithm to make more random decisions resulting in an overall better solution.Furthermore, after ten iterations a bump can be observed in the number of reshuffles.This behavior is present at VFA3-O.However, it is in ADP that performance initially gets worse before it improves with increasing iterations.As we initialized our value function approximation with zero, the initial policy coincides with the myopic policy.After a few iterations, we have updated our value function, and our decisions will be influenced by an unreliable value function approximation, possibly resulting in higher immediate costs in anticipation of future savings that are estimated wrongly.However, with increasing iterations, our approximation improves resulting in improved performance.
In the discussion of assumption A7 in Section 3 , we argued that our approach could easily include additional stacking restrictions.To demonstrate this, we implement the restriction where some containers (e.g., reefer containers) are only allowed to be stored on specific bays (e.g., because these bays have access to power).The instances of the tiny terminal layout are modified by marking 10 out of the 100 containers as designated.These designated containers may only be stored in designated bays, for which we use 2 out of the 20 bays.
From Fig. 6 , similar results as for the regular tiny instances can be observed.Again the Adapted MM outperforms MM, and both heuristics are by far outperformed by the basis function approaches.Also, the corridor method performs better than the optimized approach.
Small .The small terminal layout consists of 50 bays, each with five stacks.In total, 250 containers arrive during the planning horizon.The results on the small instances can be found in Fig. 7 .As can be observed, both heuristics are outperformed by the ADP algorithms.Compared to the tiny real-life instances, a larger difference in the number of reshuffles can be observed (both ab-Fig.7. The average number of reshuffles of the Min-Max Heuristic (MM), the adapted MM (MM-A), optimized VFA3 using a corridor of 1 (VFA3-OC1) and optimized VFA3 using a corridor of 2 (VFA3-OC2) on the small problem instances.solute and relative).It seems that for larger problems, MM performs worse compared to MM-A.VFA3-OC1 and VFA3-OC2 outperform both heuristics and yield similar results as in the tiny instances.After the final iteration, the algorithm yields a bit over twice the number of reshuffles compared to the tiny instance (1.39 and 2.22 vs 0.60 and 0.69).
As VFA3-O could not be computed for the small instance, we ignore this VFA for the remaining real-life instances.Furthermore, as the computation times for the medium and large terminal layouts appear to be too large to perform an evaluation every 10th iteration, we do not provide figures for the performance as a function of the number of iterations, but only report the performance after the last iteration.
Medium .The medium terminal layout consists of 100 bays, each with five stacks.In total, 500 containers arrive during the planning horizon.MM results in 188.03 reshuffles and MM-A in 126.33.Again both absolute and relative differences increase.VFA3-OC1 results in 3.14 reshuffles and VFA3-OC2 results in 3.70 reshuffles.Thereby outperforming both heuristics by far.
Large .The large terminal layout consists of 200 bays, each with five stacks.In total, 10 0 0 containers arrive during the planning horizon.MM results in 370.20 reshuffles and MM-A in 229.74.Similar to the previous instances, an increase in difference is observed.However, the relative increase gets smaller with increasing problem size.VFA3-OC1 results in 6.04 reshuffles and VFA3-OC2 in 6.12.Again, both benchmark heuristics are outperformed by our ADP approach by far.
Discussion .Let us first discuss the huge difference between the performance of our ADP approach and the chosen benchmark heuristic.MM is a heuristic that is designed for a terminal operated by gantry cranes and performed well in these instances (terminal layout 1).Using the same heuristic in a different setting and comparing it to an approach that is intended for that goal, is nothing more than testing an elephant on its ability to climb a tree.To the best of our knowledge, no literature exists that explores a heuristic that is defined for a terminal using reach-stackers.Therefore, an attempt has been made to alter the heuristic in such a way that it works well in such environments.The question then remains whether the adaptation is suitable for this problem, or whether the problem is too difficult to solve with such a simple heuristic.However, in absence of an optimal solution, the heuristic still provides insights into how well the proposed ADP method works.
Next, some remarks on the generation of real-life instances.During the testing of the algorithms, the terminal starts empty.
This does not reflect real-life practices as normally there are containers left from previous weeks.Also, these instances are tested using the entire planning horizon (i.e., a week).Knowing a week in advance for all containers roughly when they arrive and depart is highly unlikely in real-life operations.Both have the effect to cause a lower number of reshuffles than in real-life.Ideally, one wants to test the algorithms in an environment as realistically as possible.This can be achieved by first having a warmup period to create an initial filling of the terminal.Simulating multiple realizations will result in multiple starting points for the week in which the algorithms are compared against each other.However, we decided not to do this for computational reasons.As a compromise, we decided to consider an extended planning horizon, such that the effects of starting with an empty terminal are diminished as containers already depart while new ones still arrive during the horizon.In addition, by considering a longer planning horizon, we showed what can be achieved if the information is known more in advance.
Finally, every real-life instance has its own terminal layout.For simplicity, decided that all bays consists of five are all reachable from both sides, and that the number of stacks equals the average number of container arrivals per week.In reality, this may be not the case and bays of different sizes and reachability may be present within a single terminal.

Practical impact
It can be concluded that, on the explored problem instances, the proposed approach yields compelling results.The next step is to apply the proposed solution to an actual terminal.For this to work, the physical terminal needs to be recreated digitally following the proposed model, and a planning horizon needs to be chosen over which the algorithm will operate.This can be done in a similar way as we did in Section 5.1 for the second experimental phase.Likely, the ADP approach itself will be used in a rolling horizon fashion in practice, e.g., by running the algorithm every day considering a multi-day planning horizon.Here, at the beginning of each day, the planned in-and outbound containers during the selected planning horizon need to be grouped into batches, satisfying the assumptions made in this paper.Next, we train the ADP algorithm on the selected horizon and use the trained model to support our decisions on where to place the inbound containers and what reshuffles to do for the outbound containers during the first day.The next day, we repeat the procedure.
The difficulty in applying the procedure described above might be satisfying the limiting assumptions made in Section 3 .For example, the known sequence of batch arrivals is quite a strong assumption.However, it should be noted that (i) those arrivals that occur roughly at the same time and for which the sequence is hard to predict might artificially be considered as one batch (see Section 3 ), and (ii) these assumptions are only used for the training phase of the ADP algorithm.The resulting trained model (i.e., the feature weights from the basis function approach) can be used more generally even though reality turns out to behave differently, as the trained model is still able to provide reasonable estimates of the value of each state.

Conclusions
This paper considered the combined in-and outbound stochastic Container Relocation Problem (CRP).The goal was to find an approach that determines real-life applicable container stack allocations resulting in a minimum number of reshuffles.The focus was on small to medium-sized terminals, e.g., inland terminals, which usually operate using a reach-stacker.We modeled the problem as a Markov Decision Process (MDP), where containers are grouped into batches based on their estimated arrival and departure times.The sequence of batches is assumed known, but the sequence of containers within a batch is not.The model enforces stacking restrictions imposed by a reach-stacker but, as shown in this paper, can also be applied to gantry crane operated terminals as well as a combination of these.
We developed two Approximate Dynamic Programming (ADP) approaches to solve the presented MDP: a lookup table approach using aggregated states and the basis function approach.Both approaches were compared to the Min-Max benchmark heuristic.First, the algorithms were tested on tuning instances.We found that for terminals operated by reach-stackers, the basis function approach resulted in 81%-97% fewer reshuffles compared to the benchmark heuristic.On terminals operated by gantry cranes, the basis function approach scored comparably.The lookup table approach did not yield compelling results.Next, the algorithms were tested on four real-life instances reflecting the day-to-day container moves of real terminals operated by reach-stackers.The smallest of these instances contains 100 inbound containers and the largest 10 0 0. As the decision space of larger instances is too big to solve within a reasonable time, the optimized basis function approach and the corridor method were introduced.We found that the optimized basis function approach could not produce results within a reasonable time; however, when combined with the corridor method, it required less computational time and resulted in 97-98% fewer reshuffles on all instances compared to the benchmark heuristic.
Future work could explore the performance of our approach under different relaxations of our assumptions.In addition, it would be interesting to see whether a further decrease in the number of reshuffles can be achieved by alleviating some of the restrictions on the decision space.Finally, it would be interesting to explore the impact on the performance of different features, compared to the nine proposed in this paper.
All problem instances and code used to achieve the results in this paper can be found at https://github.com/boschma2702/ContainerStacking .

Appendix B. Additional stacking restrictions
Even for small problem instances, the size of the decision space is too big to calculate in a reasonable amount of time.To limit the size, an additional stacking restriction is included.This stacking restriction is based on real-life practices and is best explained with an example.Imagine a bay that is reachable from both sides with a container placed on both outer ends.Although the stacks in between these containers are empty, they cannot be used as the reach-stacker cannot reach these locations.Placing containers in such a way is inefficient in terms of space usage and consequently in practice rarely done.Therefore, an artificial restriction is added that prevents such container placements.The rule makes a distinction between bays that are reachable from one way and from both ways.
For the one-way bay, first, the check is done whether the bay is empty.If the bay is empty, the container must be placed on the rightmost stack.If the bay is not empty and the container is placed on a currently empty stack, it is not allowed to have empty stacks to the right.If the target stack is non-empty, it is a valid storage location according to this restriction.
For the two-way bay, again first a check is done whether the bay is empty.If the bay is empty, the container must be placed in the middle.If the bay is not empty and the target stack is empty, either one of the following two statements must hold: 1.The stack to the right has a container and all stacks to the left are empty.2. The stack to the left has a container and all stacks to the right are empty.
Similar to the one-way bay, if the target stack is non-empty, it is a valid storage location according to this restriction.

Appendix C. Min-Max heuristic
The Min-Max Heuristic (MM), as introduced by Caserta et al. (2009) , first chooses a stack where the first departing container departs later than the current container (i.e., a stack that causes no additional reshuffles).This choice disregards empty stacks.If multiple of such stacks exist, the stack containing the earliest departing item is chosen.In case no such stacks exist and no empty stacks are present, MM chooses the stack that has the latest earlier departing item.
Let min (s ) be the earliest departing container in stack s (if s is empty, then min (s ) = C + 1 , where C is the number of containers in the initial layout).Let c be a container from stack s that needs to be reshuffled and S be the set of all stacks.Then the reshuffle location s for c is defined as: Adapted MM-Rule .The first rule of MM states that a stack is chosen where the earliest departing container of that stack departs later than the current container being handled.This rule narrows the selection of stacks down to those that do not result in a new reshuffle.This rule does not work in a terminal where containers are handled by reach-stackers.In that case, reshuffles might be introduced by neighboring stacks.Therefore a new rule is sought to replace this one.In case a bay is accessible from both sides, it is hard to define a rule that states when a container placed on a specific location will cause a reshuffle or not (i.e., a container may become blocking or unblocked from the other side).In addition, to preserve the simplicity of the algorithm, a simplified rule is used to indicate whether a container becomes blocking.
Figure C.1 shows part of a bay.The white square marked by T indicates a target location.When a container is placed at such a location, it must be accessible from the left or the right (or both).When this location is accessible from the left, the locations marked by L are checked.A container placed on the target location may become blocking when there exists a container, located on L , that departs earlier.The same check is applied when the location is accessible from the right.Then the locations marked by R are checked.Containers directly below the target location are checked regardless of the direction of accessibility (marked by B ).
This rule causes many false positives, meaning that when this rule states a container becomes blocking it might not be the case.Therefore, this rule is referred to as potential blocking as the container might not be blocking.The first rule of MM is changed by choosing a stack that does not result in a potential blocking container.

Fig. 2 .
Fig. 2. Illustration of three different states and their corresponding state aggregation.The terminal has five bays and the arrows indicate from which side these bays can be accessed.The number on the containers indicate the outbound batch number.

Fig. 3 .
Fig. 3. Performance in terms of the number of reshuffles of the various algorithms on the smaller problem instances.

Fig. 4 .
Fig. 4. The average number of reshuffles of the Min-Max Heuristic (MM), and the optimized basis function approaches on the Gantry Terminal problem instances.

Fig. A. 1 .
Fig. A.1.Illustration of a part of a bay viewed from the side in 2D.The gray squares represent containers.The white square with the letter T represents a container to be removed or an empty spot to place a container.The arrows indicate where containers are allowed to be placed for T to be reachable.

Fig. C. 1 .
Fig. C.1.Illustration of a bay viewed from the side in 2D.The gray square represents a container.The white square with a T represents a target location.The letters indicate which locations need to be checked when the target location is accessed from the left ( L ) or the right ( R ).In addition, the letter B indicates locations that need to be checked regardless of the way of accessing them.

Table 3
The average number of reshuffles after the final learning iteration for the various lookup table approaches (Single = single pass ADP, Double = double pass ADP, F = fixed stepsize, H = harmonic stepsize).

Table 4
The average number of reshuffles after the final learning iteration for the different Value Function Approximations (VFAs) of the baseline run.For terminal layout 2, no solution could be obtained with the lookup-table approach, as the program ran out of memory.