M ACHINE LEARNING ALGORITHMS FOR THE PROBLEM OF OPTIMIZING THE DISTRIBUTION OF PARCELS IN TIME -D EPENDENT NETWORKS : THE CASE STUDY

: In the paper we present machine learning algorithms for the problem of optimizing the distribution of parcels in stochastic time-dependent networks, which have been built as a part of some Distribution Optimization System. The problem solved was a modified VRPTW (Vehicle Routing Problem with Time Windows) with many warehouses, a heterogeneous fleet, travel times depending on the time of departure (stochastic time-dependent network) and an extensive cost function as an optimization criterion. To solve the problem a modified simulated annealing (SATM) algorithm has been proposed. The paper presents the results of the algorithm learning process: the calibration of input parameters and the study of the impact of parameters on the quality of the solution (calculation time, transport cost function value) depending on the type of input data. The idea is to divide the input data into classes according to a proposed classification rule and to propose several strategies for selecting the optimal set of calibration parameters. These strategies consist in solving some multi-criteria optimization tasks in which four criterion functions are used: the length of the designated routes, the computation time, the number of epochs used in the algorithm, the number of designated routes. The subproblem was building a network model of travel times that is used in constructed SATM algorithm to determine the travel time between recipients, depending on the time of departure from the start location. An attempt has been made to verify the research hypothesis that the time between two points can be estimated with sufficient accuracy depending on their geographical location and the time of departure (without reference to the micro-scale, i.e. the detailed structure of the road network). The research was conducted on two types of data for Warsaw: from transport companies and one of the Internet traffic data providers. Learning the network model of travel times has produced very promising results, which will be described in the paper.


To cite this article:
Tarapata, Z., Kulas, W., Antkiewicz, R., (2022).Machine learning algorithms for the problem of optimizing the distribution of parcels in time-dependent networks: the case study.

Introduction
The routing problems are one of the most important in transport and logistics, from the practical point of view, and in operation research and theory of algorithms from theoretical point of view.One of the well-studied and useful is Vehicle Routing Problem with Time Windows (VRPTW) (Cordeau et al., 2002;Solomon, 1987).In the project described in (Development, 2020; Tarapata et al., 2020) we have developed modified Vehicle Routing Problem with Time Windows (VRPTW) which is a part of some Distribution Optimization System (Tarapata et al., 2020;Tarapata and Antkiewicz, 2021).The modification of the defined problem in relation to the classic VRPTW was based on the fact that we include many warehouses, a heterogeneous fleet, stochastic time-dependent travel times (stochastic time-dependent network (Vidal et al., 2020) and an extensive cost function (the total weighted costs of delivery of goods, including the costs of: permanent use of vehicles, work of vehicles resulting from the distance travelled and work of vehicle drivers resulting from their working time) as an optimisation criterion.There are many algorithms proposed to solve VRPTW problem.One of the first algorithm classifications was made by Solomon in (Solomon, 1987).Further papers (Cordeau et al., 2002) divide these algorithms into: heuristic approximation methods with upper limitations, heuristic methods with lower limitations, integer-based solutions.Authors of the paper (Tan et al., 2001) contains additions to the four algorithms: searching for a neighbourhood, simulated annealing, tabu search (He and Li, 2018), and genetic algorithms.The papers (Bräysy and Gendreau, 2005) present some metaheuristics.The simulated annealing method has been classified as a meta-heuristic method with upper limitations (Afifi et al., 2013).Our simulated annealing algorithm (SATM) was developed based on the paper (Woch and Łebkowski, 2009).We have extended the algorithm by including several new elements: heterogeneous fleet, modification of searching initial solution, different methods of cost calculations, time-dependent network of travel, parallelization, and others.The algorithms have been tested on Solomon and Gehring-Homberger benchmarks (VRPTW, 2021), as well as based on real data from courier companies.Our paper discusses two important problems that often occur when looking for a solution to a VRPTW problem: calibration of the parameters of the random solution search algorithm (in our case: simulated annealing, SATM) and construction of the time-dependent travel time matrix which is the basis for determining the transit time between two recipients in algorithms solving the VRPTW problem.Knowing the time of travel between two recipients makes it possible to calculate the time of travel to each recipient and determine whether we fit into a given (required) time window.The paper is organized as follows.In section 2 we present learning method for constructing a model of the network of travel times.In section 2 we present learning method of our simulated annealing (SATM) algorithm for VRPTW problem solving which consist of an optimal selection of calibration parameters.Finally, we conclude results of our research and indicate problems for future works.

Modelling of the network of travel times
One of the subproblem in time-dependent VRPTW is the problem of constructing a network model of travel times, used to determine the travel times between recipients, depending on the time of departure (time-dependent network).The network of travel times is a set of locations to be visited (vertices of the network) together with the time of travel between them (edges of the network and their weights).The problem of constructing the network model of travel times can be considered as estimation the parameters of the road network based on GPS data from the monitored vehicles (mainly the travel time of the network sections).The problem can be divided into three phases: (1) collecting GPS data from vehicles (location, speed, azimuth, etc.); (2) "tying" data from travel to road sections (socalled map matching (Chen et al., 2014)); (3) estimating travel times for sections on the basis of data collected for the section on travel times (speeds) (we use proprietary solutions supported by the latest achievements in this area, described in (Bertsimas et al., 2019;Karurukan et al., 2018;Kim et al. 2021;Lan et al., 2019;Shi et al., 2017;Tan et al., 2001).In solving this problem, a number of specific problems arise (Chowdhury et al., 2017): processing of mass data (Big Data) from a large number of vehicles, for a large number of network sections; estimation of time parameters for each network section (network learning) or multidimensional classification of network sections or sub-areas forming virtual network vertices and estimation of parameters for the segment/edge category of directed graph modelling the network of recipients (the time of the section travel is a random variable whose distribution and parameters need to be estimated (Juhasz et al., 2017)); problems of efficiency and computational complexity of processing large and fast changing data sets and many others.Therefore, we try to verify the research hypothesis (one of the propositions was presented in (Fulton, 2015)) that the travel time between two points can be estimated with sufficient accuracy, depending on their geographical location and the time of departure (without referring to the micro-scale, i.e. the detailed structure of the road network).The research was carried out on two types of data: for Warsaw -from data of transport companies and one of the road traffic data providers acting in Internet.We have also conducted our research for New York (travel data from taxi company) and the results are very interesting (Development, 2020).The results of the research (learning based on regression models) were promising.They can be used in the next stage for improving the quality (accuracy) of the developed models and their learning methods.

Reduction method of number of time intervals
One of the objectives of the research was to reduce the number of time intervals (time windows) for which the network matrices of the travel times are built without too much loss of quality of travel time reasoning models.The travel time network matrix is built for a set of pairs (origin-destination) of points between which there is a need to estimate the travel time.This time depends on the month, day of the week and the moment of starting the travel from the point of origin of the pair of points, so for the given set of pairs (origin-destination) of points we must have as many matrices as there are moments of starting the travel (more precisely: time intervals of starting the travel, assuming that for each moment of starting the travel from the time interval, the time of the travel is identical between a fixed pair of points).If we divided the day into 15-minute intervals, we would have 4x24=96 times intervals, for which we would have to estimate the time matrix for each pair (origin-destination) of points, for separate weekday!The idea is to reduce the number of these time intervals.The aim of the research, already in detail, was to show in which time intervals of the travel start time, the travel times between a fixed pair of points are similar and in which they differ significantly.It was therefore a matter of grouping (clustering) the start time intervals due to similar travel times.Then, for each group of time intervals, one model can be built to estimate the travel time.This is important because, as shown in Table 2 the learning time of travel time network models increases exponentially with the increase in the groups of time intervals and the quality of the models (average percentage error of the travel time estimation) changes slightly.There were several dozen (m) pairs of points (origindestination) in Warsaw and for them Google Maps calculates the fastest driving times starting from the point of origin in 15-minute intervals, for the selected day (20.03.2019,Wednesday).Google Maps for routes calculated in the past or in the future gives the time interval of travel [time_min; time_max], so for each pair of starting points, a time interval has been obtained for each departure (start) time of the travel: [time_min; time_max].Clustering for the m routes was calculated (at that time the vector had dimension m*2: time_mini and time_maxi, i=1,...m and there were 96 vectors).The k-means method was used to group the time windows.Euclidean distance was chosen as a measure of distance.Results of clustering are presented in Fig. 1.From the Fig.

Assumptions for modelling network of travel times and performance evaluation
A model of simple linear regression was used to determine the relationship between the travel time and the geographical distance between the origin-destination points: where:  ̂estimated travel time (explained variable) [min]; xgeographical distance between origin-destination (explanatory variable) [km]; alfa0, alfa1model parameters.
As a measure of the regression error, mainly the Mean Average Absolute Error (MAE) was used, which is the average of the sum of the absolute differences between each pair of points: a given value of y (travel time) and the estimated value  ̂, that is, the travel time (counted in minutes) calculated on the basis of the regression model used , where n is the number of data); MAPE (Mean Average Percentage Error) was also used ). Wherever possible, the determination factor R 2 was also calculated.
The study was conducted for several time windows (clusters) in the day (see Fig. 1.), for one day of the week (Wednesday), without division of Warsaw into zones (see Fig. 2a) and with division into zones (Fig. 2b).There were over 270 000 of learning data (pairs of start (origin) and stop (destination) points randomly selected from Warsaw and its neighbourhoods (see Fig. 2), at different times of the day, but with a probability distribution depending on the time of day: more points at peak times, less points at offpeak times; average 17 pairs of origin-destination for each pair of zones from Fig. 2b).For these pairs of points, the fastest route was calculated using the Google Maps API (GoogleMaps, 2021).For each pair of origin-destination, for each moment of starting the travel a time interval of travel time of fastest route was obtained from Google Maps: [time_min; time_max].We don't know probability distribution of this travel time, so we assume that this distribution is normal with expected value equals avg_time=0,5*(time_min + time_max) and standard deviations equals 1/6*(time_max -time_min).For further calculations we assume that travel time of fastest rout is equal avg_time.

Analysis of the learning results of the travel time network models
Based on the determined different groups (clusters) of time intervals (Table 1) and data on travel from one of the Internet providers (Google Maps), the learning of travel time network models for Warsaw and its neighbourhoods was carried out.The results of the comparison of the quality of the taught regression models (research strategies) with and without zoning are presented in Table 2 and there are very promising.The study strategies described in Table 2 mean respectively (strategy number -number of zones -number of clusters of time intervals): 1-11-1; 2-11-1; 3-11-1; 4-11-7; 5-11-14; 6-55-1; 7-55-4; 8-55-7; 9-55-14.On the basis of the data contained in Table 2, it can be observed that the time between two points can be estimated with sufficient accuracy depending on their geographical location (i.e. the geographical distance between them) and the time of departure (without reference to the micro-scale, i.e. the detailed structure of the road network).The model without teaching and without dividing Warsaw into zones and without considering the moment of leaving the starting zone gave an average absolute error of estimation (MAPE) of travel time of about 43%, which was unacceptable.The learning of the model consisted in the fact that for subsequent research strategies (the research strategy consisted in extending the model by 55 zoning (Fig. 2a) and selecting the number of groups (clusters) of the time interval for starting the travel), new parameters of the model were estimated in order to minimize the average absolute error of estimating the travel time ((55)(55) Number_of_clusters (1 or 4 or 7 or 14) regression models were learned, i.e. for each pair of zones and for each group of start time intervals, a model for the travel time between these zones depending on the geographical distance between start and destination points and the start time interval was taught).From Table 2 also result dependence between average absolute error of travel time models and the geographical distance between the starting and stopping point of the travel in Warsaw and its surroundings and considering the research strategy.The best results were achieved with division into 55 zones and with 14 clusters of travel start time intervals (MAPE=13% error), but already for 7 clusters of travel start time intervals (as in Table 1) satisfactory results were achieved (MAPE=14.5%)and the learning time of the model was 6 times shorter (about 7 hours instead of 44).This meant that the average percentage error in estimating the travel time between any pair of points in Warsaw and its vicinity, depending on the geographical distance between them, was 14.5%.Moreover, 75% of estimation errors were less than 18% and 95% of estimation errors were less than 41%.The model was verified based on 300 data of real travels (Uber data) in Warsaw from the period IX.2016-III.2020and we obtain MAPE=23% (too little data and in addition only between selected zone 1 An interesting, additional observation from the research is that the average length (in km) of the shortest time route in Warsaw (despite roads of different categories and with different speed limits) between any pair of points start/origin and stop/destination is strongly correlated (the correlation coefficient is about 0.91) with the geographical distance between these points and is expressed by the formula: length [km] = 1,66+1,36 Geogr_distance [km].pairs).We can iteratively improve parameters of the models by solving some optimization problem.After solving the problem we reduce MAE error from 6.95 to 3.7 and MAPE from 23% to 13.7%.

Learning of simulated annealing (SATM) algorithm for VRPTW problem solving 3.1. General idea of SATM algorithm
Our simulated annealing algorithm (SATM) was developed based on the paper (Woch and Łebkowski, 2009).We have extended the algorithm by including several new elements: heterogeneous fleet, modification of searching initial solution, different methods of cost calculations, time-dependent network of travel, parallelization, and others.In details SATM algorithm is described in (Development, 2020; Kulas and Tarapata; 2021), but in this paper a proposal for a method of optimal selection of calibration parameters is the most important.The initial step is to find an initial solution (acceptable of course).It is advisable that this should be a fairly good solution and quickly obtained.Typically, a modified Solomon algorithm is used (Solomon, 1987).Then the search for the global extremes begins again.Initially, the algorithm tries to search the whole space of acceptable solutions trying to hit around the global extreme.During the calculations, the tendency of the algorithm to leave the local extremes (so-called system temperature decrease -annealing) is gradually reduced, if the longer the calculation takes, the more likely the current local extremes are to be global as well.For each fixed temperature, repeat the search for the best solution by: determining an acceptable "adjacent" solution and performing a simple post-optimisation of the routes (this is one more our extension of the standard algorithm).If the new solution is better, then it is acceptable.If it is worse, it is assumed with probability depending on the distance from the best solution so far and the degree of annealing (the higher the distance and the lower the temperature, the lower the probability).Finally, the conditions for completing the calculation for the fixed temperature and for annealing are checked (see Table 3).The key issues for the algorithm are: determination of the initial solution, the strategy for finding an adjacent solution and the annealing strategy.The key calibration parameters, used for example in the paper (Woch and Łebkowski, 2009), are the initial TZERO temperature, the type and parameters of the annealing function A(t) (LOG), the number of the epochs (EPOCH) and the number ITER of the iterations (list of all calibration parameters is presented in Table 4, parameters with id from 5 to 17 are additionally proposed by us in our SATM algorithm).Some recommendation to set initial values of calibration parameters are presented in (Development, 2020; Woch and Łebkowski, 2009) and in the 4 th column in Table 4.The determination of the first (initial) solution is based on the proposal in (Woch and Łebkowski, 2009).The parallelization of the SATM algorithm was developed by considering the conclusions from the paper (Czech and Czarnas, 2002) and research from the paper (Wieczorek, 2011).

A method to optimise the selection of calibration parameters
In our research we use Gehring-Homberger benchmarks (VRPTW, 2021) and we define input data characteristics for VRPTW problem to classify the type of data, see Table 5.Each file of input data has set of deliveries (customers), their geo-coordinates (to calculate distance between them), demands and service time window.Values of input data characteristics for a few exemplified Gehring-Homberger benchmarks are presented in Table 6.Number of deliveries #DELIV = #WIND.We define type of input data and describe it by xyz, where: x -degree of clusterization of deliveries, For example, 201 describes type of input data with strong clusterization of deliveries (2), the advantage of short service windows over long ones (0) and the advantage of medium length time windows over very short time windows (1).Criteria for assessing the quality of a VRPTW solution using SATM algorithm are as follows: c1total length of routes from the calculation; c2calculation time in seconds; c3the number of the epoch of calculation that already gave the best result; c4number of routes from the calculation.We use following notations: Nset of input data id for VRPTW problem, N={1, 2, …, n}; S_tidset of input data types, _ = {:  ∈ {0,1,2},  ∈ {0,1},  ∈ {0,1}}, xyz calculated using ( 2)-( 4); f_tida function that returns type of input data id, f_tid: N→S_tid; Mset of identifiers for sets of calibration parameters, M={1, 2, …, m};  = { 1 ,  2 , . . .,   }tested sets of calibration parameters, where   = ⟨ 1 ,  2 , . . .,  17 ⟩ , jM describes vector of values of calibration parameters from Table 4; c1(i,j)value of total length of routes from the calculation for the i-th input data using the j-the set of calibration parameters, iN, jM; c2(i,j)value of calculation time in seconds for the i-th input data using the j-the set of calibration parameters, iN, jM; c3(i,j)value of the number of the epoch of calculation that already gave the best result for the i-th input data using the j-the set of calibration parameters, iN, jM; c4(i,j)value of number of routes from the calculation for the i-th input data using the j-the set of calibration parameters, iN, jM;  Because range of the values of criteria may be different (for example a typical value of c4 is up to several dozen and typical value of c1 is up to several tens of thousands), we calculate normalized   values of criteria ck, k=1,…,4: where: max ( ) max ( , ) Because optimal value of   (, ) is equal to 1, for each k=1,…,4, so we can define mc function as Euclidean distance between optimal and current values of criteria: We define a few strategies to recommend a set SCPi of calibration parameters for the i-th input data: Strategy S1, for the i-th input data, allow us to select such a set SCPi of calibration parameters for which the value of mc function is minimal, using strategy S2 we select such a subset SCPi of calibration parameters that the value of mc function is not less than 10% lower than the best value, using strategy S3 we determine such a subset SCPi of calibration parameters that for each criterion the normalized value of this criterion is not less than 90% its optimal value (let's recall that after normalisation, the optimal value of the criterion is equal to).Let us note that we solve the problem of optimizing the multi-criteria selection of a set of calibration parameters and the presented strategies are methods to solve this problem.
Considering presented explanations, procedure of recommendation a set of calibration parameters for the input data types is presented in Table 7.
Interpretation: the apr means the number of input data which are of type r and for which the set p of calibration parameters was recommended.Interpretation of the   *100% can be as follows: this is a percent of input data of the r-th type for which the p-th set of calibration parameters gave the best results.Another interpretation: when we solve VRPTW problem using SATM algorithm and we have input data of type r, we should use the set p of calibration parameters with frequency ("probability") equal to   in order to obtain the best solution (depend on strategy S) from the point of view of many criteria c1,.., c4 .

Analysis of results
We test n=36 Gehring-Homberger benchmarks (for #DELIV: 200, 400 and 600) for different sets of calibration parameters (m=160, see Table 4, column 4), for THREADS=5 (we have conducted calculations also for different numbers of threads but it is not described in this paper).Criteria c3 and c4 was calculated but were not considered in this analysis (i.e.c3(i,j)=c4(i,j)=0, iN, jM).All calculations have been conducted on computer with processor Intel ® Core™ i7-9700 (8 cores) and 16 GB RAM (relevant for calculation time).
In Table 8 the best sets SCP of calibration parameters values has been presented.Fig. 3. "Probability" distribution of using set id of calibration parameters from Table 8 for different input data type ("xyz") in order to obtain the best solution for two c1, c2 criteria (strategy S1)  In Fig. 3 and Table 9 we present values or visualization of calculated matrices B for strategies S1, S2 and S3.From this analysis results that the best solutions have been obtained for the 4 th and 5 th calibration parameters set id.For example (see Table 9), in strategy 2, for "000" input data type we have obtained the best results for solving VRPTW problem for (0,35 + 0,50)*100%=85% input data using the 4 th and 5 th calibration parameters set id from Table 8.In turn, calibration parameters set id=1 for the smallest number of input data gave the best results.Very interesting result for strategy 3 has been obtained: for subset IDT={100, 101, 110, 201} of input data types the best set of calibration parameters is only set with id=4.It means that for 100% of input data of the type belonging to IDT subset we obtain single set of calibration parameters for which the normalized value of this criterion c1 (total length of routes) and c2 (calculation time) is not less than 90% its optimal value.

Conclusions
The article discusses two problems that often occur when looking for a solution to a VRPTW problem: calibration of the parameters of the random solution search algorithm (in our case: simulated annealing, SATM) and construction of the time-dependent travel time matrix which is the basis for determining the transit time between two recipients in algorithms solving the VRPTW problem.
Because the SATM algorithm (like whole group of randomized algorithms) is very sensitive to the values of the calibration parameters (especially when it comes to calculation time, but also to the quality of the routes to be mapped), the next research step should be the typing (grouping) of calibration parameters and performing many more tests (much more learning sets).As we did in the case of the def-inition of input data types, the further aim of the research will be to divide the sets of calibration parameters into their types and ultimately -automatic selection of the type of calibration parameters when starting the algorithm based on the input data type.
A very interesting problem is the selection of calibration parameters of the algorithm to obtain the best possible solution after a maximum set time, for example 5 minutes.This is very important from a practical point of view because we want to have a solution acceptable from the point of view of the desired criteria quickly.It can be the next step for research.The advantage of our proposal to construct a network of travel time matrix is that to estimate the travel time between a pair of customers we only need their geographical locations and the moment they leave the warehouse.Having once learnt the regression models of the dependence of travel times on the distance in a given area (e.g.Warsaw), we can use them many times for different groups of customers from this area.Normally, in case of a new set of customers to be visited, we must recalculate/receive the time matrix between each pair of them each time.
We are not able to predict in advance for which customer pairs we will need these travel times next time.
Moreover, relating to the construction of the network of travel times we conclude that zoning has a positive impact on the quality of travel time estimates based on distance and departure time.The results of the additional research (Development, 2020) allowed to conclude that the absolute error of estimation for zones 6x6 in relation to zones 5x5 decreased by about 15%.Generally, the conclusion is that the denser the road infrastructure (e.g.large cities), the smaller should be the size of a single zone Tarapata, Z., Kulas, W., Antkiewicz, R., Archives of Transport, 61 (1), [133][134][135][136][137][138][139][140][141][142][143][144][145][146][147]2022 (square), the rarer this structure (e.g.out-of-town areas), the bigger should be the size of the square (multi-resolution model, e.g.quadtree).The number of clusters in the time interval of the start of the trip has also an influence on the quality of the estimation of the time interval network models; in the conducted studies the data on trips from one day of the week (Wednesday) were used.Other days of the week and months should be included in "clustering".In this way we would achieve full yearround coverage and the model would be even more accurate.Moreover, reliable learning data (travel times between points, depending on start time of travel) for construction of the network of travel times should be given.We had data for real travels from several courier companies, but these data were unreliable (too many so-called unusual observations, difficult to filter out which have influence on built regression models), so we used GoogleMaps data.As a result of the project described in (Development, 2020; Tarapata et al., 2020) we have also obtained Smart Tracking Device (IUM) and based on historical data derived from routes and IUMs installed in many vehicles we plan to learn the network of travel times anew.The presented suggestions and comments should be a basis for further research.

Table 4 .
Type of calibration parameters in SATM algorithm and their interpretation (fixed for one thread, with more threads should be inepoch -repetition of the search in one epoch (should increase with the size of the task -exponentially / polynomially) annealing function: 0 -geometric, 1 -logarithmic (practically only the geometric function is used) 0 5 TAU Weighting factor of the criterion for the total duration of deliveries 0 6 ALFA Weighting factor of the criterion for a given number of routes 10000 7 BETA Weighting factor of the criterion for the total length of routes 1 8 GAMMA Cost (penalty) for using a vehicle exceeding the limit 1000000 9 ETA Cost (penalty) for assigning a customer to a vehicle exceeding the limit 1000 10 DELTA The factor of the system's temperature change function (determines the speed to narrow the search area) the radius of the circle, in which the neighbours of a soluend of the epoch the best of the existing solutions -minimal (MIN != 0) 1 17 THREADS Number of calculation threads (for parallelization) 142Tarapata, Z., Kulas, W., Antkiewicz, R., Archives of Transport, 61(1),[133][134][135][136][137][138][139][140][141][142][143][144][145][146][147] 2022

Table 2 .
Results of comparison of the quality of regression models (research strategies) with and without zoning(3)estimated model without zoning and without clusters of time intervals,  ̂= 10,96 + 2, 08 ⋅ .

Table 3 .
Pseudo-code of the SATM algorithm

Table 5 .
Input data characteristics for VRPTW problem

Table 7 .
Procedure of recommendation a (best) set of calibration parameters for the input data types 1. Run SATM algorithm for each input data iN, for each set of calibration parameters jM;

Table 9 .
Matrices B for strategies S2 and S3