Cross-Temporal Forecast Reconciliation at Digital Platforms with Machine Learning

Platform businesses operate on a digital core and their decision making requires high-dimensional accurate forecast streams at different levels of cross-sectional (e.g., geographical regions) and temporal aggregation (e.g., minutes to days). It also necessitates coherent forecasts across all levels of the hierarchy to ensure aligned decision making across different planning units such as pricing, product, controlling and strategy. Given that platform data streams feature complex characteristics and interdependencies, we introduce a non-linear hierarchical forecast reconciliation method that produces cross-temporal reconciled forecasts in a direct and automated way through the use of popular machine learning methods. The method is sufficiently fast to allow forecast-based high-frequency decision making that platforms require. We empirically test our framework on unique, large-scale streaming datasets from a leading on-demand delivery platform in Europe and a bicycle sharing system in New York City.


Introduction
Time series to be forecasted are oftentimes naturally part of a hierarchical structure, where higher frequency and granular series are added together to form lower frequency aggregated series.Separate forecasts of each series rarely conserve this hierarchy, and forecast reconciliation methods are therefore required.In this paper, we consider novel forecast reconciliation methods for on-demand delivery platforms that require forecasts at different levels of cross-sectional and temporal aggregation.We introduce non-linear forecast reconciliation based on popular machine learning methods to capture the complex interdependencies of platform data.
Platforms such as Uber, Lyft, GrubHub, UberEats or DoorDash are nowadays omnipresent in the global economy.They operate in high frequency and on a large number of verticals (regions, product categories, etc.).Indeed, their market place is typically split up in different geographical regions, so a natural cross-sectional aggregation scheme arises from many individual delivery areas over zones towards few market places.Moreover, a temporal aggregation scheme naturally arises since, on the granular end, fast operational decisions (think in terms of minutes) are needed to ensure the platform's service couriers are at the right time and location to serve consumer demand promptly and to determine compensation schemes for couriers through dynamic pricing.On the coarser end, strategic business decisions also require long-term planning since the budget available for each delivery area is set typically using daily demand forecasts.Accurate and coherent, i.e. reconciled, demand forecasts across all levels of the cross-sectional and temporal hierarchy are therefore key to the business' success and to support aligned decision making across different planning units.
The literature to reconcile forecasts has been pioneered by Hyndman et al. (2011) and Athanasopoulos et al. (2009), see Hyndman and Athanasopoulos (2021) for a textbook introduction and Hyndman and Athanasopoulos (2014) for an introduction to practitioners given the strong interest from industry.Typical initial applications concern tourism where demand series are subject to a cross-sectional hierarchy across different geographical regions, though many other cases appear more recently, see for instance Caporin et al. (2023) for a financial application to realized volatility forecasting.While forecast reconciliation methods initially focused on producing coherent forecasts across cross-sectional hierarchies, the literature on forecast reconciliation in the temporal direction (Athanasopoulos et al., 2017) as well as combinations of both in cross-temporal frameworks (Kourentzes and Athanasopoulos, 2019;Di Fonzo and Girolimetto, 2023a,b) is much sparser (Petropoulos et al., 2022) though in recent years it has proliferated.We refer to Athanasopoulos et al. (2023) for an extensive review on forecast reconciliation in cross-sectional, temporal and cross-temporal hierarchies, to Hollyman et al. (2021) for the link with the forecast combination literature and Athanasopoulos et al. (2024) for recent innovations in hierarchical forecasting.
Methodological considerations to forecast reconciliations are discussed in Hyndman et al. (2016), whereas theoretical results are first established by Wickramasuriya et al. (2019) by putting forward an optimal linear forecast reconciliation method, minimizing the sum of forecast error variances across the collection of time series; see also Van Erven and Cugliari (2015) for a game-theoretic approach to forecast reconciliation and Panagiotelis et al. (2021) for a unification of the former two works.Traditionally, forecast reconciliation is thus typically achieved by linearly combining base forecasts of all the time series in the hierarchy.
Recently, however, applications of machine learning algorithms surface to reconcile forecasts in a non-linear way.
forecasts and selectively combine them in a direct and automated way without requiring that the information in the entire hierarchy must be used for producing reconciled forecasts.Anderer and Li (2022) use deep learning and LightGBM tree-based methods to propose a variant of the bottom-up method proposed by Spiliotis et al. (2021), accounting for biases that are difficult to observe at the bottom level.The latter forecasts are adjusted to obtain a higher forecasting accuracy on the upper levels of the hierarchy.Their approach works well on the M5 competition dataset consisting of over ten thousand hierarchical retail sales time series (see Makridakis et al., 2022c andMakridakis et al., 2022a for details).Abolghasemi et al. (2022) use tree-based methods and lasso regression in their dynamic top-down bottom-up method applied to fast-moving consumer goods and the M5 competition dataset.Finally, a new stream of research uses neural networks for both forecasting and reconciliation; examples of such end-to-end modeling approaches are Theodosiou and Kourentzes (2021) and Wang et al. (2022), which bring promising evidence on various datasets.
Our paper also proposes an ML-based reconciliation approach, extending the work of Spiliotis et al. (2021) in the following ways.From a methodological point of view, we generalize their ML-based forecast reconciliation method for cross-sectional hierarchies to crosstemporal hierarchies.Additionally considering reconciliation across the temporal direction requires important design choices with respect to the ML model used for reconciliation since the inputs consist of mixed-frequency time series, namely all temporal frequencies in the temporal hierarchy.Besides, since the ML algorithm is a key backbone in our forecast reconciliation approach, we carefully investigate the effect of (i) the ML algorithm used, thereby considering not only random forest and XGBoost as in Spiliotis et al. (2021) but also LightGBM due to its successful application in the M5 competition; and (ii) tuning hyperparameters to make careful considerations on the trade-off between improvements in forecast accuracy and computational time.Finally, we consider an extensive set of base forecasts, ranging from academic standards to industry standards (for our platform application) as well as a combination of several base forecasts methods.
From an application viewpoint, we showcase the performance of our forecast reconciliation method on platform demand data streams, thereby bringing novel insights into the performance of forecast reconciliation methods on a modern and unique dataset.We use UK demand data (September 2022-September 2023) from a last mile logistics platform where the bottom level series are 30-minute demand series for 141 delivery areas in London and in Manchester, their two largest markets.We aggregate to hourly and daily frequency in the temporal dimension, and towards zone and market level in the cross-sectional dimension.
Platform data have several features that are distinct from the traditional datasets considered in the forecast reconciliation literature.First, they are high-frequency with, our top, hence lowest frequency, level (i.e.daily) consisting of the bottom level considered in earlier work such as recent M5 competition (other works even consider lower frequency settings such as monthly to yearly data).Indeed, computational speed and fast decision making at digital platforms are key to their success, thereby requiring forecasts at the intra-day level.
A second distinctive feature of platform data is their streaming character which means that they face non-stationarities or shifts for example due to expansion, competition among platforms, sales of certain clients, changing customer's preferences or non-regular events such as sports events.The data on Manchester highlight the performance of the ML forecast reconciliation method in presence of data shifts.We compare our approach with state-of-the art reconciliation benchmarks from the recent literature.
We obtain several insights based on our application.First, our main empirical finding is that ML-based forecast reconciliation for platform data can result in substantial forecast accuracy improvements compared to existing linear reconciliation methods.However, MLbased reconciliation is not uniformly superior to linear methods.In fact, for the most important series, i.e. high frequency delivery area level, random forest based reconciliation typically performs best, for city level series LightGBM based reconciliation yields the smallest forecast errors.On the other hand, linear reconciliation methods outperform XGBoost based reconciliation on the bottom level series.The latter result hold for XGBooost without tuning its hyperparameters to keep computing time low.However, ignoring the huge computational cost, we also study the impact of hyperparameter tuning of the ML algorithms and find for example that XGBoost performs in this case at par with random forest in the lowest levels of the hierarchy.A second empirical finding is that ML-based reconciliation seems to work equally well irrespective of the quality of the base forecasts.Hence, exploring a large model space for obtaining base forecasts is unnecessary.Finally, data streams are prone to data shifts so we compare the forecast reconciliation methods on the Manchester market that is impacted by a major shift in our evaluation period.The results show that, while forecast performance of all methods suffer during a data shift, ML-based reconciliation is able to revert quickly back to pre-shift forecast accuracies.
The rest of the paper is organized as follows.Section 2 introduces the platform data we analyze in this paper.Section 3 presents our ML-based forecast reconciliation approach for cross-temporal hierarchies.Section 4 describes the forecast set-up we use to evaluate the out-of-sample forecast performance of our ML-based reconciliation approach versus state-ofthe-art benchmarks whereas Section 5 presents the results.Finally, Section 6 concludes.

Platform Data
We use a unique dataset of demand for deliveries from a leading on-demand last-mile logistics platform in Europe, Stuart, which connects businesses to a fleet of geographically independent couriers. 1We analyze UK London demand 30-minute frequency data from September 5, 2022 to September 24, 2023. 2 To efficiently organize parcel deliveries, London is split into 18 zones which are constituted by the 117 bottom-level delivery areas.Planning at the delivery area level is of particular importance because it determines the remuneration paid to the couriers and and it therefore directly helps to optimize the platform's efficiency.
Deliveries are carried out on a daily basis between 7am and 11:30pm in all areas. 3This creates a balanced dataset of 13, 090 observations for each of the 136 (=1+18+117) time series at the 30-minute frequency.For business planning purposes, the series also require temporarily aggregation to the hourly and the daily frequency, thereby resulting in temporally aggregated time series that consist of 13, 090/2 = 6, 545 and 13, 090/34 = 385 observations respectively.The forecast reconciliation therefore consists of three cross-sectional and three temporal layers implying 408 base time series.
Figure 1 plots demand for the City of London delivery area, London Central zone, and London market at 30-minute, hourly and daily frequencies over the sample period.We observe some typical characteristics of platform data.First, intra-day area data is spiky because of peak demand periods triggered by events such as flash sales, food promotions, bad weather.These peaks are largely tempered at daily frequency.Holidays can also cause large 1 The data are provided to us in the context of ongoing research collaborations.Due to a confidentiality agreement, we are not allowed to distribute or report actual demand data.Demand data across all figures are therefore normalized between 0-100.All analyses were carried out with the original data.
2 In addition to the London market, we also analyze the case of Manchester in Section 5.4.
3 Occasional overnight demand is integrated in the last 30-min of the business day.drops in delivery demand, this is particularly noticeable in the end-of-year period.Second, there are strong seasonality patterns at all levels of the hierarchy.A displays these seasonality patterns.We clearly see that the intra-day seasonality (visible from both the 30-minute and hourly time series) is mainly driven by food deliveries.The bottom row of the plots traces daily seasonality, and illustrates that there can be substantial differences between weekends and the rest of the week.In general, London faces higher weekend demand while the opposite is true for the delivery area City of London.This implies that there exists heterogeneity in demand patterns at the delivery area level.In fact, unreported plots for other delivery areas reveal that very different dynamics appear.
Figure 2 illustrates this heterogeneity across areas.In the left panel, we visualize the  however has mostly zero demand throughout the entire business day.Large diversity appears in the morning and evening hours with delivery areas that are idle and others that are active.

Machine Learning Based Forecast Reconciliation
In Section 3.1, we review the cross-temporal framework using the general notation from Athanasopoulos et al. (2023).In Section 3.2 we introduce our ML-based forecast reconciliation approach for cross-temporal hierarchies.

Cross-Temporal Hierarchies
We start with some general notation so that our methodology of the next section can be widely applied.We denote y b,t for b = 1, . . ., n b and t > 0 the high-frequency time series observation for bottom level b at time t.The cross-sectional hierarchy is simply obtained by summing over elements in the set of the n b bottom level time series.This yields n a additional high-frequency time series resulting in a total of n = n a + n b series.
Regarding the temporal hierarchy, note that in a platform setting data is streaming but for simplicity let us consider t = 1, . . ., m with m the number of observations to sum over when going from the highest to the lowest sampling frequency in the hierarchy.If we define k = k 1 , ..., k p the p temporal aggregation levels, with k 1 = 1 and k p = m, then the nonoverlapping temporally aggregated series, for each series i = 1, . . ., n and k = k 1 , ..., k p , are given by Next, denoting τ as the observation index of the lowest frequency (top temporal level), we can stack these observations in (m k × 1) vectors x where ⊤ denotes the transpose of a vector and m k = m/k.
Finally, the cross-temporal hierarchy is obtained by considering for each variable i = x x 1, . . ., n of the cross-sectional hierarchy, the entire temporal hierarchy.Notation wise, we collect in a ( p l=1 m k l ×1) vector all temporally aggregated time series and then stacking all n variables into the vector In our application, regarding the cross-sectional hierarchy the bottom-level consists of

Forecast Reconciliation via Machine Learning
We are now ready to introduce our machine learning-based forecast reconciliation approach for cross-temporal hierarchies, thereby extending the work of Spiliotis et al. (2021) for cross-sectional hierarchies.Next, from the observations in the validation sample V consisting of observations V = {Q + 1, . . ., Q + R × H}, we construct the response vector and features matrix for the ML-model which can be generally written as where f b (•) denotes the forecast function to be trained for bottom-level series b, see Section 4.2 for specifications, and ε b,V represents the error term.Note that every bottom-level series in the cross-temporal hierarchy (i.e.30-min delivery area demand in our application) is predicted by a separate ML model, thereby allowing the forecast reconciliation approach to adapt to different patterns in each series.The response variable y b,V in ( 1) is simply given by the m 1 × R × H observations of the bottom-level series for variable b; hence .
As features matrix to the ML-model, we use which contains various base forecasts on the validation sample, where ⊗ denotes the Kronecker product and 1 is a column vector of ones.More specifically, the first set of predictors, highlighted in yellow, exploits information in the cross-sectional hierarchy and contains the n b highest-frequency (i.e.30-min in our application) base forecasts for all bottom-level variables, as well as the n a highest-frequency base forecasts for the upper level series in the cross-temporal hierarchy (i.e.zone and market in our application).The second set of predictors highlighted in blue exploits information in the temporal hierarchy and contains all temporal (i.e.30-min, hourly and daily in our application) base forecasts for the bottom-level variable b for which we are constructing our ML-model.Notice that due to the frequency mismatch, each of the lower-frequency (i.e.hourly and daily in our application) at level k a for a = 2, . . ., p are repeated respectively k a times.We hereby make the assumption that temporal gain in reconciling forecasts mainly comes from the focal variable's own temporally aggregated series to keep this second set of predictors compact.Finally, note the central role for the bottom level variable b itself (highlighted in yellow and blue) whose base forecast we aim to improve upon in this reconciliation step.

Design of the Forecast Study
We start by discussing in Section 4.1 the forecast set-up we use to evaluate out-of-sample forecast performance.In Section 4.2, we summarize the base forecast models all forecast reconciliation methods rely on, together with the machine learning methods used by our procedure to obtain the reconciled forecasts.Finally, Section 4.3 outlines the benchmark forecast methods against which we compare the performance of our procedure.

Forecast Set-up
To compare the forecast performance of our forecast reconciliation method against its benchmarks, we use a standard rolling-window approach using the period February 20, 2023 to September 24, 2023 (31 weeks) as test sample.In each rolling window iteration, we use the most recent six months of data to estimate the base models on the estimation set (first five months), and the most recent month of data to estimate the ML model on the validation set.Hence, in line with the notation used in Figure 4, we take Q = 140 days, H = 7 days and R = 4.Note that since the demand data in our platform application consists of nonnegative integers we ensure this by rounding the base forecasts prior to using them as inputs in the ML model, i.e.
x[k] ⋆ i,t = max(0, ⌊x i,t ⌉), we similarly round the bottom-level reconciled forecasts before aggregating them.
In a next step, we combine the estimation and validation sample (N = 168 days) to estimate the base models and obtain final out-of-sample base forecasts as well as reconciled forecasts on the test set.As forecast horizon, we consider one week of data (i.e.horizon m k × H for temporal aggregation level k) as this corresponds to the planning horizon for the compensation schemes of couriers through dynamic pricing which crucially relies on demand forecasts as inputs.Hence, for the 30-min series, we obtain 34 × 7 = 238 forecasts; for the hourly series we obtain 17 × 7 = 119 forecasts and for the daily series we obtain H = 7 forecasts.This concludes one iteration in the outer rolling window used for forecast evaluation purposes.In the next iteration, we move one-week ahead and repeat the same steps.We roll forward until we reach the end of our sample, thereby resulting in 31 outer rolling windows.
, which measures the overall deviation of the forecasted values F [k] i,j from the actual values i,j , for all time points j in the test set of size T test /k where T test denotes the number of high-frequency (i.e.30-min) observations in the test set.We then average these WAPEs across all cross-sectional units belonging to the same cross-temporal level; for instance the overall WAPE for the bottom-level series (i.e.30-min for the n b areas) is then given by b .The WAPE is a popular forecast metric to evaluate performance at delivery platforms (i) because of its scale-independence thereby facilitating comparisons across heterogeneous areas as well as different time/geographical aggregation levels, (ii) it is agnostic to zeros as it is very unlikely that the denominator is zero, thereby making it a suitable metric even when the demand in an area is low or intermittent, and (iii) it emphasizes accuracy for items with larger demand, thereby prioritizing accurate forecasting of demand during busy periods.Note that as robustness check we also report results for the root mean squared scaled error as well as the symmetric mean absolute percentage error.

Base Forecasts and Machine Learning Methods
Our procedure allows practitioners to choose their preferred base forecast models as well as their preferred ML method to perform the ML-based reconciliation.In this paper, we consider four base forecast models and three popular ML methods to this end.
Base Forecasts.We consider one popular industry forecast model for platform data, two all-round time series models for forecasting univariate time series and a forecast combination of the former three as ensemble, base forecast model.The first base forecast model is a simple yet often-used model in the platform industry which we will label as "Naive" since the forecast for a specific slot (30-min, hourly or daily) of next week is simply the value observed in the same slot and day of the previous week.The following two base forecast models are oftentimes used in the hierarchical forecasting literature to obtain base forecasts, namely the class of SARIMA and exponential smoothing models.Full details on the base forecast methods are available in Appendix B.
Machine Learning Methods.We consider three popular ML models, namely random forest, XGBoost and LightGBM.The former two are also considered in Spiliotis et al. (2021) whereas the latter has recently shown great success on the M5 competition (Makridakis et al., 2022a) consisting of an application with hierarchical retail sales time series, thereby making all suitable candidates for our platform data.More details on the ML models are available in Appendix B.
For all three ML methods, we report main results based on their standard implementations with default tuning parameters (see Appendix B).We do this, first to avoid excessive computing times on our streaming platform data, but second, and importantly, also because off-the-shelf implementations of tree-based methods often attain excellent performance across a variety of settings (e.g, Januschowski et al., 2022).In Section 5.3, we investigate the improvements in forecast performance one can obtain by tuning these ML models.
Finally, when reporting our main results, we use the sum of squared errors as loss function to train all ML models.However, our framework allows practitioners to use another loss function as they see suited for the data at hand.We also investigated the performance of the ML-based reconciliation methods when using the Tweedie loss function instead of the standard squared loss, since the former showed superior performance in the M5 competition (Januschowski et al., 2022;Makridakis et al., 2022a) on forecasting hierarchical time series as it is specifically geared towards sparse (zero-inflated) target data, which some of the considered delivery areas also display (see Figure 2).However, using a Tweedie loss instead of the regular squared loss function did not improve forecast results of our proposed ML reconciliation methods on the considered data, and are hence omitted but available from the authors upon request.

Benchmark Reconciliation Methods
We compare our new ML forecast reconciliation procedure against the base forecasts and five state-of-the-art linear reconciliation benchmarks for cross-temporal hierarchies, as summarized in Table 1 and implemented in the FoReco package (Girolimetto and Di Fonzo, 2023) in R (R Core Team, 2023).Apart from the Bottom Up method, the reconciliation step requires specifying how to estimate the covariance matrix of the in-sample fit errors since the benchmark methods are geared towards minimizing reconciliation errors as opposed to our procedure which is geared towards minimizing forecast combination errors.For the tcs, cst and ite benchmark methods, the temporal reconciliation covariance matrix estimator is the series variance scaling matrix (thf comb = "wlsv") which is the diagonal matrix that contains the estimated variances of the in-sample residuals across each level.The cross-sectional reconciliation covariance matrix estimator is a shrinkage covariance matrix of the in-sample residuals (hts comb = "shr"), where the off-diagonal elements are shrunk towards zero.For the oct method cross-temporal based covariance matrix estimator is simply containing the in-sample residual variance on the main diagonal (csts comb = "wlsv").Full details on the cross-temporal benchmark methods are available in Di Fonzo and Girolimetto (2023a).

Results
This section is divided in four parts.In Section 5.1, we discuss the overall forecast performance of the forecast reconciliation methods, followed by Section 5.2 where we present detailed insights on our ML-based reconciliation approach.In Section 5.3, we discuss the trade-off between computing time and forecast accuracy when incorporating hyperparameter tuning into our ML-based reconciliation proposal.Finally, in Section 5.4, we discuss how shifts in the demand data streams affect the performance of the reconciliation methods.

Overall Forecast Performance
Table 2 summarizes the forecast accuracy across the complete cross-temporal hierarchy for the different combinations of base forecasts and reconciliation methods. 4Overall, all methods find that forecasting is easier higher up in the hierarchy, i.e. lower temporal frequency and larger geographic areas, which confirms the general findings in the literature.
For the set of considered base forecasts and reconciliation methods, WAPE falls from around 30% to 5% for respectively 30-minute delivery areas and the daily London hierarchy levels.
The ML-based forecast reconciliation results using random forest are consistently outperforming all benchmarks at the delivery area and zone levels at all frequencies.For example, using 30-minute frequency SARIMA base forecasts, we obtain a WAPE of 0.3013 for the cst method (i.e. the best linear forecast reconciliation method) compared to 0.2801 for random forest based reconciliation (i.e. the best non-linear forecast reconciliation method).However, ML-based forecast reconciliation does not systematically dominate linear methods as shown for instance by XGBoost with a WAPE of 0.3195.Furthermore, ML-based forecast reconciliation seems to be rather insensitive to the choice of the base forecasts.In fact, the Naive, ETS and SARIMA results are qualitatively the same.For example in the case of random forest, the respective WAPEs are very close with 0.2762, 0.2809 and 0.2801 respectively.
To assess the significance of our findings, we report in Figure 5 the results of the accuracy ranking based multiple comparisons with the best (MCB) test, a methodology popularized in the forecasting literature since Koning et al. (2005).We compute the MCB test across Figure 5: MCB test with confidence level 0.95, using all levels in the hierarchy.Methods that do not overlap with the gray zone indicating the confidence interval of the best method are significantly worse than the best.
all levels in the cross-temporal hierarchy and report results for each of the four base forecast methods.To this end, we use the R-package tsutils (Kourentzes, 2023) with function nemenyi() to compute the MCB test.The test ranks the performance of the examined methods across the series being forecast comparing their average ranks by considering a critical difference, determined through a confidence interval.For each base forecast panel in Figure 5, the methods that do not overlap with the best ranked procedure (as highlighted by the gray zone) are significantly worse than the best.Visual inspection of the results confirms our overall finding that random forest reconciliation dominates the other procedures.
The important business question is whether ML-based forecast reconciliation helps im-proving the base forecast methods for high-frequency (i.e.30-minute and hourly) delivery area levels, because these forecasts are direct inputs for automated decision making.The answer is yes and the forecast gains are between 10 and 15%.As an illustration for random forest reconciliation with ETS base forecasts, the WAPE is 0.2809 compared to 0.3180 for non-reconciled base forecasts.It also turns out that these high-frequency delivery area level gains become even stronger at the zone and market levels.In fact, the forecast gains mount to even 40% at the market level.For example, the London 30-minute WAPEs decrease from 0.1621 to 0.0959 with ETS base forecasts and random forest reconciliation.Here, the linear reconciliation methods only deliver marginal accuracy gains, with the best (cst) WAPE being 0.1349.However, we also note that at the London level, the overall smallest WAPEs are obtained using Naive base forecasts for 30-minute and hourly frequencies.
Finally, in the bottom panel of Table 2, we provide forecast combination results, that is by combining with equal weights the base forecasts.We find this strategy to slightly pay off compared to the best performing base methods.For example, at the 30-min delivery area level, the random forest ML method for the base forecasts are respectively 0.2762, 0.2809 and 0.2801 while the forecast combination yields 0.2758.The gains are overall stronger for the linear reconciliation methods compared to the non-linear ones, yet the former are, overall, outperformed by the ML-based reconciliation methods.Still in practice, we generally do not know a priori which base forecasts method will perform best out-of-sample, and moreover, this likely varies across the different levels of the cross-temporal hierarchy.Then, a forecast combination base forecast approach forms an effective practice to obtain accurate and robust base forecasts, which can still be further improved upon via ML-based reconciliation to obtain coherent forecasts.
To sum up, random forest forecast reconciliation performs particularly well but not all ML methods are doing better than linear methods.The strongest gains are obtained for the 30-minute and hourly frequencies.Forecast combination, i.e. averaging the base forecasts, is (marginally) beneficial.Note that these main findings are qualitatively the same when we measure forecast accuracy in terms of the symmetric mean absolute percentage error (SMAPE), a popular asymmetric loss function, or in terms of the root mean squared scaled error (RMSSE), a popular metric in the forecast reconciliation literature (see Hyndman and Koehler, 2006 for a discussion on these metrics) instead of WAPE.We refer to Tables C.1 and C.2 in Appendix C for respectively the SMAPE and RMSSE results.

ML-Based Reconciliation Insights
From the previous results, we find that random forest based forecast reconciliation outperforms the considered benchmarks, especially at the more granular levels in the hierarchy.
To gain more insights, we investigate the link between forecast accuracy and size of the delivery areas.In addition, we investigate what drives forecast performance by computing SHAP values which are considered to be the current state-of-the-art method for interpreting ML models, see Molnar (2022) for an introduction.forecasts as the relative WAPEs of the ML-reconciled forecasts over the base forecasts are mostly below 1.
Next, for each random forest model with Naive base forecasts in the (outer) rollingwindow setup, we compute SHAP values (Shapley et al., 1953;Lundberg and Lee, 2017) for all observations in the test set.To this end, we use the package treeshap (Komisarczyk et al., 2023) in R, a fast implementation for tree ensemble models (Lundberg et al., 2018).
Figure 7 shows the resulting relative variable importance plots by variable group for City of London as an illustration.Given the large number of features used in the ML-model, we focus on the 36 highest ranked features in terms of variable importance (which together account for more than 80% of the overall variable importance).Overall, the proportions are stable over the test sample.While City of London's own 30-min and hourly streams count for a sizable portion in the 20-30% range, the other areas 30-minute streams count together for almost 40%.Zone and market level, or own daily information contributes only marginally.

Computational Aspects and Hyperparameter Tuning
Our main results above are obtained by training the ML algorithms with their default hyperparameters as provided by the software packages.By doing so, we save a substantial amount of computing time which is an important consideration when deploying forecast reconciliation in a production environment.However, in this section, we investigate whether the performance of the ML-based forecast reconciliation methods can be further improved by hyperparameter tuning.To this end, we use the common practice of grid search to tune the hyperparameters of the random forest algorithm, since there are only few hyperparameters, whereas we use Bayesian optimization to tune the hyperparameters of the XGBoost and LightGBM algorithms as the latter two have a large amount of hyperparameters.Details on the tuning process are provided in Appendix B.
Table 3 reports results for the ML-based reconciliation methods-default and tunedwith SARIMA base forecasts.The random forest algorithm with default tuning parameters, these were best performing in most of our previous results, performs equally well as its tuned version.This is a great advantage when implementing the forecasting procedure in a high-dimensional streaming data environment with frequent updates.In contrast to random forest, the XGBoost algorithm greatly benefits from tuning at the high-frequency delivery area level.Here, the loss decreases from 0.3196 to 0.2790, a reduction of about 13% and XGBoost is now at the same accuracy level as random forest.For the zone and market hierarchies, the differences between default and tuning are marginal.Finally, the LightGBM algorithm is only mildly impacted, with small tuning improvements at the delivery area level.
While hyperparameter tuning does result, overall, in a marginal gain in forecast accuracy, it is important to investigate the computational burden that comes along with it.Table 3 reports in the third column the approximate computing times (in seconds) for one ML fit with fixed tuning parameters as well as one round of hyperparameter tuning, and this for random forest, XGBoost and LightGBM.As well known, both the XGBoost and LightGBM algorithms are fast with default tuning parameters, requiring only about one second to train.
Random forest is in comparison relatively slow with seven seconds to train.However, the prohibitive cost of tuning is appalling.All algorithms need large multiples of computing time to tune the hyperparameters.In particular the XGBoost and LightGBM algorithms with many tuning hyperparameters are long to optimize.5 To conclude, among the considered ML algorithms, the random forest algorithm is preferred because the default tuning parameters yield good accuracy levels close to the tuned ones.This is a great advantage when faced with computational constraints such as given by the platform environments we consider in this paper.

Data Shifts: The Case of Manchester
The streaming platform data for the London market are reasonably stable over time.
In this section, we investigate how our ML-based forecast reconciliation procedure performs when facing shifts in the data streams.In fact, breaks or data shifts are regularly observed in platform data because of new competitors in the marketplace and companies moving some of their locations to other on-demand platforms.
To study the impact of breaks on forecast reconciliation, we focus on Manchester for which we have a cross-temporal hierarchy that consists cross-sectionally of 24 delivery area and Manchester as the entire market (there is no intermediate level consisting of zones in Manchester), and temporally of 30-minute, hourly and daily frequencies for the same sample period as London.base forecasts and all reconciliation methods; the other base forecast results are similar and available on request.We report results for the period before the data shift (i.e.prior to June 14), the period after the data shift (i.e. after July 14), and the one month period in between characterizing the data shift.There are several interesting findings for the delivery area level of the hierarchy.First, for "Before shift" period we see that random forest reconciliation for delivery areas outperforms other methods with the same relative magnitudes as found in the London market where no data shifts occur.For example, random forest and cst WAPEs are 0.3226 and 0.3435, respectively.Second, we see that the volatile "During shift" period WAPEs at the delivery area level all more than double compared to the stable period before.The LightGBM ML reconciliation method performs best with a WAPE of 0.7516.
Third, the "After shift" period is generally characterized by better accuracies than during the shift, but the size of the gains are much stronger for the ML-based reconciliation methods.
To illustrate the magnitude of this, consider cst with WAPEs of 0.7986 and 0.7023 and random forest WAPEs of 0.8004 and 0.4206, respectively for the "During" and "After shift" periods.Fourth, the un-reconciled "After shift" SARIMA base forecasts are of low quality with WAPEs still double the size compared to the "Before shift" period.This can somehow be expected given that parameter estimates are based on historical data subject to the shift.However, our ML-based reconciliation method is able to transform these low quality projections into forecasts that yield WAPEs much closer to the period prior to the shift.At the market level part of the hierarchy, we observe only small differences in terms of forecasting accuracy over the three periods.This is expected since the aggregation of delivery areas not all being subject to a data shift, results in a smoother market level time series, see Figure 8, bottom panel.The LightGBM method has consistently the smallest WAPEs at the 30minute and hourly frequency levels.

Conclusion
Forecast reconciliation based on linear methods has been shown to be very useful and successful in a wide variety of applications.Recently, ML-based forecast reconciliation methods are proposed for cross-sectional time series hierarchies.We extend the latter approach to cross-temporal hierarchies.We provide a general description of the methodology and an application to a unique platform dataset from an on-demand logistics company.In particular, the data consists of demand streams that constitute a rich hierarchy in spatial and time dimensions, with bottom level time series defined as delivery areas at 30-minute frequency.
Our key empirical finding is that ML-based forecast reconciliation for our platform data can result in substantial forecast accuracy improvements compared to existing linear recon-ciliation methods.However, unless one can face the huge computational cost of hyperparameter tuning, ML-based reconciliation is not uniformly superior to linear methods.Another key finding is that, as platform data streams are potentially impacted by data shifts, our approach is able to react swiftly to such instability when compared with linear approaches.
Several extensions should be considered.First, applications to other datasets consisting of different cross-temporal hierarchies.Second, our ML algorithms in the platform application consist of tree-based methods only, though our approach can be used with any ML algorithm.Taking into account computing time, it would be interesting to test alternative neural network based methods using our approach.Third, we currently retrain the ML algorithms at every iteration of the rolling window, it would be interesting to investigate if online ML methods that incrementally update the forecast function as new data becomes available can lead to further computational efficiencies.2024) for an application on emergency medical services data.We opt for equal weights in the forecast combination, mainly because of its simplicity, ease in implementation and its established track-record of good performance in the forecast combination literature (e.g., Elliott and Timmermann, 2016, Chapter 14).

B.3 Machine Learning Methods
We consider three machine learning methods: (i) random forest, (ii) XGBoost and (iii) LightGBM.
Random Forest.Random forests, proposed by Breiman (2001), produce forecasts by combining regression trees.A regression tree is a nonparametric method that partitions the feature space to compute local averages as forecasts, see Efron and Hastie (2016) for a textbook treatment.The tuning parameters are the number of trees that are used in the forecast combination, the number of features to randomly select when constructing each regression tree split, and the minimum number of observations in each terminal node to compute the local forecasts.We use the standard implementation of the randomForest package (Liaw and Wiener, 2002) in R with default settings for the hyperparamters (i.e.number of trees: ntree = 500, number of variables sampled at each split: mtry = # of features/3, minimum size of terminal nodes: nodesize = 5) when reporting our main results.
In Section 5.3, we also investigate the performance of the random forest based forecast reconciliation method when the hyperparameters are tuned.To this end, we follow Spiliotis

Figure 1 :
Figure 1: Time series plots for full sample of the area City of London belonging to the zone London Central and the market London.

Figure 2 :
Figure 2: Left: Aggregated demand in each area (circle) in the London market.Right: Average demand (top) and proportion of zeros (bottom) across areas; the median is displayed in solid black, 10% and 90% quantiles in dotted black, 25% and 75% in dashed black.

n
b = 117 delivery areas that are aggregated into the upper levels consisting of 18 zones (middle) and finally one market (top level), as visualized in the left panel of Figure 3.The total number of variables is thus n = 117 + 18 + 1 = 136.Regarding the temporal hierarchy, we have three frequencies p = 3, (30-min -hourly -daily), as visualized in the middle panel of Figure 3, m = 34 and k 1 = 1, k 2 = 2 and k 3 = 34 so that x ) collects the m 1 = 34 (m 2 = 17) 30-min (hourly) demand observations on day τ , and x [34] i,τ corresponds to the single daily observation for variable i.Finally, the combined cross-temporal hierarchy is displayed in the right panel of Figure 3.

Figure 4 :
Figure 4: Rolling-window procedure for machine-learning based forecast reconciliation.

Finally
, to reconcile the actual out-of-sample demand forecasts of horizon h for forecast evaluation purposes, we use the complete sample up to N = Q + R × H to obtain the m k × H step ahead forecasts for each cross-sectional series at temporal aggregation level k.This is visualized in the bottom part of Figure 4, where the gray bar represents the training sample (of length N ) on which the base forecast models are estimated and the red bar (of length H) represents the test set on which we evaluate out-of-sample forecast performance.The out-of-sample base forecasts x [k] i,τ (i = 1, . . ., n and k = k 1 , . . ., k p ) are used as inputs to the previously estimated ML models (one for each bottom-level series) to provide the reconciled bottom-level forecasts x [1] b,τ (b = 1, . . ., n b ) of the cross-temporal hierarchy.As a last step, these reconciled bottom-level forecasts are then aggregated as explained in Section 3.1 to obtain coherent forecasts across the complete hierarchy.

Figure B. 1
Figure B.1 in the Appendix B provides a visualization of iteration r and r +1 in the outer rolling window.We hereby highlight the estimation (blue bars) and validation (green bars) sets for the inner rolling window used to construct the inputs to the ML model (see Section 3.2).Due to our inner rolling-window set-up, the training set (blue bar) and forecast horizon (green bar) of the last three (inner) rolling windows in iteration r overlap with the first three (inner) rolling windows in iteration r + 1, as indicated with the shading.This offers a clear

Figure 6 Figure 6 :
Figure6plots forecast performance (WAPE, vertical axis) of each delivery area (dot) based on its average demand size (horizontal axis).The left panel shows this relationship using un-reconciled base forecasts, and we find a smirk shape pattern with many WAPEs above 0.5.The Naive base forecasts are, overall, worse than ETS and SARIMA when average demand is below 20.The middle figure shows the random forest forecast reconciliation results and we see that the smirk flattens out, with hardly any noticeable difference between the base forecast methods, and few areas having WAPEs above 0.5.The right figure highlights that random forest based forecast reconciliation has a capacity to "correct" bad quality base

Figure 7 :
Figure 7: Variable importance in random forest ML model with Naive base forecasts for the delivery area City of London.
Figure A.2 in Appendix A displays aggregated demand over the sample period in each delivery area in Manchester. Figure 8, top panel, illustrates how severe the data shift is on June 14, 2023 in all considered temporal frequencies for the Bolton Central delivery area situated North-West of Manchester.

Figure 8 :
Figure 8: Time series plots for full sample of area Bolton Central (top) belonging to the market Manchester (bottom), with a data shift for Bolton Central at June 14 2023.

Figure A. 2 :
Figure A.2: Aggregated demand in each delivery area in Manchester.
rBayesianOptimization(Yan, 2021)  in R and consider the following intervals with lower and upperbounds for each hyperparameter, in line withSpiliotis et al. (2021): we set the values of max depth between (2, 10), the learning rate (eta) between (0.01, 0.05), subsample values between (0.3, 1), colsample bytree values between (0.3, 1), min child weight between (0, 10) and gamma between (0, 5).The values for the maximum number of boosting iterations (nrounds) are over the range of 50 and 200.LightGBM.LightGBM, put forward by Microsoft in 2016, is a gradient boosting framework that uses tree-based learning algorithms like XGBoost but as the name suggests has computational advantages with respect to training speed, memory usage and parallelization.The implementation of gradient-based one-side sampling and exclusive feature bundling techniques allows handling large training datasets.We use the lightgbm package(Shi et al., 2023) in R with fixed, default hyperparameters when reporting our main results: 100 boosting iterations (nrounds), 31 as maximum number of leaves (num leaves), 0.1 as learning rate (eta), 1 as subsample ratio (subsample), 1 as subsample ratio of columns (colsample bytree), 0.001 as minimum sum of instance weight (hessian) (min child weight) and 0 as ℓ 1 -regularization (lambda l1) and no limit to the max depth (max depth) for the tree model.Finally, to tune the hyperparameters, we also use Bayesian optimization with the following similar prior values for the hyperparameters as with XGBoost.The maximum number of leaves (num leaves) fixed to the range of 5 and 31.The learning rate (eta) is set between (0.01, 0.05), subsample values between (0.3, 1), colsample bytree values are set between (0.3, 1), min child weight between (0, 10), max depth between (2, 10) and lambda l1 between (0, 5).The values for the maximum number of boosting iterations (nrounds) are over the range of 50 and 200.

Table 1 :
Overview of cross-temporal forecast reconciliation benchmarks.
Notes: This table summarizes the benchmark cross-temporal forecast reconciliation methods.

Table 2 :
London forecast reconciliation results.

Table 3 :
London forecast reconciliation results with hyperparameter tuning.Notes: This table shows forecast accuracy measured in weighted absolute percentage error (WAPE) for SARIMA base forecasts.The forecast horizon is one week-ahead.Computing times are approximate, measured in seconds and recorded on a Windows 10 Enterprise LTSC machine with Intel Core i5 2.4GHz processor.

Table 4
summarizes the forecast accuracy for the cross-temporal hierarchy for SARIMA

Table 4 :
Manchester forecast reconciliation results.

Table C .
2: London forecast reconciliation results for RMSSE loss.Notes: This table shows forecast accuracy measured in root mean squared scaled error (RMSSE).The forecast horizon is one week-ahead.The best forecast reconciliation results for each base forecast (Naive, ETS, SARIMA, Forecast Combination) are highlighted in gray.