Hierarchical Forecasting at Scale

Hierarchical forecasting techniques allow for the creation of forecasts that are coherent with respect to a pre-specified hierarchy of the underlying time series. This targets a key problem in e-commerce, where we often find millions of products across many product hierarchies, and forecasts need to be made for both individual products and product aggregations. However, existing hierarchical forecasting techniques scale poorly when the number of time series increases, which limits their applicability at a scale of millions of products. In this paper, we propose to learn a coherent forecast for millions of products with a single bottom-level forecast model by using a loss function that directly optimizes the hierarchical product structure. We implement our loss function using sparse linear algebra, such that the number of operations in our loss function scales quadratically rather than cubically with the number of products and levels in the hierarchical structure. The benefit of our sparse hierarchical loss function is that it provides practitioners a method of producing bottom-level forecasts that are coherent to any chosen cross-sectional or temporal hierarchy. In addition, removing the need for a post-processing step as required in traditional hierarchical forecasting techniques reduces the computational cost of the prediction phase in the forecasting pipeline, as well as its deployment complexity. In our tests on the public M5 dataset, our sparse hierarchical loss function performs up to 10% better as measured by RMSE and MAE compared to the baseline loss function. Next, we implement our sparse hierarchical loss function within an existing gradient boosting-based forecasting model at bol, a large European e-commerce platform. At bol, each day a forecast for the weekly demand of every product for the next twelve weeks is required. In this setting our sparse hierarchical loss resulted in an improved forecasting performance as measured by RMSE of about 2% at the product level, as compared to the the baseline model, and an improvement of about 10% at the product level as measured by MAE. Finally, we found an increase in forecasting performance of about 5–10% (both RMSE and MAE) when evaluating the forecasting performance across the cross-sectional hierarchies that we defined. These results demonstrate the usefulness of our sparse hierarchical loss applied to a production forecasting system at a major e-commerce platform.


Introduction
In e-commerce, we are often faced with two forecasting challenges.First, forecasts at the lowest granularity -often the individual product level -are required but we also need forecasts at higher granularities, for example at the category, department or regional level, as higher level forecasts are often needed in logistics and methods adjust the forecasts for each level in the hierarchy by minimizing the errors at each forecast level.These methods are applied as a post-processing step that requires a matrix inversion that scales cubically with the number of products or product hierarchies [1,2,5].In settings with millions of products such as in e-commerce, this becomes computationally expensive at prediction time.Neural network methods can optimize for the hierarchy in an end-to-end manner, however, these are either multivariate methods that scale poorly to millions of time series [6] or they can only optimize for the temporal hierarchy [3].
Sparse loss function.In order to overcome these scaling issues, we design a sparse Hierarchical Loss (HL) function that directly optimizes both cross-sectional and temporal hierarchical structures.Our corresponding sparsity-aware implementation ensures that the number of operations in our loss function scales quadratically rather than cubically with the number of products and levels in the hierarchical structure, enabling computationally efficient training.The benefit of our sparse hierarchical loss function is that it provides practitioners a method of producing bottom-level forecasts that are coherent to any chosen cross-sectional and temporal hierarchy.In addition, removing the need for a postprocessing step as used in traditional hierarchical forecasting techniques reduces the computational cost of the prediction phase in the forecasting pipeline.Furthermore, this also reduces the deployment complexity of the forecasting pipeline.
Evaluation.We evaluate our sparse HL function on a gradient-boosted forecasting system on the public M5 dataset [7] and a proprietary dataset from our ecommerce partner.For the M5 dataset, we demonstrate that our implementation provides up to 10% better forecasting performance as measured by both RMSE and MAE compared with (i) reconciliation methods and (ii) baseline bottom-level forecasting methods that use a standard loss function.For the proprietary dataset, we present the results of an offline test on the product-level forecast system of bol, a European e-commerce company with a catalog of millions of unique products.We find our sparse HL function improves the forecasting performance by about 2% on RMSE and 10% on MAE as compared to the baseline forecasting system.This demonstrates the usefulness of our sparse HL function in a large-scale setting.
Contributions.In summary, the main contributions of this paper are: 1. We design a sparse hierarchical loss function that enables direct end-to-end training of crosssectional and temporal hierarchical forecasts in large-scale settings in Section 4.
2. We empirically demonstrate that our sparse hierarchical loss function can outperform existing hierarchical forecasting reconciliation methods by up to 10% in Section 5.1.
3. We show how our sparse hierarchical loss function scales to large-scale settings and demonstrate a reduction of both training and prediction time of up to an order of magnitude compared to the best hierarchical forecasting reconciliation methods (Section 5.1).
4. We present the results of an offline test of our method for the primary product demand forecasting model at bol, a European e-commerce company with a catalogue of millions of unique products, demonstrating an improvement of 2% on RMSE and 10% on MAE as compared to the baseline forecasting system, in Section 5.2.

Related work
Forecasting for large-scale settings.Contemporary large-scale forecasting applications require forecasting many time series concurrently [8].In academia, there has been a surge in the use of neural network-based forecasting methods, which are methods that commonly learn a single forecast model that can produce forecasts for many time series.We refer the interested reader to the recent survey of Benidis et al. [9] for an overview of these methods.However, tree-based methods topped the M5 forecasting competition [7], which is believed to be due to the strong implementations available of these algorithms [10], such as the LightGBM [11] or XG-Boost [12] packages.Our own experience within bol confirms this view: the ease of use, execution speed and strong default performance are key reasons a tree-based method is often the default choice when creating a new forecasting model.
Hierarchical forecasting.Hierarchical forecasting [1,13,14,15,5] and temporal hierarchical forecasting techniques [16,2,3,4] aim to solve the problem of creating forecasts that are coherent with respect to a prespecified cross-sectional and/or temporal hierarchy of the underlying time series.We divide hierarchical forecasting methods into Reconciliation methods and Other methods.
Reconciliation methods.These methods solve the hierarchical forecasting problem as a post-processing step by reconciling the forecasts to a pre-specified crosssectional and/or temporal hierarchy [1,13,14,15,5,17,18]. Limitations of these approaches are (i) that they require a post-processing step, (ii) computing the reconciliation may be computationally expensive, as we show in Section 3.1, and (iii) approaches that are computationally less expensive tend to perform worse, as we show in Section 5. Recent work by Taieb [16], Ben Taieb and Koo [15] has improved forecasting performance of previous reconciliation approaches but at the expense of even higher computational costs, as we explain in Section 3.
Other methods.In [6,3] neural network-based endto-end hierarchical probabilistic forecasting method are proposed to solve the hierarchical forecasting problem.More recently and most closely related to our work, Han et al. [19] introduced SHARQ, a method that reconciles probabilistic hierarchical forecasts during training by employing a regularized loss function that aims to improve hierarchical consistency of bottom-up forecasts through regularization.However, the regularization does not strictly enforce the cross-sectional hierarchy in this method.

Background
To understand our problem setting and the issues we identify with existing hierarchical forecasting methods, we introduce the hierarchical forecasting problem and common methods of solving the hierarchical forecasting problem.
Suppose we have a time series ⃗ y t , where t denotes the time stamp.We are interested in finding estimates ŷt of future values y t of the time series by employing a model f based on past values y t−1 , . . ., y t−T of the time series and additional attributes X: In our hierarchical forecasting setting, we aim to create forecasts for many time series concurrently, whilst adhering to pre-specified hierarchical relationships that exist between the time series.This can be formalized as [20]: where ŷt ∈ R m denotes the vector of forecasts for all m time series in the hierarchy, S ∈ {0, 1} m×n is a matrix that defines the hierarchical relationship between the n bottom-level time series and the m * = m − n aggregations, P ∈ R n×m is a matrix that encapsulates the contribution of each forecast to the final estimate, and ỹt ∈ R m is the vector of forecasts adjusted for the hierarchy.We can use the matrix P to define various forecast contribution scenarios.Note that we can straightforwardly extend Equation (2) to the setting of temporal hierarchies [2,3] by considering forecasts of different time granularities in our vector of base forecasts ŷt and using an appropriate choice of S to aggregate series of a different time granularity.We will show how cross-sectional and temporal hierarchical forecasting can be combined in Section 4.
The optimal solution to the problem in Equation ( 2) can be found using Reconciliation methods and Other methods.
Reconciliation methods.MinTShrink [5] and variants find the optimal P matrix by solving the following minimization problem for a particular choice of W: Assuming W is positive definite, this has the following solution (ref.Theorem 1 of [5]): in which S is partitioned as In MinTShrink, W is estimated using the shrunk empirical covariance estimate of [21].Simpler choices for W, such as the identity matrix, reduce the solution to the Ordinary Least Squares (OLS) solution of [1].In ERM, Ben Taieb and Koo [15] note than MinTShrink and variants rely on the assumption of unbiasedness of the base forecasts.They relax this assumption by formulating the hierarchical reconciliation problem as an Empirical Risk Minimization problem, introducing the ERM method.In addition, they propose two regularized variants of ERM aimed at reducing forecast variance.
Other methods.Hier-E2E [6] solves the problem of Equation ( 2) by learning a neural network model that combines the forecasting and reconciliation step in a single model, resulting in an end-to-end solution removing the need for a post-processing step.Similarly, COPDeepVAR [3] is an end-to-end neural network method that enforces temporal hierarchies, however this is a univariate method that is not able to enforce structural hierarchies (i.e., cross-sectional hierarchies) simultaneously, and therefore not suited to our task.SHARQ [19] also moves the reconciliation step into the training phase and achieves reconciliation using a regularized loss function, where the regularization enforces the coherency.However, this method does not enforce absolute coherency to the hierarchy.

Scaling issues of hierarchical forecasting methods
Our main motivation for this paper are the limitations of prior work for problem settings with many time series.
Scaling issues with reconciliation methods.In reconciliation methods, we encounter the following issues when scaling to many time series: • The reconciliation is performed as a postprocessing step, and thus has to be performed as an additional step after generating the base forecasts.Even though P in Eq. 2 needs to be computed only once using Eq. ( 4), the reconciliation still needs to be performed after each base forecast is produced.Also, P ideally is sparse [15], but no reconciliation method guarantees this, so computing Eq. (2) will generally be a dense matrix-vector product that scales with the number of time series.
• For MinTShrink [5], estimating W according to the method of [21] is computationally expensive, with a computational complexity of O(Nn 2 ), where N denotes the number of training samples used to compute the shrunk covariance estimate.In addition, the shrunk covariance estimate of [21] is not guaranteed to give consistent results in highdimensional settings [22], making it less applicable for problem settings with many time series.Finally, the estimate for W will generally be a dense matrix, so we cannot make use of efficient sparse algorithms to solve Eq. (4).However, even for simpler, sparse choices of W (such as the identity matrix of OLS [1]), we still need to invert a matrix of size m * × m * in order to solve Eq. ( 4), which becomes computationally costly for problems with many aggregations, which naturally arise in retail forecasting scenarios.For example, for the M5 retail forecasting competition [23], m * = 12, 350, even though there are only 3,049 unique products in this dataset.
• For ERM and its regularized variants [15], we need to either invert multiple dense matrices that scale quadratically with the number of time series, or we need to compute a Kronecker product that scales quadratically with the number of time series, followed by an expensive lasso search procedure.Improving the computational complexity of the ERM methods is also mentioned in [15] as an avenue for future work.
Scaling issues with other methods.Hier-E2E [6] is a multivariate method, which means both input and output of the neural network scale with the number of time series.For neural networks, this significantly adds to the training cost and parameter cost as a large amount of parameters are required to handle all the separate time series.This in turn requires GPUs with more memory to train these models, which increases cost to operate them.

Sparse Hierarchical Loss
In this section we present our main technical contribution, the sparse hierarchical loss.First, we show how cross-sectional and temporal hierarchical forecasting can be combined.Then, we introduce our loss function and demonstrate it via a toy example.
Combining cross-sectional and temporal hierarchical forecasting.We are interested in finding forecasts that can be aggregated according to a pre-specified crosssectional hierarchy S c ∈ {0, 1} m c ×n c and temporal hierarchy S t ∈ {0, 1} m t ×n t : ỹc t = S c P c ŷc t , (5) ỹt = S t P t ŷt .
These equations can be interpreted as follows: • In Eq. ( 5), we aggregate n c bottom-level time series from the same timestep t across a set of m c = n c + m * ,c cross-sectional aggregations.
• In Eq. ( 6), we aggregate each time series consisting of n t timesteps across a set of m t = n t +m * ,t temporal aggregations, hence we drop the subscript t.
We will only create bottom-level forecasts, thus P c = [0 n c × m * ,c I n c ] and P t = [0 n t × m * ,t I n t ], yielding: Considering only bottom-level forecasts has a number of benefits: (i) each forecast is coherent to any hierarchy by design, and (ii) we reduce the number of required forecasts from m to n, which can be a significant reduction (there is no need for a forecast for m * aggregations in the hierarchy).We now construct a matrix of bottom-level forecasts Ŷ ∈ R n c ×n t , in which the columns represent the forecasts of the bottom-level time series at a timestep t.This allows us to combine (7) and ( 8) as follows: in which Ỹ ∈ R m c ×m t represents the matrix of forecasts aggregated according to both cross-sectional and temporal hierarchies.Equivalently, we can aggregate our bottom-level ground truth values Ȳ ∈ R n c ×n t : Sparse hierarchical loss.To find the best forecasts for the hierarchical forecasting problem (9), we try to find a model f (•) by gradient-based optimization of the following loss function: in which l c and l t denote the number of levels in hierarchies S c and S t , respectively, and i, j are the (row, column) indices of the elements of the S -matrices, and ⊗ denotes the outer-product between the denominator components in the loss function.Note that (11) shares similarities with the Weighted Root Mean Squared Error from the M5 competition [7].
We can derive the gradient and hessian of ( 11) with respect to the bottom-level forecasts Ŷ: Analysis.The best possible forecast is achieved when the loss ( 11) is minimized, or equivalently when the gra-dient ( 13) is zero: which becomes zero when Ŷ = Ȳ.Thus, the best forecast model is found when each bottom-level forecast equals the ground truth.This is equivalent to the standard (i.e., non-hierarchical) squared error loss often used in forecasting problems.We argue that our hierarchical loss gradient can be seen as a smoothed gradient compared to the standard squared error loss gradient (i.e., Ŷ − Ȳ).For example, consider the canonical case where we have two bottom-level time series (n c = 2) consisting of two timesteps (n t = 2).Furthermore, suppose we have a single cross-sectional aggregation (the sum of the two time series, thus m * ,c = 1 and m c = m * ,c + n c = 3), and a single temporal aggregation (the sum of the two timesteps, thus m * ,t = 1 and m t = m * ,t + n t = 3).Finally, there are two levels in our cross-sectional hierarchy and in our temporal hierarchy, thus l c = 2 and l t = 2.The standard squared error loss gradient for this problem is: in which e i j denotes the bottom-level error (ŷ i j − ȳi j ).For our hierarchical loss, Eq. ( 9) reads: and the gradient of the loss with respect to the bottom level time series ( 14) reads (ref. Appendix A for the full derivation): .
When we compare this result to the standard squared error loss gradient (16), we find that we smooth the bottom-level gradient by adding to it portions of the gradients of all cross-sectional and temporal aggregations the bottom-level series belongs to.This derivation also shows the motivation of adding the denominator parts d c and d t to the loss function (11): it is neccessary to scale the aggregation gradients by the number of elements in the aggregation, otherwise the magnitude of the gradient grows with the number of time series and the number of levels in the hierarchy, which we found to be undesirable when trying to facilitate stable learning.
Sparsity.S c and S t are highly sparse.For example, S c has only n c l c non-zero elements: the number of bottomlevel time series multiplied by the number of aggregations in the hierarchy.Hence, the overall sparsity of S c is given by 1− n c l c m c n c .For the M5 dataset [23], n c = 3, 049, l c = 12, m c = 42, 840, corresponding to a sparsity of 99.97%.Next, the matrix of bottom-level ground truth values Ȳ in (10) may be sparse too, for example in the case of products that are not on sale for every timestep n t in the dataset.All these sources of sparsity motivate the use of sparse linear algebra when computing Equations ( 11)- (15).
Implementation.We implement the hierachical loss (11), the bottom-level gradient ( 13), ( 14) and hessian (15) in Python using the sparse library from SciPy [24].Note that equations ( 13)-( 14) can be rearranged: such that the parts before and after the brackets can be precomputed as they do not depend on the forecast values Ỹ, avoiding a costly division operation inside a training iteration.Also note that the hessian (15) does not depend on the forecast values Ỹ, so it can be precomputed as well.Our implementation, including the code to reproduce the experiments on public data from Section 5, is available on GitHub. 1

Experiments
In this section we empirically verify the usefulness of our sparse hierarchical loss.First, we evaluate forecasting accuracy using a set of experiments on the public M5 dataset [23].Then, we evaluate our sparse hierarchical loss in an offline experiment on a proprietary dataset from our e-commerce partner.

Public datasets
Task & Dataset.Our task is to forecast product demand.We use the M5 dataset [23] for our offline, public dataset experiments.The M5 dataset contains productlevel sales from Walmart for 3,049 products across 10 stores in the USA.Furthermore, the dataset contains 12 cross-sectional product aggregations (e.g.department, region), which allow us to test hierarchical forecasting performance.We preprocess the dataset resulting in a set of features as described in Appendix B. We forecast 28 days into the future.
Baseline models.For our baseline forecasting model, we primarily use LightGBM [11], trained to predict one-day ahead.We subsequently recursively generate predictions for 28 days.Tree-based models dominated the M5 forecasting competition due to their strong performance and ease of use [7,10].Moreover, our e-commerce partner's primary product forecasting is a LightGBM-based model, so we expect results from offline experiments on public datasets to transfer to our proprietary setting when using the same base forecasting model.We compare the performance of our LightGBM models against traditional statistical methods ARIMA [25], ETS [26], Theta [27], Seasonal-Naive [20], Naive [20] and Croston [28].We note that deep learning based approaches are becoming more prevalent in e-commerce [29], especially with the rise of the Transformer-architecture in forecasting models [30,31].We consider this for future work, and did not consider this for our study as (i) the cloud cost to operate these models is 10x higher for our e-commerce partner as compared to a tree-based model, and (ii) none of the neural-network based methods are able to scale to the size of our e-commerce partner, as explained in Section 3.1.
Experimental setup.To test our hierarchical sparse loss function against baseline forecasting systems, we consider the following scenarios: 1. Bottom-up.We train a single global model on only the bottom-level time series.
2. Separate aggregations.We train separate models for every aggregation in the hierarchy, resulting in 12 models for the entire M5 dataset.

3.
Global.We train a single global model on all time series in the dataset, including all the aggregations.
For the first scenario in our experiments (Bottom-up), we vary both the objective (i.e.loss function that is optimized by LightGBM) and the evaluation metric (i.e. the loss function that governs early-stopping during hyperparameter optimization).For the objective, we consider the squared error loss (SL), the Tweedie loss (TL) and our sparse hierarchical loss (HL).The Tweedie loss is a loss function that assumes that the time series follow a distribution somewhere in between a Poisson and a Gamma distribution, which is useful in zero-inflated settings such as retail demand forecasting.It is a loss function that was favored by contestants in the M5 forecasting competition [10], and it is the loss also used in the primary forecasting system of our e-commerce partner.
For the latter two scenarios, we will obtain noncoherent forecasts.Thus, these methods require a reconciliation post-processing step to reconcile the forecasts to the hierarchy.We employ the following crosssectional reconciliation methods: • Base.No reconciliation is performed.
• WLS-struct.and WLS-var.Weighted Least Squares (WLS) [5], where W in Equation ( 3) is a diagonal matrix containing respectively the sum of the rows of S (WLS-struct) or the in-sample forecast errors (WLS-var).
• MinT-shrink.Trace Minimization [5], where W in Equation ( 3) is the shrunk covariance matrix of in-sample forecast errors.We also experimented with using the non-shrunk covariance matrix of the in-sample forecast errors (MinT-cov), but this produced erroneous/high variance results, which we attribute to precisely the motivation to shrink the covariance matrix in MinT-shrink: to reduce the variance when the amount of time series considered becomes very large.
• ERM.The Empirical Risk Minimization (ERM) method [15].Due to computational issues explained in Section 3.1, we were not able to apply the regularized ERM variants to our experiments, but only the unregularized variant.
We optimize the hyperparameters of each of the LightGBM models by Bayesian hyperparameter optimization using Optuna [32].The settings for the hyperparameter optimization can be found in Appendix C. Each model is trained for 10 different random seeds, and our results are based on the mean and standard deviation of those 10 rollouts.
Evaluation.We evaluate our results for every aggregation in the hierarchy using the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) [20].In the results section, we present the RMSE / MAE relative to the Bottom-up scenario using the squared-loss objective with the squared-loss metric.For absolute values and standard deviation of the results, see Appendix D.
Results -LightGBM as baseline model.For our first experiment, we only consider cross-sectional hierarchies (i.e. S t = I n t ).We present our results on relative RMSE using LightGBM as baseline model in Table 4 and conclude the following: • The best method is the Bottom-up-scenario combined with our sparse hierarchical loss as objective, outperforming the baseline by 0-20% across aggregations.This holds for both settings in which we use our sparse hierarchical loss.
• Even when we only use our sparse hierarchical loss as an evaluation metric during training whilst optimizing the standard squared loss (the SL/HL scenario), we already see a small improvement of ±5% across aggregations.
• Even though the Tweedie loss improves over the baseline loss, our sparse hierarchical loss function still outperforms it by ±5% across aggregations.
• From the reconciliation methods, MinT-shrink and WLS-var perform best in the Separate aggregations-scenario, although the performance delta across aggregations is still ±5-30% as compared to the best (our) method.
For relative MAE, we present our results in Table 5.We find that overall, our sparse hierachical loss still performs best by ±5% compared to other loss functions and scenarios.However, the results are more nuanced: we find that MinT-shrink in the Separate aggregations-scenario performs strong as well.In addition, we also find that the Tweedie loss (TL) performs relatively well.This finding corroborates the usefullness of the TL in intermittent demand settings, such as retail, where zero demand is often observed.Next, we compare our findings against the forecasting results when employing different baseline models in Table 1.For brevity, we only show the metrics for all time series combined (incl.aggregations).In addition, we only show a single reconciliation method for the other baseline models, as we found little difference in results when employing different reconciliation methods.We then find that on RMSE, our sparse hierachical  loss outperforms the other baseline models by at least 50%, and on MAE by at least 10%.This verifies that on this dataset and with this type of problem, using a more complex model such as LightGBM greatly improves forecasting performance, as was also shown in the M5 forecasting competition [7].
Analysis: impact of hierarchy.We investigate the impact of the choice of hierarchy.
Temporal hierarchies.As we noted before, we only used cross-sectional aggregations in our first experiments.We now also include temporal aggregations by aggregating our bottom-level time series across years, weeks and months.We ablate for every setting and show the results in Table 2. Interestingly, we find that using temporal hierarchies jointly with cross-sectional hierarchies reduces forecasting performance by ±35% (RMSE) and ±17% (MAE).This setting is even worse than only using temporal hierarchies, which performs worse than using only cross-sectional hierarchies by ±26% (RMSE) and ±12% (MAE).We further analyze these results by studying the RMSE across the forecast days in Table 3.As noted before, we forecast 28 days ahead, and each forecast is created by recursively applying the one-step ahead LightGBM model.We find that as we forecast further into the future, the setting with only using cross-sectional aggregations starts to perform better by up to 20% as compared to the baseline where we do not use any aggregations.Again, the setting where we employ temporal hierarchies too shows relatively bad performance across all forecast day buckets.
Random hierarchies.In hierarchical forecasting problems, the aggregation matrices S c and S t are commonly fixed a priori and considered constant during training and prediction.As we are performing the reconciliation in an end-to-end fashion during training, we can modify these matrices at every iteration.This allows us to understand the robustness of our solution to possible misspecification of the hierarchy, and more generally, to what extent the choice of the hierarchy has an effect on forecasting performance.We perform an experiment by randomly sampling an S c -matrix at every iteration of the LightGBM training process.At every iteration, we sample uniformly at random (i) a number of levels for the cross-sectional hierarchy and (ii) a number of maximum categories for the level and construct a random S c -matrix to be used in the gradient (14) and hessian (15).We validate and test on the 'true' S c -matrix.We present the results in Table 2 and Table 3, under 'Random'.We find that performance deteriorates by about 30% as compared to the baseline, and we also find that our results in this experiment show higher variance compared to the baseline methods.Thus, misspecifica-tion of the hierarchy can severely deteriorate forecasting performance.
Analysis: time complexity.We investigate the computational time complexity required to perform training and prediction for each scenario and present the results in Table 6.The training and prediction time complexity is indicated by how respectively the training time and prediction time scales with respect to the default LightGBM training and prediction time complexity.We first investigate the case where we only consider crosssectional hierarchies.This case is indicated by 'HL' in Table 6.First, we note that adding our hierarchical loss objective adds a component to the time complexity that scales with n c,3 , as we need to compute (13).However, our sparse implementation of the hierarchical loss reduces this component from n c,3 to n c,2 l c , effectively reducing the scaling from cubic to quadratic in the number of bottom-level time series, as l c is generally small.In the reconciliation scenarios, we always need to compute a matrix inversion to solve Eq. ( 4) that scales cubically with the number of cross-sectional aggregations m c, * or with the total number of time series m c .The first is not problematic as generally m c, * ≪ n c in large-scale settings, but methods with this time complexity consequently trade in performance, as we observed in Table 4.To empirically verify the differences in asymptotic time complexity, we recorded the training and prediction time for each scenario.We show timings for training and prediction for a single store of the M5 dataset (4M training samples) and for the entire M5 dataset (52M training samples), to provide an indication of scaling when the problem size increases by an order of magnitude.First, we note that using our sparse implementation of the HL reduces training time by a factor of 3× when training for all stores.Second, our sparse HL has a prediction time similar to the baseline (SL).Next, we find that the training time of our sparse hierarchical loss is two orders of magnitude faster than reconciliation methods in the Separate aggregationsscenario.This is mainly due to the many individual models that need to be trained in this scenario and thus shows a clear benefit of having just a single model.We observe an order of magnitude difference in prediction time when comparing the sparse hierarchical loss to the Separate aggregations-scenario when predictinf all stores.Again, this shows a clear benefit of having just a single model for this forecasting task.For the Global-scenario, we see that reconciliation methods require a smaller training time when training for all stores (about twice less), however that scenario also did not give strong forecasting performance as we established in Table 4. Also, the prediction time using our sparse HL is an order of magnitude lower.As ML costs in production systems mainly consists of prediction costs, having a lower prediction time is beneficial. 2 Finally, we also show the time complexity of using both crosssectional and temporal hierarchies jointly, as indicated by 'HL+' in Table 6.Adding temporal hierarchies adds another matrix multiplication that scales with the number of timesteps to the complexity.In our experiments, we find that adding temporal hierarchies results in a twice higher training time when training for all stores and a 50% higher prediction time when predicting for all stores.We view it as potential future work to investigate how to perform this end-to-end learning of both cross-sectional and temporal hierarchies even more efficiently.
To conclude, we showed that our sparse HL incurs some additional training overhead but no additional prediction overhead as compared to the base case SL, whereas it does not require the additional reconciliation step that reconciliation methods require.

Proprietary datasets
At our e-commerce partner bol, a LightGBM-based forecasting model is used as the primary product forecasting model.The model is used to forecast weekly product demand for 12 weeks.Every day, 12 separate models are trained, each tasked to forecast demand for a single week for every product.The model is used to forecast the majority of the products on sale at any moment, which are approximately 5 million unique items.We investigate the use of our sparse hierarchical loss function as a drop-in replacement for the existing Tweedie loss that is used within the company.The offline dataset consists of 36M training samples from the period January 2017 to the end of June 2021.We test on 55M samples from the period July 2021 to January 2022.The baseline model for every weekly forecast model is a LightGBM model with a Tweedie loss (TL).The baseline model uses 19 proprietary features.
Experimental setup.We replace the TL with our HL and investigate forecasting performance on the test set.For the HL, we use the proprietary aggregations product group and seasonality group, each containing respectively ±70 and ±6,000 unique values.We have n c = ±5M bottom-level time series and m c, * = ±6, 070 aggregated time series across l c = 4 levels: product (bottom-level), product group, seasonality group and total.
Results.On average, we find our sparse HL outperforms the existing TL model by about 1-2% on RMSE and ±10% on MAE.We further investigate the performance by investigating how the RMSE and MAE vary across the 12 forecasting horizons and weekly demand buckets, and present the results in Figure 1.We find that our sparse HL performs best on both RMSE and MAE on the lower weekly demand buckets (up to 100 products sold per week), where it outperforms the TL averaged over all the forecasting horizons.The TL is clearly better for higher weekly demand buckets, commonly outperforming the HL and SL by up to 5%.Next, we investigate forecasting performance across the crosssectional hierarchies that we defined.We show the results in Figure 2. We find that for most forecasting horizons, the HL and SL outperform the TL, with an average outperformance of the HL over the TL of ±10% at the product level, ±5% at product group level and ±4-7% at seasonality group level.Hence, we are able to confirm some of the results we found in the M5 experiment, although the baseline SL performed quite strong in this experiment as well.We believe this is due to the M5 experiment having much more hierarchical levels (12 as compared to the 4 we used for our proprietary dataset experiment), since the HL is equal to the SL in the case of no hierarchies, and with fewer hierarchies the HL thus becomes closer to the SL.Hence, we believe our HL is most useful in settings with both many timeseries as well as many hierarchies.To conclude, we find that this experiment demonstrates the usefulness of our sparse HL applied to a production forecasting system at a major e-commerce platform.

Conclusion
We introduced a sparse Hierarchical loss function to perform hierarchical forecasting in large-scale settings.We demonstrated that we are able to outperform existing hierarchical forecasting methods both in terms of performance as measured by RMSE and MAE by up to 10% as well as in terms of computational time required to perform the end-to-end hierarchical forecasting in large-scale settings, reducing prediction time as compared to the best hierarchical forecasting reconciliation method by an order of magnitude.We empirically verified our sparse hierarchical loss in an offline test for bol, where we confirmed the results from our offline test on the public M5 dataset.
In addition to our main contributions, one of our main learnings has been that we could not find a benefit of having multiple models for separate aggregations in the hierarchy, as the bottom-up scenario we employed consistently outperformed other scenarios.Secondly, we did not find a benefit of training a model whilst adhering to both cross-sectional and temporal hierarchies jointly.
Limitations of our work are that we did not consider the probabilistic forecasting setting, where reconciled forecasts are required across an entire forecast distribution.
For future work, we aim to extend our work to the setting of probabilistic forecasting by combining our sparse hierarchical loss with existing probabilistic forecasting frameworks from, e.g., [33,34,35].In addition, we seek to further investigate solutions for efficiently combining cross-sectional and temporal hierarchies.1: Forecasting results for the primary product forecasting model at our e-commerce partner bol.We show RMSE (a, left) and MAE (b, right) by weekly demand bucket relative to the Tweedie loss baseline for each forecasting horizon (week).The Hierarchical loss outperforms the Tweedie loss on RMSE and MAE on smaller weekly demand buckets.

Hierarchical Loss
Tweedie Loss Squared Loss    HL+ (sparse) None O(T (n t n c + n t n c,2 l c n c n t,2 l t )) O(T (n t n c ) + n t n c,2 l c n c n t,2 l t )) 723 Sep. agg.We have the following hierarchical forecasting problem: , which results in the following gradient and hessian: 16 (e 0 0 + e 1 0 + e 0 1 + e

Figure 2 :
Figure2: Forecasting results for the primary product forecasting model at our e-commerce partner bol.We show RMSE (left column of figures) and MAE (right column of figures) by aggregation level relative to the Tweedie loss baseline for each forecasting horizon (week).The Hierarchical loss commonly outperforms the Tweedie loss on every aggregation level.

Table 1 :
Forecasting results for all time series (incl.aggregations) on the M5 dataset, using different baseline models.We show absolute and relative RMSE and MAE.Lower is better, and bold indicates the best performing method.

Table 2 :
Forecasting results for all time series (incl.aggregations) on the M5 dataset, ablating for the use of cross-sectional and temporal hierarchies.We show absolute and relative RMSE and MAE, with the standard deviation in brackets.Lower is better, and bold indicates the best performing method.Note that when not using cross-sectional nor temporal aggregations, the hierarchical loss is equal to the standard squared error loss.

Table 3 :
Forecasting results for all time series (incl.aggregations) on the M5 dataset, ablating for the use of cross-sectional and temporal hierarchies.We show relative RMSE for several forecasting day buckets of the forecast.Lower is better, and bold indicates the best performing method.

Table 4 :
Forecasting results for all stores on the M5 dataset, using LightGBM as baseline model.We report relative RMSE as compared to the baseline (shown in italic).Lower is better, and bold indicates best method for the aggregation, taking into account standard deviation of the best method across the 10 seeds.For absolute values and standard deviation of the results, see Appendix D.

Table 5 :
Forecasting results for all stores on the M5 dataset, using LightGBM as baseline model.We report relative MAE as compared to the baseline (shown in italic).Lower is better, and bold indicates best method for the aggregation, taking into account standard deviation of the best method across the 10 seeds.For absolute values and standard deviation of the results, see Appendix D.

Table 6 :
Computational time complexity and observed timings in seconds for all scenarios.The complexity is indicated by how respectively the training time and prediction time scales with respect to the default LightGBM training/prediction time T , where n t denotes the number of timesteps per time series, n c denotes the number of bottom-level time series in the hierarchy, n c l the number of time series in each level in the hierarchy and l c the number of levels in the cross-sectional hierarchy, m c (m t ) the total number of cross-sectional (temporal) aggregations, and m * = m − n.
For each of the scenarios of our experiments in Section 5, we construct a set of features for the LightGBM model as given in TableB.7.To facilitate the most 'fair' comparison across methods, each model has the same features, and for the time series aggregations in the hierarchy we construct the features taken over the aggregation.