A Bayesian Approach for Predicting Food and Beverage Sales in Staff Canteens and Restaurants

Accurate demand forecasting is one of the key aspects for successfully managing restaurants and staff canteens. In particular, properly predicting future sales of menu items allows a precise ordering of food stock. From an environmental point of view, this ensures maintaining a low level of pre-consumer food waste, while from the managerial point of view, this is critical to guarantee the profitability of the restaurant. Hence, we are interested in predicting future values of the daily sold quantities of given menu items. The corresponding time series show multiple strong seasonalities, trend changes, data gaps, and outliers. We propose a forecasting approach that is solely based on the data retrieved from Point of Sales systems and allows for a straightforward human interpretation. Therefore, we propose two generalized additive models for predicting the future sales. In an extensive evaluation, we consider two data sets collected at a casual restaurant and a large staff canteen consisting of multiple time series, that cover a period of 20 months, respectively. We show that the proposed models fit the features of the considered restaurant data. Moreover, we compare the predictive performance of our method against the performance of other well-established forecasting approaches.


Introduction
The total turnover in the US restaurants sector is projected to reach $863bn in 2019, contributing to 4 % of the gross domestic product [2]. Currently, the restaurant industry employs 15.3 million people at one million locations across the United States. At the same time, the industry accounts for 11.4 million tons of food waste that constitutes an estimated value of $25bn going to waste [3] every year.
Food waste is a critical socioeconomic problem considering that 11.8 % of the households in the U.S. suffered from food insecurity in 2018 [9]. Furthermore, production, transportation, and disposal of unused food significantly impact the environment [1,18]. On the other hand, food costs represent 28 % to 35 % of sales in restaurants. Hence, reducing food waste can be critical to boost profitability and to reduce the environmental impact of operating a restaurant. Furthermore, also food service contractors face similar challenges when providing meals at staff canteens, hospitals, etc.
Accurate demand forecasting is one of the key aspects for successfully managing restaurants and, from an environmental point of view, maintaining a low level of pre-consumer food waste. In this work, we are interested in predicting future values of the daily sold quantities of a given menu item. Hence, we deal with time series that show multiple strong seasonalities, trend changes, and outliers. Traditionally, judgmental forecasting techniques, based on the managers experience, are applied to estimate future demand at restaurants. However, producing high quality forecasts is a time consuming and challenging task, especially for inexperienced managers. Hence, we aim for a data-driven approach that supports restaurant managers in their decision making.
Nowadays, most restaurants and canteens use (electronic) Point of Sales (POS) systems to keep track of all their sales. Clearly, the resulting data inventories are valuable sources for various data science applications such as forecasting future sales numbers of menu items. In general, many industries rely on POS data to predict future demand. However, especially in the retailing sector, it is hard to make beneficial use of these estimates due to complicated supply chains and long lead times. In contrast to that, the restaurant industry is characterized by short lead times when ordering food stock and the absence of a complex supply chain.
Hence, we propose a forecasting approach that requires only POS data, i.e., weather information and special promotions are not considered. Moreover, in order to ensure acceptance of the approach among restaurant managers, the models ability for a straightforward human interpretation is critical.
The remainder of this paper is organized as follows. First, in Section 2, we provide a concise problem description. In Section 3, we briefly discuss related literature. Then, in Section 4, we propose our forecasting approach. In Section 5, we evaluate the prediction quality of our approach using real-world data of sales in a restaurant and a canteen. Finally, Section 6 concludes the paper and outlines future research directions.

Problem Description
This work is concerned with predicting the daily sold quantities of given menu items. This information, together with the recipes of the items, is essential to effectively manage the ordering of food stock. In our use-case, an (electronic) Point of Sales (POS) system is the exclusive data source. Typically, such systems create a new line item for each sold menu item together with a time stamp. This way, large data inventories, consisting of thousands of line items, are created. Each restaurant, let it be a small burger joint or a large company canteen, usually has many different products on its menu. While some products are offered all year long, others are only seasonally available. Moreover, the demand for certain products differs over the year.
Each menu item is usually identified through an unique ID. From the stored records, we can query the accumulated number of sales of the different menu items for each day. Hence, gathering the data required for our forecasting approach causes little to no additional administrative burden. In some cases, it is reasonable to do forecasting for product groups rather than for individual menu items. For example, a canteen may have a daily vegetarian option. Each day a different dish, identified by its own ID, is served.

Assumptions
For now, we assume that the sales of individual menu items are independent of each other. While this assumption is unlikely to be true for all types of restaurants, e.g., some sides go along better with certain main dishes than others, it is a reasonable simplifying assumption.
By design, a POS system only documents sales. The absence of recorded sales of a certain menu item on a given day may be due to several reasons: (i) The restaurant was closed. In this case, there is no recorded sale of any item at all. (ii) The item was not on the menu that day by choice or due to lack of stock. (iii) Nobody wanted to order the product. Obviously, it makes a difference if a product was not sold due to the fact that nobody wanted to order it (iii), or due to the reason that the product was not offered (i, ii). To distinguish between these to cases we treat an absence of sales caused by (iii) as zero sales and the absence of sales caused by (i) or (ii) as missing values. Setting the number of sales independently of (i), (ii), or (iii) always equal to zero can lead to inferior model fits, especially if the periods where the restaurant is closed, or does not offer special products, are aperiodic, or do not show a common pattern. For sure, this could be prevented by adding some information regarding why there are no sales to the model. However, this would require to manage and store additional data other than the POS data, which we do not want to presuppose in this work. We assume that the restaurant was opened for business on a given day if there is at least one recorded sale of any menu item. Usually, data available from a POS system contains no records about when a product was added to and when it was removed from the menu. Therefore, we assume that an item has been removed from the menu if there is no recorded sale for 60 or more days.

Related Work
In this section, we give a brief overview of related literature. We focus on forecasting approaches used in the restaurant industry as well as on the use of POS data.
A general view of forecasting methods for restaurant sales and customer demand is given in the recent review paper by Lasek et al. [29]. Ryu and Sanchez [37] compare several methods (moving average, multiple regression, exponential smoothing) in order to estimate the daily dinner counts at a university dinning center. In their case, a study shows that multiple regression gives the most accurate predictions. Also, Reynolds et al. [35] use multiple regression to predict the annual sales volume of the restaurant industry (and of certain subsegments). Forst [13] applies ARIMA and exponential smoothing models to forecast the weekly sales (in USD) at a small campus restaurant. Hu et al. [21] use ARIMA methods to predict the customer count at casino buffets. Cranage and Andrew [11] use exponential smoothing to predict the monthly sales (in USD) at a restaurant. Bujisic et al. [5] analyze the effects of weather factors onto restaurant sales. The paper also contains a review of forecasting approaches for restaurant sales (with and without weather factors). However, the authors point out that most work from the literature is concerned with aggregated forecasts, i.e., only predictions for product categories or weekly sales numbers are given, rather than forecasting sales of individual menu items per day.
Moreover, Tanizaki et al. [41] propose to use machine learning techniques in order to forecast the daily number of customers at restaurants. Their predictions are based on POS data that is enriched with external data, e.g., weather information and event data. Similarly, Kaneko and Yada [26] present a deep learning approach to construct a prediction model for the sales at supermarkets using POS data.
So far, this short review shows that existing work in the literature is mostly concerned with predicting aggregated numbers rather than the sale of individual items. Moreover, it reveals that using POS data is common practice in many industries. However, our review exposes a gap in the existing literature as, to the best of our knowledge, there exists no work concerned with predicting sales of individual menu items at restaurants (based on POS data).
Hence, we extend our review onto prediction approaches dealing with data showing similar features than ours. The problem considered in this work is clearly a time series prediction problem. Its most dominant characteristics are: multiple strong seasonalities, trend changes, data gaps, and outliers. Thus, the application of classical time series models such as ARIMA models or exponential smoothing [20] is of limited use. These models can only account for one seasonality and, moreover, require missing data points to be interpolated. In principle, multiple seasonality patterns can be dealt with in exponential smoothing, see e.g., Gould et al. [16]. However, most available ARIMA or exponential smoothing implementations, e.g., [23,22], do not include such features.
Therefore, Taylor and Letham [42] present a decomposable time series model that is similar to a generalized additive model [17]. It consists of three main components: trend, seasonality, and holidays. The model is designed in such a way that it allows for a straightforward human interpretation for each parameter allowing an analyst to adjust the model if necessary. The approach is concerned with time series having features such as multiple strong seasonalities, trend changes, outliers, and holiday effects. As a motivating example they use a time series describing the number of events that are created on Facebook every day. However, this number is quite large in compar-ison to the daily sales in restaurants and canteens. Thus, for Facebook's example it is valid to assign a normal distribution to the target variable. However, this might be inferior in our use case, since small target values are quite common. Note that the normal distribution N (λ, λ) has a shape similar to the Poisson distribution P oisson(λ) for large λ. In contrast to the normal distribution, the Poisson distribution is a classical distribution for count data.

Methodology
In this section, we propose two Bayesian generalized additive models (GAMs) for demand prediction. The first one assumes a normally distributed target (also called response) y, the second one assumes that the response y follows a negative binomial distribution. Both models include a trend function g(t) and a seasonality function s(t). Sparsity inducing priors [33,7] are assigned to the parameters of g(t) and s(t). This allows for a good model regularization, selection of significant trend changes or influential seasonal effects, and, finally, enhanced model interpretability.
Considering that restaurants and canteens commonly offer many different menu items and that new data is recorded on a daily basis, multiple models have to be trained on a frequent basis. For this reason, the training process should require only little computation time. Additionally, the inference should take place automatically without the drawback that a human analyst has to investigate the convergence of the procedure. Therefore, we consider the application of full Bayesian inference as inappropriate and focus on the comparatively easy to obtain mode of the posterior, the so-called maximum a posteriori (MAP) estimate, instead.

Generalized Additive Models
A generalized additive model writes as where y denotes the target variable corresponding to some exponential family distribution, x 1 , . . . , x k , z 1 , . . . , x k ∈ R denote the predictors (also called covariates), g denotes a link function (bijective and twice differentiable), and f 1 , . . . , f q denote some smooth functions. Typically, the functions f i are defined as weighted sums of basis functions, i.e., f i = K i k=1 γ ik B ik (z i ). Sparsity inducing priors can be used to perform variable selection within the set {x 1 , . . . , x k } and, further, appropriate priors can be used to control the smoothness of the functions f 1 , . . . , f q .

Trend Function
To model the trend of a time series, i.e., a series of data points ordered in time, we use a polynomial spline g of degree l, see Fahrmeir et al. [12].
A mapping g : [a, b] → R is called polynomial spline of degree (order) l ≥ 0 with knot points at k 1 < . . . < k m , (a < k 1 , k m < b), if 1. g is a polynomial of degree l on each of the intervals [a, k 1 ), [k 1 , k 2 ),. . . , [k m , b), and 2. g is l − 1 times continuously differentiable provided that l > 0. Let S k l denote the set of all l-th order splines with knots given by k 1 < . . . < k m . Equipped with the operations of adding two functions and taking real multiples S k l is a real vector space. One can show that each element of S k l can be uniquely written as linear combination of the d = l + m + 1 functions For this reason, the functions B 1 , . . . , B d build a basis B of the spline space S k l . B is called the truncated power series basis (TP-basis). The TP-basis allows for a simple interpretation of the trend model The trend model g consists of a global polynomial of degree l which changes at each knot k j . The amount of change at a given knot point k j is determined by the absolute value of the corresponding coefficient γ l+j+1 . Thus, the knots of the polynomial spline are interpreted as (possible) change points in the trend. As a consequence, the assignment of a sparsity inducing prior to the trend coefficients γ 1 , γ 2 , . . . , γ d would allow for an identification of truly significant trend changes. This increases the interpretability of the model and also regularizes the trend function.
According to domain experts, for most menu items the trend (regarding the number of sales) is quite constant over long periods, while significant changes appear only occasionally. Hence, we decide to specify a knot point every k-th day between the first and the last date with observations. Assuming that the trend may change on a weekly basis, a possible value for k could be k = 7. However, for restaurants that observe only slight changes once in a while also larger values for k are useful in order to reduce the size of the resulting model. Consequently, we choose k = 30 in our experimental evaluation, see Section 5.4.3.

Prior for the Trend
In this section, we propose two different priors for the trend model. The first one is mainly inspired by the Bayesian Lasso [33]. Here, we assign independent Laplace priors to the coefficients of the trend model which do not belong to the global polynomial. In case that y is assumed to be normally distributed with given variance σ 2 the priors read as The conditioning on σ 2 is important, because it guarantees a unimodal full posterior [33]. Similarly, in case of a negative binomial response y the priors are given by Assigning Laplace priors to coefficients γ l+2 , . . . , γ d results in a sparse MAP estimate of these parameters. The amount of sparsity is controlled by the tuning parameter τ 1 . As a consequence, true change points can be automatically detected while wrongly proposed ones are ignored. Moreover, as in Bayesian ridge regression we assign independent normal priors to the coefficients γ 2 , . . . , γ l+1 : Thus, in the MAP estimate each of these coefficients is shrunken towards zero according to its importance in a matter of regularization. Finally, the improper and non-informative prior p(γ 1 ) ∝ 1 is assigned to the intercept of the trend function.
Besides the usage of Laplace and normal priors we also propose to assign the horseshoe prior, see Carvalho et al. [7], to all the coefficients of the trend function except of the intercept. In case of a Gaussian target variable the prior reads as (see [31]): where C + denotes the half-Cauchy distribution and j ∈ {2, . . . , d}. If a negative binomial response is assumed, σ 2 has to be removed from above specification. The horseshoe prior is a shrinkage prior which enforces a global scale τ on the one hand and, on the other hand, allows for individual adaptions λ 2 , . . . , λ d of the degree of overall shrinkage.
The possibility to use individual shrinkage strengths sometimes is an advantage of the horseshoe prior over the Bayesian Lasso, where the shrinkage effect is uniform across all coefficients. We notice that for some menu items the trend of the corresponding time series changes drastically within a short period of time. In this case uniform shrinkage either leads to an over-shrinkage of heavy changes, or to an under-shrinkage of insignificant changes. However, it turned out that a model using the horseshoe prior requires a very good initialization of the respective optimization routine, in order for the procedure to converge towards a useful MAP-estimate. Since we could not identify an automatic approach to find such an initialization, we recommend to use the Bayesian Lasso prior as standard approach, and let an expert apply the horseshoe prior if necessary.
Taking into account that the application of the horseshoe prior leads to problems in terms of automation we have not considered other similarly sophisticated priors [36,4] for now. However, this is subject to future work.

Dates
Obviously, the trend model requires the time t to be given as numeric values. Hence, we use a function δ : {dates} → R that maps all possible dates (time points, days) to real numbers. We assume that observations are available at the dates t 1 < . . . < t n . Further, let N ≥ n denote the number of days that lie in the interval [t 1 , t n ]. Then, δ maps t 1 to zero and each other date t to the number of days that t 1 differs from t divided by N . Finally, δ(t 1 ), . . . , δ(t n ) are elements of the interval [0, 1).

Seasonality Function
The seasonality function s(t) is used to model periodic changes of the daily sold quantities y t . In order to take account of the seasonal effects day of week, day of month, and month of year diverse dummy variables (taking only the values 0 and 1 in order indicate the presence of a given season) are introduced. Let x = (x 1 , . . . , x s ) T denote the vector consisting of all dummy variables corresponding to the considered seasonal effects. The seasonality function is then given by

Prior for the Seasonality
Inspired by the Bayesian Lasso [33], we assign independent Laplace priors to the coefficients of the seasonality model. For a normally distributed target y, the priors are specified as and for a negative binomial target the priors read as The sparsity of the parameter's MAP estimate induced by the Laplace priors enables an automatic selection of influential seasonal effects. We do not consider the usage of the horseshoe prior [7] for the seasonal model. As already mentioned in Section 4.2.1, this prior results in a more complicated model optimization. Since we discovered empirically that the Bayesian Lasso prior performs quite well for the seasonal model it appears unnecessary to accept the additional complexity.

Deciding the Granularity of the Seasonality Model
Let the number of days with observations be denoted by n. Clearly, the size of the number n influences the usefulness of modeling diverse seasonal effects. Hence, we apply the following strategy: • n < 30: model only the effect of day of week, • 30 ≤ n < 120: additionally model the effect of month of year, • n >= 120: model all seasonal effects (day of week, day of month, and month of year).

Prediction Models
As already mentioned at the beginning of Section 4, we propose two Bayesian GAMs for predicting the daily sold quantities of certain menu items in restaurants and canteens. Both models include a trend function g(δ(t)) (see Section 4.2) and a seasonality function s(t) (see Section 4.3).

Normal Model
In the so-called normal model, the daily sold quantity at time t, denoted by y t , is modeled as where ε t denote noise terms that are assumed to be i.i.d. according to N (0, σ 2 ). Additionally, the non-informative and scale-invariant prior p(σ 2 ) ∝ 1 σ 2 is assigned to the unknown error variance. Suggestions for priors used to regularize g(δ(t)) and s(t) are described in Section 4.2.1 and Section 4.3.1. Assume that we have observations at the time points t 1 < . . . < t n , then Equation (5) translates to with according to Section 4.3 and according to Section 4.2.

Negative Binomial Model
As can be seen in Equation (6), the assumption of additive and normally distributed noise directly implies that the target y also follows a normal distribution. In our use-case this assumption cannot be satisfied since the target is a non-negative integer. Indeed, this can even lead to predicting negative values, in case that the trend is a monotonically decreasing function. More appropriate distributions for the target variable are the Poisson distribution and the negative binomial distribution, which are both well-established distributions for regression with count data. Nevertheless, it should be mentioned, that, for large λ, the normal distribution N (λ, λ) is a good approximation of the Poisson distribution P oisson(λ) due to the central limit theorem. However, the mean number of daily sold quantities can be quite small for some menu items in restaurants and staff canteens. Thus, we decide to model the target y, additionally to the normal distribution, also with a negative binomial distribution. This distribution can be considered as an extension of the Poisson distribution that also allows for over-dispersion. The mean and the variance of y are given by where µ, φ ∈ R + denote the parameters of the distribution.
Although, the negative binomial distribution does not allow for under-dispersion, we do not consider this limitation as a major drawback. Under-dispersed data is rather unlikely in practical applications [19]. Domain experts have confirmed this statement with regards to the application considered in this work. In other applications, where under-dispersion is common, the Conway-Maxwell-Poisson-distribution [10] is a notable alternative to the negative binomial distribution. This distribution allows for both under-dispersed and over-dispersed data, but at the price of significantly increased model complexity.
In the so-called negative binomial model the quantity y t sold on day t is modeled as with a > 0. Using the exponential function exp() as response function ensures that the expected demand µ t always stays positive, no matter how the estimates of g(δ(t)) and s(t) look like. Another side effect of this response functions is that the seasonal component acts in a multiplicative way on the trend, µ t = exp(g(δ(t))) exp(s(t)). As recommended by Gelman [14] we assign a standard half-normal prior N + (0, 1) to a = 1 √ φ : Note that by assigning the prior N + (0, 1) directly to φ, most of the prior mass would be on models with a large amount of over-dispersion. In case of limited over-dispersion, this can lead to a conflict between prior and data. Assuming that observations are available at the time points t 1 < . . . < t n model (9) translates to where y = (y t 1 , . . . , y tn ) T , X[i, :] denotes the i-th row of matrix X defined via (7), and Z[i, :] denotes the i-th row of matrix Z defined via (8).

Choice of the Tuning Parameters
The priors proposed for the seasonality function s(t) (see Section 4.3.1) depend on a hyperparameter τ 2 . Further, in case that the priors (2) -(4) are used for the trend g(δ(t)), they depend on hyperparameters denoted by τ 1 and τ 3 . These parameters can be considered as tuning parameters that control the amount of sparsity/regularization in the estimates of g(δ(t)) and s(t). Thus, a good specification of τ 1 , τ 2 , and τ 3 is essential for the model performance. Bad choices could either allow for too much flexibility within a given model (inclusion of too many seasonal effects, a trend function that nearly interpolates the data, etc.) and, thus, result in overfitting, or do not allow for enough flexibility such that some significant effects are ignored.
In principle, there exist three approaches to specify the hyperparameters τ 1 , τ 2 , τ 3 : • The first one is to determine standard settings that perform quite well on average. • Another possibility is to determine useful settings via cross-validation. While, in general, the prediction quality of the resulting model is better than for the standard settings, the process itself is computationally expensive. • The third option would be to specify hyper-priors for the tuning parameters. In this work, we focus on the first two approaches and defer the last one to future work.

Standard Settings
Certainly, good specifications of the tuning parameters differ from data set to data set, i.e., in the context of this work from menu item to menu item. Nevertheless, it is useful to have a fixed specification of them which works quite well on average. For this purpose, we have tested diverse possible specifications of these parameters with several data sets of representative menu items. The following specifications led to the best results:

Cross-Validation
Cross-validation allows for an automatic identification of good and individual tuning parameter specifications for each data set. Applying cross-validation on a given data set means that the data set is partitioned into multiple train/test splits. On each of these splits the considered model is trained and tested with different specifications of its tuning parameters. Finally, the specification which on average goes along with the best predictions (according to some pre-defined quality measure) is selected. To take account of temporal dependencies, we consider an expanding window approach for defining the train/test splits. The approach is outlined in Figure 1. Looking at the figure the amount of training data is gradually reduced while the amount of test data is kept constant.

Overall Strategy
We recommend to use the standard settings proposed in Section 4.  Figure 1: Illustration of the cross-validation approach. The data is assumed to be ordered temporarily.
restriction to the standard settings should not be considered too critical, since we discovered empirically that they perform quite well. In case that cross-validation is applied to specify τ 1 , τ 2 , τ 3 , we perform the process step by step for each single tuning parameter. Optimizing all three tuning parameters at once would require the training of substantially more models and is therefore considered as computationally too expensive. Within the step-wise procedure at first the parameter τ 3 (responsible for the global trend) is optimized, then the parameter τ 1 (responsible for trend changes) and, finally, the parameter τ 2 (responsible for seasonal effects). Parameters that are not currently optimized are either fixed with the standard settings provided in (12), or, if available, with the already optimized values.
Finally, it should be mentioned that the quality criterion used inside the cross-validation is the mean absolute deviation (MAD) between the predicted values and the corresponding true ones.

Prediction & Prediction Uncertainty
Assume that y t has to be predicted for t ∈ {t * 1 , . . . , t * p } with t * 1 < . . . < t * p and, further, assume that observations are available at the time points t 1 < . . . < t n with t 1 ≤ t * 1 . Let β, γ, σ, and a denote MAP estimates corresponding either to the normal model (6) or to the negative binomial model (11). Additionally, treat the estimates as if they were the true values. Then the expected value can be used for prediction. Note that the exponential function in Equation (13) is taken componentwise and, moreover, that X * and Z * are defined analogously to X and Z (Equations (7) and (8)), by replacing t 1 , . . . , t n with t * 1 , . . . , t * p . Besides predicting a posteriori reasonable values for y * , providing some uncertainty information regarding the forecasts is essential for planning food stock orderings. Especially, intervals that contain the y t * i (i = 1, . . . , p) with a pre-defined probability are of particular interest. Then, dependent on the importance of the availability of a given menu item, the manager can decide to go along with the prediction µ * i = E(y t * i | β, γ), or adjust her or his orders closer to the upper or lower interval boundary. Recall (Section 4.2 and Section 4.2.1) that the trend is assumed to be constant most of the times but changes occasionally. To model this assumption, the trend function g(δ(t)) includes a knot every k-th day between the first and the last date with observations. Additionally, sparsity inducing priors are assigned to the coefficients γ l+2 , . . . , γ d which represent the trend changes at the knot points k 1 , . . . , k m . The restriction of merely specifying knots in the observed time period, implies that the trend stays constant in unobserved periods, i.e., in the future. However, in case that prediction uncertainty information is required, possible future trend changes must be considered. Hence, we extend the trend function to .
. . , k m+r denote additionally introduced knot points. The additional knots extend the approach of specifying a knot point every k-th day within the observed period to the time interval In case that t * p − k m < k the expansion g * is equal to the original trend function g. Additionally, it is assumed that the coefficients γ d+1 , . . . , γ d+r are i.i.d. zero mean Laplace distributed: Assuming that the scale of future trend changes is determined by the scale of the historically observed changes γ l+2 , . . . , γ d the parameter b is estimated as Note that given i.i.d. samples x 1 , . . . , x N from a zero mean Laplace distribution with scale parameter b, the mean of the absolute values of the x i is the maximum likelihood estimator of b. Since the estimates β, γ, σ, and a are treated as if they were the true values, (1 − α) prediction intervals for the y t * i (i = 1, . . . , p) can be computed as follows: 1. Expand the trend function to g * , see Equation (14). 2. By replacing t 1 , . . . , t n with t * 1 , . . . , t * p : (a) Determine the model matrix X * analogous to the computation of X (Equation (7)). (b) Determine the model matrix Z * * analogous to the computation of Z. Note that that now g * is used to model the trend and not g. 3. If g * differs from g, sample from γ * = (γ d+1 , . . . , γ d+r ) T according to the Equations (15) and (16). Let the samples be denoted byγ * = (γ d+1 , . . . ,γ d+r ) T . 4. Draw a sample from the conditional distribution of y * = (y t * 1 , . . . , y t * p ) T conditioned onγ * . The distribution is given by: NegBinom exp X * [i, :] β + Z * * [i, :]( γ T ,γ * T ) T , 1 a 2 , neg. bin.
In case that g * equals g, the vector γ * is empty and can be ignored. 5. Repeat step 3 and step 4 a pre-defined number of times. 6. Compute the α/2 and (1 − α/2) quantiles of the samples drawn from y * |γ * component-wise in order to obtain the desired prediction intervals. It should be mentioned that the assumptions taken for the uncertainty estimation are rather strong and, therefore, one cannot expect the prediction intervals to have exact coverage. Therefore, the intervals should be considered as an indicator for the level of uncertainty.
At first, the restriction to MAP estimates implies that uncertainty is underestimated. However, as already stated at the beginning of Section 4, we consider the application of full Bayesian inference as inappropriate due to the additional complexity. Moreover, it is restrictive to assume that future trend changes are independent and of the same average magnitude as historic changes. Nevertheless, considering that restaurants and canteens commonly plan their stock orders 7-14 days in advance, the number of possible future trend changes that have to be taken account of is very limited. For this reason, using overly sophisticated approaches to model future trend changes is not required.
Besides using α% prediction intervals to measure the uncertainty of future predictions they can also be used to validate if the model fits the observed data. In a first step, the intervals are computed for the time period with observations. Then we check if the fraction of days for which the target variable y does not lie within the corresponding intervals is significantly larger than (1 − α)%. If this is the case the model does not fit the data well and must be adjusted.

Performance Evaluation
In this section, the prediction quality of our approach is evaluated. First, Subsection 5.1 provides a brief description of the test data used. Then, in Subsection 5.2 we briefly explain how the proposed models have been implemented. In Subsection 5.3, we evaluate how the models fit the features of the considered restaurant data. Finally, in Subsection 5.3.2, we compare our prediction models against other promising approaches from the literature.
Throughout this section the degree l of the polynomial spline used to model the trend is specified as l = 1. It turned out that higher degrees often cause too strong trend decreases or increases during prediction. Moreover, the trend function is designed to have a knot point every 30-th day.

Testing Data
In this study, POS data from two different restaurants, that was provided by a partnering restaurant consulting firm, is used. Restaurant A is a rather casual place located in Vienna, Austria, offering a traditional Austrian menu. Restaurant B is a large staff canteen in the Netherlands, operated by a major food services company. For both locations, the data covers a time span of about 20 months. Moreover, the most representative menu items are selected, i.e., accounting for most sales. For restaurant A the menu items are categorized. An overview of the time series data considered per category is given in Table 1. Additionally, this table provides some descriptive statistics of the time series data. In particular, mean, standard deviation (sd), minimum (min), maximum (max), and the number of missing values (miss) are presented. These statistics are computed for the time series with the highest (indicated with H) and the time series with the lowest (indicated with L) sales revenue, respectively per category. The temporal order of the data is ignored for the calculation of the mean and the standard deviation. The idea of the statistics is simply to give the reader a feeling in which range the individual series take values etc. For restaurant B a categorization is not possible, since we only obtained data for which the product names and categories have been masked. Similar to restaurant A we provide an overview of the corresponding time series data in Table 2. We notice that restaurant B shows a high number of missing values since it is closed on weekends (compare to Section 2.1).   In order to provide some structural characteristics of the POS data four representative time series A1, A2, B1, and B2 are considered. A1 and A2 correspond to restaurant A, while B1, and B2 belong to restaurant B. In Figures 2 -3, the time series are decomposed into trend and weekly, monthly, and yearly seasonal patterns. The decomposition is obtained from the proposed model that assumes a negative binomial target, compare to Section 4.4.2. This model is chosen for the decomposition since it achieved the best results in the performance evaluation presented in Section 5.4.3. Please recall that in this model the seasonal effects (for weekday, day of month, month of year) act multiplicatively on the trend. Note, that the seasonal effects are obtained by passing the coefficients of the seasonality function to the exponential function. Investigation of the Figures 2  -3 shows that the trend is often quite constant (A1, A2), while at the same time also smooth (B1) and drastic (B2) trend changes appear. Moreover, weekly, monthly, and yearly seasonal patterns are observed. For instance, time series A1 shows an increasing number of sales at the end of the year, time series A2 shows a similar behavior at the end of the month, and the time series B1 and B2 show the lowest number of sales on Fridays. Finally, the presented time series also show some outliers. It should be mentioned, that the outliers observed in these time series are not as strong as outliers we could observe in other time series from restaurant A, or restaurant B. For reasons of brevity we do not include further time series plots in this section.

Implementation of the Models
We have implemented the models proposed in Section 4 using RStan [40], the interface between the programming languages R [34] and Stan [6]. Stan is known as a state-of-the-art platform for statistical modeling and high-performance statistical computation. In particular, Stan can be used to compute the MAP-estimate of the model parameters and also to perform full Bayesian inference via Markov chain Monte Carlo (MCMC) sampling. Hence, a few lines of Stan code, complemented by a R script to compute the matrices X and Z, suffice to express our Bayesian models.
For the remainder of this section, we use the following abbreviations to identify our models: In case that the target values are assumed to be negative binomial distributed our method is abbreviated by NegBinom and in case that a normal distribution is assumed our method is denoted as Normal.

Insights on the Proposed Models
In this subsection, we illustrate how well the proposed models fit the properties of the provided realworld data. The evaluation is based on representative datasets from the above described restaurants A and B. In Section 5.3.1, the priors given by the Equations (2) -(4) are assigned to the trend functions of the considered models. Further, the standard settings specified in (12) are assigned to the tuning parameters of the models. These are our recommendations in case that the model optimization has to take place automatically and at low computational cost. In Section 5.3.2, we illustrate the superior performance of more advanced approaches for a time series associated with restaurant B.

Standard Approaches
At first, a representative time series A1 (compare to Section 5.1) corresponding to restaurant A is considered. In Figure 6, the model fit is visualized for Normal (5) and in Figure 7, the model fit of NegBinom is shown. The trend function is plotted in red, the seasonal function is colored green, the expected value E(y| β, γ), see Equation (13), is drawn in dark blue and, finally, the 95% prediction intervals for the components of y are plotted in light blue. In particular, for NegBinom, the exponential function of the trend and the season are plotted. Compared to Normal, the seasonal function of NegBinom takes small values. The reason for this is that the seasonal function acts in a multiplicative way on the trend, µ t = exp(g(δ(t))) exp(s(t)), in NegBinom and in an additive way, µ t = g(δ(t)) + s(t), in Normal. For Normal, 94.63% of the observed sales lie within the 95% prediction intervals. The prediction intervals of NegBinom cover 94.97% of the sales. Thus, both models provide a reasonable fit of the observed time series. In Figures 8 and 9 the coefficients of the trend functions are visualized. In particular, each coefficient is plotted against the time point it has a non-zero effect on the model for the first time. Since the coefficients of the global polynomial effect the model from the beginning, different symbols are used to visualize them (a square for the intercept γ 1 , a triangle for the coefficient γ 2 ). While Normal does not consider any significant trend changes, NegBinom detects a slight change in the second half of the year 2019. The coefficients of the seasonality functions are shown in Figures 10 and 11. While NegBinom considers more seasonal effects as significant than Normal, both models agree on the effects with the highest impact (December, November, August, Thursday, Friday, eleven-th day of month, . . . ). Now a representative time series B1 (compare to Section 5.1) belonging to restaurant B is considered. In Figures 12 -17 the model fit is shown for Normal and for NegBinom The figures can be interpreted in the same way it has been done for time series A1. For Normal, 93.92157% of the observed sales lie within the corresponding 95% prediction intervals. The prediction intervals of NegBinom cover 95.29412% of the sales.

Advanced Approaches
In this section, the time series B2 (compare to Section 5.1) corresponding to restaurant B is considered. This time series shows a drastic trend change within a short period of time. In Figure 18, the model fit of NegBinom is visualized with different prior specifications of the trend function. We use the prior (3) -(4), that is inspired by the Bayesian Lasso, and also the horseshoe prior. In case of prior (3) -(4) the model is trained once with the standard settings (12) and once according to the step-wise cross-validation (CV) described in Section 4.5.3. The cross-validation is performed using 6 train/test splits with the testing sets including 14 observations, respectively. In Figures 18  and 19, we observe that the model has some serious problems in case that prior (3) -(4) is applied with the standard specification of the tuning parameters. The shrinkage of the trend coefficients is too strong, such that the trend function fails to sufficiently model the abrupt decrease at the end of 2017. If cross-validation is performed, a better model fit is obtained. However, significant trend changes are detected nearly every month which is an indication of overfitting. Due to the uniform shrinkage going along with the Lasso prior, insignificant trend changes cannot be shrunken sufficiently in order to allow for the appearance of drastic changes. Finally, an application of the horseshoe prior results in a good model fit. The trend function includes few large coefficients in absolute values to model the abrupt trend change, while most of the coefficients are equal to zero.

Comparison Against Other Prediction Methods
In this section, the performance of our models is compared against other well-established time series forecasting approaches. The priors given by the Equations (2) -(4) are assigned to the trend functions of our models and, further, the standard settings specified in (12) for the model tuning parameters are used.
For a given time series and a given accuracy measure the performance of the considered model is evaluated using a 15-fold cross-validation as illustrated in Figure 1. The size of the test datasets is specified as 14.

Forecast Accuracy Measures
Clearly, some measures to compare the prediction quality of our approach against others are required. Let O = {t 1 , . . . , t n } denote the set of training time points and, further, let F = {t * 1 , . . . , t * p } denote the set of testing time points. Additionally, we denote the observed values with y t and the predicted ones with y t . A review of related work, see Section 3, shows that commonly used measures for point estimates are [43,42]: • mean absolute deviation: MAD = 1 p t∈F |y t − y t |, • mean squared error: • mean absolute percentage error: Another common measure is the mean forecast error that is positive if a model tends to under-forecast and negative otherwise. Except of MAPE all up to now mentioned measures are scale dependent, i.e., the larger the y t the larger the error. We compare our approach against others from the literature on several different time series, i.e., different menu items and restaurants, in order to obtain a broad comparison. Hence, it is desirable that the used prediction quality measures are scale invariant. The above mentioned scale dependent measures are important because in the restaurant industry the calculation of the nominal and safety stocks are based on these measures. In case of inferior estimates one possible encounters inventory shortages or increased food waste. For this reason, the scale dependent measures are considered in Section 5.4.3, but with the note that the aggregated values presented in this section are dominated by those time series that show the largest values on average. The scale invariant MAPE is not considered since it can not deal with zero values, i.e., y t must be different from zero. Some more suitable measures are the mean arctangent absolute percentage error (MAAPE), proposed by Kim and Kim [28], the weighted absolute percentage error (WAPE) [8], and the mean absolute scaled error (MASE) [24,32]: • In order to evaluate the quality of the prediction intervals the following measures are helpful: prediction interval coverage probability (PICP) [38], prediction interval normalized average width (PINAW) [38], mean scaled interval score (MSIS) [15,32]. It is desirable to have a small PINAW and a large PICP [27]. MSIS is designed to balance the relationship between PINAW and PICP.
Note that in this evaluation study always 95% prediction intervals are considered, since intervals with this confidence level are most commonly used in the business world [32].

Prediction Methods to Compare Against
Now, we briefly present the methods that we consider for comparison against our proposed model. Therefore, we adopt the selection taken by Taylor and Letham [42].
• Facebook's prophet method. The prophet R package implements a modular regression model with interpretable parameters [42]. We discuss the model in more detail in Section 3.
Within the performance comparison the model is once used with automatic season detection (prophetA) and once with yearly and weekly seasonality (prophetS). • ARIMA. The auto.arima function returns the best ARIMA model according to the AIC or BIC value. Here, we choose the Akaike information criterion (AIC). • Seasonal ARIMA. The auto.arima function can also be used to determine the best seasonal ARIMA (SARIMA) model. To this aim, we specify the seasonality to be weekly and, further, define the order of seasonal differencing equal to 1. If the seasonal order is not specified, the consideration of a seasonality is decided by the auto.arima function which might also return a non-seasonal model. • Exponential Smoothing. The ets function implements an exponential smoothing state space model. It fits a collection of exponential smoothing models and selects the best, see Hyndman et al. [25]. In this performance comparison the seasonality of the exponential smoothing model is defined to be weekly. • TBATS. Trigonometric seasonality based on Fourier series, Box-Cox transformation, ARMA errors, Trend, and Seasonal components, see Livera et al. [30]. In this comparison study, the TBATS model is specified to learn weekly, monthly, and yearly seasonalities. Note that all above mentioned forecasting models, except for prophet, are implemented in the forecast R package [23,22]. We want to point out that ARIMA, SARIMA, ExpSmooth, and TBATS assume regularly spaced data points. Since the considered testing data (compare to Section 2.1 and Section 5.1) contains missing data, the missing values are linearly interpolated and subsequently rounded before fitting one of these models.

Evaluation
Finally, we present how our approach performs compared to the ones described above. In Tables  3 -10  Considering the results for restaurant A it can be seen that both of our models show a positive MFE for nearly all of the considered categories. This indicates, that our approach tends to underforecast the sales numbers. While also tending to under-forecasting, the classical times series models (ARIMA, SARIMA, ExpSmooth, TBATS) show the lowest and, thus, the best MFEs. However, none of the models under comparison shows large MFEs, such that one could speak about a poor performance. In contrast to restaurant A, for restaurant B the non-standard approaches (i.e.,Normal, NegBinom, and prophet) obtain lower MFEs than the classical ones. Especially, our models show significantly lower values than all other methods under comparison.
Inspecting the MSEs and the MADs for restaurant A reveals that the proposed models often obtain the lowest errors. In case that another model performs better than ours, the differences are rather small. For restaurant B NegBinom obtains the lowest MAD and the second lowest MSE.
Switching the analysis from the scale-dependent metrics for point forecasts to the scale-invariant ones (WAPE, MAAPE, MASE) shows that our approach NegBinom shows the best results overall. For restaurant B NegBinom obtains significantly lower error values than all other considered approaches. This result is stronger than the result for the scale-dependent metrics, since the scaleinvariant metrics are not dominated by the time series showing the highest number of sales.
Moreover, NegBinom always achieves the lowest mean scaled interval score. Therefore, the prediction intervals corresponding to this approach can be considered as the most useful ones. The superior performance could be explained by a good coverage probability paired with tight interval widths.
Finally, it is important to link the performance of the methods to the structural characteristics of the time series data. In Section 5.1, it is shown that the individual time series show multiple seasonalities. While, weekly, monthly, and yearly seasonal effects are observed, the strength of each of these effects differs among menu items. In principle, a method being capable of incorporating each of the mentioned seasonalities should return the best results. However, in case of limited training data one has to be careful. Allowing for too much flexibility, i.e., modeling each of the seasons, can lead to over-fitting. This may be the reason why TBATS often shows inferior results than some of the simpler forecasting approaches. On the other hand, considering only some of the seasonalities can also negatively impact the model performance. For instance, the SARIMA model, that only considers weekly seasonalities, most of the times returns worse point predictions than the simple ARIMA model. We assume that neglecting the remaining seasonalities makes it difficult to extract the true weekly effects, especially if the amount of training data is limited. Thus, we conclude that an approach which considers all of the mentioned seasonal effects, but at the same time, in terms of regularization, also penalizes the model flexibility should be preferred. This observations agree with the fact that our approach leads to the best predictions overall. Normal and NegBinom use shrinkage priors in order to automatically detect significant seasonalities and, further, to guarantee that their influence does not exceed a necessary level.
Finally, another reason for the superior performance is that all other methods somehow assume Gaussianity, while the negative binomial distribution is more appropriate to model count data. This might be the reason why NegBinom provides the most useful prediction intervals. It is likely that the use of the negative binomial distribution is a key factor. Similar findings are observed by Snyder et al. [39] who also found that models based on the negative binomial distribution give the best prediction performance concerning a forecast of the intermittent demand for slow-moving inventories, i.e., low-volume items such as auto parts. The negative binomial distribution is widely used for slow-moving inventory items, presumably because of its good performance in practice.

Evaluation of the Day of Month Effect
While it is quite intuitive that sales of menu items show weekly and yearly patterns, it is less evident that there are also effects that repeat monthly. As shown in Section 5.1 the proposed models detect monthly seasonalities within the time series data considered in this study. However, there is still the question if these effects significantly influence the model prediction accuracy. To answer this question, we have repeated the evaluation performed in Section 5.4.3 for Normal and NegBinom with the day of month effect switched off. It turns out that the usage of this explanatory variable does not have a significant influence on the average model performance, whether for Restaurant A or Restaurant B. However, a more detailed analysis shows that there are some time series where the consideration the the day of month effect leads to consistent improvements. For instance, time series A2 (compare to Section 5.1) is such a time series. In Table 11, the corresponding model accuracies are reported once with the day of month effect turned on and once with this effect turned off. We notice that the consideration of monthly patterns leads to superior accuracies. Since the additional model complexity introduced by the day of month effect has only a minor influence on the computation times, we conclude that their consideration is useful. However, we found that the model performance can only be improved for a fraction of the restaurant sales time series considered in this paper.                     Table 11: Performance evaluation of the day of month effect for time series A2.

Conclusion & Future Work
In this work, a new approach for predicting future sales of menu items in restaurants and staff canteens was proposed. In particular, two Bayesian generalized additive models were presented. The first one assumes future sales to be normally distributed, while the second one uses the more appropriate negative Binomial distribution. Both approaches use shrinkage priors to automatically learn significant multiple seasonal effects and trend changes. The features learned by the models have a straightforward human interpretation which helps potential users to create the necessary trust and confidence in the methodology. The performance of our approach was extensively evaluated and compared to other well-established forecasting methods. Basis of the analysis were two data sets retrieved from (electronic) Point of Sales (POS) systems collected at a restaurant and a staff canteen. The evaluations show that our approach often provides the best and most robust point predictions overall. Moreover, the prediction intervals of our model with the negative Binomial distribution are notably more accurate than the ones obtained by the other methods. Currently, our approach only takes account of POS data. This makes it universally applicable, but also limits the prediction quality that can be achieved. In future work, we plan to enrich the data source by weather data, information regarding special events and holidays, and, last but not least, expert knowledge of restaurant managers. Additionally, we want to extend our approach such that it can also predict on hourly basis. Accurate sales predictions on hourly basis will allow an optimization of workforce planning.