Local and Global Trend Bayesian Exponential Smoothing Models

This paper describes a family of seasonal and non-seasonal time series models that can be viewed as generalisations of additive and multiplicative exponential smoothing models, to model series that grow faster than linear but slower than exponential. Their development is motivated by fast-growing, volatile time series. In particular, our models have a global trend that can smoothly change from additive to multiplicative, and is combined with a linear local trend. Seasonality when used is multiplicative in our models, and the error is always additive but is heteroscedastic and can grow through a parameter sigma. We leverage state-of-the-art Bayesian fitting techniques to accurately fit these models that are more complex and flexible than standard exponential smoothing models. When applied to the M3 competition data set, our models outperform the best algorithms in the competition as well as other benchmarks, thus achieving to the best of our knowledge the best results of per-series univariate methods on this dataset in the literature. An open-source software package of our method is available.


Introduction
Despite being introduced over 50 years ago, exponential smoothing methods remain among the most widely used methods for forecasting (Goodwin 2010).Their enduring popularity stems primarily from their relative simplicity, robustness, flexibility, and good forecasting performance (Gardner Jr. 2006).For example, exponential smoothing models were able to outperform several more sophisticated and complex algorithms in the influential M3 forecasting competition (Makridakis & Hibon 2000), and exponential smoothing models were successfully used as building blocks in many of the methods used in the recent M4 competition (Makridakis et al. 2018).
Exponential smoothing was originally created as a simple heuristic, in which the next point forecast is given by the previous point forecast plus a correction term that is proportional to the error made by the previous forecast.Over the years, the basic exponential smoothing technique has been extended to include additional aspects such as seasonality (additive and multiplicative), local trends (additive, multiplicative, and damped versions of both), and more complex stochastic error term specification (additive and multiplicative).Exponential smoothing techniques are frequently referred to in literature by the initialism "ETS", which may stand for either ExponenTial Smoothing, or Error, Trend, and Seasonality.Hyndman et al. (2002) and Hyndman et al. (2008) systematically consolidated the various exponential smoothing models, and gave them a solid theoretical base, by reformulating them within a statespace framework with a single source of error.Subsequently, this theoretical contribution was followed by the creation of the forecast package (Hyndman & Khandakar 2008) in the R programming language (R Core Team 2018), which implements ETS models in an accessible and user-friendly way within an opensource framework.The package has subsequently become the de facto standard tool for the forecasting of time series in many areas.A more recent continuation of this development is via the fable pack-age (O'Hara-Wild et al. 2021), which reimplements and extends the functionality of the forecast package.Apart from these main lines of research, other authors have developed various robust versions of exponential smoothing, e.g., (Cipra 1992, Gelper et al. 2010, Crevits & Croux 2016, Crevits et al. 2018) and Svetunkov & Kourentzes (2018) developed the complex exponential smoothing method.

Bayesian ETS
The majority of development in the ETS space has been through the lense of frequentist statistical approaches.In contrast, the application of Bayesian methods to exponential smoothing has been comparatively less studied.Important examples of Bayesian ETS approaches being proposed in the works of Andrawis & Atiya (2009) and Bermudez et al. (2010).Corberan-Vallet et al. (2013) subsequently proposed a related method for time series containing zeros, Bermudez et al. (2009) presented a model suitable for multivariate forecasting, and Corberán-Vallet et al. (2011) extended the approach to forecasting of time series with correlated random disturbances.The work of Bermudez et al. (2010) examined the topic of Bayesian estimation of basic ETS models in a holistic way.The performance of resulting model was comparable to the best forecasting models in the M3competition, as well as forecasts produced from popular automatic packages (Bermudez et al. 2010).An interesting result of this study was the demonstration that the use of Bayesian estimation meant the resulting model was able to produce accurate prediction intervals, which were found to be close to the empirical intervals.This is an important feature that the classical exponential model from the M3 competition is lacking.Interestingly, despite the promising results emerging from that work, the Bayesian approach has not found widespread adoption in the application of ETS models.

Shortcomings of classical ETS
The ETS models that are currently in use, as described in Hyndman et al. (2008), and implemented in the forecast package, are either linear or expo-nential in their trend, with the only option currently available to potentially bridge this gap being a damped trend.So, the choice is limited especially if the underlying growth rate cannot be suitably modeled by these available options.Furthermore, the ETS models assume normality of the errors, homoscedasticity of the errors for additive models, and have some further shortcomings described in details in Section 4.
There are many practical applications, e.g., in cloud infrastructure, rideshare, and other tech businesses, where characteristics of the ETS framework mentioned above do not adequately capture the time series in consideration.In many applications, time series grow faster than linear but slower than exponential, and in situations with volatile time series, the Gaussian error assumption is violated.In the following, we use as an example a monthly time series from Google trends that shows the popularity of the search term "tesla" (and thus indicates interest in the company Tesla, Inc.) over time.The series is shown in Figure 1.The sudden shifts observed in the series can be a result of new product launches, marketing activities, etc.However, often, especially in the case of new products and services, they result from major customer activities, which may be difficult to predict.
Expressing scale of the error as a function of level is important in time series with error levels that are one or two orders of magnitude smaller at the beginning than at the end of the series, so that homoscedastic models are unsuitable in this case.At the same time, while volatility grows with time series levels, it is seldom proportional to the level.

Our Contributions
In this paper we propose a number of practically motivated extensions to the ETS family of models with additive error and additive trend, to convert them into flexible nonlinear models.As the models are likely to be analytically unsolvable, and our models use non-Gaussian error distributions, we fit them using a probabilistic programming tool (Stan Development Team 2015) that employs a fast Markov Chain Monte Carlo engine (Hoffman & Gelman 2014).In sum, the main contributions of our paper are as fol- 1.We propose a number of extensions to the ETS framework that are designed to increase the flexibility of the base models.Importantly, these extensions are motivated directly by phenomena commonly observed in real-world time series.
2. We utilise Markov-chain Monte-Carlo (MCMC) sampling procedures to enable Bayesian model fitting.A key strength of this approach is that we can relax the usual Gaussian assumptions and still obtain accurate prediction intervals, even for small sample sizes.
3. We implement both a standard version of the sampler, rendering the implementation extensible and easy to understand, and a bespoke specialised Gibbs sampler that is roughly an order of magnitude faster than the standard implementation.
4. We provide documented and tested open-source software implementations of our proposed methods.
The methods presented in this paper are available in the XXX package on CRAN1 .The Rlgt package is a comprehensive implemention of models discussed in this paper, together with further documentation and illustrations regarding their use.
The remainder of this paper is structured as follows: Section 2 discusses ETS models in detail, Section 3 discusses Bayesian model fitting and MCMC sampling, Section 4 describes the shortcomings of classical ETS models, Sections 5 and 6 present the non-seasonal and seasonal versions of our new model respectively, Section 8 describes the experiments and results, and Section 9 concludes the paper.

Exponential Smoothing
The exponential smoothing algorithm is fundamentally based on the concept of time series decomposition.As discussed in Section 1, ETS performs a decomposition of the time series data into the three main components: (i) stochastic error; (ii) a trend; and (iii) a seasonal component.The trend represents the long-term increase and decrease in the time series, and is not necessarily linear over time.The seasonal component represents the perturbation in the time series due to seasonal factors, e.g., quarter of the year or day of the week.Seasonality patterns last for a fixed period of time.Finally, the error component models the random components which cannot be explained by the model.
These components can interact with each other in either additive or multiplicative ways.Other interactions are also possible in the ETS model, such as damped trends, in which the strength of the trend reduces over time (Gardner 1985).Exponential smoothing methods were proposed as a simple way of generating fast and reliable forecasts for a wide range of time series data in various application areas that do not make too many strong, parametric assumptions.

ETS model family nomenclature
A wide range of different ETS-based models have been introduced over the years.Following the nomenclature in Hyndman & Athanasopoulos (2018), ETS variants commonly model the trend as either additive (A), additive damped (Ad), multiplicative (M), or as having no trend (N).Seasonality is usually modelled as being either additive (A), multiplicative (M), or not present (N), and the error term is usually either additive (A) or multiplicative (M) in nature.The ETS model family constitutes the set of models that can be built from all possible combinations of these choices for error term, trend, and seasonality.For example, an ETS model with additive error, additive damped trend and multiplicative seasonality, which is commonly known as the Holt-Winters' damped method, would be denote as an "AAdM model" under the common nomenclature.Furthermore, if a component is not further specified, a "Z" is used as a wildcard placeholder for the missing term; for example, an AAZ model is a model with additive error, additive trend, and any form of seasonality.

Classical ETS model estimation
The classical ETS models are usually fitted via the method of maximum likelihood.This technique is the most popular technique for estimating model parameters in classical statistics.This popularity is primarily due to the strong asymptotic properties the maximum likelihood estimates are known to possess (under suitable regularity conditions), such as consistency, asymptotic efficiency, and asymptotic normality (Fahrmeir & Kaufmann 1985).The method of maximum likelihood is based on the notion of maximising the joint likelihood of the data under the model p(y 1 , . . ., Here, θ is a vector of model parameters and y is a vector of n observed values, i.e., the likelihood function p(y | θ) is defined as the probability of generating a certain set of data y given the parameter values θ.
The estimates of the parameters are defined as the values of the parameters which maximise the likeli-hood function, i.e., θ = arg max Here, θ denotes the maximum likelihood estimate (MLE) and Θ denotes the valid parameter space.For the classical ETS models, the parameter vector θ usually consists of (a mix of) the following parameters: a damping coefficient, an error variance, initial trend terms, initial seasonality factors, and smoothing factors for the level, trend, and seasonality.For the purpose of forecasting, it is usual that the estimated parameters are plugged into the relevant equations to produce future predictions (Hyndman & Athanasopoulos 2018).

Non-seasonal (damped) trend model
One particular model from the ETS model family, the non-seasonal (damped) linear trend model (AAdN), is given by where y t denotes the value of the dependent variable of interest at time t, ŷt+1 is the conditional expectation of y t+1 given the information up to time t, σ denotes the standard deviation of the error distribution, l t denotes the level at time t, and b t the local trend at time t.The parameters to be fitted are α ∈ (0, 1), a smoothing parameter for the level term; β ∈ (0, 1), a smoothing parameter for the local trend term; ϕ ∈ (0, 1), a damping factor; and initial values for level, l 1 , and slope, b 1 .Equation (2) models the distribution of y t+1 ; specifically, it assumes that the value of y t+1 is distributed as per a normal distribution with mean ŷt+1 and standard deviation σ.Equation (3) states that the one-step-ahead prediction value ŷt+1 is formed as the sum of the previous level and trend values.Equation (4) describes the evolution of the level over time; the estimate of the level at time t+1 is given by the weighted average of the observed value y t+1 and the previous estimate of the level l t .The initial level value l 1 is treated as a parameter to be estimated from the data.
Equation ( 5) describes the evolution of the trend across the time horizon.Similar to Equation (4), the updating process of the trend is based on the weighted average of the current trend value (l t+1 − l t ) and the previous estimate of the trend b t .As with the level, the initial value b 1 is treated as a parameter and estimated.The parameter ϕ is a damping factor that reduces the influence of the trend; if ϕ = 1 the model reduces to the usual AAN (i.e., undamped) model.
We note that when using a linear trend, we are making the assumption that we expect that in addition to the level changing over time, we also expect that the trend will change over time.The weighted average estimate for the trend value usually works best if the trend changes gradually over time in the corresponding time series data.More comprehensive explanations of this model can be found in Hyndman & Athanasopoulos (2018).

Seasonal (damped) trend model
The seasonal counterpart of the non-seasonal (damped) linear trend model introduces a multiplicative seasonal pattern to account for potential seasonality.This model is also known as the Holt-Winters' multiplicative method (Hyndman & Athanasopoulos 2018).
This extension introduces three new quantities: the s t represents the seasonal factor prevailing at time t, ζ ∈ (0, 1) represents a smoothing parameter for the seasonal factors, and m > 1 represents the number of seasons comprising a complete period.The main differences between the AAN model described in Section 2.3 and the seasonal extension are: • In comparison to Equation (3), Equation ( 7) introduces a multiplicative factor of s t+1−m which estimates the prevailing seasonal factor at time t + 1.This factor is estimated m periods before (at this point the seasonal factor is not yet known, so the best estimate that we have is the seasonal factor obtained at time t + 1 − m).
• The updating process of the level in Equation ( 8) is based on the weighted average of the current level value (note that the data value is modeled as a product of the level and seasonality factor, so the level value is given by y t+1 /s t+1−m ) and the one-step prediction of the current level value.
• Equation ( 10) describes the evolution of the seasonality factor.Similar to the trend and level adjustment process, the seasonality factor is estimated as a weighted average of the current seasonality value y t /l t and the most recent estimate of the corresponding seasonality value s t−m .

Bayesian estimation
The overwhelming majority of previous work on exponential smoothing utilises frequentist inferential tools, such as maximum likelihood and confidence intervals, to estimate model parameters and provide measures of uncertainty.Maximum likelihood estimation tends to suffer from high parameter variance, particularly for sample sizes and probability models with heavy tails, such as the t-distribution; similarly, due to the great difficulty in deriving exact confidence intervals for all but the simplest of models, most uncertainty quantification relies on asymptotic arguments, and may provide poor estimates of variability for small sample sizes.
In contrast to this, we use the Bayesian paradigm of statistical inference to fit our proposed ETS models.In the Bayesian approach one treats the unknown model parameters, say θ, as realisations of random variables coming from some prior distribution, π(θ), and argues conditionally on the observed data sample, y, to form the posterior distribution: where p(y | θ) denotes the likelihood function.The posterior distribution describes how likely specific values of θ are to be the unknown, population value of θ, in light of the data and prior beliefs.The Bayesian approach naturally handles complete parameter uncertainty through the mechanism of the prior distribution.This same mechanism also allows us to specify any prior information about the unknown model parameters in a very transparent manner, and as such tends to offer a form of regularisation that can stabilise parameter estimates.An additional strength of the Bayesian approach is that given the posterior distribution over the unknown model parameters, it is, at least in principle, straightforward to generate the posterior distribution over any function of the model parameters via standard transformation of random variables techniques.This can be used, for example, to generate posterior distributions over point or distributional forecasts, which naturally incorporate the uncertainty in parameter estimates.

Monte Carlo Markov Chain (MCMC) Methods
Despite the attractive theoretical advantages of the Bayesian approach, the primary weakness is in the potential difficulties in implementation.Computation of the normalising factor in (11) involves highdimensional integration, and is in general intractable.
Even if one could compute it, effectively exploring an arbitrary multidimensional probability distribution is problematic.Instead, one generally resorts to some form of approximation; one of the most popular approximation techniques is via simulation of values of θ from the posterior distribution.This is usually done via Monte Carlo Markov Chain (MCMC) techniques, that do not require knowledge of the normalising term, and are thus very convenient.Given a random sample of m values drawn from (11), say θ (1) , . . ., θ (m) , it is straightforward to compute quantities such as the posterior mean, standard deviation and intervals using the corresponding empirical (sample) quantities.Furthermore, if we have some function of the model parameters, f (θ), we can approximate the posterior distribution of this function by simply using i.e., we simply evaluate f (•) for each sample θ (i) we have drawn from the posterior distribution.In the context of this paper, the typical functions we may evaluate would be quantities such as forecasts over future data.

Probabilistic Programming
The application of Bayesian statistical techniques to practical data science problems has increased dramatically in the last two decades.This has primarily been driven by increases in computational power, coupled with the introduction of easy-touse Bayesian probabilistic programming tools such as Stan (Stan Development Team 2015) and JAGS (Plummer 2003).These tools enable non-specialist data scientists to perform Bayesian modelling and computation in a more efficient manner, specifically in terms of development time and programming effort.Stan is a probabilistic programming language that is focused on Bayesian statistical modelling.The underlying computational algorithm used by Stan to sample from the posterior distribution is a variation of the Hamiltonian MCMC algorithm (Neal 1994), known as the "No U-Turn Sampler" (NUTS).This procedure uses autodifferentiation to automatically build an MCMC sampler from the posterior distribution specified by the user.Conveniently, Stan is also integrated into R through the package Rstan (Stan Development Team 2018).
Although Stan offers a number of advantages as a state-of-the-art Bayesian tool, the primary benefit of the tool stems from the separation between modelling (through the probabilistic programming language) and computational processes (Betancourt 2017) (through the built-in Hamiltonian sampling algorithm) that it provides.This separation enables users to focus primarily on the specification of the problem, i.e., the choice of model and prior distributions, rather than the computational aspects of sampling.For most problems, users are not required to specify any computational procedures, as they can be handled adequately by the default process.However, the algorithm is not fool-proof, and in general both run-time and convergence of the resulting Hamiltonian sampler can be heavily influenced by the way in which models and priors are specified.
In addition to Stan, there are a number of other probabilistic programming tools that support Bayesian automated MCMC sampling.In particular, JAGS (and its predecessor BUGS) has also been extensively used by Bayesian data analysts (Lunn et al. 2000).In comparison to Stan, JAGS uses an automatic Gibbs sampling approach which can be less efficient than the NUTS algorithm used by Stan (Rbloggers 2014).

Weaknesses of classical ETS models
As discussed in Section 1, classical ETS models have several shortcomings that can adversely affect their performance in practice.We now describe these weaknesses in more detail.

Limited functional form of trend
Despite a variety of different forms of trends being considered in classical ETS models, the choice may frequently still be too limited; this is illustrated by the motivating example of the results of entering the search term "tesla" into Google trends, shown in Figure 1.Clearly, the trend appears to be growing faster than an additive form but slower than a multiplicative form.

Non-normal error distribution
Statistical models based on ETS models, both additive or multiplicative, usually assume that the error terms are identically and independently distributed normal random variables.This assumption is fairly restrictive for most applications, and is not necessarily appropriate for modelling volatile time series data, such as the example from Figure 1.The assumption of normality implicitly assumes that the probability of obtaining a value that highly deviates from the mean (i.e., an outlier) is extremely low.In some real data, such as the movements of stock prices, there is a general consensus that the normality assumption tends to be violated due to high variance, especially in the time of crises.Despite these empirical observations, the normality assumption tends to be retained in classical ETS models as even small deviations from the normality assumption may greatly complicate the likelihood function; this can cause considerable issues as maximisation of the likelihood function is the key step in standard maximum likelihood estimation based procedures.Consequently, the assumption of normality is usually required in classical ETS models, as otherwise the MLE may be difficult to find, and other inferential quantities derived from the likelihood, such as confidence intervals may also be very difficult to obtain.

Heteroscedastic error distribution
In addition to being non-normally distributed, the scale of the errors in time series data is often not homoscedastic, i.e., it may vary over time.This issue has been recognised even in classical ETS models, and various approaches based on innovation state space representations have been introduced to deal with it, with varying levels of success.For example, the multiplicative error form in the state space models may allow the error to depend linearly on the current value of the data.Similar to the discussion of the form of the trends in Section 4.1, the way in which the error scale is linked to the trend is usually quite limited, and may not be sufficient to accurately model more general time series data.For example, the state-space model might not be able to handle an error function which grows with the data over time (i.e., faster than an additive error form) but at a sublinear rate (i.e., slower than the multiplicative error form).

Local and Global Trend nonseasonal model
To address the shortcomings identified in Section 4, we propose the non-seasonal local and global trend (LGT) model.It is derived from the classical ETS model with additive error, additive trend, and no seasonality, with the addition of a number of distinguishing features.We define it as follows.13defines the conditional mode (i.e., the predicted value) of the Student t-distribution at the next time-step.Equation 14estimates the level adjustment via exponential smoothing, and Equation 15 estimates the local trend via exponential smoothing of the first differences of the smoothed levels, and Equation 16 defines the level dependent (heteroscedastic) scale of error distribution.
Equation 14 is very reminiscent of the Holt linear method with a slightly simpler structure.The local trend update formula is the same as in the Holt linear (AAN) method, see Equation 5.The main differences are now discussed in more detail.

Student t-distributed errors
Equation 12 models the observations via a Student t-distribution, with a time-varying scale.We note that the Student t-distribution has three free parameters as opposed to the two free parameters of the normal distribution.The additional parameter, ν, commonly referred to as the degrees-of-freedom, controls the weight of the tails of the distribution, which facilitates the graceful handling of "outliers" in data.This choice of data distribution is in contrast with the innovation state space models for classical ETS (Hyndman & Athanasopoulos 2018), in which the error distribution is assumed to be normal and identically distributed with a mean of zero and variance of σ 2 .
However, as outlined in Section 4, this assumption is often overly restrictive in practice.This motivates our choice to generalise the error assumption from normal to Student t.The Student t-distribution is a generalisation of the normal distribution, with the heaviness of the tails being smoothly controlled by the degrees-of-freedom parameter ν.When ν is relatively high (i.e., above 20), the Student t is virtually identical to a normal distribution.In contrast, when ν is small the tails of the t-distribution are heavier than the tails of a normal distribution; when ν = 1 the t-distribution reduces to the special case of the Cauchy distribution.
We note that ν, like all other model parameters, is fitted automatically within our Bayesian approach.This means that the model is able to automatically adapt to relatively calm time series (e.g., like those of the M3 competition), but can handle the more volatile series that are often found in practical applications.

One-step ahead prediction
In the classical AAN model (Equation 3), the forecasts are calculated as ŷt+1 = l t + b t , where l t is the current estimate of the level and b t is the current estimate of the trend, resulting in a linearly changing forecast.In multiplicative models the one-step-ahead prediction is instead given by ŷt+1 = l t b t (Hyndman & Athanasopoulos 2018), which results in an exponentially varying forecast.In the latter case, for a model to be reasonably stable, the trend values b t need to be close to one.Define the trend to be b t = 1 + δ t , where δ t is a small value; then ŷt+1 = l t b t = l t (1 + δ t ) = l t + l t δ t .This suggests that both the additive and multiplicative case can be generalised by the following equation: with ρ ∈ [0, 1].While this generalisation of additive and multiplicative trend model is interesting, we found in practice that its performance was typically worse than that of the alternative global trend model: where γ is constant for the whole time series.This flexible form of trend function is particularly useful for dealing with time series that grow faster than linear but slower than exponential.If the ρ parameter is allowed to be negative, then for a growing trend we can obtain a damped behaviour.In practice, we did not observe this capability to be particularly important when fitting to the M3 time series.Additionally, the model includes a local linear trend of the form λb t , where λ is a value between −1 and 1.In practice, this trend is almost always between 0 and 1. Restricting the magnitude of this coefficient means that the model is sensitive to the recent changes in slope, but to a lesser extent than the original AAN model.A negative λ coefficient is indicative of a counter-trending behaviour, but in practice this appears to be a relatively rare situation.
The one-step-ahead forecast is then: It is assumed, and enforced during the simulation (i.e., forecasting) stage, that the level and forecasting values are all strictly positive.All model parameters, such as γ, ρ, and λ, are fitted via MCMC sampling from the posterior distribution.
The one-step-ahead forecast given by Equation 19, is given by the linear combination of the level l t , the global nonlinear trend γl ρ t , and the damped local linear trend λb t .We use the term "global" to describe the nonlinear trend to stress the fact that both γ and ρ are fitted on the entire time series and do not change over time.Typically, the global trend will dominate over the short-term behaviour caused by the local linear trend.
The simple function that defines the global trend has several interesting properties.If ρ is close to zero, the contribution of the global trend to the one-stepahead prediction becomes near-constant, and the resulting trend becomes close to linear.Alternatively, if ρ is close to one, the contribution of the global trend to the one-step-ahead prediction becomes almost proportional to the level value, and the trajectory of the multistep forecast becomes exponential.This means that the global trend is flexible enough to adequately model trends typical of fast growing businesses, i.e., growth that is faster than linear but slower than exponential in nature.

Heteroscedastic error
In addition to accounting for a possible heavy-tailed error distribution, our proposed model also allows the scale of the error to vary as a function of the estimated level of the time series.This is achieved by allowing the scale parameter of the Student's tdistribution to vary according to a monotonically increasing function of the current level of the time series.This is given by Equation 16, which models the scale of the error by a power transformation of the one-step-ahead prediction plus a level-independent base scale term ξ.
This allows us to account for the common situation in which the magnitude of the error is an increasing function of value of the time series, with the use of a variable power τ allowing us to moderate the rate at which the scale increases as the value of the time series increases.In practice, the value of τ is usually limited to values between zero and one.A τ of zero corresponds to the situation of a constant scale across the series, i.e., homoscedasticity.In contrast, τ = 1 approximates the behaviour of the multiplicative error ETS model, in which the magnitude of the error is proportional to the expected value.In practice, a value between zero and one is desirable for τ as we expect the error to grow in relation to the magnitude of the time series, but in a sub-linear manner.The parameter ξ > 0 sets the base error scale.These modifications relax the usual NID assumption, as the error term is now not required to be either normally or identically distributed.

Seasonal Global Trend Model
We also consider the extension of our procedure to seasonal models.Specifically, we introduce the seasonal global trend model (SGT), which is a seasonal modification of the LGT model previously discussed.In contrast to the non-seasonal model, we remove the local linear trend term and replace it with the ability to model seasonal effects.Specifically, we utilise a multiplicative seasonality scheme, with the seasonality coefficients evolving as per the usual ETS model: Here, m > 0 is the seasonality (e.g., m = 12 for a monthly series), ζ ∈ [0, 1] is the seasonality smoothing coefficient, and s t are seasonality factors.
A key difference in our Bayesian implementation, vis à vis the conventional seasonal ETS model, is the manner in which the initial seasonality coefficients, s 1 , . . ., s m , are handled.Rather than calculate the ratio of observations against a cyclical mean for a few cycles, potentially followed by maximum likelihood estimation, they are handled in a fully Bayesian fashion.That is, we simply treat these quantities as unknown parameters and assign them appropriate prior distributions; specifically log s i ∼ C(0, b), where C(a, b) denotes a Cauchy distribution with location a and scale b.We then allow these quantities to be sampled along with the other model parameters as part of the MCMC sampling procedure.
The complete specification of the model is: The latter condition enforces the requirement that the seasonality adjustments be expressed relative to the overall scale of the signal.The complete set of parameters that need to be fitted for the SGT model are described in

Implementation
In Sections 5 and 6 we introduced several extensions to the classical ETS model, that allow for a wider range of trend and error models.We fit these models within the Bayesian framework of statistical inference.Specifically, we use MCMC approaches to simulate from the posterior distribution, and use these to approximate the posterior predictive distribution over forecasts.We provide two implementations: (i) a Stan-based implementation, and (ii) a bespoke Gibbs sampling implementation based on a simplified model.The former is more flexible and easier for users to extend, but is quite slow.The Gibbs sampler is less straightforward to extend, but is substantially faster and does not require a working Stan installation to run.In the following we detail the approaches.

Prior Distributions
A key step in Bayesian inference is the specification of prior distributions.The choice of prior distributions can substantially affect the performance of a Bayesian procedure, particularly for small sample sizes.We now detail the choices used in our implementation.
The degrees-of-freedom ν and power-coefficients ρ and τ are assigned uniform distributions over appropriate intervals; the scale σ and minimum error scale ξ are assigned half-Cauchy distributions; the smoothing parameters α, β and ζ (for seasonal models) are assigned beta distributions; the coefficients γ and λ are assigned heavy-tailed, weakly informative Cauchy distributions (Gelman 2006); and the initial values for the seasonality terms, s 1 , . . ., s m are assigned log-Cauchy prior distributions.The hyperparameters for these prior distributions are given default values in our implementation and there is no need for the user to specify these.However, due to the flexibility of Stan it is straightforward for the user to modify these prior hyperparameter values if so desired.

Gibbs Sampler
In addition to providing a Stan implementation, we also developed an MCMC sampling algorithm for our proposed model via a bespoke Gibbs sampler.For purposes of tractability, the Gibbs sampler is based on a slightly simplified form of the LGT model: The simplified model enforces homoscedasticity on the errors, which dramatically improves the convergence speed of the resulting Gibbs sampler.For the purposes of sampling, the Gibbs sampler represents the t-distribution via a scale-mixture of normal distributions (Lange et al. 1989), i.e., where IG(a, b) denotes the inverse-gamma distribution.This representation is equivalent, as integrating out the latent variables ω 2 t exactly recovers the t-distribution.Despite introducing n new latent variables that need to be sampled, this representation lets us treat ŷt as the linear predictor in a regular Gaussian linear regression, making sampling γ and λ substantially easier.We also utilised this same scalemixture form to represent the Cauchy prior distributions, as well as an inverse-gamma scale mixture of inverse-gammas to represent the half-Cauchy prior distribution (Wand et al. 2011).
A second difference between the Gibbs sampler and the Stan implementation is in the way in which the ρ and ν parameters are handled.Given the nonstandard form of the conditional distributions for these parameters, the Gibbs sampler utilises a simple grid-sampler by restricting the range of possible values these parameters can take to a finite, discrete set.We found this has little effect on the performance of the methods, as the models appear quite insensitive to the specific values of these parameters, but substantially improves convergence of the resulting algorithm.

The Sampling Algorithm
We now briefly summarise the Gibbs sampling algorithm.Gibbs sampling from a posterior distribution, say p(θ 1 , θ 2 | y) works by iteratively sampling from the conditional distributions θ 1 | θ 2 , y and θ 2 | θ 1 , y.This can be efficient when it is relatively easy to draw random samples from the conditional distributions.Let ŷ = (ŷ 1 , . . ., ŷn ) denote the vector of onestep-ahead forecasts made by the LGT model, and ω = (ω 1 , . . ., ω n ) denote the vector of latent variables.We now summarise the specific Gibbs sampling steps: 1. Sample σ 2 | y, ŷ, ω.Due to the use of the scalemixture representation for the t-distribution the half-Cauchy prior, this conditional distribution is of the inverse-gamma type.
4. Sample the coefficients γ, λ | y, ŷ, ω, σ 2 ; due to the latent-variable representation of the likelihood and prior distributions, the conditional distributions for these are just (truncated) normal distributions.
For this step we integrate the latent variables ω out of the likelihood to improve convergence of the Gibbs sampler.6. Sample the power coefficient ρ | ŷ, ν, σ 2 using a grid-sampler, again integrating out the latent variables ω.
These steps are iterated repeatedly until the desired number of samples have been generated.

Experiments
As our proposed models are extensions of the traditional exponential smoothing model family, they are univariate in nature and do not train across series, or utilise additional covariates.As such, we opt for an evaluation on the classical M3 benchmark dataset.This has been the standard benchmark of datasets in forecasting for decades, and remains an exellent and highly representative dataset for evaluating univariate methods of the type presented in this paper.Thus, the M3 dataset is the focus of our experiments.We use the M3 time series and forecasts from the "Mcomp" R package (Hyndman et al. 2013).The M3 dataset consists of four categories of time series: (i) yearly (645 series); (ii) other (174 series); (iii) quarterly (756 series); and (iv) monthly (1428 series).The first two categories are non-seasonal, and the second two categories are seasonal at different levels of seasonality.

Error measures
When evaluating our new algorithms we use the same error measures as the participants in the original M3 competition to ensure that the results are directly comparable, and to avoid any irregularities that have been reported in the literature (Boylan et al. 2015).
In particular, we use the following two metrics: (i) the symmetric mean absolute percentage error (sMAPE), which was the primary metric used in the M3 competition (Makridakis & Hibon 2000), given by sMAPE = 200 h h t=1 and (ii) the mean absolute scaled error (MASE), given by: Here, h is the maximum prediction horizon, s is the seasonality (s = 1 for non-seasonal series and s > 1 for seasonal series, e.g., s = 4 for quarterly series), and n is the length of the time series.

Results and comparison
To facilitate a fair assessment of our proposed univariate methods we restrict the scope of competitor methods to univariate techniques only, excluding from our comparison any methods that utilise acrossseries information.Apart from Bergmeir et al. (2016), the only other work that we are aware of that has claimed to achieve some level of improvement over the original competition participants, without cross-learning, and has performed comprehensive experiments on the M3 data, is the Multi Aggregation Prediction Algorithm (MAPA) presented in Kourentzes et al. (2014).The sMAPE metric used by those authors is the same as Equation 27, whereas they use a different definition of the MASE, where those authors always use s = 1 while differentiating in-sample data.However, their results reported of the benchmarks are quite different.For example, sMAPE of auto ETS of the yearly time series is reported as 18.7, while we calculate it as 17.37, which agrees with Bergmeir et al. (2016).An explanation could be changes/improvements in the default parameter configuration of auto ETS between 2014 and 2016.Nonetheless, we note that the paper's proposed algorithm (MAPA) improves on ETS only in the yearly category by 2% (sMAPE), while the corresponding improvement by LGT is 12%.
Finally, it is imporant to note that as the proposed LGT/SGT models are fitted with MCMC (i.e., sampling from the posterior), each run yields slightly different results.However, the standard deviations of metrics that are also shown in Table 3 (e.g., 0.0006 for Monthly series' MASE), are comparatively small, demonstrating that the performance is quite stable.Table 4 shows average and maximum absolute differences of the forecasted values for the same series across seven runs.Typically, the differences are small, e.g., below 0.5%.

Prediction Intervals
The forecasts are a result of step-by-step simulation (with, e.g., 5000 paths) and then aggregation (calculating quantiles).Figure 2 shows a particular quarterly time series, the first 10 of the simulation paths, the 5th and the 95th percentiles, and the 50th percentile taken as the expected value.
The prediction interval coverage appears to be category-dependent.Table 5 shows coverage ranges for our proposed methods, and analogous ranges of auto ETS and auto ARIMA with analytically calculated prediction intervals.ETS and ARIMA are known to produce too narrow intervals, which can be confirmed from the table.While also our method tends to produce too narrow intervals (for example, for yearly series forecasted, in 2.43% of cases the forecasted values are higher than 99p, but this should have happened in 1% of cases), in the large majority of cases, the model performs better than auto ETS and auto ARIMA.Interestingly, though not shown in this table, the simulated intervals of ETS and ARIMA are worse and even narrower.

Ablation study and computation times
While the LGT and SGT approaches are based on ETS methods, they differ from them in multiple ways.The main differences are concentrated between ETS (AAN) and LGT, while SGT can be considered as Table 5: Percent of cases where true values were outside of prediction intervals.Above 95p -percentage of cases when the true value turned out to be above the 95th percentile; Below 5p -percentage of cases when the true value turned out to be below the 5th percentile; etc.The results are based on 7 runs.
a direct extension of LGT into the space of seasonal data.Thus, in this ablation study we investigate differences between ETS (AAN) and LGT to find the importance of the innovations and whether the complete set of these innovations is required to achieve the reported performance.
To achieve this goal we run performance testing on different variants of our proposed LGT approach (Table 6).We remove one by one features such as Student t distribution and heteroscedasticity for errors and replace them with constant variance and normal distribution, respectively.Also, we remove the global trend feature in Table 6).We find that any of these changes decreases the performance of LGT on the M3 non-seasonal (yearly) data.
We also compare the performance of the full LGT method with ETS/ZZZ, ETS/AAN, ETS/AAN (Bayesian version) -ETS (AAN) fitted with Stan using our Bayesian approach, and an implementation of the LGT method within the forecast package in R, which is thus fitted using the standard optimisation method which is used to fit ETS methods.
Table 6 shows that all new features, such as global trend, student distribution and heteroscedasticity for errors, and the Bayesian optimisation of the model parameters brought into the LGT approach are all important to achieve the best performance.
As a part of the ablation study we also recorded average execution times (per single time series and rounded to the 0.01 second) for the investigated methods.We can see that the Bayesian based methods are orders of magnitude slower than the original exponential smoothing techniques.But with the unimpressive performance of our implementation of LGT in the forecast package, the Bayesian model fitting seems necessary to leverage the performance gains of the proposed models.Also note that the bespoke Gibbs sampler runs about an order of magni-

Conclusions
We have presented two related models, the seasonal SGT and the non-seasonal LGT model, that are derived from the AAZ exponential smoothing models.They are designed to perform well on both fast growing, and often irregular, time series found in fast-growing businesses such as cloud services or rideshare businesses, and on the M3 competition dataset, where they bested the top-performing algorithms in every category/metric pair, as well as auto ARIMA and auto ETS from the forecast R package.We believe that the very good forecasting performance of the models on the M3 dataset comes mostly from their use of flexible, non-linear functions for the trend and the size of the error.Standard Exponential Smoothing models can only be either linear or exponential in their trend -these models can deliver both of these two extremes and anything in-between.
Similarly, the function for size of the error bridges Exponential Smoothing models with additive and multiplicative errors.A Student's t-distribution for the error term is also more capable to fit to both smooth and fat-tailed time series.Probabilistic programming systems, including Stan, provide a convenient platform for development of innovative time series algorithms.
The main drawback of our new methods is that they are considerably slower than the original ETS techniques.Though we have been able to overcome this problem to a certain extent by the implementation of a bespoke Gibbs sampler, our methods are still orders of magnitudes slower.However, in the new forecasting world where large scale forecasting is typically performed very competitively by global cross-series models (often Machine Learning models), our method can have an important place in the forecaster's toolbox to tackle single series forecasting without covariates and general data scarcity.

Figure 1 :
Figure 1: Monthly series of the interest in the search term "tesla" over time (source: trends.google.com).

Figure 2 :
Figure 2: Interest in the search term "tesla" over time (source: trends.google.com),as a monthly series.LGT forecast 4 years ahead (50th percentile), ten examples of the forecasting trajectories (in light gray), and the 5th and the 95th percentiles.

Table 1 :
Parameters to be fitted for the LGT model Equation 12 models observation y t+1 by a Student t-distribution, where Equation σt+1 = σ ŷτ t+1 + ξ.(16)Here, t(v, m, s) denotes a Student t-distribution with degrees-of-freedom v, location m and scale s.Furthermore, y t denotes the value of the dependent variable of interest at time t, ŷt+1 is the predicted value (conditional mode) of y at time t + 1 given information up to time t; σt+1 denotes the scale of errors at time t + 1; l t is the level at time t; and b t the local trend at time t.The parameters that must be estimated are described in Table1.

Table 2 :
Parameters to be fitted for the SGT model

Table 3
Bergmeir et al. (2016) and monthly series, we also reproduce the sMAPE and MASE scores for the best variants of bagged algorithms discussed inBergmeir et al. (2016).The metrics in Table3are averaged over all series belonging to a particular category.Table3shows that the LGT/SGT algorithms are superior to their competitors in terms of the sMAPE and MASE metrics in almost all cases, with the exception of sMAPE for monthly time series, for which they are the second best performer.

Table 4 :
Average and Maximum of absolute deviation in percent of forecasted values across 7 runs.

Table 6 :
Performance on M3 non-seasonal data (yearly) of the ablation study methods.tude faster than the Stan implementation, with very competitive results.