A loss discounting framework for model averaging and selection in time series models

We introduce a Loss Discounting Framework for model and forecast combination which generalises and combines Bayesian model synthesis and generalized Bayes methodologies. We use a loss function to score the performance of different models and introduce a multilevel discounting scheme which allows a flexible specification of the dynamics of the model weights. This novel and simple model combination approach can be easily applied to large scale model averaging/selection, can handle unusual features such as sudden regime changes, and can be tailored to different forecasting problems. We compare our method to both established methodologies and state of the art methods for a number of macroeconomic forecasting examples. We find that the proposed method offers an attractive, computationally efficient alternative to the benchmark methodologies and often outperforms more complex techniques.


Introduction
Recent developments of econometric modelling and machine learning techniques combined with increasingly easy access to vast computational resources and data has lead to a proliferation of forecasting models yielding either point forecasts or full forecast density functions.This trend has been met with a renewed interest in tools that can effectively use these different forecasts such as model selection, or forecast combination, pooling or synthesis, e.g., Stock and Watson (2004), Hendry and Clements (2004), Hall and Mitchell (2007), Raftery et al. (2010), Geweke and Amisano (2011) Waggoner and Zha (2012), Koop and Korobilis (2012), Billio et al. (2013), Del Negro et al. (2016), Yao et al. (2018), McAlinn and West (2019), Diebold et al. (2022), Li et al. (2023) to mention just a few.Wang et al. (2023) provide an excellent recent review of work in this area.
Combining forecasts from different models, rather than using a forecast from a single model is intuitively appealing and justified by improved empirical performance (see e.g. Bates and Granger, 1969;Stock and Watson, 2004).Hendry and Clements (2004) suggested that combining point forecasts provides an insurance against poor performance by individual models which are misspecified, poorly estimated or non-stationary.
In density forecasting, the superiority of a combination over single models is less clear.Bayesian model averaging (BMA) (Leamer, 1978) is simple and coherent approach to weight forecasts in a combination but may not be optimal under logarithmic scoring when the set of models to be combined is misspecified (Diebold, 1991).Since sets of models will usually not include the true data generating mechanism, this result has driven a substantial literature proposing alternatives to BMA.Hall and Mitchell (2007) proposed a logarithmic scoring rule for a time-invariant linear pool with weights on the simplex which leads to a forecast density combination that minimises Kullback-Leibler divergence to the true but unknown density.This idea has been developed to Bayesian estimation (Geweke and Amisano, 2011), Markov switching weights (Waggoner and Zha, 2012) and dynamic linear pools (Del Negro et al., 2016;Billio et al., 2013).These approaches often lead to better forecasting performance but at the cost of increased computational expense.A computationally cheaper alternative directly adjusts the model weights from BMA to allow time-variation (Raftery et al., 2005) leading to Dynamic Model Averaging (DMA) (Raftery et al., 2010), which uses an exponential discounting of Bayes factors with a discount/forgetting/decay1 to achieve time-varying model weights.Performance can be sensitive to the discount factor and Koop and Korobilis (2012) suggested using logarithmic score maximisation to find an optimal discount factor for DMA.Beckmann et al. (2020) applied this idea to model selection and developed Dynamic Model Learning (DML) method with an application to foreign exchange forecasting.Outside the formal Bayesian framework, Diebold et al. (2022) suggested a simple average of the forecasts from a team of N (or less) forecasters chosen using the average logarithmic scores in the previous rw-periods.This can be seen as a localised and simplified version of Hall and Mitchell (2007).
Recently, McAlinn and West (2019) and McAlinn et al. (2020) proposed a broad theoretical framework called Bayesian Predictive Synthesis (BPS) which includes the majority of proposed Bayesian techniques as special cases.They propose a novel forecast combination method using latent factor regression, cast as a Bayesian seemingly unrelated regression (SUR) and showed better performance than the BMA benchmark and an optimal linear pool.However, the approach can be computationally demanding with a large pool of models.Tallman and West (2023) use entropic tilting to expand the BPS framework to more general aims than forecast accuracy (such as return maximisation in portfolio allocation).
This paper describes our Loss Discounting Framework (LDF) which extends DMA and DML to general loss function (in a similar spirit to Tallman and West, 2023), and more general discounting dynamics.A computationally efficient time-varying discounting scheme is constructed through a sequence of pools of meta-models, which starts with the initial pool.Meta-models at one layer are constructed by combining meta-models at the previous layer using a DMA/DML type rule with different discount factors.We show that LDF can outperform other benchmarks methods and is more robust to hyperparameter choice than DMA/DML in simulated data, and foreign exchange forecasting using econometric fundamentals using a large pool of models.We also show how tailoring the approach to constructing long-short foreign exchange portfolios can lead to economic gains.A second example illustrates the limitations of our methodology in US inflation forecasting.
The paper is organised as follows.Section 2 presents some background leading into a description of the proposed methodology in Section 3. In Section 4, the performance of the LDF approach is examined in a simulated example and applications to foreign exchange and US inflation forecasting.We discuss our approach and set out directions for further research in Section 5.The code to reproduce our study is freely available from https://github.com/dbernaciak/ldf.

Background
It is common in Bayesian analysis (Bernardo and Smith, 2009;Yao et al., 2018, and references therein) to distinguish three types of model pools M = {M 1 , M 2 , • • • , M K }: M-closed -the true data generating process is described by one of the models in M but is unknown to researchers; M-completethe model for the true data generating process exists but is not in M, which is viewed as a set of useful approximating models; M-open -the model for the true data generating process is not in M and the true model cannot be constructed either in principle or due to a lack of resources, expertise etc. 2 Model selection based on BMA only converges to the true model in the M-closed case (see e.g.Diebold, 1991) and can perform poorly otherwise.
There are several reasons to believe that econometric problems are outside the M-closed setting.Firstly, real-world forecasting applications often involve complex systems and the model pool will only include approximations at best.In fact, one might argue that econometric modellers have an inherent belief that the models they propose provide reasonable approximation to the data generating process even if certain process features escape the capabilities of the supplied methodologies.Secondly, in many applications, the data generating process is not constant in time (Del Negro et al., 2016) and may involve regime changes and considerable model uncertainty.For example, in the foreign exchange context, Bacchetta and Van Wincoop (2004) proposed the scapegoat theory suggesting that investors display a rational confusion about the true source of exchange rate fluctuations.If an exchange rate movement is affected by a factor which is unobservable or unknown, investors may attribute this movement to some other observable macroeconomic fundamental variable.This induces regimes where different market observables might be more or less important.
These concerns motivate a model averaging framework that is both, suitable for M-complete (or even M-open) situations and incorporates timevarying model weights.We use π t|s,k to represent the weight of model k at 2 Clarke et al. (2013) give, a slightly unusual, example of works of William Shakespeare as an M-open problem.The works (data) has a true data generating process (William Shakespeare) but one can argue that it makes no sense to model the mechanism by which the data was generated.time t using information to time s and use the forecast combination density where p k (y t |y s ) represents the forecast density of model k at time t using information y 1 , . . ., y s , which we call the predictive likelihood.DMA (Raftery et al., 2010), assumes that s = t−1 and updates π t+1|t,k using the observation at time t, y t , and a forgetting factor, denoted by α, by the recursion where c is a small positive number introduced to avoid model probability being brought to machine zero by aberrant observations3 .The log-sum-exp trick is an alternative way of handling this numerical instability which would, at least in part, eliminate the need for the constant c.We leave the role of this parameter to further research.The recursions in (2.2) and (2.3) amount to a closed form algorithm to update the probability that model k is the best predictive model given information up to time t, for forecasting at time t.A model receives a higher weight if it performed well in the recent past and the discount factor α controls the importance that one attaches to the recent past.For example, if α = 0.7, the forecast performance 12 periods prior to the last one receives approximately 2% of the importance of the most recent observation.However, if α = 0.9, this importance is as high as 31%.Therefore, lower values of α lead to large changes in the model weights.In particular, α → 0 would lead to equal model weights and α = 1 recovers the standard BMA.
DMA has been shown to perform well in econometric applications whilst avoiding the computational burden of calculating large scale Markov Chain Monte Carlo (MCMC) or sequential Monte Carlo associated with methods such as Waggoner andZha (2012). Del Negro et al. (2016) showed that DMA performed comparably to their novel dynamic prediction pooling method in forecasting inflation and output growth.It was subsequently expanded and successfully used in econometric applications by Koop and Korobilis (2012), Koop and Korobilis (2013) and Beckmann et al. (2020).In the first two papers the authors compare DMA for a few possible values of discount factors α, whereas, in the latest paper the authors follow the recommendation of Raftery et al. (2010) to estimate the forgetting factor online in the context of Bayesian model selection.We find that estimating the forgetting factor is key to performance of DMA as we will show in our simulation study and empirical examples.Our LDF provides a general approach by combining multiple layers of discounting with time-varying discount factors to provide better performance and robustness to the hyperparameter choice.

Methodology
Our proposed loss discounting framework (LDF) provides a method of updating time-varying model weights using flexible discounting of a general measure of model performance.The flexible discounting is achieved by defining layers of meta-models using the simple discount scheme in (2.2) and (2.3).The approach can be used for both dynamic model averaging and dynamic model selection.For example, we can define a pool of forecast combination densities by applying (2.1) with different values of the discount factor.We refer to the elements of this pool as meta-models.We can subsequently find the best meta-model average (or best meta-model) by again applying exponential discounting to past performance of the meta-models.This leads to an approach with two layers but clearly we could continue the process by defining a pool of meta-models at one layer by applying the forecast combination in (2.1) to a pool of meta-models at the previous layer.
The method has two key features.The first key feature of the model averaging (selection) we develop is the ability to shrink the pool of the relevant models (show greater certainty across time in a single model) in times of low volatility and to encompass more models (display greater variation in model selection) when the volatility of the system is high.The second key feature is the use of a generalised measure of model performance which enables user to define the scores/losses which is directly connected with their final goal.
As we show in the empirical study, aligning model scores to the final purpose leads to better performance.

Loss Discounting Framework
We first describe how a score can be used to generalize DMA and then describe our discounting scheme using meta-models in more detail.The score or loss (we will use these terms interchangeably) is defined for the prediction of an observation with predictive distribution p and observed value y and denoted S(p, y).This measures the quality of the predictive distribution if the corresponding observed value is y.For a set of K models, we assume that the (one-step ahead) predictive distribution for model k at time t is p k,t = p k (y t |y t−1 ) we define the log-discounted predictive likelihood for the k-th model at time t using discount factor α to be We define a model averaged predictive density where This generalizes the use of the logarithmic scoring in DMA.The use of scoring rules for Bayesian updating for parameters was pioneered by Bissiri et al. (2016) (rather than inference about models in forecast combination) and is justified in a M-open or misspecified setting.Loaiza-Maya et al. (2021) extend this approach to econometric forecasting.They both consider sums which are equally weighted (i.e.α = 1 for layer 2).Miller and Dunson (2019) provide a justification for using a powered version of the likelihood of misspecified models.
Each meta-model is defined using a recipe for model or meta-model averaging/selection.We consider a specific type of such recipe which is based on exponential discounting of the scores with different discount factor from a set of possible values S α = {α 1 , . . ., α M }.To lighten the notation, we write w(m) and LDPL(m) to denote weights and log-discounted predictive likelihoods evaluated at α m .
In the first layer, each model in the model pool is scored and the i-th meta-model is defined by applying either DMA or DML with discounting α i and the weights defined above.
Then, to construct the second layer, the meta-models in the first layer are scored and the i-th meta-model is again defined by applying either DMA or DML with discounting α i to these scores.This iterative process can be easily extended to an arbitrary number of layers.We highlight two parallels between the methods used in LDF for time series models and concepts in Bayesian modelling.The first parallel is between the layers of meta-models in LDF and the use of hyperpriors in the Bayesian hierarchical models.Similarly to making a decision on the set up of hyperpriors in the hierarchical models LDF allows for varying depth and type of meta-model layers appropriate for the use case in question.We also draw analogy between the model selection versus the maximum a posteriori probability (MAP) estimate of the quantity, and model weights in model averaging versus full posterior distribution.
To provide a full description of the approach, we will write the forecast densities of the K models as p K (y t |y t−1 ) to make notation consistent.At every other layer, we define predictive meta-models which are an average of (meta-)models at the previous layers.At the first layer, we directly use the forecast combination in (2.1) and, for n ⩾ 2, we apply (2.1) to the M meta-models specified at the previous layer, To define the weights w (n) t|t−1,k , we extend the log-discounted predictive likelihood for the k-th (meta-)model at the n-th layer at time t using discount factor α m to be LDPL The weights in layer n are constructed using either softmax 4 (to give a form of (meta-)model averaging) or argmax (to give a form of (meta-)model selection).We use the notation L n to be represent this operation in the n-th layer which can either take the value s (softmax) or a (argmax).If where The N -layer LDF with score S and with choice L n (equal to s or a) at layer n will be written LDF N L 1 L 2 ...L N (S).The scheme only needs a single discount factor to be chosen in the final meta-model layer.This parameter might be set by an expert or calculated on a calibration sample if the data sample is sufficiently large to permit a robust estimation.In LDF, We refer to the discount factor in the final meta-model layer as α.
As well as defining a model combination at each layer, LDF N L 1 L 2 ...L N (S) also leads to a discount model averaging of the initial model set for any N k 1 (y t |y t−1 ). (3.3) Given this set up the models and meta-models are either averaged by using the softmax function or selected by using the argmax function applied to the log-discounted predictive likelihood.

Special cases 3.2.1. Dynamic Model Averaging
The updates of the Dynamic Model Averaging weights in (2.3) correspond to passing LDPL t,K with the logarithmic scoring function through the softmax function.In DMA we only have one level of discounting where p k (y t |y t−1 ) are the different forecaster densities.Therefore, we could denote DMA as LDF 1 s where the superscript indicates a single level of loss discounting and the s subscript indicates the use of the softmax function.

Dynamic Model Learning
Dynamic Model Learning (DML) (Beckmann et al., 2020) provides a way to optimally choose a single discount factor for the purposes of model selection.In DML the logarithmic scoring function S p (0) k , y t−i are passed through an argmax function to select the best model.We could refer to DML as LDF 2 a,a with with the second layer of meta-models prior restricted to a single point on the grid, namely S α = {1} for n = 2.
A similar idea for model averaging using the softmax function for selection an ensemble of parameters α, was developed in Zhao et al. (2016).

Two-Layer Model Averaging/Selection within Loss Discounting Framework
The Loss Discounting Framework allows us to describe more general setups for discounting in forecast combination, such as these models with two or more meta-model levels.In this paper we focus on LDF with two layers of meta-models such as LDF 2 s,a , LDF 2 a,s , LDF 2 a,a and LDF 2 s,s , as well as, the limiting cases such as LDF ∞ s•••s .In contrast to DMA and DML having two (with α ̸ = 1) or more layers of meta-models makes the discount factors in the other layers time dependent which, as we show in the next sections, leads to an improved performance of model averaging and selection.
In terms of computation time our proposed algorithm is very fast as it just relies on simple addition and multiplication.This is an advantage over more sophisticated forecasts combination methods when the time series is long and/or we would like to incorporate a large (usually greater than 10) number of forecasters.
As mentioned before, LDF 2 a,a is a generalised version of DML presented in Beckmann et al. (2020) where implicitly the authors suggest α = 1, i.e, all past performances of the forgetting factors are equally weighted.In the limit α → 0 we would choose the discount factor α which performed best in the latest run, disregarding any other history.The LDF 2 a,s specification is a hybrid between model selection and model averaging.The first layer performs the model selection for each discount factor, the second layer performs the model averaging for the discount factors.Therefore, for each discount factor we select a single model but then we take a mixture of discount factors which results in a mixture of models.

Properties of LDF as N → ∞
It is natural to consider the impact of additional layers in an LDF model.If we use either the softmax or the argmax at all layers, the weights for each model converge as N → ∞ and so adding more layers has a diminishing effect on the sequence of predictive distributions.Intuitively, for the softmax functions, we have a diminishing impact on the final result as we take weighted averages of the weighted averages of the models, and for the argmax/model selection the LDF approach settles on a single model for any discount factor in the final layer.The detailed and rigorous proofs are provided in the technical Appendix A. We demonstrate in the empirical sections that the sequence converges to a predictive distribution which is often the best or nearly best performing set up of the LDF framework.

Comments
Low variation in LDPL across time leads to model weight concentration on fewer models and higher variation in LDPL leads to the opposite, the model weights are more evenly spread.This is because in a presence of high variation in LDPL the lower discount factors will be preferred and hence the faster forgetting which accommodates the regime switching.
If one believes that the data generating process (DGP) is present in the pool, LDF will not perform as well as BMA which will asymptotically converge to the right model quicker than LDF.Conversely, if the DGP is not among the models in the pool, LDF adapts by adjusting the weights of the models over time to approximate the DGP.
Following the argument in Del Negro et al. ( 2016) to interpret DMA in terms of a Markov switching model, our extension allows a time-varying transition matrix, i.e.Q t = (q(t) kl ).The gradual forgetting of the performance of the discount factor α allows for a change of optimal discount factor when the underlying changes in transition matrix are required.However, we also show that our two-layer model specification outperforms the standard DMA model even when the transition matrix is non-time-varying.This point will be further illustrated in Appendix B.

Examples
Our methodology is best suited to data with multiple regime switches with a potentially time-varying transition matrix.As such, it is particularly useful for modelling data such as inflation levels, interest or foreign exchange rates.We illustrate our model on a simulated example and two real data examples.The supplementary materials for our examples are given in Appendix B, Appendix C, Appendix D and Appendix E.
We evaluate the performance of the models by calculating the out-ofsample mean log predictive score (MLS) where y 1 , . . ., y s are the observations for a calibration period and T is the total number of observations and p LDF correspond to the selected LDF model.

Simulation study
The data generating process (DGP) of Diebold et al. ( 2022) is where y t is the variable to be forecast, x t is the long-run component of y t , µ t is the time-varying level (in Diebold et al. (2022) set to 0).We can interpret µ t as a piecewise-constant deterministic signal with a finite state space that accounts for regime switches.The error terms are all i.i.d and uncorrelated.
In Figure 1 we present how synthesised agent forecast level of LDF 2 s,s adjusts to the mean levels implied by the DGP.We can see that the model is very reactive to the mean predicted level following the true DGP mean closely with only a small time lag.
All results 5 are based on 10 runs, where the levels were fixed but the random numbers regenerated.The standard Bayesian model averaging (MLS = -4.34)fared poorly since it quickly converged to the wrong model.DeCo (MLS=-0.57) was adapted to output 39 quantiles from which we calculated 5 Except BPS for which we performed only 1 run due to computational cost.
the log scores 6 did not manage to cope well with abrupt level changes in our numerical example, overestimated the variance which lead to poorer scores.BPS (MLS = -0.73)with normal agent predictive densities 7 performed better than BMA but struggled to quickly adjust to the regime changes which resulted in low log-scores at the change points.The N-average method performed better (we chose rolling-window of 5 observations which performed best), with an MLS of -0.52 for N = 3 and N = 4, than BMA and BPS and similarly to the standard DMA method of Raftery et al. (2010).
Crucially, we note that DMA's performance varies significantly depending on the hyperparameter choice, whereas the multilayered LDF methods' performance does not.This is clearly illustrated in Figure 2a.One could adopt various strategies in trying to find the hyperparameters.The most basic one would be based on tuning the hyperparameter on the calibration period and keeping the parameter constant thereafter.In this case, for example, if we set the calibration period to 250 the methods choose discounts as: DMA 0.5 (MLS = -0.50);LDF 2 s,s 0.6 (MLS = -0.42);LDF 2 s,a 0.7 (MLS = -0.49).For comparison the stable state LDF ∞ s,••• ,s achieves MLS = -0.41.The non-LDF model averaging models, namely BPS, DeCO and best N-average, were tuned to achieve best performance to the entire sample a posteriori in contrast to LDF models where we select a single configuration based on the initial sample of 250 observations.Another strategy could be based on selecting the best discount factor at each time step (online) based on the expanding window, potentially exponentially weighted -this boils down to an LDF approach with an additional argmax layer.In this case DMA simply becomes LDF 2 s,a and as shown can lead to better results.Even better results and more robustness can be achieved using LDF 2 s,s where a mix of discount factors is being used.
In Figure 2b we present the LPDR for the tested models against LDF ∞ s,••• ,s .The LDF models (including DMA) generally performed better, however, the results suggest that the 2-layer LDF, which can weight multiple discount 6 39 quantiles in increments of 0.025.We used the default setting in DeCo package with Σ = 0.09 (matching our DGP), and with learning and parameter estimation.The quantiles indicated that the output can be well approximated by the normal distribution.
7 We used the original set of parameters (adjusted β = 0.95 and δ = 0.95 to get better results as proposed by the authors of the paper but adjusted the prior variance to match the σ y parameter.The model was run for 5000 MCMC paths with 3000 paths burnin period.All other runs of BPS (which achieved worse results) are detailed in Appendix B)  s,a model with α = 0.95 versus α = 1 as well as LDF 2 s,s with α = 0.8.We observe more dynamic adaptation of discount parameter in the fist meta-model layer when the final α < 1.
factors model, is more robust to abrupt changes in the level than the other methods.
Figure 3 show how the average parameter α in the first meta-model layer dynamically changes using LDF 2 s,a with α = 0.95 and LDF 2 s,s with α = 0.8.It is close to 1 in periods of stability and closer to 0 in times of abrupt changes.In comparison, for α = 1 the average parameter α is the first metamodel layer is rather stable, oscillating around 0.6.As mentioned before, this variation in parameter α might be beneficial since the lower the α parameter more models will be taken into consideration and the final outcome might show more uncertainty.Additionally, a lower parameter α facilitates the ability to quickly re-weight the models to adapt to the new regime.Whereas, in the times of stability it might be better to narrow down the meaningful forecasts to a smaller group by increasing the parameter α.This illustrates how the two layer model provides useful flexibility in the discount factors in the first meta-model layer.Another observation from Fig. 3 is concerning the average values of discount parameters α in the first layer across time.For LDF 2 s,a the average α in the first payer for α = 0.95 in the second layer is 0.75 and for LDF 2 s,s with α = 0.8 in the last layer it is 0.71.Whereas the average α for both LDF models with α = 0.1 in the last layer is 0.61.

Foreign Exchange Forecasts
We consider exchange rate forecasting (see Rossi, 2013, for a comprehensive review).The random walk is a typical benchmark as it corresponds to the claim that the exchange rates are unpredictable but Rossi (2013) argues that economic variables can have time-varying predictive power.Beckmann et al. (2020) consider exploiting this predictive ability using DML with a pool of Time-Varying Parameter Bayesian Vector Autoregressive (TVP-BVAR) models with different subsets of economic fundamentals.We closely follow their set-up.Appendix D.1 and Appendix D.2 describe the model. 8e use a set of G10 currencies: Australian dollar (AUD), Canadian dollar (CAD), euro (EUR), Japanese yen (JPY), New Zealand dollar (NZD), Norwegian krone (NOK), Swedish krona (SEK), Swiss franc (CHF), pound (GBP) and US dollar (USD).All currencies are expressed in terms of the amount of dollars per unit of a foreign currency, i.e. the domestic price of a foreign currency.The data is monthly9 and runs from November 1989 to July 2020.This is a more up-to-date data set than the one used in other studies, but similar in length. 10We use the macroeconomic fundamentals: • Uncovered Interest Rate Parity (UIP) which postulates that, given the spot rate S t , the expected rate of appreciation (or depreciation) is approximately: where i t is the domestic and i * t is the foreign interest rate corresponding to the time horizon h of the return11 .
• Long-short interest rate difference -the difference between 10 year benchmark government yield and 1 month deposit rate.
• Stock growth -monthly return on the main stock index of each of the G10 currencies/countries.
• Gold price -monthly change in the gold price.
The data is standardised based on the mean and standard deviation calibrated to an initial training period of 10 years.
We consider using all possible models as our pool (which consists of 2048 models including all possible subsets of the fundamentals).Comparison to the N-average method (Diebold et al., 2022), DeCo (Billio et al., 2013) and BPS (McAlinn and West, 2019) the competing methods is not available for this pool due to computational cost and so we consider a small pool (which consists of the 32 models based on UIP only and time-constant parameters).An exhaustive list of model parameter settings is outlined in Appendix Appendix D.2.
We compare performance of the competing model averaging techniques using the logarithmic score and economic evaluation using Sharpe ratios of a long-short currency portfolio.We find that LDF provides superior performance according to the logarithmic score and demonstrate how these differences in scores manifest themselves in an economic evaluation.

Small model pool -analysis of scores
Figure 4 compares the logarithmic score for DMA and some specifications of LDF.LDF provides better performance for an optimal choice of the Small pool Large pool hyperparamter and is more robust to the choice of the hyperparmeters than DMA.12Interestingly, for model selection, we note that the proposed twolayer LDF specification LDF 2 a,a (as well as LDF ∞ a,...,a ) methodology improved upon the DML method (Beckmann et al., 2020), which as we recall is LDF 2 a,a with α = 1.The best scores in model averaging/selection were achieved for LDF 2 s,s /LDF 2 a,s specification with α = 0.9.The average value of discount parameters α in the first meta-model layer across time, for LDF 2 s,s with α = 0.9 is 0.77 which was very similar for α = 1.However, the variability of α in the first meta-model layer for α < 1 was much larger, i.e. α being closer to 0 during times of increased volatility and closer to 1 calmer times (same observation of either pool of models).
Figure 5 shows the LDPR for the competing methods on an expanding window with the hyperparameters of LDF calibrated using the first 10 years of data and the competing models calibrated in sample.The LDPRs shows considerable time variation with the sudden drops in performance of LDF 2 s,a and models correspond to the period of big FX volatility increases as measured by Barclays G10 FX index.In comparison to other methods, the LDF 2 s,s method with α = 0.9 performs best (MLS=22.16),followed by other two layer LDF specifications and the 4-model average (MLS=22.10).BPS method (MLS = 21.60) did not perform well here.Similarly, DeCo (MLS = 18.31) method using multivariate normal approximation 13 In terms of model performance out of sample, LDF 2 s,s calibrated only on initial 10 years of data (to select α) -α = 0.8 (MLS=22.15)stilloutperforms the other non-LDF models which were calibrated in-sample.The detailed results are presented in Table D.3 in Appendix D. The sta-ble state LDF models performed similarly to the two layer specification, LDF ∞ s,••• ,s achieves MLS=22.13 and LDF ∞ a,••• ,a scores MLS=22.07.

Large model pool -analysis of scores
We can only consider the LDF methods for the large pool of models due to the run times of the other methods.Additional meta-model layers have a similar impact as in the small model pool but with a less pronounced effect on model selection and a greater effect on model averaging Figure 4. Again, the best model averaging scores were achieved for LDF 2 s,s specification with α = 0.8 (MLS=22.37)and the stable state LDF models performed similarly to the two layer specification, MLS=22.35 for LDF ∞ s,••• ,s .For model selection, the multi-layer specification of LDF introduces the robustness to hyperparameters but does not necessarily outperform the single-layer LDF in terms of log-scores.Interestingly, in the larger pool, the EWMA Random Walk14 (RW), decay factor 0.97, model was not the best model of all models considered (MLS = 21.77)but it performed almost on par with the a posteriori best model (MLS = 21.78) which indicates that even from a big pool of models it is hard to find a single model that outperforms the random walk.

Economic evaluation of model selection
We consider economic evaluation by constructing a portfolio of long and short currency positions targeting 10% annual volatility with 8bps transaction costs.We measure the performance by looking at the cumulative wealth over time as well as the Sharpe ratio, which captures the risk adjusted performance, applied to the smaller model pool of 32 models.To target the Sharpe ratio, we define the score to be the portfolio returns divided by the the portfolio standard deviation based on a rolling twelve months window.
We concentrate on LDF configurations that select a single model at a time, that is LDF 2 a,a and LDF 1 a .Portfolios are constructed by maximizing the returns subject to a fixed risk per model, as in Beckmann et al. (2020).Model averaging cannot be used as the correlation between the investment strategies would inevitably change the target risk level of the portfolio.An alternative approach to portfolio construction was presented in Tallman and West (2023)   of model averaging where each model aims to minimise the risk subject to a fixed return target.LDF 2 a,a only narrowly outperforms LDF 1 a on the log score (Figure 4) but has a higher Sharpe ratio (Figure 6).This is in line with the observations in Beckmann et al. (2020) who note that small differences in the log scores can translate to noteworthy economic differences.The right panel of Figure 6 shows the mean focused scores (MFS) where there are clear differences (unlike the log scores) with the double discounting version of LDF achieving better scores leading into higher Sharpe ratios and higher final wealth as seen  a,a with α = 0.7.We can clearly see that there are long stretches of stable composition which correspond to the growth periods and the periods of sudden portfolio changes correspond to the times of money growth plateau.
in Figure 7.The double discounting of LDF 2 a,a allows the discount factor to drop in times of higher volatility such as during the great financial crisis, the Chinese crash or the Brexit referendum.For α = 0.7 in the second layer the time average value of α in the first layer is 0.71 and with α = 1 in the first layer it is 0.80.This is in contrast to DML (LDF 2 a,a α = 1) and LDF 1 a specifications where in the former the discount factor settles at 0.9 and does not move and in the latter it is just fixed to a predetermined constant value.
In Figure 8 we show the portfolio composition though the time.We note that the weights display stability in the times when the portfolio value experiences periods of growth and the sudden weight changes correspond to periods of growth plateau.The weights generally follow the carry trade strategy which is well documented in the literature, see Della Corte and Tsiakas (2012) and references therein.

US Inflation Forecasts
The final study considers an example of McAlinn and West (2019), which involves forecasting the quarterly US inflation rate between 1961/Q1 and 2014/Q4.Here, the inflation rate corresponds to the annual percentage change in a chain-weighted GDP price index.There are four competing models: M 1 includes one period lagged inflation rate, M 2 includes period one, two and three lagged inflation interest and unemployment rates, M 3 includes period one, two and three lagged inflation rate only and M 4 includes period one lagged inflation interest and unemployment rates.All four models provide Student-t distributed forecasts with around 20 degrees of freedom.
The distinguishing features of this example are the small number of models and the existence of time periods when none of the models or model combinations lying on simplex provide an accurate mean forecast.In this example we will see the limitation of the LDF and other simplex based methodologies which are unable to correct for forecasting biases if bias corrected models are not explicitly available in the pool.
The BPS method (MLS = 0.06) dominates all other methodologies since it allows for model combinations not adhering to simplex.In fact, there were six dates in the evaluation period where the mean of BPS synthesised model was greater than the maximum of the underlying models.The feature to go beyond simplex proved to be one of the key factors in the superior performance.
The next most effective method was N-model average of Diebold et al. (2022) which for N = 2 and N = 3 models had a MLS equal to -0.01 and provided better performance than the best single model (M 2 , MLS = -0.02).For N = 2, out of the 100 evaluation points, the algorithm selected the pair (M 0 , M 1 ) 35 times, the pair (M 2 , M 3 ) 49 times and the pair (M 1 , M 3 ) 16 times.On the other hand, both 2-level LDF model averaging and DMA methods did not work very well in this example but improved upon picking just a single model.The poor performance of 2-level LDF and DMA could mostly be attributed to the highly dynamic nature of these methods which sometimes attached too much weight to a single model that would score poorly.

Discussion
This paper contributes to the model averaging and selection literature by introducing a Loss Discounting Framework which encompasses Dynamic Model Averaging, first presented by Raftery et al. (2010), generalises Dynamic Model Learning (Beckmann et al., 2020) and introduces additional model averaging or selection specifications.The framework allows for general dynamics for model weights, and works well with focused scores for goaloriented decision making.The methodology offers extra flexibility which can lead to better forecast scores and yield results which are less sensitive to the choice of hyperparameters.This is particularly important in a more realistic online forecasting setting where selection of the globally optimal hyperparameters is often unattainable.It also empowers users to choose the model specification in terms of number of levels of discounting layers which is suitable for the problem at hand.
We show that our proposed methodology performs well in both the simulation study as well as in the empirical examples based on the exchange rate forecasts where we show the superiority of our approach both for model averaging as well as model selection, where for the latter we also demonstrate how the differences in the scores translate to noteworthy economic gains.We find that the LDF can be a good choice when: the number of forecasters is fairly large and sophisticated methods become burdensome; if we want to have only a small number of hyperparameters to calibrate; we suspect that we are in the M-complete/open setting and different models might be optimal at different times but there is no consistent bias to be eliminated across all models; if we believe that scoring forecasters on the joint predictive density or joint utility basis is reasonable.
The LDF is by no means the panacea for model synthesis and the performance of different model synthesis methods depends on the problem (as seen in the empirical studies).However, LDF is often able to achieve competitive performance with a low computational overhead by using flexible dynamics and general model scores in an easy-to-implement and compute framework.
There are multiple open avenues to explore.Many current forecast combination methods described in the literature assume that the pool of forecasters does not change over time (see e.g.Raftery et al., 2010;Diebold et al., 2022;McAlinn and West, 2019).In some situations this is a substantial limitation, for example, if the forecasts are provided by a pool of experts.
Let us first consider the situation of a new agent being added to the existing pool of forecasters.The existing forecasters already have a track record of forecasts and corresponding scores.A new forecaster could be included with an initial weight.This could be fairly easily achieved in the LDF by considering a few initial scores.It is not clear what this weight should be, especially in more formal methodologies which relax the simplex restriction like McAlinn and West (2019).Similarly, forecasters may drop out completely or for some quarters before providing new forecasts.Again, in general, it is hard to know how to weight these forecasters.The LDF provides a rationale, we should be using an estimate of that forecaster's score when a forecast is made.This is a time series prediction problem and can be approached using standard methods.
We noted in the empirical sections that the best performing discount factor in the second layer is larger than the average across time discount factor in the first layer.We showed that as one keeps adding more and more layers of meta-models the weights converge to an equilibrium.I.e.adding more layers does not change the scores any more and any choice of the discount factor in the final layer leads to the same score and same discount factors in all other layers.
It would also be of interest to consider the case when the models to be combined themselves allow for sharp breaks (Gerlach et al. (2000), Huber et al. (2019)).Intuitively, more flexible models will lead to weights with less fluctuations if the models are able to represent the true DGP (for example, if one model is correctly specified then LDF should be able to roughly replicate BMA).We believe that the use of out-of-sample log predictive scores to calculate weights in LDF will avoid the problems of overfitting found using in-sample estimation methods.Therefore, we believe that LDF can take advantage of more flexible models and robustifies against the use of overly simple models.
As mentioned before, in most examples, we use joint predictive loglikelihood as a statistical measure of out-of-sample forecasting performance.It gives an indication of how likely the realisation of the modelled variable was conditional on the model parameters.The logarithmic scoring rule is strictly proper but it severely penalises low probability events and hence it is sensitive to tail or extreme cases, see Gneiting and Raftery (2007).A different proper scoring rule could be used when needed or if a decision is to be made based on the outcomes of model averaging/selection then a focused score (or utility), aligned with the final goal, can be used as successfully demonstrated in one of our examples.
Furthermore, since the scoring function is often based on the joint forecast probability density function, our methodology is not best suited to take strength from forecasters who might be good at forecasting one or more variables but not the others.This is partially due to the fact that our methodology does not consider any dependency structure between expert models and the weighting is solely performance based.An extension introducing a way to take the agent inter-dependencies into consideration would be of considerable interest.
More broadly, the exponential discounting recipe could be generalised and expanded by any forecast of the scores which could involve more parameters.

Figure 1 :
Figure 1: Simulation -True data generating process mean and mean predicted level according to LDF 2 s,s .
MLS versus values of α.LPDR of the competing models.

Figure 2 :Figure 3 :
Figure 2: Simulation -a) The MLS versus values of α for LDF and α for DMA in the x-axis.The error bars correspond to the standard deviation of MLS over 10 runs.b) LPDR of the competing models with calibration period of 250.

LDF 1 a
Figure 4: FX -MLS versus values of α for LDF and α for DMA in the x-axis for the small and large model pool.The upper plots show the cases of model averaging whereas the lower plots show model selection.

Figure 5 :
Figure 5: FX -LPDR for model averaging in the small model pool.LDF 2s,s provides best performance robust to increases in the FX volatility.

Figure 6 :
Figure6: FX -mean score values versus achieved Sharpe ratios.In the left had side plot the log scores were used, in the right hand side plot the focused scores were used.

Figure 8 :
Figure 8: FX -Portfolio composition through time for LDF 2a,a with α = 0.7.We can clearly see that there are long stretches of stable composition which correspond to the growth periods and the periods of sudden portfolio changes correspond to the times of money growth plateau.

Figure 9 :
Figure 9: US inflation -The MLS versus values of α for LDF and α for DML in the x-axis.