Using stacking to average Bayesian predictive distributions

The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), pseudo-BMA using AIC-type weighting, and a variant of pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with BB-pseudo-BMA as an approximate alternative when computation cost is an issue.


Introduction
through a Gaussian mixture model, a series of linear regression simulations, two real data examples, and an application in variational inference. We conclude with Section 5 where we give general recommendations. We provide the R and Stan code in the Suplement material .

Existing approaches
In Bayesian model comparison, the relationship between the true data generator and the model list M = (M 1 , . . . , M K ) can be classified into three categories: M-closed, M-complete and M-open. We adopt the following definition from Bernardo and Smith (1994) (see also , and ): • M-closed means the true data generating model is one of M k ∈ M, although it is unknown to researchers.
• M-complete refers to the situation where the true model exists and is out of model list M. But we still wish to use the models in M because of tractability of computations or communication of results, compared with the actual belief model. Thus, one simply finds the member in M that maximizes the expected utility (with respect to the true model).
• M-open refers to the situation in which we know the true model M t is not in M, but we cannot specify the explicit form p(ỹ|y) because it is too difficult conceptually or computationally, we lack time to do so, or do not have the expertise, etc.
Bayesian model averaging If all candidate models are generative, the Bayesian solution is to simply average the separate models, weighing each by its marginal posterior probability. This is called Bayesian model averaging (BMA) and is optimal if the method is evaluated based on its frequency properties evaluated over the joint prior distribution of the models and their internal parameters (Madigan et al., 1996;Hoeting et al., 1999). If y = (y 1 , . . . , y n ) represents the observed data, then the posterior distribution for any quantity of interest Δ is p(Δ|y) = has been assigned a normal prior distribution with center 0 and scale 10, and where its estimate is likely to be in the range (−1, 1). The chosen prior is then essentially flat, as would also be the case if the scale were increased to 100 or 1000. But such a change would divide the posterior probability of the model by roughly a factor of 10 or 100.
Stacking Stacking (Wolpert, 1992;Breiman, 1996;LeBlanc and Tibshirani, 1996) is a direct approach for averaging point estimates from multiple models. In supervised learning, where the data are ((x i , y i ), i = 1, . . . , n ) and each model M k has a parametric form y k = f k (x|θ k ), stacking is done in two steps (Ting and Witten, 1999). First, each model is fitted separately and the leave-one-out (LOO) predictorf is obtained for each model k and each data point i. In the second step, a weight for each model is obtained by minimizing the LOO mean squared error (1) Breiman (1996) claims that either a positive constraint w k ≥ 0, k = 1, . . . K, or a simplex constraint: w k ≥ 0, K k=1 w k = 1 guarantees a solution. Better predictions may be attainable using regularization (Merz and Pazzani, 1999;Yang and Dunson, 2014). Finally, the point prediction for a new data point with feature vectorx iŝ y = K k=1ŵ k f k (x|θ k,y1:n ).
It is not surprising that stacking typically outperforms BMA when the criterion is mean squared predictive error (Clarke, 2003), because BMA is not optimized to this task. Wong and Clarke (2004) emphasize that the BMA weights reflect the fit to the data rather than evaluating the prediction accuracy. On the other hand, stacking is not widely used in Bayesian model combination because the classical stacking only works with point estimates, not the entire posterior distribution (Hoeting et al., 1999).  give a Bayesian interpretation for stacking by considering model combination as a decision problem when the true model M t is not in the model list. If the decision is of the form a(y, w) = Le and Clarke (2017) prove the stacking solution is asymptotically the Bayes solution. With some mild conditions on distributions, the following asymptotic relation holds: l (ỹ, a(y, w) where l is the squared loss, l(ỹ, a) = (ỹ − a) 2 . They also prove that when the action is a predictive distribution a(y −i , w) = K k=1 w k p(y i |y −i , M k ), the asymptotic relation still holds for negative logarithm scoring rules.
However, most early literature limited stacking to averaging point predictions, rather than predictive distributions. In this paper, we extend stacking from minimizing the squared error to maximizing scoring rules, hence make stacking applicable to combining a set of Bayesian posterior predictive distributions. We argue this is the appropriate version of Bayesian model averaging in the M-open situation.

Akaike weights and pseudo Bayesian model averaging
Leave-one-out cross-validation is related to various information criteria (see, e.g. Vehtari and Ojanen, 2012). In case of maximum likelihood estimates, leave-one-out cross-validation is asymptotically equal to Akaike's information criterion (AIC, Stone, 1977). In a statistical model with the number of parameters to be k and the maximized likelihood to beL, AIC = −2 logL + 2k. Akaike (1978) proposed to use exp(− 1 2 AIC) for model weighting (see also Burnham and Anderson, 2002;Wagenmakers and Farrell, 2004). More recently we have seen also Watanabe-Akaike information criterion (WAIC, Watanabe, 2010) and leave-one-out cross-validation estimates used to compute model weights following the idea of AIC weights.
In a Bayesian setting Geisser and Eddy (1979;see also, Gelfand 1996) proposed pseudo Bayes factors where marginal likelihoods p(y|M k ) are replaced with a product of Bayesian leave-one-out cross-validation predictive densities n i=1 p(y i |y −i , M k ). Following the naming by Geisser and Eddy, we call AIC-type weighting which uses Bayesian cross-validation predictive densities as pseudo Bayesian model averaging (Pseudo-BMA).
Exact leave-one-out cross-validation can be computationally costly. For example, in the econometric literature, Amisano (2011, 2012) suggest averaging prediction models by maximizing predictive log score, while they only consider time series due to the computational challenges of exact LOO for general data structures. In the present paper we demonstrate that Pareto smoothed importance sampling leaveone-out cross-validation (PSIS-LOO) (Vehtari et al., 2017a,b) is a practically efficient way to calculate the needed leave-one-out predictive densities p(y i |y −i , M k ).
In this paper we show that the uncertainty in the future data distribution should be taken into account when computing Pseudo-BMA weights. We will propose an AICtype weighting using the Bayesian bootstrap and the expected log predictive density (elpd), which we call Pseudo-BMA+ weighting. We show that although Pseudo-BMA+ weighting gives better results than regular BMA or Pseudo-BMA weighting (in Mopen settings), it is still inferior to the log score stacking. Due to its simplicity we use Pseudo-BMA+ weighting as an initial guess for optimization procedure in the log score stacking.
Other model weighting approaches Besides BMA, stacking, and AIC-type weighting, some other methods have been introduced to combine Bayesian models.  consider using a nonparametric prior in the decision problem stated above. Essentially they are fitting a mixture model with a Dirichlet process, yielding a posterior expected utility of They then solve for the optimal weightsŵ k = arg max w k ,θ k U n (w k , θ k ). Li and Dunson (2016) propose model averaging using weights based on divergences from a reference model in M-complete settings. If the true data generating density function is known to be f * , then an AIC-type weight can be defined as

Theory and methods
We label the classical stacking procedure (1) as stacking of means because it combines models by minimizing the mean squared error of the point estimate. In general, we can use a proper scoring rule (or equivalently the underlying divergence) to compare distributions. After choosing that, stacking can be extended to combining the whole distributions.

Stacking using proper scoring rules
Adapting the notation of , we label Y as the random variable on the sample space (Ω, A) that can take values on (−∞, ∞). P is a convex class of probability measure on Ω. Any member of P is called a probabilistic forecast. A scoring rule is a function S : P × Ω →Ī R = [∞, ∞] such that S(P, ·) is P-quasi-integrable for all P ∈ P. In the continuous case, every distribution P ∈ P is identified with its density function p.
The ultimate goal of stacking a set of K predictive distributions built from the models M = (M 1 , . . . , M K ) is to find the distribution in the convex hull C = { K k=1 w k × p(·|M k ) : k w k = 1, w k ≥ 0} that is optimal according to some given criterion. In this paper, we propose the use of proper scoring functions to define the optimality criterion.
where p(ỹ|y, M k ) is the predictive density of new dataỹ in model M k that has been trained on observed data y and p t (ỹ|y) refers to the true distribution.
An empirical approximation to (3) can be constructed by replacing the full predictive distribution p(ỹ|y, M k ) evaluated at a new datapointỹ with the corresponding LOO predictive distributionp k,−i (y i ) = p(y i |θ k , M k )p(θ k |y −i , M k )dθ k . The corresponding stacking weights are the solution to the optimization problem The stacked estimate of the predictive density iŝ When using logarithmic score (corresponding to Kullback-Leibler divergence), we call this stacking of predictive distributions: The choice of scoring rules can depend on the underlying application. Stacking of means (1) corresponds to the energy score with β = 2. The reasons why we prefer stacking of predictive distributions (corresponding to the logarithmic score) to stacking of means are: (i) the energy score with β = 2 is not a strictly proper scoring rule and can give rise to identification problems, and (ii) without further smoothness assumptions, every proper local scoring rule is equivalent to the logarithmic score .

Asymptotic behavior of stacking
The stacking estimate (3) finds the optimal predictive distribution within the convex set C, that is the closest to the data generating process with respect to the chosen scoring rule. This is different from Bayesian model averaging, which asymptotically with probability 1 will select a single model: the one that is closest in KL divergence to the true data generating process.
Solving for the stacking weights in (4) is an M-estimation problem. Under some mild conditions (Le and Clarke, 2017;, for either the logarithmic scoring rule or the energy score (negative squared error) and a given set of weights w 1 . . . w K , as sample size n → ∞, the following asymptotic limit holds: Thus the leave-one-out-score is a consistent estimator of the posterior score. In this sense, stacking gives optimal combination weights asymptotically.
In terms of Vehtari and Ojanen (2012, Section 3.3), the proposed stacking of predictive distributions is the M * -optimal projection of the information in the actual belief model M * toŵ, where explicit specification of M * is avoided by re-using data as a proxy for the predictive distribution of the actual belief model and the weights w k are the free parameters.

Pareto smoothed importance sampling
One challenge in calculating the stacking weights proposed in (4) is that we need the leave-one-out (LOO) predictive density, Exact LOO requires refitting each model n times. To avoid this onerous computation, we use the following approximate method. For the k-th model, we fit to all the data, obtaining S simulation draws θ s k (s = 1, . . . S) from the full posterior p(θ k |y, M k ) and calculate The ratio r s i,k has a density function and can be unstable, due to a potentially long right tail. This problem can be resolved using Pareto smoothed importance sampling (PSIS, Vehtari et al., 2017a). For each fixed model k and data y i , we fit the generalized Pareto distribution to a set of largest importance ratios r s i,k , and calculate the expected values of the order statistics of the fitted generalized Pareto distribution. These value are used to obtain the smoothed importance weight w s i,k , which is used to replace r s i,k . For details of PSIS, see Vehtari et al. (2017a). PSIS-LOO importance sampling (Vehtari et al., 2017b) computes the LOO predictive density as The reliability of the PSIS approximation can be determined by the estimated shape parameterk in the generalized Pareto distribution. For the left-out data points wherê k > 0.7, Vehtari et al. (2017b) suggest replacing the PSIS approximation of those problematic cases by the exact LOO or k-fold cross-validation.
One potential drawback of LOO is the large variance when the sample size is small. We see in simulations that when the ratio of relative sample size to the effective number of parameters is small, the weighting can be unstable. How to adjust this small sample behavior is left for the future research.

Pseudo-BMA
In this paper, we also consider an AIC-type weighting using leave-one-out cross-validation. As mentioned in Section 2, these weights estimate the same quantities as Li and Dunson (2016) that use the divergence from the reference model based inference.
To maintain comparability with the given dataset and to get easier interpretation of the differences in scale of effective number of parameters, we define the expected log pointwise predictive density(elpd) for a new datasetỹ as a measure of predictive accuracy of a given model for the n data points taken one at a time (Gelman et al., 2014;Vehtari et al., 2017b).
Given observed data y and model k, we use LOO to estimate the elpd as The Pseudo-BMA weighting for model k is defined as However, this estimation doesn't take into account the uncertainty resulting from having a finite number of proxy samples from the future data distribution. Taking into account the uncertainty would regularize the weights making them go further away from 0 and 1.
The computed estimate elpd k loo is defined as the sum of n independent components so it is trivial to compute their standard errors by computing the standard deviation of the n pointwise values (Vehtari and Lampinen, 2002). As in (7), define elpd k loo,i = logp(y i |y −i , M k ), and then we can calculate A simple modification of weights is to use the log-normal approximation: .
Finally, the Bayesian bootstrap (BB) can be used to compute uncertainties related to LOO estimation (Vehtari and Lampinen, 2002). The Bayesian bootstrap (Rubin, 1981) gives simple non-parametric approximation to the distribution. Having samples of z 1 , . . . , z n from a random variable Z, it is assumed that posterior probabilities for all observed z i have the distribution Dirichlet(1, . . . , 1) and values of Z that are not observed have zero posterior probabilities. Thus, each BB replication generates a set of posterior probabilities α 1:n for all observed z 1:n , This leads to one BB replication of any statistic φ(Z) that is of interest: The distribution over all replicatedφ(Z|α) (i.e., generated by repeated sampling of α) produces an estimation for φ(Z).
As the distribution of elpd k loo,i is often highly skewed, BB is likely to work better than the Gaussian approximation. In our model weighting, we can define We sample vectors (α 1,b , . . . , α n,b ) b=1,...,B from the Dirichlet ( n 1, . . . , 1) distribution, and compute the weighted means,z Then a Bayesian bootstrap sample of w k with size B is, , b = 1, . . . , B, and the final adjusted weight of model k is, which we call Pseudo-BMA+ weight.

Gaussian mixture model
This simple example helps us understand how BMA and stacking behave differently. It also illustrates the importance of the choice of scoring rules when combining distributions. Suppose the observed data y = (y i , i = 1, . . . , n) come independently from a normal distribution N(3.4, 1), not known to the data analyst, and there are 8 candidate models, N(μ k , 1) with μ k = k for 1 ≤ k ≤ 8. This is an M-open problem in that none of the candidates is the true model, and we have set the parameters so that the models are somewhat separate but not completely distinct in their predictive distributions.
For BMA with a uniform prior Pr(M k ) = 1 8 , k = 1, . . . , 8, we can write the posterior distribution explicitly: , from which we see thatŵ BMA 3 P − −→ 1 andŵ BMA k P − −→ 0 for k = 3 as sample size n → ∞. Furthermore, for any given n, This example is simple in that there is no parameter to estimate within each of the models: p(ỹ|y, M k ) = p(ỹ|M k ). Hence, in this case the weights from Pseudo-BMA and Pseudo-BMA+ are the same as the BMA weights.
For stacking of means, we need to solvê This is nonidentifiable because the solution contains any vectorŵ satisfying For point prediction, the stacked prediction is always 8 k=1ŵ k k = 1 n n i=1 y i , but it can lead to different predictive distributions 8 k=1ŵ k N(k, 1). To get one reasonable result, we transform the least squares optimization to the following normal model and assign a uniform prior to w: Then we could use the posterior means of w as model weights.
For stacking of predictive distributions, we need to solve Figure 1: For the Gaussian mixture example, the predictive distribution p(ỹ|y) of BMA (green curve), stacking of means (blue) and stacking of predictive distributions (red). In each graph, the gray distribution represents the true model N(3.4, 1). Stacking of means matches the first moment but can ignore the distribution. For this M-open problem, stacking of predictive distributions outperforms BMA as sample size increases.
In fact, this example is a density estimation problem. Smyth and Wolpert (1998) first applied stacking to non-parametric density estimation, which they called stacked density estimation. It can be viewed as a special case of our stacking method.
We compare the posterior predictive distributionp(ỹ|y) = kŵ k p(ỹ|y, M k ) for these three methods of model averaging. Figure 1 shows the predictive distributions in one simulation when the sample size n varies from 3 to 200. Stacking of means (the middle row of graphs) gives an unappealing predictive distribution, even if its point estimate is reasonable. The broad and oddly spaced distribution here arises from nonidentification of w, and it demonstrates the general point that stacking of means does not even try to match the shape of the predictive distribution. The top and bottom row of graphs show how BMA picks up the single model that is closest in KL divergence, while stacking picks a combination; the benefits of stacking becomes clear for large n.
In this trivial non-parametric case, stacking of predictive distributions is almost the same as fitting a mixture model, except for the absence of the prior. The true model N(3.4, 1) is actually a convolution of single models rather than a mixture, hence no approach can recover the true one from the model list. From Figure 2 we can compare the mean squared error and the mean logarithmic score of these three combination methods. The log scores and errors are calculated through 500 repeated simulations and 200 test data. The left panel shows the logarithmic score (or equivalent, expected log predictive density) of the predictive distribution. Stacking of predictive distributions Stacking of predictive distributions performs best for moderate and large sample sizes. (b) The middle panel shows the mean squared error treating the posterior mean ofŷ as a point estimation. Stacking of predictive distributions gives almost the same optimal mean squared error as stacking of means, both of which perform better than BMA. (c) The right panel shows the expected log predictive density of stacking and BMA when adding some more N(4, 1) models to the model list, where sample size is fixed to be 15. All average log scores and errors are calculated through 500 repeated simulation and 200 test data generating from the true distribution.
always gives a larger score except for extremely small n. In the middle panel, it shows the mean squared error by considering the posterior mean of predictive distribution to be a point estimate, even if it is not our focus. In this case, it is not surprising to see that stacking of predictive distributions gives almost the same optimal mean squared error as the stacking of means, both of which are better than the BMA. Two distributions close in KL divergence are close in each moment, while the reverse does not necessarily hold. This illustrates the necessity of matching the distributions, rather than matching the moments.
Stacking depends only on the space expanded by all candidate models, while BMA or Pseudo-BMA weighting may by misled by such model expansion. If we add another N(4, 1) as the 9th model in the model list above, stacking will not change at all in theory, even though it becomes non-strictly-convex and has infinite same-height mode. For BMA, it is equivalent to putting double prior mass on the original 4th model, which doubles the final weights for it. The right panel of Figure 2 shows such phenomenon: we fix sample size n to be 15 and add more and more N(4, 1) models. As a result, BMA (or Pseudo-BMA weighting) puts larger weight on N(4, 1) and behaves worse, while the stacking is essentially unchanged. It illustrates another benefit of stacking compared to BMA or Pseudo-BMA weights. If the performance of a combination method decays as the list of candidate models is expanded, this may indicate disastrous performance if there are many similar weak models on the candidate list. We are not saying BMA can never work in this case. In fact some other methods are proposed to make BMA overcome such drawbacks. For example, George (2010) establishes dilution priors to compensate for model space redundancy for linear models, putting smaller weights on those models that are close to each other. Fokoue and Clarke (2011) introduce prequential model list selection to obtain an optimal model space. But we propose stacking as a more straightforward solution.

Linear subset regressions
The previous section demonstrates a simple example of combining several different nonparametric models. Now we turn to the parametric case. This example comes from Breiman (1996) who compares stacking to model selection. Suppose the true model is where ∼ N(0, 1). All the covariates X j are independently from N(5, 1). The number of predictors J is 15. The coefficient β is generated by where γ is determined by fixing the signal-to-noise ratio such that The value h determines the number of nonzero coefficients in the true model. For h = 1, there are 3 "strong" coefficients. For h = 5, there are 15 "weak" coefficients. In the following simulation, we fix h = 5. We consider the following two cases:

M-open:
Each subset contains only one single variable. Hence, the k-th model is a univariate linear regression with the k-th variable X k . We have K = J = 15 different models in total. One advantage of stacking and Pseudo-BMA weighting is that they are not sensitive to prior, hence even a flat prior will work, while BMA can be sensitive to the prior. For each single model M k : Y ∼ N(β k X k , σ 2 ), we set prior β k ∼ N (0, 10), σ ∼ Gamma(0.1, 0.1).
2. M-closed: Let model k be the linear regression with subset (X 1 , . . . , X k ). Then In both cases, we have seven methods for combining predictive densities: (1) stacking of predictive distributions, (2) stacking of means, (3) Pseudo-BMA, (4) Pseudo-BMA+, (5) best model selection by mean LOO value, (6) best model selection by marginal likelihood, and (7) BMA. We generate a test dataset (x i ,ỹ i ), i = 1, . . . , 200 from the underlying true distribution to calculate the out of sample logarithm scores for the combined distribution under each method and repeat the simulation 100 times to compute the expected predictive accuracy of each method. Figure 3: Mean log predictive densities of 7 combination methods in the linear regression example: the k-th model is a univariate regression with the k-th variable (1 ≤ k ≤ 15). We evaluate the log predictive densities using 100 repeated experiments and 200 test data. Figure 3 shows the expected out-of-sample log predictive densities for the seven methods, for a set of experiments with sample size n ranging from 5 to 200. Stacking outperforms all other methods even for small n. Stacking of predictive distributions is asymptotically better than any other combination method. Pseudo-BMA+ weighting dominates naive Pseudo-BMA weighting. Finally, BMA performs similarly to Pseudo-BMA weighting, always better than any kind of model selection, but that advantage vanishes in the limit since BMA picks up one model. In this M-open setting, model selection can never be optimal.
The results change when we move to the second case, in which the k-th model contains variables X 1 , . . . , X k so that we are comparing models of differing dimensionality. The problem is M-closed because the largest subset contains all the variables, and we have simulated data from this model. Figure 4 shows the mean log predictive densities of the seven combination methods in this case. For a large sample size n, almost all methods recover the true model (putting weight 1 on the full model), except BMA and model selection based on marginal likelihood. The poor performance of BMA comes from the parameter priors: recall that the optimality of BMA arises when averaging over the priors and not necessarily conditional on any particular chosen set of parameter values. There is no general rule to obtain a "correct" prior that accounts for the complexity for BMA in an arbitrary model space. Model selection by LOO can recover the true model, while selection by marginal likelihood cannot due to the same prior problems. Once Figure 4: Mean log predictive densities of 7 combination methods in the linear regression example: the k-th model is the regression with the first k variables (1 ≤ k ≤ 15). We evaluate the log predictive densities using 100 repeated experiments and 200 test data.
again, BMA eventually becomes the same as model selection by marginal likelihood, which is much worse than any other methods asymptotically.
In this example, stacking is unstable for extremely small n. In fact, our computations for stacking of predictive distributions and Pseudo-BMA depend on the PSIS approximation to log p(y i |y −i ). If this approximation is crude, then the second step optimization cannot be accurate. It is known that the parameterk in the generalized Pareto distribution can be used to diagnose the accuracy of PSIS approximation. When k > 0.7 for a datapoint, we cannot trust the PSIS-LOO estimate and so we re-run the full inference scheme on the dataset with that particular point left out (see Vehtari et al., 2017b). Figure 5 shows the comparison of the mean elpd estimated by LOO and the mean elpd calculated using 200 independent test data for each model and each sample size in the simulation described above. The area of each dot in Figure 5 represents the relative complexity of the model as measured by the effective number of parameters divided by sample size. We evaluate the effective number of parameters using LOO (Vehtari et al., 2017b). The sample size n varies from 30 to 200 and variable size is fixed to be 20. Clearly, the relationship is far from the line y = x for extremely small sample size, and the relative bias ratio (elpd loo /elpd test ) depends on model complexity. Empirically, we have found the approximation to be poor when the sample size is less than 5 times the number of parameters. Further diagnostics for PSIS are described by Vehtari et al. (2017a). Figure 5: Comparison of the mean elpd estimated by LOO and the mean elpd calculated from test data, for each model and each sample size in the simulation described above. The area of each dot represents the relative complexity of the model as measured by the effective number of parameter divided by sample size.
As a result, in the small sample case, LOO can lead to relatively large variance, which makes the stacking of predictive distributions and Pseudo-BMA/ Pseudo-BMA+ unstable, with performance improving quickly as n grows.

Comparison with mixture models
Stacking is inherently a two-step procedure. In contrast, when fitting a mixture model, one estimates the model weights and the status within parameters in the same step. In a mixture model, given a model list M = (M 1 , . . . , M k ), each component in the mixture occurs with probability w k . Marginalizing out the discrete assignments yields the joint likelihood p(y|w 1:K , θ 1: The mixture model seems to be the most straightforward continuous model expansion. Nevertheless, there are several reasons why we may prefer stacking to fitting a mixture model. Firstly, Markov chain Monte Carlo (MCMC) methods for mixture models are difficult to implement and generally quite expensive. Secondly, if the sample size is small or several components in the mixture could do the same thing, the mixture model can face non-identification or instability problem unless a strong prior is added. Figure 6 shows a comparison of mixture models and other model averaging methods in a numerical experiment, in which the true model is Y ∼ N(β 1 X 1 + β 2 X 2 + β 3 X 2 , 1), β k is generated from N(0, 1), and there are 3 candidate models, each containing one covariate: Figure 6: Log predictive densities of the combined distribution obtained by stacking of predictive distributions, BMA, Pseudo-BMA, Pseudo-BMA+, model selection by marginal likelihood, and mixture models. In each case, we evaluate the predictive density by 100 testing data and 100 repeated simulations. The correlation of variables ranges from −0.3 to 0.9, and sample size ranges from 3 to 50. Stacking of predictive distributions and Pseudo-BMA+ outperform mixture models in all cases.
In the simulation, we generate the design matrix by Var(X i ) = 1 and Cor(X i , X j ) = ρ. ρ determines how correlated these models are and it ranges from −0.3 to 0.9. Figure 6 shows that both the performance of mixture models and single model selection are worse than any other model averaging methods we suggest, even though the mixture model takes much longer time to run (about 30 more times) than stacking or Pseudo-BMA+. When the sample size is small, the mixture model is too complex to fit. On the other hand, stacking of predictive distributions and Pseudo-BMA+ outperform all other methods with a moderate sample size. Clarke (2003) argues that the effect of (point estimation) stacking only depends on the space spanned by the model list, hence he suggests putting those "independent" models on the list. Figure 6 shows high correlations do not hurt stacking and Pseudo-BMA+ in this example.

Variational inference with different initial values
In Bayesian inference, the posterior density of parameters θ = (θ 1 , . . . , θ m ) given observed data y = (y 1 . . . y n ) can be difficult to compute. Variational inference can be used to give a fast approximation for p(θ|y) (for a recent review, see Blei et al., 2017). Among a family of distributions Q, we try to find one q ∈ Q such that the Kullback-Leibler divergence to the true posterior distribution is minimized: One widely used variational family is mean-field family where parameters are assumed to be mutually independent Q = {q(θ) : q(θ 1 , . . . , θ m ) = m j=1 q j (θ j )}. Some recent progress is made to run variational inference algorithm in a black-box way. For example, Kucukelbir et al. (2017) implement Automatic Differentation Variational Inference in Stan (Stan Development Team, 2017). Assuming all parameters θ are continuous and model likelihood is differentiable, it transforms θ into real coordinate space IR m through ζ = T (θ) and uses normal approximation p(ζ|μ, σ 2 ) = m j=1 N(ζ j |μ j , σ 2 j ). Plugging this into (10) leads to an optimization problem over (μ, σ 2 ), which can be solved by stochastic gradient descent. Under some mild condition, it eventually converges to a local optimum q * . However, q * may depend on initialization since such optimization problem is in general non-convex, particularly when the true posterior density p(θ|y) is multi-modal.
Stacking of predictive distributions and Pseudo-BMA+ weighting can be used to average several sets of posterior draws coming from different approximation distributions. To do this, we repeat the variational inference K times. At time k, we start from a random initial point and use stochastic gradient descent to solve the optimization problem (10), ending up with an approximation distribution q * k . Then we draw S samples (θ and calculate the importance ratio r s i,k defined in (6) as r s i,k = 1/p(y i |θ (s) k ). After this, the remaining steps follow as before. We obtain stacking or Pseudo-BMA+ weights w k and average all approximation distributions as K k=1 w k q * k . Figure 7 gives a simple example that the averaging strategy helps adjust the optimization uncertainty of initial values. Suppose the data is two-dimensional y = (y (1) , y (2) ) and the parameter is (μ 1 , μ 2 ) ∈ IR 2 . The likelihood p(y|μ 1 , μ 2 ) is given by y (1) ∼ Cauchy(μ 1 , 1), y (2) ∼ Cauchy(μ 2 , 1).
A N(0, 1) prior is assigned to μ 1 − μ 2 . We generate two observations (y = −2). The first panel shows the true posterior distribution of μ = (μ 1 , μ 2 ), which is bimodal. We run mean-field normal variational inference in Stan, with two initial values to be (μ 1 , μ 2 ) = (5, 5) and (−5, −5) separately. This produces two distinct approximation distributions as shown in panel 2 and 3. We then draw 1000 samples each from these two distributions and use stacking or Pseudo-BMA+ to combine them. The lower 2 panels show the averaged posterior distributions. Though neither can recover the true distribution, the averaged version is closer to it. Adams et al. (2004) use US Senate voting data from 1988 to 1992 to study voters' preference for the candidates who propose policies that are similar to their political beliefs. They introduce two similar variables that indicate the distance between voters and candidates. Proximity voting comparison represents the i-th voter's comparison between the candidates' ideological positions:

Proximity and directional models of voting
where x i represents the i-th voter's preferred ideological position, and x D and x R represent the ideological positions of the Democratic and Republican candidates, respectively. In contrast, the i-th voter's directional comparison is defined by where X N is the neutral point of the ideology scale.
Finally, all these comparison is aggregated in the party level, leading to two partylevel variable Democratic proximity advantage and Democratic directional advantage. The sample size is n = 94.
For both of these two variables, there are two ways to measure candidates' ideological positions x D and x R , which lead to two different datasets. In the Mean candidate dataset, they are calculated by taking the average of all respondents' answers in the relevant state and year. In the Voter-specific dataset, they are calculate by using respondents' own placements of the two candidates. In both datasets, there are 4 other party-level variables.
The two variables Democratic proximity advantage and Democratic directional advantage are highly correlated. Montgomery and Nyhan (2010) point out that Bayesian model averaging is an approach to helping arbitrate between competing predictors in a linear regression model. They average over all 2 6 linear subset models excluding those containing both variables Democratic proximity advantage and Democratic directional advantage, (i.e., 48 models in total). Each subset regression is with the form  Figure 8: Regression coefficients and standard errors in the voting example, from the full model (columns 1-2), the averaged subset regression model using BMA (columns 3-4), stacking of predictive distributions (columns 5-6) and Pseudo-BMA+ (columns 7-8).
Democratic proximity advantage and Democratic directional advantage are two highly correlated variables. Mean candidate and Voter-specific are two datasets that provide different measurements on candidates' ideological placement.
Accounting for the different complexity, they used the hyper-g prior (Liang et al., 2008). Let φ to be the inverse of the variance φ = 1 σ 2 . The hyper-g prior with a hyper-parameter α is, The first two columns of Figure 8 show the linear regression coefficients as estimated using least squares. The remaining columns show the posterior mean and standard error of the regression coefficients using BMA, stacking of predictive distributions, and Pseudo-BMA+ respectively. Under all three averaging strategies, the coefficient of proximity advantage is no longer statistically significantly negative, and the coefficient of directional advantage is shrunk. As fit to these data, stacking puts near-zero weights on all subset models containing proximity advantage, whereas Pseudo-BMA+ weighting always gives some weight to each model. In this example, averaging subset models by stacking or Pseudo-BMA+ weighting gives a way to deal with competing variables, which should be more reliable than BMA according to our previous argument.

Predicting well-switching behavior in Bangladesh
Many wells in Bangladesh and other South Asian countries are contaminated with natural arsenic. People whose wells have arsenic levels that exceed a certain threshold are encouraged to switch to nearby safe wells (for background details, see Gelman and Hill (2006, Chapter 5.4)). We are analyzing a dataset including 3020 respondents to find factors predictive of the well switching. The outcome variable is y i = 1, if household i switched to a safe well. 0, if household i continued using its own well.
And we consider following input variables: • dist: the distance (in meters) to the closest known safe well, • arsenic: the arsenic level (in 100 micrograms per liter) of the respondent's well, • assoc: whether a member of the household is active in any community association, • educ: the education level of the head of the household.
We start with what we call Model 1, a simple logistic regression with all variables above as well as a constant term, Model 2 contains the interaction between distances and arsenic levels, Furthermore, we can use spline to capture the nonlinear relational between the logit switching probability and the distance or the arsenic level. Model 3 contains the Bsplines for the distance and the arsenic level with polynomial degree 2, where B dis is the B-spline basis of distance with the form (B dis,1 (dist), . . . , B dis,q (dist)) and α dis , α ars are vectors. We also fix the number of spline knots to be 10. Model 4 and 5 are the similar models with 3-degree and 5-degree B-splines, respectively.
Next, we can add a bivariate spline to capture nonlinear interactions, where B dis,ars is the bivariate spline basis with the degree to be 2×2, 3×3 and 5×5 in Model 6, 7 and 8 respectively. Figure 9 shows the inference results in all 8 models, which are summarized by the posterior mean, 50% confidence interval and 95% confidence interval of the probability of switching from an unsafe well as a function of the distance or the arsenic level. Any other variables assoc and educ are fixed at their means. It is not obvious from these results which one is the best model. Spline models give a more flexible shape, but also introduce more variance for posterior estimation.
Finally, we run stacking of predictive distributions and Pseudo-BMA+ to combine these 8 models. The calculated model weights are printed above each panel in Figure 9. For both combination methods, Model 5 (univariate splines with degree 5) accounts Figure 9: The posterior mean, 50% and 95% confidence interval of the well switching probability in Models 1-8. For each model, the switching probability is shown as a function of (a) the distance to the nearest safe well or (b) the arsenic level of the existing well. In each subplot, other input variables are held constant. The model weights by stacking of predictive distributions and Pseudo-BMA+ are printed above each panel. for the majority share. Model 8 is the most complicated one, but both stacking and Pseudo-BMA+ avoid overfitting by assigning it a negligible weight. Figure 10 shows the posterior mean, 50% confidence interval, and 95% confidence interval of the switching probability in the stacking-combined model. Pseudo-BMA+ weighting gives a similar combination result for this example. At first glance, the combination looks quite similar to Model 5, while it may not seem necessary to put an extra 0.09 weight on Model 1 in stacking combination since Model 1 is completely con-tained in Model 5 if setting α dis = α ars = 0. However, Model 5 is not perfect since it predicts that the posterior mean of switching probability will decrease as a function of the distance to the nearest safe well, for small distances. In fact, without further control, it is not surprising to find boundary fluctuation as a main drawback for higher order splines. This decreasing trend around the left boundary is flatter in the combined distribution since the combination contains part of straightforward logistic regression (in stacking weights) or lower order splines (in Pseudo-BMA+ weights). In this example the sample size n = 3020 is large, hence we have reasons to believe stacking of predictive distributions gives the optimal combination. Yang and Dunson (2014) propose to combine multiple point forecasts

Sparse structure and high dimensions
, and the adaptive regression. Their goal is to impose the sparsity structure (certain models can receive zero weights). They show their combination algorithm can achieve the minimax squared risk among all convex combinations, The stacking method can also adapt to sparsity through stronger regularizations. When the dimension of model space is high, we can use a hierarchical prior on w in estimation (4) to pull toward sparsity if that is desired.

Constraints and regularity
In point estimation stacking, the simplex constraint is the most widely used regularization so as to overcome potential problems with multicollinearity. Clarke (2003) suggests relaxing the constraint to make it more flexible.
When combining distributions, there is no need to worry about multicollinearity except in degenerate cases. But in order to guarantee a meaningful posterior predictive density, the simplex constraint becomes natural, which is satisfied automatically in BMA and Pseudo-BMA weighting. As mentioned in the previous section, stronger priors can be added.
Another assumption is that the separate posterior distributions are combined linearly. There could be gains from going beyond convex linear combinations. For instance, in the subset regression example when each individual model is a univariate regression, the true model distribution is a convolution instead of a mixture of each possible models distribution. Both of them lead to the additive model in the point estimation, so stacking of the means is always valid, while stacking of predictive distributions is not possible to recover the true model in the convolution case.
Our explanation is that when the model list is large, the convex span should be large enough to approximate the true model. And this is the reason why we prefer adding stronger priors to make the estimation of weights stable in high dimensions.

General recommendations
The methods discussed in this paper are all based on the idea of fitting models separately and then combining the estimated predictive distributions. This approach is limited in that it does not pool information between the different model fits: as such, it is only ideal when the K different models being fit have nothing in common. But in that case we would prefer to fit a larger super-model that includes the separate models as special cases, perhaps using an informative prior distribution to ensure stability in inferences.
That said, in practice it is common for different sorts of models to be set up without any easy way to combine them, and in such cases it is necessary from a Bayesian perspective to somehow aggregate their predictive distributions. The often-recommended approach of Bayesian model averaging can fail catastrophically in that the required Bayes factors can depend entirely on arbitrary specifications of noninformative prior distributions. Stacking is a more promising general method in that it is directly focused on performance of the combined predictive distribution. Based on our theory, simulations, and examples, we recommend stacking (of predictive distributions) for the task of combining separately-fit Bayesian posterior predictive distributions. As an alternative, Pseudo-BMA+ is computationally cheaper and can serve as an initial guess for stacking. The computations can be done in R and Stan, and the optimization required to compute the weights connects directly to the predictive task.

Invited Discussion
Bertrand Clarke *

Introduction and Summary
Yao, Vehtari, Simpson, and Gelman have proposed a useful and incisive extension to the usual model average predictor based on stacking models. The extension uses score functions as a way to determine weights on predictive densities, thereby giving a full stacking-based distribution for a future outcome. The authors are to be commended for their perceptivity and I am grateful to the Editor-in-Chief of Bayesian Analysis for the opportunity to contribute to the discussion.
The authors develop their ideas logically: Initally, they review the concept of statistical problem classes, namely M-closed, -complete, and -open. These classes are defined by the relative position of the unknown true model (assuming it exists) to the models on a model list that are available for use. Then, they recall the definitions of various model averaging techniques, including the original form of stacking due to Wolpert (1992). Typically, model averaging techniques are most useful for large M-closed problems (where model selection may not be effective) and M-complete problems.
The authors then state the definition of a proper scoring rule and define a model averaging procedure with respect to one. Intuitively, a scoring rule S is a real valued function of two variables: One is a distribution (usually assumed to have a density) and the other is an outcome of a data generator (DG). When a DG is stochastic, i.e., its outcomes are drawn according to a probability distribution, it makes sense to regard its outcomes as corresponding to a random variable. The scoring rule is meant to encapsulate how far a P chosen to generate predictions is from a realized outcome y. The idea is that our P can be (and probably is) wrong whereas by definition y is 'right' because it came from the DG. Hence, loosely, S(P, y) is small, possibly negative, when P is poorly chosen and large when P is well-chosen, both relative to y.
The definition of a scoring rule can be extended to include the case that Y = y has a distribution. This gives a real valued function that behaves somewhat like a distance between two distributions, say P and Q, and is of the form S(P, Q) = S(P, y)dQ(y).
To state the authors' central definition, which defines their new methodology as an extension of Wolpert (1992) where: i) S K is the simplex in K dimensions with generic element w = (w 1 , . . . , w K ), ii) the M k 's are candidate models and p T denotes the density of the true model, and, iii) It is understood that the integration in S is with respect to p T . The first and second entries of S in (1) use the predictive densities for a new Y n+1 = y n+1 , given Y = y, under the K candidate models and p T , respectively.
Since (1) is intractable as written, the authors replace in which the subscript −i indicates that the i-th data point, i = 1, . . . , n, has been left out. It is also assumed that model M k is defined by a conditional density for Y i and includes a prior p(θ k ) where θ k is the parameter for model M k . Using (2) in (1) and reverting to the initial definition of the score function, the stacking weights are assuming they exist and are unique. It is important to note the interchangeability of where the coefficients come from (3). Expression (4) can be used to obtain point and interval predictors for Y n+1 .

Prequentialism, Problem Classes, and Comparisons
The central methodological contribution of the paper is a general technique, essentially Expression (4), for using a score function to find coefficients that can be used to form a stack of densities that happen to be the predictive densities from K models. Thus, it is a method for producing predictors and it can be compared to other methods that produce predictors. One natural way to do this is to invoke the Prequential Principle as enunciated in : Any criterion for assessing the agreement between the predictor and the DG should depend only on the predictions and outcomes. There are two key features to this: i) There should be no conflict/confluence of interest between the assessment and the generation of the values fed into the assessment, and ii) The values of either the DG or predictor that were not used are not relevant. At root, Prequentiality primarily requires that the comparison of predictions with outcomes be disjoint from how the predictions were generated.
The Prequential Principle is very general: It does not require that the outcomes of the DG even follow a distribution. So, the Prequential Principle accommodates any sort of streaming data or data that can be regarded as having a time-ordering -even if the ordering is imposed by the analysis. Parallel to this generality, and given the ubiquity of data that is not plausibly stochastic, it makes sense to In the context of the Prequential Principle, the assumption built into the paper (and comments) is that stacking the posterior predictive densities is done using only the first n data points and the goal is to predict the n + 1. That is, n data points (x i , y i ) for i = 1, . . . , n are available to form a predictor such asp and that the predictor is evaluated at x n+1 to predict Y n+1 (x n+1 ) and its performance assessed in some way that does not involve S. Implicitly, it is assumed the prediction problem will be repeated many times and we are looking only at the n-th stage so as to examine the updating of the predictor. In this way the authors' framework may be seen as Prequential (although this is a point on which reasonable people might disagree). Aside from the formula (4), updating can include changing the models, the M k 's, or even the model averaging technique itself. This is done chiefly by the comparing the predictions with the realized values. Here, the x i 's are regarded as deterministic explanatory variables and the Y i 's are random. In much of the paper, the x i 's are suppressed in the notation.
In their Gaussian mixture model example (Section 4.1), the authors treat their problem as M-open because the true model is not on the model list. While this is reasonable for the sake of argument, it actually underscores the importance of model list selection because choosing a better model list would make the problem M-closed. Nevertheless, in this example, the authors make a compelling point by comparing three different predictors: Bayes model averaging (BMA), stacking of means (under squared error), and stacking of predictive distributions by using a log score as in (4). Figure 1 shows that BMA converges to the model on the model list closest to the true model i.e., BMA has unavoidable (and undetectable) bias. By contrast, stacking of means and stacking of predictive distributions both do well in terms of their means (Figure 2, middle panel) and stacking of predictive densities outperforms both BMA and stacking of means in other senses (Figure 2, left and right panels) because, as shown in Figure 1, it converges to the correct predictive distribution. The distribution associated with the stacking of means converges to a broad lump that does not seem useful. This example shows that matching whole distributions is feasible and sensible. It also shows BMA does not routinely perform well despite its asymptotic optimality; see Raftery and Zheng (2003).
In the example of Section 4.2, Figures 3 and 4 show that, again, stacking means and stacking predictive densities are the best among seven model averaging methods while BMA ties for fourth place or is in last place. Other comparisons have found BMA to have similarly disappointing finite sample behavior. (There is good evidence that a technique called Pseudo-BMA+ is competitive with the two versions of stacking.) In the example of Section 4.3 where the goal is to obtain a density, the authors show stacking predictive densities and Pseudo-BMA+ outperform four other techniques for obtaining a density; one is BMA. The remaining examples show further properties of stacking predictive densities (Section 4.4) or use the method to do predictive inference (Sections 4.5 and 4.6). The results on real data seem reasonable.
A remaining question is the relationship between using predictions from stacked predictive distributions and the Prequential Principle. Obviously, point predictors from stacked predictive distributions can be compared directly to outcomes and hence the Prequential Principle is satisfied. However, one of the authors' arguments is that obtaining a full distribution, as can be done by using their method, is better than simply using point predictors; this is justified by using what appear to be non-Prequential assessments such as the mean log density; see Figures, 2, 3, 4, and 6. First, the log-score was used to form the stacks. So, is log really the right way to assess performance? Second, all too often, taking a mean requires a distribution to exist and so the evaluation of the predictor may therefore depend on the true distribution if only to define the mode of convergence. It's hard to tell if this is the case with the present predictor; these are points on which the authors should comment. Moreover, effectively, the new method gives a prediction distribution (4) that leaves us with two questions: i) How should we use the distributional information, including that from the smoothing and Bayesian bootstrap, assuming it's valid? and ii) Is the distributional information associated with the stacking of means or densities valid, i.e., an accurate representation of its variability?
The answer to i) might simply be the obvious: It's a distribution and therefore any operation we might wish to perform on it, e.g., extract interval or percentile predictors, is feasible and it can be assessed in the score function of our choice. Of course, in practice, we do not know the actual distribution of the future outcome; we have only an estimate of it that we hope is good. Perhaps the consistency statement in Section 3.2 is enough. The answer to ii) seems to require more thought on what exactly the Pareto smoothing and Bayesian bootstrap are producing. This is important because an extra quantity, the score function, has been introduced and the solution in (4) -and hence the distribution assigned to stacked means or predictive densities -can depend on it, possibly delicately. The effect of the score function and the validity of the distribution the authors have associated to stacking of means are points for which the authors might be able to provide some insight.

What About Score Functions in the M-Open Case?
One can plausibly argue that the authors' methodology really only applies to M-closed and -complete problems. In other words, the examples they use are simulated and so are M-closed or real data for which one might believe a stochastic model exists even if it is tremendously complex. Of course, even if such arguments are accepted, one can use the authors' techniques in M-open problems -it just might not work as well as methods that are designed for M-open problems.
One technique that was invented with M-open problems in mind is due to Shtarkov (1987). The analogous Bayesian problem and solution was given in Le and Clarke (2016). In both cases the log score was used; however, the authors' work suggests that this technique can be generalized to other score functions.
Recall the idea behind Shtarkov's original formulation is that prediction can be treated as a game in which a Forecaster chooses a density q (for prediction) and Nature chooses an outcome y not constrained by any rule. The payoff to Nature (or loss to the Forecaster) is log q(y), i.e., log loss. Naturally, the Forecaster wants to minimize the loss. So, assume the Forecaster has 'Advisors' represented by densities p θ ; each Advisor corresponds to a θ. The advisors announce their densities before Nature issues a y. If the Forecaster has a pre-game idea about the relative abilities of the advisors to give good advice, this may be formulated into a prior p(θ). Now it makes sense for the Forecaster to minimize the maximum regret, i.e., to seek the smallest loss (incurred by the best Advisor). This means finding the q that minimizes The solution exists in closed form and can be computed; see Le and Clarke (2016). Following the authors, consider replacing the log loss in (5) by a general score function, S. Now, the Forecaster wants the q ∈ Q, say q opt,S , that minimizes where Q is a collection of densities. Expression (6) may be converted to a form analogous to (3), possibly releasing the sum-to-one constraint that some have argued is not appropriate for M-open problems. It is not obvious that a closed form for (6) can be given; q opt,S might only be available computationally. In either case, q opt,S would depend on S, give an alternative solution to Shtarkov's problem, and might perform better for M-open data than score based stacking. If the authors had any insight on these points, many readers would be glad to hear them. Raftery, A. and Zheng, Y. (2003). "Performance of Bayes model averaging. In the next several pages, I will first review the paper connecting it to the related literature (Section 2), then comment on the Mcomplete case (Section 3) and an application that may be favorable to the proposed method (Section 4). Section 5 concludes this comment.

Overview
Suppose we have a list of parametric models under consideration address a general problem of model aggregation from an interesting perspective: how to average the multiple models in M such that the resulting model combination has an optimal predictive distribution. This distinguishes its goal from two areas that have been well studied, i.e., weighing models targeted to an optimal point prediction and selecting a single model possibly with uncertainty quantification. Yao et al. (2017) particularly focus on the M-open case  to allow the true model to fall outside of M.
One of the most popular approaches is to use Bayesian model probabilities pr(M k | y (n) ) as weights, with these weights forming the basis of Bayesian Model Averaging (BMA). Philosophically, in order to interpret pr(M j | y (n) ) as a model probability, one must rely on the assumption that one of the models in the list M is exactly true, known as the M-closed case. This assumption is arguably always flawed, although one can still use pr(M k | y (n) ) as a model weight from a pragmatic perspective, regardless of the question of interpretation. In the case of M-complete or M-open, an alternative approach is to formulate the model selection problem in a decision theoretic framework, selecting the model in M that maximizes expected utility. Yao et al. (2017) adopt a stacking approach along the line of this decision theoretic framework Gutiérrez-Peña et al., 2009;.
There are various scoring rules available when defining the unity function in a decision theoretic framework. The choice can and probably should depend on the specific * This work was partially supported by the Ralph E. Powe Junior Faculty Enhancement Award by ORAU.
† Noah Harding Assistant Professor, Department of Statistics, Rice University, Houston, TX, U.S.A., meng@rice.com application-similar principle has been demonstrated by Claeskens and Hjort (2003) that model selection may focus on the parameter singled out for interest. The classical stacking method uses quadratic loss (or energy score with β = 2 in the paper), targeting at an optimal point prediction. Yao et al. (2017) consider a range of scoring rules generalizing the stacked density estimation by Smyth and Wolpert (1998). The authors recommend to use the log scoring rule, which essentially finds the Kullback-Leibler divergence projection of the true data generating density to the convex hull is the predictive density under model M k , targeting at an optimal predictive density.
The decision theoretic framework used in the paper bypasses the need to philosophically interpret or calibrate model weights, but has to evaluate the expected utility. Expected utility can be approximated either via cross-validation  or using a nonparametric prior (Gutiérrez-Peña and Gutiérrez-Peña et al., 2009). The authors use leave-one-out cross-validation to construct an approximation of the expected utility, while one may generally consider a k-fold cross-validation as in . While Bayesian model probabilities often have analytical forms available thus are computationally appealing (Liang et al., 2008), the computational burden in cross-validation is unfavorably intensive. The authors overcome this difficulty by using the Pareto smoothed importance sampling (Vehtari and Lampinen, 2002;Vehtari et al., 2012) to approximate leave-one-out predictive densities, which self diagnoses the reliability of the approximation based on some estimated parameter and leads to manageable computation. The authors thoughtfully design simple but effective simulations to illustrate and compare how stacking of distributions and selected existing methods behave, and provide R and Stan code for routine implementation.

M-complete and nonparametric references
The nonparametric Bayes literature provides a rich menu of possibilities to approximate the true data generating scheme, ranging from Dirichlet processes to Gaussian processes; refer to Hjort et al. (2010) for a review. There is a rich literature showing that Bayesian nonparametric models often have appealing frequentist asymptotic properties, such as appropriate notions of consistency (Schwartz, 1965) and optimal rates of convergence (van der Vaart and van Zanten, 2009;Bhattacharya et al., 2014;Castillo, 2014;Shen and Ghosal, 2015;Li and Ghosal, 2017;Ghosal and van der Vaart, 2017). When an optimal predictive density is the goal, one may ask why not pursue the direction of flexible modeling utilizing the rapidly developed nonparametric Bayes literature?
While I look forward to open discussions about the question above, one possible argument is that although Bayesian nonparametric models are appealing from a prediction viewpoint, they also have disadvantages in terms of not only interpretability but also in involving large numbers of parameters, which increase automatically as the sample size increases. This may lead to daunting memory, storage and communication issues in modern applications involving large data sets. It is thus often preferable from a variety of viewpoints to approximate the performance of a very rich and provably flexible nonparametric model by taking a weighted average of much simpler parametric models.
Keeping the interpretability in mind while being aware of flexible nonparametric models, we indeed reach to the case of M-complete which "refers to the situation where the true model exists and is out of model list M. But we still wish to use the models in M because of tractability of computations or communication of results, compared with the actual belief model. Thus, one simply finds the member in M that maximizes the expected utility (with respect to the true model)", according to the definition from Bernardo and Smith (1994). As a result, we can use a nonparametric Bayes surrogate for the true model and calculate the expected utility based on this surrogate, a general approach that fits into the M-complete case. Li and Dunson (2018) use a nonparametric Bayesian reference to assign weights to models based on Kullback-Leibler divergence to define a model weight that can be used in goodness-of-fit assessments, comparing imperfect models, and providing model weights to be used in model aggregation. It seems promising to migrate this idea to the two stacking approaches used in the paper, one based on optimization using proper scoring rules and the other called pseudo-BMA in Section 3.4 in a form of exponential weighting (Rigollet and Tsybakov, 2012): • Stacking using proper scoring rules. One may obtain the weights by optimizing an approximated version of (3): wherep t (ỹ i ) is a nonparametric Bayes model and other notations are defined in Yao et al. (2017).
• Pseudo-BMA. We replace the empirical observations used in Section 3.4 by the nonparametric surrogate. Specifically, the quantity elpd k can be estimated by instead of elpd k loo used in the paper, and the final weights become The use of nonparametric reference models eliminates the need of cross-validation that gives rise to the main computational hurdle in stacking of distributions. In addition, this estimate based on nonparametric reference potentially induces an inherent complexity penalty, a phenomenon observed in Li and Dunson (2018), thus the lognormal approximation for weights regularization in Section 3.4 may be not necessary.
Although we here focus on Bayesian machinery, one can approximate the expected utility using any method that estimates d( K k=1 w k p(·|y, M k ), p t (·|y)). For example, if we use the recommended Kullback-Leibler divergence, i.e., d(p, q) = KL(q, p), then there is substantial work that has focused on estimating the Kullback-Leibler divergence between two unknown densities based on samples from these densities (Leonenko et al., 2008;Pérez-Cruz, 2008;Bu et al., 2018). Here the setting is somewhat different as there is only one sample, but the local likelihood methods of Lee and Park (2006) and the Bayesian approach of Viele (2007) can potentially be used, among others.
Furthermore, all of these methods in a decision theoretic framework or BMA are focused on providing weights for model aggregation, and are not useful for goodnessof-fit assessments of (absolute) model adequacy. The nonparametric reference models in the M-complete case enables the assessment the quality of each individual model in M as well as the entire model list. One of course needs to specify an absolute scale to define what is adequate, but rules of thumb such as the one provided by Li and Dunson (2018) based on the convention of Bayes factors may be obtainable.

Data integration
Section 5.3 makes a great point that an ideal case for stacking is that the K models in the model list are orthogonal. This ideal case is not fully demonstrated by the paper, but it insightfully points to a possible solution to a challenging problem-data integration.
Modern techniques enable researchers to acquire rich data from multiple platforms, thus it becomes possible to combine various data types of fundamental differences to make a single decision, hopefully more informative than any decision based on an individual data resource. In response to this demand, there has been a recent surge of interest in data integration expanding into a variety of emerging areas, for example, imaging genetics, omics data, and analysis of covariate adjusted complex objects (such as functions, images, trees, and networks). One concrete example that I have been working on comprises a cohort of patients with demographic, clinical and omics data; the omics data include single nucleotide polymorphisms (SNPs), expression, and micro-ribonucleic acids (miRNAs). In these cases, the dramatic heterogeneity across data structures, which is one of root causes that fail many attempts especially those trying to map various data structures to a common space such as the Euclidean space, seems to be a characteristic favorable to the stacking approach. The sample size is usually not large, thus even the cross-validation approach without approximation may be computationally manageable.

Summary
To summarize, Yao et al. (2017) have tackled the model averaging problem that is one of fundamental tasks in statistics. They have proposed improvements to existing stacking methods for stacking of densities. This method requires intensive leave-one-out posterior distributions to approximate the expected utility, and the authors propose to use Pareto smoothed importance sampling to scale up the implementation.
I would like to thank Yao et al. for writing an interesting paper. I appreciated that the paper has several detailed and thoughtful demonstrations to compare the proposed methods to existing model weights and help readers understand how stacking and BMA behave differently. The integration with R and Stan makes the method immediately available to practitioners. I find the work useful and expect the proposed methods positively impact model averaging and its application to a wide range of problems in practice. I hope the comments on M-complete and a possible application to data application add some useful insights to a paper already rich in content. Leonenko, N., Pronzato, L., and Savani, V. (2008). A class of Rényi information estimators for multidimensional densities. The Annals of Statistics, 36 (5)  , which stacks based on the log score, while often outperforming BMA, fails to address a crucial problem of the M-open-BMA setting. This is the problem of hypercompression as identified by Grünwald and Van Ommen (2017), and shown also to occur with real-world data by De Heide (2016). We explore this issue in Section 2; first, we very briefly compare stacking to a related method called switching.

Stacking and Switching
Standard BMA can already be viewed in terms of minimizing a sum of log score prediction errors via  prequential interpretation of BMA. Based on this interpretation, Van Erven et al. (2012) designed the switch distribution as a method for combining Bayes predictive densities with asymptotics that coincide, up to a log log n factor, with those of the Akaike Information Criterion (AIC) and leave-one-out cross validation (LOO). It can vastly outperform standard BMA (see Figure 1 from their paper), yet is designed in a manner that stays closer to the Bayesian ideal than stacking. It has the additional benefit that if one happens to be so lucky to unknowingly reside in the M-closed (correctly specified) case after all, the procedure becomes statistically consistent, selecting asymptotically the smallest model M k that contains the data generating distribution P * . We suspect that in this M-closed case, stacking will behave like AIC, which, in the case of nested models, even asymptotically will select an overly large model with positive probability (for theoretical rate-of-convergence and consistency results for switching see Van der Pas and Grünwald (2018)). Moreover, by its very construction, switching, like stacking, should resolve another central problem of BMA identified by , Section 2), namely its sensitivity to the prior chosen within the models M k . On the other hand, in the M-open case, switching will asymptotically concentrate on the single, smallest M k that contains the distributionP closest to P * in KL-divergence; stacking will provide a weighted predictive distribution that may come significantly closer to P * , as indicated by , Section 3.2).
To give a very rough idea of 'switching': in the case of just two models M = {M 1 , M 2 }, switching can be interpreted as BMA applied to a modified set of models {M j : j ∈ N} where M j represents a model that follows the Bayes predictive density of model M 1 until time j and then switches to the Bayes predictive density corresponding to model M 2 ; dynamic programming allows for efficient implementation even when the number of models K is larger than 2. It would be of interest to compare stacking to switching, and compile a list of the pros and cons of each.
2 Standard BMA, Stacking and SafeBMA Grünwald and Van Ommen (2017) give a simple example of BMA misbehaving in an M-open regression context. We start with a set of K + 1 models is a standard linear regression model, i.e. a set of conditional densities expressing that i − 1)/2, and so on), and the ξ i are i.i.d. N (0, σ 2 ) noise variables. We equip each model with standard priors, for example, a N (0, σ 2 ) prior on the β's conditional on σ 2 and an inverse Gamma on σ 2 . We put a uniform or a decreasing prior on the models M k themselves. The actual data Z i , Y i are i.i.d. ∼ P * . Here P * is defined as follows: at each i, a fair coin is tossed. If the coin lands heads, then Z i is sampled uniformly from [−1, 1], and Y i is sampled from N (0, 1). If it lands tails, then (Z i , Y i ) is simply set to (0, 0). Thus, M 1 , the simplest model on the list, already contains the density in k=1..K M k that is closest to P * (Y | X) in KL divergence. This is the density pβ ,1/2 withβ = 0, which is incorrect in that it assumes homoskedastic noise while in reality noise is heteroskedastic; yet pβ ,1/2 does give the correct regression function E[Y | X] ≡ 0. M 1 is thus 'wrong but highly useful'. Still, while M 1 receives the highest prior mass, until a sample size of about 2K is reached, BMA puts nearly all of its weight on models M k with k close to the maximum K, leading to rather dreadful predictions of E[Y | X]. Figure 1 (green) shows E[Y | X] where the expectation is under the Bayes predictive distribution arrived at by BMA at sample size 50, for K = 30. On the other hand, SafeBayesian model averaging, a simple modification of BMA that employs likelihoods raised to an empirically determined power η < 1, performs excellently in this experiment; for details we refer to Grünwald and Van Ommen (2017). We also note that other common choices for priors on (β, σ 2 ) lead to the same results; also, we can take the X i0 , X i1 , . . . , X iK to be trigonometric basis functions or i.i.d. Gaussians rather than polynomials of Z i , still getting essentially the same results. De Heide (2016) presents various real-world data sets in which a similar phenomenon occurs.
Given these problematic results for BMA in an M-open scenario, it is natural to check how 's stacking approach (based on log score) fares on this example. We tried (implementation details at the end of this section, and obtained the red line in Figure 1. While the behaviour is definitely better than that of BMA, we do see a milder variation of the same overfitting phenomenon. We still regard this as undesirable, especially because another method (SafeBMA) behaves substantially better. To be fair, we should add that (Yao et al., 2018, Section 3.3.) advise that for extremely small n, their current method can be unstable. The figure reports the result on a simulated data sequence, for which, according to the diagnostics in their software, their method should be reasonably accurate (details at the end of this section). Since, moreover, results (not shown) based on the closely related LOO model selection with log score yield very similar results, we do think that there is an issue here -stacking in itself is not sufficient to get useful weighted combinations of Bayes predictive distributions in some small sample situations where such combinations do exist.
Hypercompression The underlying problem is best explained in a simplified setting without random covariates: let Y 1 , Y 2 , . . . i.i.d. ∼ P * and each model M k a set of densities for the Y i . Denote byp the density in k=1..K M k that minimizes KL divergence to P * . Then, under misspecification, we can have for some k = 1..K that This can happen even for a k such thatp ∈ M k . (1) is possible because p(y 1 , . . . , y n | M k ) is a mixture of distributions in M k , and may thus be closer to P * than any single element of M k . This phenomenon, dubbed hypercompression and extensively studied and explained by Grünwald and Van Ommen (2017), has the following effect: if M j for some j = k containsp and, at the given sample size, has its predictive distribution p(y n | y n−1 , M j ) already indistinguishable fromp, yet the posterior based on M k has not concentrated on anything nearp (or M k does not even containp), then M k might still be preferred in terms of log score and hence chosen by BMA. The crucial point for the present discussion is that with stacking based on the log score, the preferred method of Yao et al. (2018) (see Section 3.1.), the same can happen: (1) implies that for a substantial fraction of outcomes y i in y 1 , . . . , y n , one will tend to have, with y −i := (y 1 , . . . , y i−1 , y i+1 , . . . , y n ), that hence also giving an advantage to M k compared to the KL-bestp and M k .
But why would this be undesirable? It turns out that the predictive distribution p(· | y −i , M k ) in (2) achieves being significantly closer to P * in terms of KL divergence than any of the elements inside M k , by being a mixture of elements of M k which themselves are all very 'bad', i.e. very far from P * in terms of KL divergence (see in particular Figure 7 and 8 of Grünwald and Van Ommen (2017)). As a result, using a log score oriented averaging procedure, whether it be BMA or stacking, one can select an M k whose predictive is good, at sample size i, in log score, but quite bad in terms of just about any other measure. For example, consider a linear model M k as above. For such models, for fixed σ 2 , as a function of β, the KL divergence D(P * p β,σ 2 ) := Therefore, one commonly associates a predictive distribution p(y i | x i ) that behaves well in terms of log score (close in KL divergence to P * ) to be also good in predicting y i as a function of the newly observed x i in terms of the squared prediction error. Yet, this is true only if p is actually of the form p β,σ 2 ∈ M k ; the Bayes predictive distribution, being a mixture, is simply not of this form and can be good at the log score yet very bad at squared error predictions. Now it might of course be argued that none of this matters: stacking for the log score was designed to come up with a predictive that is good in terms of log score. . . and it does! Indeed, if one really deals with a practical prediction problem in which one's prediction quality will be directly measured by log score, then stacking with the log score should work great. But to our knowledge, the only such problems are data compression problems in which log score represents codelength. In most applications in which log score is used, it is rather used for its generic properties, and then the resulting predictive distributions may be used in other ways (they may be plotted to give insight in the data, or they may be used to make predictions against other loss functions, which may not have been specified in advance). For example , end of Section 3.1) cite the generic properties that log score is local and proper as a reason for adopting it. Our example indicates that in the M-open case, such use of log score for its generic properties only can give misleading results. The SafeBayesian method overcomes this problem by exponentiating the likelihood to the power η that minimizes a variation of log-score for predictive densities (the R-log loss, Eq. (23) in Grünwald and Van Ommen (2017)) in which loss cannot be made smaller by mixing together bad densities. Figure 1 The conditional expectations E[Y | X] in Figure 1 are based on a simulation in which the models are trained with 30 Legendre polynomial basis functions on 50 data points, as described in Section 2. The green curve represents E[Y | X] according to the predictive distribution resulting from BMA with a uniform prior on the models, where we used the function bms of the R-package BMS. The red curve is based on stacking of predictive distributions, where we used the implementation with Stan and R exactly as described in the appendix of . The black line depicts the true regression function Y = 0. The blue curve is SBRidgeIlog, which is an implementation of I-log-SafeBayesian Ridge Regression (see Grünwald and Van Ommen (2017) for details) from the R-package SafeBayes (De Heide, 2016), based on the largest model M K . The regression functions based on M k for all k < K are even closer to Y = 0 (not shown). The regression function according to the Safe BMA predictive distribution is a mixture of all these Ridge-based regression functions hence also close to 0.

Some Details Concerning
As  note, the implementation of their method can be unstable when the ratio of relative sample size to the effective number of parameters is small. We encountered this unstable behaviour for a large proportion of the simulations when the sample size was relatively small, and the Pareto-k-diagnostic (indicating stability) was above 0.5, though mostly below 0.7, for some data points. In those cases the method did not give sensible outputs, irrespective of the true regression function (which we set to, among others, Y i = 0.5X i + ξ i and Y i = X 2 i + ξ i , and we also experimented with a Fourier basis). Thus, we re-generated the whole sample of size n = 50 many times and only considered the runs in which the k-diagnostic was below 0.5 for all data points. In all those cases, we observed the overfitting behaviour depicted in Figure 1. This 'sampling towards stable behaviour' may of course induce bias. Nevertheless, the fact that we get very similar results for model selection rather than stacking (mixing) based on LOO with log-score indicates that the stacking curve in Figure 1 is representative.

Contributed Discussion
A. Philip Dawid *

Scoring rules
The paper extends the method of stacking to allow probabilistic rather than point predictions. It does this by applying a scoring rule, which is a loss function S(y, Q) 1 for critiquing a quoted probabilistic forecast Q for a quantity Y in the light of Nature's realised outcome y. In retrospect this simple extension has been a long time coming. Although theory and applications of proper scoring rules have been around for at least 70 years, this fruitful and versatile concept has lain dormant until quite recently, and is still woefully unfamiliar to most statisticians. See Dawid (1986); Dawid and Sebastiani (1999) A scoring rule S(y, Q) is a special case of a loss function L(y, a), measuring the negative worth of taking an act a in some action space A, when the variable Y turns out to have value y. In this special case, A is set of probability distributions. Conversely, given any action space and loss function we can construct an associated proper scoring rule S(y, Q) := L(y, a Q ), where a Q is a Bayes act against Q, thus minimising L(Q, a) := E Y ∼Q L(Y, a). This extends stacking to general decision problems.

Prequential strategies
However, when we take probability forecasting seriously, there are more principled ways to proceed. Rather than assessing forecasts using a cross-validatory approach-which, though popular, has little foundational theory and does not easily extend beyond the context of independent identically distributed ("IID") observations-we can conduct prequential (predictive sequential) assessment . This considers the observations in sequence, 2 and at any time t constructs a probabilistic prediction P t+1 for the next observable Y t+1 , based on the currently known outcomes y t = (y 1 , . . . , y t ). Any method of doing this is a prequential strategy. Many (though by no means all) strategies can be formed by applying some principle to a parametric statistical model M = {P θ : θ ∈ Θ} for the sequence of observables (Y 1 , Y 2 , . . .)-which need not incorporate independence, and (in contrast to all the approaches mentioned in § 2 of the paper) not only need not be considered as containing the "true generating process", but does not even require that such a process exist (the "M-absent" case). Example M -based strategies are the "plug-in" density forecast p t+1 = p(y t+1 | y t ; θ t ), with θ t the * University of Cambridge, apd@statslab.cam.ac.uk 1 Various notations occur in the literature. Mine here, which, compared with the paper, interchanges the arguments and takes the negative, is perhaps the most traditional.
2 If there is no natural sequence, they can be ordered arbitrarily-the specific ordering typically makes little or no difference. maximum likelihood estimate based on the current data y t ; and the Bayesian density forecast p t+1 = p(y t+1 | y t ; θ) π t (θ)dθ, with π t (θ) the posterior density of θ given y t .

M-closed case
Purely as a sanity check, it can be shown  that the above M -based strategies are typically consistent for hypothetical data generated from some distribution in M (a fictitious M -closed scenario). Straightforward extensions base forecasts on a discrete collection M of parametric statistical models, and exhibit consistency for fictitious data generated from any distribution in any of these. Another possible M-based prequential strategy in this case (though computationally demanding, and restricted to IID models) might be based on forecasting Y t+1 by applying the paper's stacking methodology (using a suitable scoring rule) to the partial data y t . It would be interesting to investigate its sanity under the fictitious M-closed assumption.

M-open case
Consider a model M = {P θ }, and a hypothetical generating distribution Q ∈ M . All these distributions can exhibit dependence between observations. For data y T the "bestfitting" value of θ is θ * T , minimising the cumulative discrepancy T t=1 d(Q t , P θ,t ), where P θ,t [resp., Q t ] is the forecast distribution for Y t given y t−1 under P θ [resp., Q], and d is the discrepancy function associated with proper scoring rule S. 3 In the M-closed case that Q = P θ0 ∈ M , for any choice of S we will have θ * T = θ 0 ; but otherwise θ * T will typically depend on S-which is why it is important to choose S appropriate to the context, and not, say, to assess an estimate based on log-score using a quadratic criterion. Moreover, in non-IID cases θ * T is typically data-dependent. In any case θ * T depends on the unknown (even fictional) Q. However, we could estimate θ * T by the "best-performing" value θ * * T , minimising the empirical score T t=1 S(y t , P θ,t ). Under very broad conditions (typically not requiring e.g. ergodicity) this will be consistent even in the M -open case, in the sense that, no matter what Q may be, θ * * T − θ * T → 0 as T → ∞, almost surely under Q (Skouras and Dawid, 2000). Again, this M -open sanity check (which subsumes the M -closed sanity check) extends to the case of a collection M of models; and again it would be interesting to check whether a prequential stacking strategy could preserve this property.

M-absent case
The most incisive test of a forecasting method is its performance in the M-absent case. The overall (negative) performance of any strategy, on a full data-set y T , can be assessed by its total prequential score T t=1 S(y t , P t ), which judges forecasts relative to actual data, so allowing direct data-driven comparison of strategies. Consider now a model M = {P θ }, and (for data y T ) the best-performing value θ * * T as defined in § 2.2. There is a wealth of theory (again, woefully ignored by most statisticians), developed in the discipline of "prediction with expert advice" (Vovk, 2001), which shows that, under suitable conditions, we can construct a prequential strategy P which for any data whatsoever will perform essentially as well as P θ * * ; for example (Cesa-Bianchi and Lugosi, 2006;Rakhlin et al., 2015) such that, for any infinite datasequence y = (y 1 , y 2 , . . .), By adding probabilistic assumptions about the origin of y we can then (if so desired) recover some of the consistency results in § 2.1 and § 2.2 above. And again extension to forecasts based on a collection M of models is straightforward. I wonder how well stacking would perform by this criterion?

Contributed Discussion
William Weimin Yoo * In this paper,  consider the problems of model selection and aggregation of different candidate models for inference. Inspired by the stacking of means method proposed in the frequentist literature, the authors generalize this idea by developing a procedure to stack Bayesian predictive distributions. Given a list of candidate models with their corresponding predictive distributions, the goal here is to find a linear combination of these distributions such that it is as close as possible to the true date generating distribution, under some score criterion. To find the linear combination (model) weights, they replace the full predictive distributions with their leave-one-out (LOO) versions in the objective function, and proceed to solve this convex optimization problem. The authors then propose a further approximation to LOO computation by importance sampling, with the importance weights obtained by fitting a generalized Pareto distribution. To showcase the benefits of the newly proposed stacking method, the authors conducted extensive simulation studies and data analyses, by comparing with other techniques in the literature such as BMA (Bayesian Model Averaging), BMA with LOO (Psedo-BMA), BMA with LOO and Bayesian Bootstrap, and others.
We can take a graphical modeling perspective on LOO. For example, replacing marginal likelihoods p(y|M k ) with n i=1 p(y i |y j : j = i, M k ) is akin to simplifying a complete (fully connected) graph linking observations to one where the Markov assumption holds, i.e., the node corresponding to y i is independent conditioned on its neighbors {y j : j = i}. In the proposed stacking method, the full predictive distribution p( y|y) is approximated by the LOO p(y i |y j : j = i), and further approximation is needed because the LOO is typically expensive to compute. However if there are some structures in the data, such as clusters of data points/nodes, then one can take advantage of this by conditioning on a smaller cluster B of nodes around y i and compute instead p(y i |y j : j = i, j ∈ B). This would then further speed up computations as one can fit models using only local data points.
Another point I would like to make is that the superb performance of the stacking method warrants further theoretical investigations. Figure 2(c) in the simulations shows that the proposed method is robust to incorrect/bad models, in the sense that its performance stays unchanged even if more incorrect models are added to the list. It would be nice if we will also have some theoretical guarantees that the stacking method will concentrate on the correct (M-closed) or the best models (M-complete) in the model list. In addition, Figure 9 shows that this method is able to "borrow strength" across different models, by using some aspects of a model to enhance performance of a different model. Therefore aggregation by stacking adds value by bringing the best out of each individual model component, and it would be interesting to characterize through theory what this added value is. This then invites us to reflect on how the quality of individual model component affects the final stacked distribution. For example, given posterior * Mathematical Institute, Leiden University, The Netherlands, yooweimin0203@gmail.com contraction rates for the different posterior models, what would be the aggregated rate for the stacked posterior? Also, how does prediction/credible sets constructed using the stacked posterior compare with those constructed from each model component, will it be bigger, smaller or something in between?
As most existing methods including the newly proposed stacking method use some form of linear combinations, it would be interesting to find other ways of aggregation. As pointed out by the authors, linear combinations of predictive distributions will not be able to reproduce truths that are generated through some other means, e.g., convolutions. To apply stacking in the convolutional case, I think one way is to do everything in the Fourier domain, by stacking log Fourier transforms (i.e., log characteristic functions) of the predictive posterior densities, exponentiate and then apply inverse Fourier transform to approximate the truth generated through convolutions.
I think another possible area of application for the stacking method is in distributed computing. In this modern big data era, data has grown to such size that it is sometimes infeasible or impossible to analyze them using standard Markov Chain Monte Carlo (MCMC) algorithms on a single machine. Hence this gave rise to the divide and conquer strategy where data is first divided into batches and (sub)posterior distribution is computed for each batch. The final posterior for inference is then obtained by aggregating these (sub)posteriors. To this end, I think the present stacking method can be deployed after some modifications, with potential to yield superior performance when compared with existing weighted average-type approaches.
I find the paper to be very interesting and the stacking method proposed is a key contribution to the Bayesian model averaging/selection literature. It is shown to be superior than the golden standard, i.e., BMA and its finite sample performance is tested comprehensively though a series of numerical experiments and real data analyses. I think this is a very promising research direction, and any future contributions are very welcomed.

Contributed Discussion
Robert L. Winkler * , Victor Richmond R. Jose † , Kenneth C. Lichtendahl Jr. ‡ , and Yael Grushka-Cockayne §  propose a stacking approach for aggregating predictive distributions and compares its approach with several alternatives, such as Bayesian model averaging and mean stacking. Because distributions are more informative than point estimates and aggregating distributions can increase the information upon which decisions are made, any paper that encourages the use of distributions and their aggregation is important and most welcome. Therefore, we applaud the authors for bringing attention to these issues and for their careful approach to distribution stacking using various scoring rules.
To stack distributions, the authors propose the use of convex combinations of competing models' cdfs. The different cdf stackers they consider vary in the (non-negative) weights assigned to each model in the combination. This cdf stacking approach is framed by the logic of Bayesian model averaging: there is a prior distribution over which model is correct and the data are generated by this one true model. The result is a convex combination of the competing models, where the weights follow from the prior distribution over which model is correct.
The idea of a "true model" is in the spirit of hypothesis testing in the sense that it is black and white-one truth that we try to identify. In most real-world situations, however, the existence of a "true model" is highly doubtful at best. A better approach may be to think in terms of information aggregation from multiple inputs/forecasters/models. Under this approach, no base model is true, but all provide some useful information that is worth combining. The following example illustrates this point and motivates the idea of aggregating information with a quantile stacker.
Suppose there are k ≥ 2 forecasters. Forecaster i privately observes the sample x i = (x Ni−1+1 , . . . , x Ni ) of size n i for i = 1, . . . , k, where N i = i j=1 n j . Each forecaster will report their quantile function Q i for the uncertain quantity of interest x N k +1 . The data are drawn from the normal-normal model: μ ∼ N (μ 0 , mλ) and (x j |μ) ∼ iid N (μ, λ) for j = 1, . . . , N k + 1, where λ denotes the precision.
x Ni−1+j , and λ i = (m+n i )/(m+n i +1)λ (Bernardo and Smith, 2000, p. 439). The corresponding quantile function is where Φ is the standard normal cdf. Once the decision maker hears the forecaster's updated quantile functions, his updated beliefs are ( (1) According to the example above, the result of aggregating the forecasters' information is mean/quantile stacking. The mean/quantile stacker in (1) is a linear combination of the prior-predictive mean, the forecasters' posterior-predictive means, and the forecaster's posterior-predictive quantiles. Interestingly, only the weights on the quantiles are necessarily positive. Choosing this linear combination to optimize a quantile scoring rule (Jose and Winkler, 2009;Gneiting, 2011;Grushka-Cockayne et al., 2017), such as the pinball loss function, may be a useful way to choose the weights.
In fact, a quantile stacker solves the example in Yao et al's Section 4.1 exactly and would be perfectly calibrated. The quantile stacker 0.6Q 3 (u) + 0.4Q 4 (u), where Q i (u) = μ i +Φ −1 (u) and μ i = i, yields the aggregate quantile function 3.4+Φ −1 (u), which is the quantile function of the true data-generating process in Section 4.1. A quantile stacker may also work well in the regression example of Section 4.2.
When aggregating model forecasts of a continuous quantity of interest, each model may be well-calibrated to the training data, although each model may have different sharpness. In this case, a result in Hora (2004) kicks in. The convex combination of well-calibrated models' cdfs cannot be well-calibrated (unless they are all identical). Typically, the convex combination of cdfs will lead to an underconfident aggregate cdf (Lichtendahl Jr et al., 2013). The same holds for forecasts of a binary event; see Ranjan and Gneiting (2010). Because the average quantile forecast is always sharper than the average probability forecast (Lichtendahl Jr et al., 2013), the average quantile may be better calibrated when the average cdf is underconfident.
In the case of aggregating binary-event forecasts, such as the example in Yao et al's Section 4.6, it might be optimal to transform the probabilities prior to combining them in a generalized linear model . Lichtendahl Jr et al. (2018) propose a Bayesian model in the spirit of the example given here. The model results in a generalized additive model for combining the base model's binary-event forecasts.
We are thankful that the authors highlight the importance of combining predictive distributions, and we hope this paper stimulates further work in this area.

Combination of Predictive Densities
The authors address model combination when the data generating process is unavailable (M-complete) or unattainable (M-open). A starting-point is that Bayesian model averaging (BMA) is widely applied in contexts that violate the M-closed assumptionsthe assumptions under which it is the "right way" to combine predictive densities. The failure of BMA is highlighted by the authors in examples. Some of our own interests are in time series forecasting, with multiple examples of BMA degenerating fast to the wrong model, and failing to adapt to temporal shifts in the data. This is not a failure of Bayesian thinking, of course; the message remains follow-the-rules, but understand when assumptions fail to justify blind application of the rules. The authors' methodology relates to other methods responding to this. For example, optimal pooling  as referenced by the authors, and the DeCo approach Aastveit et al., 2018), address the problem explicitly in the M-incomplete setting. Focused on time series examples, the conceptual basis of these works is of course much broader.
The authors development of density combination methods based on prior uses of stacking in point estimation also parallels the historical development of Bayesian density combination methods following the literature on combination of point forecasts (e.g. Hall and Mitchell, 2007;Aastveit et al., 2014;Kapetanios et al., 2015;Aastveit et al., 2018;Del Negro et al., 2016, and references therein). Then, much of the density combination literature has grown from somewhat algorithmic and empirical model-based perspectives. We regard the stacking approach as largely adopting this perspective. While the authors argue cogently for the construction, connect with what they term "pseudo"-Bayesian thinking and aspects of asymptotics, the import of the work is largely-and strongly-based on the examples and empirical evaluation. We are led to ask, is this new approach to "averaging Bayesian predictive distributions". . . really Bayesian?

Bayesian Predictive Synthesis
The recently developed approach of Bayesian predictive synthesis (BPS:  is explicitly founded on subjective Bayesian principles and theory, and defines an over-arching framework within which many existing density (and other) combination "rules" can be recognized as special cases. Critically, this provides opportunity to understand the implicit Bayesian assumptions underlying special * Booth School of Business, University of Chicago, kenichiro.mcalinn@chicagobooth.edu † Norges Bank, BI Norwegian Business School, knut-are.aastveit@norges-bank.no ‡ Department of Statistical Science, Duke University, mike.west@duke.edu cases. BPS links to past literature on subjective Bayesian "agent/expert opinion analysis" (e.g. Genest and Schervish, 1985;West and Crosse, 1992;West, 1992) and provides a formal Bayesian framework that regards predictive densities from multiple models (or individuals, agencies, etc) as data to be used in prior-posterior updating by a Bayesian observer (see also West and Harrison, 1997, Sect 16.3.2). The approach allows for the integration of other sources of information and explicitly provides the ability to deal with M-incompleteness. A main theoretical component of BPS is a general theorem describing a subset of Bayesian analyses showing how densities can be "synthesized". Special cases include traditional BMA, most existing forecast pooling rules, and-in terms of theoretical construction-the stacking approach in the article.
In  and  BPS is developed for time series forecasting where the underlying Bayesian foundation defines a class of dynamic latent factor models. The sequences of predictive densities define time-varying priors for inherent latent factor processes linked to the time series of interest. BPS is able to learn and adapt to the biases, aspects of mis-calibration, and-critically-inter-dependences among predictive densities. A further practical key point is that BPS can-and shouldbe defined with respect to specific predictive goals; this is a point of wider import presaged in the earlier Bayesian macroeconomics literature and illustrated in McAlinn and  and  through separate forecast combination models for multiple different goals (multiple-step ahead forecasting). Applications in macroeconomic forecasting in these papers demonstrate that a class of proposed BPS models can significantly improve over conventional methods (including BMA and other pooling/weighting schemes). Further, as BPS is a fully-specified Bayesian model within which the information from each of the sources generating predictive density are treated as (complicated) "covariates," posterior inferences on (time-varying or otherwise) parameters weighting and relating the sources provides direct access to inferences on their biases and inter-dependencies.
It is of interest to consider how the current stacking approach relates to BPS through an understanding of how the resulting rule for predictive density combination can be interpreted in BPS theory (see equation (1), and the discussion thereafter, in McAlinn and West 2018). As with other combination rules, an inherent latent factor interpretation is implied and this may provide opportunity for further development. In related work with BPS based on mixture models, Johnson and West (2018) highlight the opportunities to improve both resulting predictions and generate insights about the practical impact of model inter-dependencies that are largely ignored by other approaches. This can be particularly important in dealing with larger numbers of predictive densities when the underlying models generating the densities are known or expected to have strong dependencies (a topic touched upon in Section. 5.3 in the article).

Contributed Discussion
Minsuk Shin * Overview I congratulate the authors on an improvement in predictive performance of Bayesian model averaging. This improvement is considerably significant under M-open settings (Bernardo and Smith, 2009) where we know that the true model is not one of the considered candidate models. As the authors pointed out, it is well-known that the weights of standard Bayesian model averaging (BMA) asymptotically concentrates on a single model that is closest to the true model in Kullback-Leibler (KL) divergence. This would be problematic under M-open settings, because a single (wrong) model would dominate the predictive inference so that its predictive performance might be attenuated.
To address this issue, the authors bring the stacking idea to BMA. Instead of minimizing the mean square error of the point estimate as in original stacking procedures, they propose a procedure to evaluate the model weights that maximizes the empirical scoring rule based on the leave-one-out (LOO). This results in a convex combination of models that is empirically close (in terms of the considered score rule or the divergence function) to the true model that generates the observed data. The authors also circumvent the computational intensity of the LOO procedure by adopting Pareto smoothed importance sampling . This importance sampling procedure uses the importance sampling weights approximated by the order statistics of the fitted generalized Pareto distribution, so refitting each model n times can be avoided.

Some Issues in Extensions to High-dimensional Model Selection
I think that the proposed work is very interesting under a situation where the number of models is fixed as n increases. It would be also interesting to investigate theoretical properties of the stacking procedure under high-dimensional model (or variable) selection regime (Narisetty et al., 2014;Castillo et al., 2015). In theory of high-dimensional variable selections, it is not uncommon to assume that the number of predictors, saying p, increases at a certain rate of n. When p is fixed and only n increases, the asymptotic results in Section 3.2 should hold. However, when p increases faster than n, the uniform convergence (over models) of the estimator of the score rule may not hold, because the total number of models increases at an exponential rate of p, that is 2 p . It is well-known that Akaike Information Criterion (AIC) is asymptotically equivalent to LOO, and AIC is not consistent in model selection even under low-dimensional settings. So, the stacking procedure based on LOO might not be optimal in selecting the best model under high-dimensional settings, and might suffer from high false discovery rates.
Also, in computational aspects for high-dimensional variable selection (for linear regression models), it would be interesting to consider a stochastic search algorithm such as shotgun stochastic search (SSS) (Hans et al., 2007) that is developed for standard Bayesian variable selection. By slightly modifying the SSS algorithm, one might be able to explore models having high predictive densities. However, the number of candidate models is enormous under high-dimensional settings, and the approximation of the predictive densities for all the candidate models is computationally intensive even when the Pareto importance sampling technique is used. A possible option might be a twostep procedure that first collects a number of models by using a standard Bayesian variable selection procedure such as (George and McCulloch, 1993;Raftery et al., 1997;Hans et al., 2007), then the stacking weights can be evaluated based on the pre-specified models. However, standard Bayesian variable selection procedures choose models with large posterior model probability that is proportional to the marginal likelihood, so the pre-selected models might not be optimal in prediction.

Conclusion
Once again, I would like to congratulate the authors of this paper, and I think that it would be interesting to extend the idea to high-dimensional model selection problems.

Contributed Discussion
Tianjian Zhou * I congratulate the authors for an insightful discussion of combining predictive distributions using stacking. Consider a set of predictive distributions built from multiple models, p(ỹ | y, M k ) for k = 1, . . . , K. A combination of these predictive distributions has the formp(ỹ | y) = K k=1 w k p(ỹ | y, M k ). The coherent Bayesian method of choosing the weights (w k 's) is Bayesian model averaging (BMA) with posterior model probabilities. However, this approach has several limitations. The authors therefore propose stacking of predictive distributions, which chooses the weights by minimizing divergence ofp(ỹ | y) from the true distribution.

Model Space and Computational Complexity
Some considerations of BMA still apply also for stacking. For example, the choice of candidate models. For a regression analysis with p predictors, there are 2 p possible linear subset regression models. One could potentially consider most or all of them, as in Section 4.5. Sometimes it is also necessary to consider interaction and nonlinear terms, as in Section 4.6. Thus, the size of the model space can be enormous, making computation infeasible for reasonably large p. Similar strategies as for implementing BMA could potentially also be used for stacking. In all examples, the candidate models belong to the same model family (e.g., of regression models). It would be interesting to see if performance can be further improved by combining models that belong to different model families. For example, combining linear regression models and regression tree models.
Alternatives As discussed by the authors, sometimes it is preferable to fit a supermodel that includes the separate models as special cases. For example, a regression model that includes all predictors and possible interaction and nonlinear terms, using, for example, a spike-and-slab prior for the regression coefficients. I would like to point out another interesting alternative, Bayesian nonparametric (BNP) models. BNP models do not necessarily include the separate models as special cases, but can be chosen to put positive prior probability mass everywhere, that is, within arbitrary neighborhoods of any model. In that sense, BNP models are "always right". Importantly, BNP priors make fewer parametric assumptions and are highly flexible. For example, for Section 4.1, * Research Institute, NorthShore University HealthSystem, tjzhou95@gmail.com a natural choice would be the widely used Dirichlet process mixture model. For Section 4.6, a natural choice could be a Bayesian additive regression tree (BART) model or a Gaussian process prior. Of course, BNP models have their own limitations, such as often difficult implementation, complex computation and possibly poor performance with small sample size. A possible way out could be to include BNP models as candidate models in stacking. Little would change in the overall setup.
Applications I would like to suggest potential applications of the proposed method to biomarker discovery and the prediction of patient outcomes. A biomarker refers to a measurable biological signal whose presence is indicative of some outcome or disease. Examples of biomarkers include physical characteristics such as height, weight, blood pressure, as well as genomic level measurements such as gene expression. Outcomes of interest include, for example, the presence or progression of cancer and risk of mortality or risk of readmission. The goals are: (1) finding a subset of biomarkers that are highly predictive of the outcome of interest, and (2) predicting the outcome of interest for a future patient. Traditional variable selection approaches ignore model uncertainty and use a single set of selected biomarkers for prediction. Model combination approaches, on the other hand, report a set of possible models and average over these models for prediction. Therefore, model combination approaches might give more sensible biomarker selections and have better prediction accuracy. There are some existing works using BMA (e.g. Volinsky et al. 1997 andYeung et al. 2005). Considering the better performance of stacking relative to BMA, stacking should be considered for the same applications and could improve existing analyses.
There are some challenges for using stacking, specific to this application. First, the number of candidate biomarkers can be large, especially for genomic level biomarkers. Biomarkers can also have non-additive and nonlinear effects on the outcomes. Second, missing data are common. Third, it is very common that the patient population is heterogeneous, and different subgroups of patients might call for different outcomepredictive biomarkers. I have discussed the first issue in the second paragraph. For the second issue, the candidate models should be able to handle missing data. To reduce the sensitivity of the analysis with respect to missing data, we could average over multiple models with different strategies of dealing with the missing data, and different assumptions about the missingness. Similarly, for the third issue, some candidate models should be included that account for the possibility of subgroup effects.

Contributed Discussion *
Lennart Hoogerheide † and Herman K. van Dijk ‡ Basic practice and probabilistic foundation A basic practice in macroeconomic and financial forecasting 1 is to make use of a weighted combination of forecasts from several sources, say models, experts and/or large micro-data sets. Let y t be the variable of interest, and assume that some form of predictive valuesỹ 1t , . . . ,ỹ nt is available for y t with a set of weights w 1t , . . . , w nt where n is also a decision variable. Then, basic practice in economics and finance is to make use of: (1) A major purpose of academic and professional forecasting is to give this practice a probabilistic foundation in order to quantify the uncertainty of such predictive density features as means, volatilities and tail behaviour. A leading example of a forecast density being produced and used in practice is the Bank of England's fan chart for GDP growth and inflation, which has been published each quarter since 1996. For a survey on the evolution of density forecasting in economics, see Aastveit et al. (2018) and for a related formal Bayesian foundational motivation, see .

Proposal by Yao, Vehtari, Simpson and Gelman
In recent literature and practice in statistics as well as in econometrics, it is shown that Bayesian Model Averaging (BMA) has its limitations for forecast averaging, see the earlier reference for a summary of the literature in economics. The authors focus in their paper on the specific limitation of BMA when the true data generating process is not in the set and also indicate the sensitivity of BMA in case of weakly or non-informative priors. As a better approach in terms of forecast accuracy and robustness, the authors propose the use of stacking, which is used in point estimation, and extend it to the case of combinations of predictive densities. A key step in the stacking procedure is that an optimisation step is used to determine the weights of a mixture model in such a way that the averaging method is then relatively robust for misspecified models, in particular, in large samples.
Dynamic learning to average predictively We fully agree that BMA has the earlier mentioned restrictions. However, we argue that a static approach to forecast averaging, * This comment should not be reported as representing the views of Norges Bank. The views expressed are those of the authors and do not necessarily reflect those of Norges Bank. An extended version is available as Tinbergen DP. † VU University Amsterdam and Tinbergen Institute, l.f.hoogerheide@vu.nl ‡ Erasmus University Rotterdam, Norges Bank and Tinbergen Institute, hkvandijk@ese.eur.nl 1 In applied statistics and economics the usual terminology is forecasting instead of prediction. In more theoretical work the usage of the word prediction is common. In this comment we use both terminologies interchangeably. as suggested by the authors, will in many cases remain sensitive for the presence of a bad forecast and extremely sensitive to a very bad forecast. We suggest to extend the approach of the authors to a setting where learning about features of predictive densities of possibly incomplete or misspecified models can take place. This extension will improve the process of averaging over good and bad forecasts. To back-up our suggestion, we summarise how this has been developed in empirical econometrics in recent years by , Casarin et al. (2018), and Baştürk, Borowska, Grassi, Hoogerheide, and Van Dijk (2018). Moreover, we show that this approach can be extended to combining not only forecasts but also policies. The technical tools necessary for the implementation refer to filtering methods from the nonlinear time series literature and we show their connection with dynamic machine learning.
The fundamental predictive density combination Let the predictive probability distribution of the variable of interest y t of (1), given the setỹ t = (ỹ 1t , . . . ,ỹ nt ) , be specified as a large discrete mixture of conditional probabilities of y t givenỹ t coming from n different models with weights w t = (w 1t , . . . , w nt ) that are interpreted as probabilities and form a convex combination. One can then give (1) a stochastic interpretation using mixtures. Such a probability model, in terms of densities, is given as: (2) Let the predictive densities from the n models be denoted as f (ỹ it |I i ), i = 1, . . . , n, where I i is the information set of model i. Given the fundamental density combination model of (2) and the predictive densities from the n models, one can specify, given standard regularity conditions about existence of sums and integrals, that the marginal predictive density of y t is given as a discrete/continuous mixture, where I is the joint information set of all models. The numerical evaluation of this equation is simple when all distributions have known simulation properties. An important research line in economics and finance has been to make this approach operational to more realistic environments by allowing for model incompleteness and dynamic learning where the densities have no known simulation properties; see the earlier cited references.

Mixtures with model incompleteness and dynamic weight learning
A first step is to introduce, possibly, time-varying model incompleteness by specifying a Gaussian mixture combination model with time varying volatility which controls the potential size of the misspecification in all models in the mixture. When the uncertainty level tends to zero then the mixture of experts or the smoothly mixing regressions model is recovered as limiting case, see Geweke and Keane (2007), Jacobs et al. (1991). The weights can be interpreted as a convex set of probabilistic weights of different models which are updated periodically using Bayesian learning procedures. One can write the model in the form of a nonlinear state space which allows to make use of algorithms Policy combination Mixture of policies based on sequential Monte Carlo methods such as particle filters in order to approximate combination weight and predictive densities.

Forecast combinations, policy combinations and machine learning
To extend the predictive density combination approach to a policy combination one with time-varying learning weights is a very natural step in economics. We summarise in Figure 1 how to combine forecasts and policies using a two-layer mixture. That is, we start with a mixture of predictive densities of three data driven time series models, i.e. a Vector-Autoregressive model (VAR), a Stochastic Volatility model (SV) and a Dynamic Factor Model (DFM). These are combined with a mixture of two data driven portfolio strategies that are known as momentum strategies. For background on the model and residual momentum strategies we refer to Baştürk, Borowska, Grassi, Hoogerheide, and Van Dijk (2018). It is noteworthy that this graphical representation is similar to the one used in machine learning. In our procedure the unobserved weights are integrated out using (particle) filtering. Our empirical results, see Baştürk, Borowska, Grassi, Hoogerheide, and Van Dijk (2018), show that the choice of a model set in a mixture is important for effective policies. We emphasise that this approach is fully Bayesian and does not contain an optimisation step as is used in stacking approach. However, the optimisation can be easily made dynamic. For a similar technique used in optimal pooling of forecasts we refer to . small subset of the individual log-scores strongly influenced the model selection result. For illustration, the left panel of Figure 1 depicts the histogram of log-score differences between two example models (for each held-out point in the leave-one-out procedure), from which the mean (or sum) is usually computed. It is clear that we cannot conclude that one model is superior to the other (i.e., that zero is a bad choice for the center of this distribution). In the context of stacking, we cannot give more weight to one of these two models with any degree of confidence. To further assess the variability inherent to the LOOCV estimate of marginal predictive performance, we bootstrapped the meandifferences to compute uncertainty intervals, and decided to conclude that one model was better than another only if this interval did not include zero; see the right panel of Figure 1 for this computation on our dataset. The first interval in this figure corresponds to the histogram in Figure 1. The non-spatial model 5 (M5) performs poorly, but we cannot conclude that there is a best model. In the context of stacking, the five "equivalent" models would be weighted by highly arbitrary weights to create a stacked model, which we find questionable. We wonder whether combining the bootstrapped uncertainty intervals with the stacking idea (in some way) could lead to a more robust approach to stacking.
We question the authors' choice to compare the stacking approach to the other methods presented in the paper. Indeed, they point out that Bayesian model averaging (BMA) weights reflect only the fit to the data (i.e., within-sample performance)without maximizing the prediction accuracy (i.e., out-of-sample performance). Thus, the comparison of BMA (or its modified versions, Pseudo-BMA and Pseudo-BMA+) against the stacking of distributions, which is conveniently constructed to improve prediction accuracy, does not seem fair. To highlight the gains and the pitfalls of stacking predictive distributions, it would be more reasonable to compare the prediction ability of the stacking approach against the prediction ability of each one of the stacked models.
In Section 3.3, the authors advocate the use of Pareto-smoothed importance sampling as a cheap alternative to exact LOOCV, which can be computationally expensive for large sample sizes. We agree with the authors' guidelines, and we here re-emphasize that this approach is potentially unstable/invalid when the importance ratios have a very heavy or "noisy" tail. Indeed, it is well-known that the ith order statistics X (i) in a sample of size n from the GP distribution with shape parameter k has finite mean for k < n − i + 1, and finite variance for k < (n − i + 1)/2; see, e.g., Vännman (1976). In particular, this implies that the maximum X (n) has infinite mean for k ≥ 1. As the GP shape parameter is usually estimated with high uncertainty, especially with heavy tails, a conservative decision rule is preferred in practice. Moreover, we want to stress that the estimation of the shape parameter k via maximum likelihood may be strongly influenced by the largest observations. Therefore, more robust approaches might be preferred. Possibilities include using methods based on probability weighted moments, which were found to have good small sample properties (Hosking and Wallis, 1987;Naveau et al., 2016), or using a Bayesian approach with strong prior shrinkage towards light tails. Opitz et al. (2018) recently developed a penalized complexity (PC) prior  for k, designed for this purpose.

Contributed Discussion
Marco A. R. Ferreira * I congratulate the authors on delivering a stimulating and thought-provoking article. In this comment, in the context of data observed over time, in cases when one model has a posterior probability close to one but a stacking weight much smaller than one, I suggest a way to investigate the causes of the disagreement.
Here we focus on the case when data arrives over time. To simplify the discussion, let us assume that data are observed at discrete time points. Let y t be a vector that contains all the data observed at time t, t = 1, . . . , T . Further, let y 1:t = (y 1 , . . . , y t ) . Then, instead of using the leave-one-out predictive density p(y i |y −i , M k ), we may consider the one-step ahead predictive density p(y t |y 1:(t−1) , M k ) which is given by p(y t |y 1:(t−1) , M k ) = p(y t |y 1:(t−1) , θ k , M k )p(θ k |y 1:(t−1) , M k )dθ k .
We note that after y t has been observed, comparing p(y t |y 1:(t−1) , M 1 ), . . . , p(y t |y 1:(t−1) , M K ) allows one to evaluate the relative ability of each model to predict at time t − 1 the vector of observations y t . Hence, in the context of data observed over time, instead of p(y i |y −i , M k ), it seems more natural to use the one-step ahead predictive density p(y t |y 1:(t−1) , M k ). Thus, for data observed over time the stacking of predictive distributions would choose weights where the summation on t starts at t * + 1 because the first t * observations are used to train the models to reduce dependence on priors for parameters. We note that the above equation is very similar to that of the optimal prediction pools of Amisano (2011, 2012), except that they start the summation at t = 1.
It is also helpful to consider the formula for the posterior probability for each model. To keep exposition simple, let us assume equal prior probabilities for the competing models. Further, we assume that the first t * observations are used for training. Then, the posterior probability for model M k is ( Keeping (1) and (2) in mind, what can we infer when a model M has posterior probability close to one but its weight w in the stacking of predictive distributions is much smaller than one? The posterior probability being close to one means that M * Department of Statistics, Virginia Tech, Blacksburg, VA 24061, marf@vt.edu is probably, amongst the K models being considered, the model closest in Kullback-Leibler sense to the true data generating mechanism. But its weight w being much smaller than one means that there are important aspects of the true data generating mechanism that have not been incorporated in M .
We note that both (1) and (2) depend on the data only through the one-step ahead predictive densities p(y t |y 1:(t−1) , M k ). Thus, for data observed over time, when there are disagreements between the posterior probabilities of models and the stacking weights, an examination of the one-step ahead predictive densities p(y t |y 1:(t−1) , M k ) such as plotting them over time as in Vivar and Ferreira (2009) may help identify what aspects of the true data generating mechanism are being neglected by model M .
For example, an examination of p(y t |y 1:(t−1) , M k ) may indicate that model M provides better probabilistic predictions 95% of the time, but that in the remaining 5% of the time the observations are outliers with respect to M but are not outliers with respect to a model M * that has fatter tails than M . In that situation, the outlying observations would prevent w from being close to one. Further examination of the outlying observations could possibly suggest ways to improve model M to get it closer to the true data generating mechanism.
As another example, an examination of p(y t |y 1:(t−1) , M k ) may indicate that M and another model M * take turns at providing better probabilistic predictions. For example, say that for a certain environmental process, M provides better predictions during a certain period of time, and then after that M * provides better predictions, and after that M provides better predictions, and so on. In that case, probably the environmental process has different regimes, and thus for example a Markov switching model (Frühwirth-Schnatter, 2006) may be adequate to model such environmental process.
I would imagine that a sensibly estimated leave-one-out predictive density p(y i |y −i , M k ) could also be used for diagnostics. I would appreciate if the authors can comment on advantages and difficulties associated with such use.
Finally, in the M-closed case, will the stacking weight of the true model converge to one as the sample size increases?

Contributed Discussion
Luis Pericchi * † This article gave me a Déja Vu, of 25 years ago in London when the book by Bernardo and Smith (1994) was being finished and  was starting. With the formalization of the complement of the M-closed perspective, the end of Bayes Factors and Bayesian Model Averaging (BMA) was predicted or at least confined to a very small corner of statistical practice. However, re-sampling Bayes Factors, particularly the Geometric Intrinsic Bayes Factors were being invented around the same years and these re-sampling schemes changed completely the perspective. For some historical reasons however, non re-sampling Intrinsic Bayes Factors were much more developed along the years. Perhaps this will be one of the positive consequences of this paper and the previous in this line, to recover the thread of development, theoretical and practical, of the rich mine vein of re-sampling Bayes Factors.
Just two illustrations of the fundamentally different behavior of re-sampling Bayes Factors, more in tune with open perspectives are in Bernardo and Smith (1994) p. 406 (Lindley's paradox revisited) and in  p. 369 on which the Intrinsic Implicit Priors were first named. However this paper seems to restrict its scope to non resampling Bayes Factors which is insufficient. On the other hand the paper interestingly relinquishing marginal likelihoods appears to get away, at least to some extend by using K-L loss. This should be study also theoretically, but it should be noted that the loss function, even restricted to K-L functional form, changes also whenever the training sample and validation sample sizes change, and the change is huge. This paper seems also restricted to n − 1 cross validation. Also it seems that the only goal of statistics is to make good predictions, but good explanations are also of paramount importance. In that direction   Finally I conjecture that casual choice of estimators within models would lead to un-Bayesian inefficient solutions, and the authors seem to agree with this conjecture in 5.3. Careful consideration of all the entertained models and admissible estimators for parameters should be considered prior to the optimization procedures. In fact this may solve the old conundrum of whether the same priors should or not be used for estimation and Selection.

Contributed Discussion
Christopher T. Franck * I congratulate the authors on a fascinating article which will positively impact statistical research and practice in the years to come. The authors' procedure for stacking Bayesian predictive distributions differs from Bayesian model averaging (BMA) in two important ways. First, the stacking procedure chooses weights based directly on the predictive distribution of new data while BMA chooses weights based on fit to observed data. Second, the stacking procedure is not sensitive to priors on parameters to the extent that BMA is. The leave-one-out approach in the stacking procedure bears some resemblance to intrinsic Bayes factors, which made me curious as to whether intrinsic Bayesian model averaging (iBMA) can make up some of the reported performance difference between stacking and BMA. In this note, I restrict attention to the authors' linear subset regressions example, adopt the authors' Bayesian model, and compare BMA with iBMA. Since iBMA does not improve prediction over BMA in this initial study, I ultimately suggest that the stacking procedure is superior to iBMA for prediction.
It appears that the stacking procedure's replacement of the full predictive distribution with the leave-one-out predictive distribution p(y i |θ k , M k )p(θ k |y −i , M k )dθ k is the major mechanism that bestows invariance to priors. This approach makes priors resemble posterior distributions by conditioning on a subset of the data. Further, a simple re-expression of p(θ k |y −i , M k ) via Bayes' rule reveals that nuisance constants which accompany priors (discussed by the authors in the BMA segment of Section 2) cancel out when any portion of the data is used to train priors. This is the same tactic that partial Bayes factors (Berger and Pericchi, 1996) use to cancel the unspecified constants which accompany improper priors and contaminate resulting Bayes factors. Briefly, a partial Bayes factor takes a training sample from the observed data, uses the likelihood of the training sample to update the prior, and forms a Bayes factor as the ratio of marginal likelihoods that adopt the trained prior alongside the remainder of the likelihood. An intrinsic Bayes factor is an average across some or all possible training samples. Where the original motivation for intrinsic Bayes was to enable model selection using improper priors, the approach is also used for model selection that is robust to vague proper priors. For improper priors, the goal is usually to choose a minimal training sample size to render the prior proper. As the training sample size increases for fixed n, the prior exerts less influence on the posterior model probabilities, but the method becomes less able to discern competitive models (Fulvio and Fulvio, 1997).
Using the authors' Bayesian model and data generating process and a similar outof-sample testing procedure, I compared standard BMA with iBMA. In the iBMA case, I formed intrinsic Bayes factors which I then translated to posterior model probabilities for use in model averaging. I obtained 50 Monte Carlo replicates with 10 test points. I considered iBMA training sample sizes of 1, 5, and n−1. The M-open case (not shown) favored the iBMA setting with n − 1 training samples, leaving only one data point for * Department of Statistics, Virginia Tech, chfranck@vt.edu Figure 1: Averages of log predictive densities for 10 test points based on 50 Monte Carlo replicates. Black corresponds to authors' prior, red corresponds to a more vague prior on coefficients. Increasing the training sample size diminishes the effect of the prior but also reduces log predictive density at higher sample sizes. None of the iBMA settings produce predictive densities substantially higher than the BMA approach (open circles). the likelihood. This setting barely changed the posterior model probabilities from their uniform prior, which works well only in this specific case where a near uniform mixture of the 15 candidate models performs adequately. The M-closed results shown in Figure 1 confirm that (i) iBMA diminishes influence of the prior as the size of the training sample increases (note overlap in n − 1 training lines), (ii) an excessively large training sample proportion erodes the predictive density especially at larger sample sizes, and more importantly suggests that (iii) endowing BMA with prior-invariance machinery that resembles the stacking procedure's does not appear to offer any advantage in predictive density. Hence, I second the authors' conclusion that stacking is a superior approach for prediction. The present study suggests that the stacking procedure's prior invariance property is a convenient bonus but not the major reason for its impressive performance.  models under comparison, not to the methods used in order to produce stacking. What is interesting to notice is that despite the different weights computed according to the two schemes, the combined conditional predictive densities are rather similar for all the values of the conditioning variable x. This feature has been proved to hold also when the sample size increases and similar results (not reported here) have been obtained for different model specifications.
The main insight from this small simulation study is twofold: first, the approach proposed by the authors outperforms the alternative weighting schemes, both in fitting and in computational efficiency. Second, the scheme of Li and Dunson (2016) yields comparable results in terms of conditional density estimation. This might suggest that coupling Pseudo-BMA and Reference-Pseudo-BMA might be a successful strategy in those circomstances when leave-one-out or Pareto smoothed importance sampling leaveone-out cross-validation are suspected to produce unstable results, due to small sample size or large values ofk.

Contributed Discussion
Merlise Clyde * Stacking (Wolpert, 1992;Breiman, 1996b,a) has seen renewed attention in recent years as an ensemble learning method for prediction, particularly among kaggle competitions! In the M-open context,  arrived at stacking of densities from a decision-theoretic solution using a log scoring rule (see also Walker et al. (2001) for an alternative computational approach to the same problem). I hope the authors contributions to predictive density estimation will encourage more practitioners to look beyond point estimation and consider quantification of uncertainty. Examples where uncertainty quantification is being addressed quite successfully with ensemble learning of densities include computer models, such as those used in weather forecasting. Raftery et al. (2005);  and related papers propose logscoring rules among other utilities for ensemble learning with implementation in the R package ensembleBMA (Fraley et al., 2018). Other examples include forecasting with economic time series (see discussion by McAlinn et al. of Yao et al.). A key distinguishing feature in density ensemble learning methods in the M-open context is whether the predictive densities or parameters of the densities are independent of the data used to learn the weights (computer models) or that the available data must be used to both learn the predictive densities and the optimal weights for the ensemble. The former case, as used in examples in section 4.1 of the paper, does not require any data splitting such as Leave-One-Out cross-validation to learn the predictive densities and then solve the predictive optimization problem and can help illustrate potential problems.

Faraway, So Close
Let's reconsider the example of section 4.1 of the authors, but change the true data generating distribution to a N(42, 1) while keeping the candidate densities for ensemble learning to be the same as in the paper: N(μ k , 1) with μ k = k for k = 1, . . . , 8. Traditional Bayesian Model Averaging (BMA) will degenerate to the model closest to the true model in Kullback-Leibler divergence, hence, the N(8, 1). However, stacking of densities, will fair no better and could be worse if the weights do not degenerate to zero for k < 8. Raftery et al. (2005) and following papers, explicitly accommodate potential bias in computer models used in the means of each of the predictive densities in the ensembles; applying that approach rather than using a N (μ k , 1) we could correct for bias using components that are N (a k + b k μ k , 1) where a k and b k are learned from the data. In our simple example, it would appear that there would be no unique solution to the weights or a k and b k . This is not a concern if one is only interested in predictive analytics without the desire to interpret weights as a measure of importance, etc.
Stacking can also fail if the component densities are too simple. Consider the situation where the true data generating model is Y = K k=1 X k β k + with ∼ N(0, I n ) as * Department of Statistical Science, Duke University, Durham, NC 27708-0251 USA, clyde@duke.edu url: http://stat.duke.edu/˜clyde in example 4.2. We will take the candidate densities to be Gaussian densities centered at X kβk whereβ k is the ordinary least squares estimate of β from the simple linear regression of Y on X k . In the case that X j and X k are all mutually orthogonal vectors, β k is as unbiased estimate of β k , however, the stacked predictive mean, k x * kβ k using either squared error or log-scoring rules will be asymptotically biased as individual weights will be less than 1. There is the potential that the bias will actually grow with k, as more predictors will dilute the weights. In this case, the bias corrections posed in Raftery et al. (2005) would have no effect. The more realistic case where the X k are not orthogonal or the component densities are centered at Bayesian estimates may suffer from the same problems.
Arguments in  showed that BMA would fail when the true model was outside the class of models used in BMA, however, these examples illustrate that stacking can suffer from similar problems as BMA. While both BMA and stacking of densities are guaranteed to be "close" in some measure, they may still be far from the truth.

Choice of Weights Revisited
In the original papers on stacking (Breiman, 1996a,b;Wolpert, 1992), the constraints that the weights sum to one and be non-negative did not follow from any first principles, but appeared to work well in practice. For the above regression example using squared error loss, the non-negativity constraint on weights is in fact not binding, however, the sum to one constraint is the source of the problem. By relaxing that constraint, one can recover the true predictive mean asymptotically, however, it is not obvious what the weights should sum to as part of the optimization problem. In terms of the dual Lagrangian formulation of the constrained optimization problem, the choice of constraint for the sum of the weights is related to the pre-specification of the penalty in lasso regression.
For stacking with densities, the solution for the weights based on an EM algorithm to find the optimal weights under the log-scoring rule  demonstrates that the weights always satisfy the constraints of non-negativity and sum to one so that relaxation is not a possible avenue to address the problems identified above.
The arguments above suggest that stacking of densities, like BMA, may be sensitive to the choice of models that are used in the ensemble.

Rejoinder
Yuling Yao * , Aki Vehtari † , Daniel Simpson ‡ , and Andrew Gelman § We thank the editorial team for organizing the discussion. We are pleased to find so many thoughtful discussants who agree with us on the advantage of having stacking in the toolbox for combining Bayesian predictive distributions. In this rejoinder we will provide further clarifications and discuss some limitations and extensions of Bayesian stacking.
1 When is leave-one-out cross validation appropriate?

Exchangeability
Stacking maximizes the weighted leave-one-out (LOO) scoring rule using all observations: where S K 1 denotes the simplex space {w : K k=1 w k = 1, w k ∈ [0, 1]} andp k,−i (·) = p(·|x i , y −i , θ k , M k )p(θ k |y −i , x −i , M k )dθ k is the leave-i-out predictive density for model M k . To emphasize the data generating models, we include here covariates x, which were suppressed in the main paper for simplicity.
The asymptotic optimality of stacking relies on the consistency of the leave-one-out scoring rule: S (p k,−i , y i ) − E (x,ỹ)|(x,y) S (p(·|x, x, y, M k ),ỹ) → 0. (2) The conditional iid assumption of y given x is typically sufficient and is used in many proofs, but it is not necessary. The key assumption we need is exchangeability (Bernardo and Smith, 1994, Chapter 6).
As discussed by Dawid and Clarke, it is not clear what happens to LOO or LOOstacking in the asymptotic limit, if there is no true data-generating mechanism, p(ỹ|x). Assuming such a true or underlying distribution is equivalent to assuming stationarity of the data-generating process. From a Bayesian standpoint, it is appropriate to model a non-stationary process with a non-stationary model. Realistically, though, whatever model we use, stationary or not, when applied to real data will encounter unmodeled trends; hence we any asymptotic results can only be considered as provisional.

Data-generating mechanisms and sample reweighting
If we have not yet observed the new data point (y N +1 , x N +1 ), we can use LOO to approximate the expectation over different possible values for (y N +1 , x N +1 ). Instead of making a model p(y, x), we re-use the observation as a pseudo-Monte Carlo sample from p(y N +1 , x N +1 ), while not using it for inference about θ.
For the predictive performance estimate, however, we need to model know how the future x N +1 will be generated. In standard LOO we implicitly assume future x N +1 comes from the empirical distribution 1 N N i=1 δ xi . If we assume that the future distribution is different from the past, then the covariate shift can be taken into account by weighting (e.g. Shimodaira, 2000;Sugiyama and Müller, 2005;Sugiyama et al., 2007). When we know exactly or have estimated from extra information p N +1 (x N +1 ) = p(x N +1 |x), we can use importance weighting to adjust the sample weights r i ∝ p N +1 (x i )/p(x i ) and replace (1) In particular, the convergence of importance sampling does not require independence of covariates x 1 , . . . , x N .

Fixed design
If x is fixed or chosen by design, we can still make a conditional model p(y|x, θ) and analyze the posterior distribution, p(θ|x, y). We can reinterpret r i as the weight for the i-th observation. Standard LOO will assign equal weights on each x i . If we care about the performance for some fixed x i than for others, we can use different and weighting schemes to adjust.

Non-factorizable models
Our fast LOO approximation (PSIS-LOO) generally applies to factorizable models p(y|θ, x) = N i=1 p(y i |θ, x i ) such that the pointwise log-likelihood can be obtained easily by computing log p(y i |θ, x i ).
Non-factorizable models can sometimes be factorized by re-parametrization. For example, consider a multilevel model with M groups, if we denote the group level parameter and global parameter as θ m and ψ, then the joint density is p(y|x, θ, ψ) = J j=1 ⎡ ⎣ Nj n=1 p(y jn |x jn , θ j )p(θ j |ψ) where y are partially exchangeable, i.e. y mn are exchangeable in group j, and θ m are exchangeable. Rearrange the data and denote the group label of (x i , y i ) by z i , then (3) can be reorganized as N i=1 p(y i |x i , z i , θ, ψ) so the previous results follow. Furthermore, de-pending on whether the prediction task is to predict a new group, or a new observations within a particular group j, we should consider leave-one-point-out or leave-one-groupout, corresponding to modeling the new covariate by p(x,z) ∝ δ(z = j) zi=j δ(x = x i ) or J z=1 zi=j δ(z = j,x = x i ). When there is no obvious re-parametrization making the model conditional factorizable, the pointwise log-likelihood has the general form log p(y i |y −i , θ). It is still possible to use PSIS-LOO in some special non-factorizable forms. Vehtari et al. (2018a) provide a marginalization strategy of PSIS-LOO to evaluate simultaneously autoregressive normal models. Bakka et al. express concerns about the reliability of PSIS. We refer to  and Vehtari et al. (2018b) for computation and diagnostic details.
Lastly, although we used LOO, other variations of cross-validation could be used in stacking. Roberts et al. (2017) review cross-validation strategies for data with temporal, spatial, hierarchical, and phylogenetic structure. Many of these can also be computed fast by PSIS as demonstrated for m-step-ahead cross-validation for time series (Buerkner et al., 2018).

Prequentialism
When observation y t come in sequence, there is no reason in general to use a conditionally independent or exchangeable for them. Nevertheless, LOO and LOO-stacking can still be applicable if the concern is the whole structure of the observed time points. For example, we might be interested analyzing whether more or less babies would be born on some special days of the year.
If the main purpose is to make prediction for the next not-yet-observed data, we can utilize the prequential principle : p(y 1:N |θ) = N t=1 p(y t |y 1:t−1 , θ), and replace the LOO density in (1) by the sequential predictive density leaving out all future data: p(y t |y <t ) = p(y t |y 1:t−1 , θ)p(θ|y 1:t−1 )dθ in each model, and then stacking follows. This is similar to the approach developed by Amisano (2011, 2012). Ferreira and Dawid suggest similar ideas. The ergodicity of y will yield, When there is a particular horizon of interest for prediction, p(y t |y <t ) above is generalized to m-step-ahead predictive density p(y t<m |y <t ) = p(y t , . . . , y t+m−1 |y 1 , . . . , y t−1 ) = p(y t<m |y <t , θ)p(θ|y <t )dθ.
However, the prequential approach introduces a different prediction task. Unless some stationarity of the true data generating mechanism is assumed (e.g., P (y t |y <t ) = P (y t |y <t )), the average cumulative performance (the second term in (4)) is different from the one-step-ahead assessment in (2), which is only evaluated at next unseen observation t = N + 1.

Dynamic approximation of posterior densities
The exact prequential evaluation requires refitting each model for each t, which can be approximated by PSIS as, p(y t |y <t ) = p(y t |θ, y <t ) p(θ|y<t) p(θ|y) p(θ|y)dθ. We then start from the full data inference p(θ|y) and dynamically update p(θ|y <t ) using PSIS approximation. When p(θ|y <t ) reveals large discrepancy from p(θ|y) for some small t, which can be diagnosed by PSIS-k, we refit the model p(θ|y <t ) and update the proposal. Buerkner et al. (2018) verify such approximation gives stable and accurate results with minimal number of refits in an auto regressive model.

Dynamic stacking weights
Hoogerheide and Dijk and McAlinn, Aastveit and West point out that static weighting is not desired in time series, as a model good at making short-term prediction might not do well in the long run. Stacking can be easily made dynamic, allowing the explanation power of models to change over time. A quick fix is to replace model weights w t in (1) by time-varying w t,k in the t-th term. To incorporate historical information, we can add regularization term −τ N t=2 ||w t,· − w t−1,· || in the stacking objective function. The heterogeneity of stacking weights can also be generalized to other hierarchical data structures, and this can be seen as related to a generalization of the mixture formulation of Kamary et al. (2014).
Bayesian predictive synthesis (BPS, McAlinn and West, 2017; has been developed for dynamic Bayesian combination of time series forecasting. The prediction in BPS takes the form α(y|z) k=1:K h k (z k )dz where z = z 1:K is the latent vector generated from predictive densities h k (·) in each model and α(y|x) is the distribution for y given z that is designed to calibrate the model-specific biases and correlations. We agree that BPS is more flexible in its combination form, for stacking is restricted to linear weighting α(y|z) = k α k (z)δ z k (y). On the other hand, stacking has its flexibility that can be tailored for decision makers' specific utility, and the convex optimization is more computationally feasible for complicated models. It will be interesting to make further comparisons in the future.

Reliance on the list of models and the restriction to linear combinations
We agree with Clyde and Zhou that the performance of stacking depends on the choice of model list, as stacking can do nothing better than the optimal linear combination from the model list. Stacking is not strongly sensitive to the misspecified models (see Section 4.1 of our paper), but it will be sensitive to how good an approximation is possible given the ensemble space.
We discuss the concern of inflexibility of linear-additive-form of density combination in Section 5.2, and construct the same orthogonal regression example as Clyde, in which stacking will not work to approximate the true model that is a convolution of individual densities. By optimizing the leave-one-out performance of combined prediction, the stacking framework can be extended to more general combination forms, such as the posterior family used in the BPS literature. Furthermore, simplex constraints will be unnecessary if it goes beyond the linear combination of densities. We are interested in testing such approaches. Yoo proposes another way to obtain convolutional combinations by stacking in the Fourier domain.

Model expansion as an alternative
One setting where stacking can be used, but full model expansion could be more difficult, is when some set of different sorts of models have been separately fit. The same idea is summarized by Pericchi as "careful consideration of all the entertained models and admissible estimators for parameters should be considered prior to the optimization procedures." We are less concerned about the situation described by Belitser and Nurushev, Shin, and Zhou, in which the number of models are so large that stacking can be both computationally expensive and theoretically inconsistent, because in that setting we would recommend moving to a continuous model space that encompasses the separate models in the list.
Stacking is not designed for model selection, but for model averaging to get good predictions. We do not recommend to use it as model selection, although models with zero weights could be discarded from the average. For large p and small n, instead of stacking or other model averaging methods, we recommend using an encompassing model with all variables and prior information about the desired level of sparsity (Piironen and Vehtari, 2017b,c). For example, the regularized horseshoe prior can be considered as a continuous extension of the spike-and-slab prior with discrete model averaging over models with different variable combinations (Piironen and Vehtari, 2017c). For high-dimensional variable selection we recommend a projection predictive approach Vehtari, 2016, 2017a), which has a smaller variance in selection process due to the use of the encompassing model as a reference model and has better predictive performance due to making the inference conditional on the selection process and the encompassing model.

Nonparametric approaches
Li and Iacopini and Tonellato suggest the use of nonparametric reference models to eliminate the need of cross-validation. If we are able to make a good nonparametric model there is probably no need for model averaging. Although model averaging might be used as part of model reduction, instead of using component models p(·|y, M k ) we would prefer to form the component models using a projection predictive approach which projects the information from the reference model to the restricted models Vehtari, 2016, 2017a).
Zhou suggests Bayesian nonparametric (BNP) models as an alternative to model averaging. Indeed, the spline models used in the experiments in Section 4.6 of our paper can be considered as BNP models. We can compute fast LOO-CV also for Gaussian processes and other Gaussian latent variable models .

Logarithmic scoring rules
Finally, we emphasize that the choice of scoring rules in stacking depends on the underlying application, and it is unlikely to give one optimal result that is applicable to any situation in advance. As Winkler, Jose, Lichtendahl and Grushka-Cockayne and Grüwald and Heide point out, there is no need to use log score if the focus is some other utility. Our proposed stacking framework is applicable to any scoring rule. We are particularly interested in interval stacking that optimizes the interval score, which is likely to provide better interval estimation and posterior uncertainties.