Asymptotically Minimax Prediction in Infinite Sequence Models

We study asymptotically minimax predictive distributions in an infinite sequence model. First, we discuss the connection between the prediction in the infinite sequence model and the prediction in the function model. Second, we construct an asymptotically minimax predictive distribution when the parameter space is a known ellipsoid. We show that the Bayesian predictive distribution based on the Gaussian prior distribution is asymptotically minimax in the ellipsoid. Third, we construct an asymptotically minimax predictive distribution for any Sobolev ellipsoid. We show that the Bayesian predictive distribution based on the product of Stein's priors is asymptotically minimax for any Sobolev ellipsoid. Finally, we present an efficient sampling method from the proposed Bayesian predictive distribution.


Introduction
We consider prediction in an infinite sequence model. The current observation is a random sequence X = (X 1 , X 2 , . . .) given by

K. Yano and F. Komaki
where θ = (θ 1 , θ 2 , . . .) is an unknown sequence in l 2 := {θ : ∞ i=1 θ 2 i < ∞} and W = (W 1 , W 2 , . . .) is a random sequence distributed according to ⊗ ∞ i=1 N (0, 1) on (R ∞ , R ∞ ). Here R ∞ is a product σ-field of the Borel σ-field R on the Euclidean space R. Based on the current observation X, we estimate the distribution of a future observation Y = (Y 1 , Y 2 , . . .) given by where W = ( W 1 , W 2 , . . .) is distributed according to ⊗ ∞ i=1 N (0, 1). We denote the true distribution of X with θ by P θ and the true distribution of Y with θ by Q θ . For simplicity, we assume that W and W are independent.
Prediction in an infinite sequence model is shown to be equivalent to the following prediction in a function model. Consider that we observe a random function X(·) given by where L 2 [0, 1] is the L 2 -space on [0, 1] with the Lebesgue measure, F (·) : [0, 1] → R is an unknown absolutely continuous function of which the derivative is in L 2 [0, 1], ε is a known constant, and W (·) follows the standard Wiener measure on L 2 [0, 1]. Based on the current observation X(·), we estimate the distribution of a random function Y (·) given by whereε is a known constant, and W (·) follows the standard Wiener measure on L 2 [0, 1]. The details are provided in Section 2. Xu and Liang [19] established the connection between prediction of a function on equispaced grids and prediction in a high-dimensional sequence model, using the asymptotics in which the dimension of the parameter grows to infinity according to the growth of the grid size. Our study is motivated by Xu and Liang [19] and is its generalization to the settings in which the parameter θ is infinite-dimensional. Using the above equivalence, we discuss the performance of a predictive distribution Q(·; ·) of Y based on X in an infinite sequence model. Let A be the whole set of probability measures on (R ∞ , R ∞ ) and let D be the decision space { Q : R ∞ → A}. We use the Kullback-Leibler loss as a loss function: for all Q ∈ A and all θ ∈ l 2 , if Q θ is absolutely continuous with respect to Q, then l(θ, Q) := log dQ θ dQ (y)dQ θ (y), and otherwise l(θ, Q) = ∞. The risk of a predictive distribution Q(·; ·) ∈ D in the case that the true distributions of X and Y are P θ and Q θ , respectively, is denoted by R(θ, Q) := l(θ, Q(·; x))dP θ (x).
Prediction in infinite sequence models 3167 We construct an asymptotically minimax predictive distribution Q * ∈ D that satisfies ∞ i=1 a 2 i θ 2 i ≤ B} with a known non-zero and nondecreasing divergent sequence a = (a 1 , a 2 , . . .) and with a known constant B. Note that for any ε > 0, the minimax risk is bounded above by (1/2ε 2 )(B/a 2 1 ) < ∞. Further, note that using the above equivalence between the infinite sequence model and the function model, the parameter restriction in the infinite sequence model that θ ∈ Θ(a, B) corresponds to the restriction that the corresponding parameter in the function model is smooth; B represents the volume of the parameter space, and the growth rate of a represents the smoothness of the functions.
The constructed predictive distribution is the Bayesian predictive distribution based on the Gaussian distribution. For a prior distribution Π of θ, the Bayesian predictive distribution Q Π based on Π is obtained by averaging Q θ with respect to the posterior distribution based on Π. Our construction is a generalization of the result in Xu and Liang [19] to infinite-dimensional settings. The details are provided in Section 3.
Further, we discuss adaptivity to the sequence a and B. In applications, since we do not know the true values of a and B, it is desirable to construct a predictive distribution without using a and B that is asymptotically minimax in any ellipsoid in the class. Such a predictive distribution is called an asymptotically minimax adaptive predictive distribution in the class. In the present paper, we focus on an asymptotically minimax adaptive predictive distribution in the simplified Sobolev class {Θ Sobolev (α, B) : α > 0, B > 0}, where Θ Sobolev (α, B) := {θ ∈ l 2 : i∈N i 2α θ 2 i ≤ B}. Our construction of the asymptotically minimax adaptive predictive distribution is based on Stein's prior and the division of the parameter into blocks. The proof of the adaptivity relies on a new oracle inequality related to the Bayesian predictive distribution based on Stein's prior; see Subsection 4.2. Stein's prior on R n is an improper prior whose density is n i=1 θ 2 i (2−n)/n . It is known that the Bayesian predictive distribution based on that prior has a smaller Kullback-Leibler risk than that based on the uniform prior in the finite dimensional Gaussian settings; see Komaki [9] and George, Liang and Xu [8]. The division of the parameter into blocks is widely used for the construction of the asymptotically minimax adaptive estimator; see Efromovich and Pinsker [7], Cai, Low and Zhao [4], and Cavalier and Tsybakov [5]. The details are provided in Section 4.
The remainder of the paper is organized as follows. In Section 5, we provide an efficient sampling method for the proposed asymptotically minimax adaptive distribution and provide numerical experiments with a fixed ε. In Section 6, we conclude the paper.

Equivalence between predictions in infinite sequence models and predictions in function models
In this section, we provide an equivalence between prediction in a function model and prediction in an infinite sequence model. The proof consists of the two steps. First, we provide a connection between predictions in a function model and predictions in the submodel of an infinite sequence model. Second, we extend predictions in the submodel to predictions in the infinite sequence model. The detailed description of prediction in a function model is as follows. Let Explicitly, λ i is 1/{π(i − 1/2)} 2 and e i (·) is √ 2 sin((i − 1/2)π·) for i ∈ N. The detailed description of prediction in the sub-model of an infinite sequence model is as follows.
and we use Theorem 4.2.2. in Dudley [6]. Let A D be the whole set of probability distributions on (S D , B D ), where B D is the relative σ-field of R ∞ .
The following theorem states that the Kullback-Leibler loss in the function model is equivalent to that in the submodel of the infinite sequence model. Theorem 2.1. For every Q ∈ A F and every F ∈ H F , there exist Q ∈ A D and θ ∈ l 2 such that Conversely, for every Q ∈ A D and every θ ∈ l 2 , there exist Q ∈ A F and F ∈ H F such that Proof. We construct pairs of a measurable one-to-one map Φ : L 2 [0, 1] → S D and a measurable one-to-one map Ψ : Φ is well-defined as a map from L 2 [0, 1] to S D because for x(·) and y(·) in L 2 [0, 1] such that x(·) = y(·), we have x(·), λ We show that Φ is one-to-one, onto, and measurable. Φ is one-to-one because if Φ(x(·)) = Φ(y(·)), then we have λ i e i (·) satisfies that Φ(x(·)) = x. Φ is measurable because Φ is continuous with respect to the norm || · || L2 of L 2 [0, 1] and ρ, and because R ∞ is equal to the Borel σ-field with respect to Further, the restriction of Φ to H F is a one-to-one and onto map from H F to l 2 .
Let Ψ : . Ψ is the inverse of Φ. Thus, Ψ is one-to-one, onto, and measurable.
Since the Kullback-Leibler divergence is unchanged under a measurable oneto-one mapping, the proof is completed. [11] constructed the connection between estimation in an infinite sequence model and estimation in a function model. Our connection is its extension to prediction. In fact, the map Φ is used in Mandelbaum [11].

Remark 2.2. Mandelbaum
The following theorem justifies focusing on prediction in (R ∞ , R ∞ ) instead of prediction in (S D , B D ). Theorem 2.3. For every θ ∈ l 2 and Q ∈ A, there exists Q ∈ A D such that l(θ, Q) ≤ l(θ, Q).
In particular, for any subset Θ of l 2 , Proof. Note that Q θ (S D ) = 1 by the Karhunen-Loève theorem. For Q ∈ A such that Q(S D ) = 0, l(θ, Q) = ∞ and then for any Q ∈ A D , l(θ,Q) < l(θ, Q). For where Q is the restriction of Q to S D .

Asymptotically minimax predictive distribution
In this section, we construct an asymptotically minimax predictive distribution for the setting in which the parameter space is an ellipsoid Θ(a, B) = {θ ∈ l 2 : ∞ i=1 a 2 i θ 2 i ≤ B} with a known sequence a = (a 1 , a 2 , . . .) and with a known constant B. Further, we provide the asymptotically minimax predictive distributions in two well-known ellipsoids; a Sobolev and an exponential ellipsoids.

Principal theorem of Section 3
We construct an asymptotically minimax predictive distribution in Theorem 3.1.
Further, the Bayesian predictive distribution based on G τ =τ * (ε,ε) is asymptotically minimax: The proof is provided in the next subsection.

Proof of the principal theorem of Section 3
The proof of Theorem 3.1 requires five lemmas. Because the parameter is infinitedimensional, we need Lemmas 3.2 and 3.5 in addition to Theorem 4.2 in Xu and Liang [19].
The first lemma provides the explicit form of the Kullback-Leibler risk of the Bayesian predictive distribution Q Gτ . The proof is provided in Appendix A.

Lemma 3.2.
If θ ∈ l 2 and τ ∈ l 2 , then Q Gτ (·|X) and Q θ are mutually absolutely continuous given X = x P θ -a.s. and the Kullback-Leibler risk R(θ, Q Gτ ) of the Bayesian predictive distribution Q Gτ is given by The second lemma provides the Bayesian predictive distribution that is minimax among the sub class of D. The proof is provided in Appendix A.
Then, for any ε > 0 and The third lemma provides the upper bound of the minimax risk.

Lemma 3.4.
Assume that 0 < a 1 ≤ a 2 ≤ · · · ∞. Then, for any ε > 0 and anyε > 0, We introduce the notations for providing the lower bound of the minimax risk. These notations are also used in Lemma 4.2. Fix an arbitrary positive in- The fourth lemma shows that the minimax risk in the infinite sequence model is bounded below by the minimax risk in the finite dimensional sequence model. The proof is provided in Appendix A.
The fifth lemma provides the asymptotic minimax risk in a high-dimensional sequence model. It is due to Xu and Liang [19]. Lemma 3.6 (Theorem 4.2 in Xu and Liang [19]). Let τ * (ε,ε) be defined by (8). Let T (ε,ε) be defined by (9) Based on these lemmas, we present the proof of Theorem 3.1.
From Lemma 3.5 with d = 1/ε 2 and Lemma 3.6, This completes the proof.

Examples of asymptotically minimax predictive distributions
In this subsection, we provide the asymptotically minimax Kullback-Leibler risks and the asymptotically minimax predictive distributions in the case that Θ(a, B) is a Sobolev ellipsoid and in the case that it is an exponential ellipsoid.

The Sobolev ellipsoid
The simplified Sobolev ellipsoid is We setε = γε for γ > 0. This setting is a slight generalization of Section 5 of Xu and Liang [19], in which the asymptotic minimax Kullback-Leibler risk with γ = 1 is obtained.
We expand T := T (ε,ε) and τ * (ε,ε). From the definition of T , we have 2λ(ε,ε) = 1 T 2αε2 (1 + o (1)). Thus, we have where we use the convergence of the Riemann sum and Thus, we obtain the asymptotically minimax risk where We compare the Kullback-Leibler risk of the asymptotically minimax predictive distribution with the Kullback-Leibler risk of the plug-in predictive distribution that is asymptotically minimax among all plug-in predictive distributions. The latter is obtained using Pinsker's asymptotically minimax theorem for estimation (see Pinsker [14]). We call the former and the latter risks the predictive and the estimative asymptotically minimax risks, respectively. The orders of ε −2 and B in the predictive asymptotic minimax risk are both the 1/(2α + 1)-th power. These orders are the same as in the estimative asymptotically minimax risk. However, the convergence constant P * and the convergence constant in the estimative asymptotically minimax risk are different. Note that the convergence constant in the estimative asymptotically minimax risk is the 2α(2α+1) multiplied by 1/(2γ 2 ). Figure 1 shows that the convergence constant P * becomes smaller than the convergence constant in the estimative asymptotically minimax risk as γ −1 increases. Xu and Liang [19] also pointed out this phenomenon when γ = 1.
Thus, we obtain the asymptotically minimax risk We compare the predictive asymptotically minimax risk with the estimative asymptotically minimax risk in the exponential ellipsoid. From (13), From Pinsker's asymptotically minimax theorem, Thus, for any γ > 0, In an exponential ellipsoid, the order of ε in the predictive asymptotically minimax risk is the same as that in the estimative asymptotically minimax risk. The convergence constant in the predictive asymptotically minimax risk is strictly smaller than that in the estimative asymptotically minimax risk.
Remark 3.7. There are differences between the asymptotically minimax risks in the Sobolev and the exponential ellipsoids. The constant B has the same order in the asymptotically minimax risk as that of ε −2 when the parameter space is the Sobolev ellipsoid. In contrast, the constant B disappears in the asymptotically minimax risk when the parameter space is the exponential ellipsoid.

Asymptotically minimax adaptive predictive distribution
In this section, we show that the blockwise Stein predictive distribution is asymptotically minimax adaptive on the family of Sobolev ellipsoids. Recall that the Sobolev ellipsoid is Θ Sobolev (α, B) = {θ ∈ l 2 :

Principal theorem of Section 4
For the principal theorem, we introduce a blockwise Stein predictive distribution and a weakly geometric blocks system.
A blockwise Stein predictive distribution for a set of blocks is constructed as follows. Let d be any positive integer. We divide {1, . . . , d} into J blocks: {1, · · · , d} = ∪ J j=1 B j . We denote the number of elements in each block B j by b j . Corresponding to the division into the blocks B(d) . . . , θ b1 ), · · · , and θ B J = (θ J−1 j=1 bj +1 , . . . , θ d ). In the same manner, we divide X (d) into X B1 , · · · , and X B J . Let h where || · || is the square norm. We define the blockwise Stein predictive distribution with the set of blocks B(d) as where h . In regard to estimation, Brown and Zhao [3] discussed the behavior of the Bayes estimator based on the blockwise Stein prior.
The weakly geometric blocks (WGB) system is introduced as follows.
The following is the principal theorem. Let d(ε) be 1/ε 2 . Let B * ε be the WGB system defined by (15) be the blockwise Stein predictive distribution with the WGB system B * ε defined by (14).
The proof is provided in Subsection 4.3. An inequality related to the Bayesian predictive distribution based on Stein's prior that we will use in the proof of Theorem 4.1 will be shown in Subsection 4.2. In Subsection 4.3, we introduce several lemmas and provide the proof of Theorem 4.1.

Oracle inequality of the Bayesian predictive distribution based on Stein's prior
Before considering the proof of Theorem 4.1, we show an oracle inequality related to Stein's prior for d > 2 that is useful outside of the proof of Theorem 4.1. Recall

Remark 4.3.
We call inequality (16) an oracle inequality of Stein's prior for the following reason. By the same calculation in (21) in the proof of Lemma 3.3, the second term on the right hand side of inequality (16) is the oracle Kullback-Leibler risk, that is, the minimum of the Kullback-Leibler risk in the case that the action space is and in the case that we are permitted to use the value of the true parameter θ (d) . Therefore, Lemma 4.2 tells us that the Kullback-Leibler risk of the ddimensional Bayesian predictive distribution based on Stein's prior is bounded above by a constant independent of d plus the oracle Kullback-Leibler risk.

Proof of Lemma 4.2. First,
is the Bayesian predictive distribution based on the uniform prior u (d) (θ (d) ) := 1, andθ h (d) are the Bayes estimators based on the uniform prior u (d) and based on Stein's prior h (d) , respectively. Here E v is the expectation of X (d) with respect to the d-dimensional Gaussian distribution with mean θ (d) and covariance matrix vI d . For the proof of the identity, see Brown, George and Xu [2]. Second, JS is the James-Stein estimator. For the first inequality, see Kubokawa [10]. For the second inequality, see e.g. Theorem 7.42 in Wasserman [17]. Thus, we have Here, we use  L2 (d, B) : . . = 0}. Note that another type of an asymptotically minimax adaptive predictive distribution in the family of L 2 -balls has been investigated by Xu and Zhou [20].
Thus, we have .
Since from Theorem 4.2 in Xu and Liang [19] we have the proof is complete.

Proof of the principal theorem of Section 4
In this subsection, we provide the proof of Theorem 4.1. The proof consists of the following two steps. First, in Lemma 4.7, we examine the properties of the blockwise Stein predictive distribution with a set of blocks. The proof of Lemma 4.7 requires Lemma 4.6. Second, we show that the blockwise Stein predictive distribution with the weakly geometric blocks system is asymptotically minimax adaptive on the family of Sobolev ellipsoids, using Lemma 4.7 and the property of the WGB system (Lemma 4.8).
For the proof, we introduce two subspaces of the decision space. For a given set of blocks Although the decision space G BW (B(d)) is included in the decision space G mon (B(d)), the following lemma states that if the growth rate of the numbers in each block in B(d) is controlled, then the infimum of the Kullback-Leibler risk among G mon (B(d)) is bounded by a constant plus a constant multiple of the infimum of the Kullback-Leibler risk among G BW (B(d)). The proof is provided in Appendix B.
The following lemma states the relationship between the Kullback-Leibler risk of the blockwise Stein predictive distribution and that of the predictive distribution in G mon (B(d)). The proof is provided in Appendix B.
be the blockwise Stein predictive distribution with the set of blocks B(d) defined by (14). Then, for any θ ∈ l 2 , The following lemma states that the WBG system satisfies the assumption in Lemmas 4.6 and 4.7. The proof is due to Tsybakov [16]. Lemma 4.8 (e.g., Lemma 3.12 in Tsybakov [16] j=1 be the WGB system defined by (15) j=1 . Then, there exist 0 < ε 0 < 1 and C 0 > 0 such that Based on these lemmas, we provide the proof of Theorem 4.1.
Proof of Theorem 4.1. First, since the WGB system B * ε satisfies the assumption in Lemma 4.7, it follows from Lemma 4.7 that for 0 < ε < ε 0 , Second, we show that the asymptotically minimax predictive distribution Q G τ =τ * (ε,ε) in Theorem 3.1 is also characterized as follows: for a sufficiently small ε > 0, It suffices to show that the Bayesian predictive distribution Q G τ =τ * (ε,ε) is included in G mon (B * ε ) for a sufficiently small ε > 0. This is proved as follows. Recall that T (ε,ε) defined by (9) is the maximal index of which τ * i (ε,ε) defined by (8) is non-zero. From the expansion of T (ε,ε) given in (11), for a sufficiently small ε > 0, for for a sufficiently small ε > 0. Combining the first argument with the second argument yields This completes the proof.

Numerical experiments
In Subsection 5.1, we provide an exact sampling method for the blockwise Stein predictive distribution. In Subsection 5.2, we provide two numerical experiments concerning the performance of that predictive distribution.

Exact sampling from the blockwise Stein predictive distribution
We provide an exact sampling method from the posterior distribution based on Stein prior h (d) (θ (d) ) := ||θ (d) || 2−d on R d . Owing to the block structure, it suffices to provide an exact sampling method from the posterior distribution based on Stein's prior. We use the following mixture representation of Stein's prior: where c(d) is a constant depending only on d. Thus, as for the posterior distribution of h (d) , we have Here π(θ (d) |t, x (d) ) is the probability density function of the normal distribution. Under the transformation t → κ := ε 2 /(ε 2 + t), the distribution f (κ|x (d) ) of κ is a truncated Gamma distribution: Therefore, we obtain an exact sampling from the posterior distribution based on Stein's prior by sampling the normal distribution and the truncated Gamma distribution. For the sampling from the truncated Gamma distribution, we use the acceptance-rejection algorithm for truncated Gamma distributions based on the mixture of beta distributions; see Philippe [13].

Comparison with a fixed variance
Though we proved the asymptotic optimality of the blockwise Stein predictive distribution with the WGB system, it does not follow that the blockwise Stein predictive distribution behaves well with a fixed variance ε.
In this subsection, we examine the behavior with a fixed ε of the blockwise Stein predictive distribution with the WGB system compared to the plugin predictive distribution with the Bayes estimator based on the blockwise Stein prior and the asymptotically minimax predictive distribution in the Sobolev ellipsoid Θ Sobolev (α, B) given in Theorem 3.1. In this subsection, we call the asymptotically minimax predictive distribution in the Sobolev ellipsoid Θ Sobolev (α, B) given in Theorem 3.1 the Pinsker-type predictive distribution with α and B.
For the comparison, we consider the 6 predictive settings with ε = 0.05: In each setting, we obtain 1000 samples of y (d(ε)) distributed according to the blockwise Stein predictive distribution with the WGB system up to the d(ε)-th order using the sampling method described in Subsection 5.1, and we construct the coordinate-wise 80%-predictive interval of y (d(ε)) using 1000 samples. In each setting, we use the Pinsker-type predictive distribution with α and B such that i ≤ B: we use α = 2 and B = 3 in the firth, second, and third settings. We use α = 0.75 and B = 3 in the fourth, fifth, and sixth settings.
In each setting, we obtain 5000 samples from the true distribution of y and calculate the means of the coordinate-wise mean squared errors normalized bỹ ε 2 , and then calculate the means and the standard deviations of the counts of the samples included in the predictive intervals. Tables 1 and 2 show that the Pinsker-type predictive distribution (abbreviated by Pinsker) has the smallest mean squared error and has the sharpest predictive interval. It is because the Pinsker-type predictive distribution uses α and B. The blockwise Stein predictive distribution (abbreviated by Bayes with WGBStein) and the plugin predictive distribution with the Bayes estimator based on the blockwise Stein prior (abbreviated by Plugin with WGBStein) have nearly the same performance in the mean squared error. The blockwise Stein predictive distribution has a wider predictive interval than the plugin predictive distribution. Its predictive interval has a smaller variance than that of the plugin predictive distribution in all settings. In the next paragraph, we consider the reason for this phenomenon by using the transformation of the infinite sequence model to the function model discussed in Section 2.
Using the function representation of the infinite sequence model discussed in Section 2, we examine the behavior of the predictive distributions in the second and fifth settings. Figure 2 shows the mean path and the predictive intervals of predictive distributions at t ∈ {i/1000} 1000 i=1 and the values of the true function at t ∈ {i/1000} 1000 i=1 . Figure 2 (a), Figure 2 (b), Figure 2 (c), and Figure 2 (d) represent the Pinsker-type predictive distribution and the true function in the second setting, the blockwise Stein predictive distribution and the plugin predic- tive distribution in the second setting, the Pinsker-type predictive distribution and the true function in the fourth setting, and the blockwise Stein predic-tive distribution and the plugin predictive distribution in the fourth setting, respectively. The solid line represents the true function and the mean paths. The dashed line represents the pointwise 80% predictive intervals. The black, green, blue, and red lines correspond to the true function, the Pinsker-type predictive distribution, the blockwise Stein predictive distribution, and the plugin predictive distribution, respectively. The mean paths of the blockwise Stein predictive distribution and the plugin predictive distributions are more distant from the true function than that of the Pinsker-type predictive distribution, corresponding to the results in Table 1. The predictive intervals of the blockwise Stein predictive distribution are wider than these of the other predictive distributions, corresponding to the results in Table 2. Though the blockwise Stein predictive distribution has a mean path that is more distant from the true function than the Pinsker-type predictive distribution, it has a wider predictive interval and captures future observations. In contrast, although the plugin predictive distribution has nearly the same mean path as the blockwise Stein predictive distribution does, it has a narrow predictive interval and does not capture future observations.

Discussions and Conclusions
In the paper, we have considered asymptotically minimax Bayesian predictive distributions in an infinite sequence model. First, we have provided the connection between prediction in a function model and prediction in an infinite sequence model. Second, we have constructed an asymptotically minimax Bayesian predictive distribution for the setting in which the parameter space is a known ellipsoid. Third, using the product of Stein's priors based on the division of the parameter into blocks, we have constructed an asymptotically minimax adaptive Bayesian predictive distribution in the family of Sobolev ellipsoids.
We established the fundamental results of prediction in the infinite-dimensional model using the asymptotics as ε → 0. The approach was motivated by Xu and Liang [19]. Since it is not always appropriate to use asymptotics in applications, the next step is to provide the result for a fixed ε.
We discussed the asymptotic minimaxity and the adaptivity for the ellipsoidal parameter space. There are many other types of parameter space in highdimensional and nonparametric models; for example, Mukherjee and Johnstone [12] discussed the asymptotically minimax prediction in high-dimensional Gaussian sequence model under sparsity. For future work, we should focus on the asymptotically minimax adaptive predictive distributions in other parameter spaces.

Appendix A: Proofs of Lemmas in Section 2
Proof of Lemma 3.2. The proof is similar to that of Lemmas 5.1 and 6.1 in Belitser and Ghosal [1]. We denote the expectation of X and Y with respect to P θ and Q θ by E X,Y |θ .