Adaptive Bayesian nonparametric regression using kernel mixture of polynomials with application to partial linear model

We propose a kernel mixture of polynomials prior for Bayesian nonparametric regression. The regression function is modeled by local averages of polynomials with kernel mixture weights. We obtain the minimax-optimal rate of contraction of the full posterior distribution up to a logarithmic factor that adapts to the smoothness level of the true function by estimating metric entropies of certain function classes. We also provide a frequentist sieve maximum likelihood estimator with a near-optimal convergence rate. We further investigate the application of the kernel mixture of polynomials to the partial linear model and obtain both the near-optimal rate of contraction for the nonparametric component and the Bernstein-von Mises limit (i.e., asymptotic normality) of the parametric component. The proposed method is illustrated with numerical examples and shows superior performance in terms of computational efficiency, accuracy, and uncertainty quantification compared to the local polynomial regression, DiceKriging, and the robust Gaussian stochastic process.

Frequentist methods for nonparametric regression typically compute a fixed estimated function through the given data (x i , y i ) n i=1 . In contrast, Bayesian nonparametric techniques first impose a carefully-selected prior on the unknown function f and then find the posterior distribution of f given the observed data (x i , y i ) n i=1 , providing natural ways for uncertainty quantification (full posterior) instead of the point estimates given by the frequentist approaches. One of the most popular Bayesian nonparametric regression methods is the Gaussian process (Rasmussen and Williams, 2006) due to its tractability. Nevertheless the computational burden of the Gaussian process in likelihood function evaluation resulted from the inversion of the covariance matrix prevents its scalability to big data.
In this paper, we propose a kernel mixture of polynomials prior for nonparametric regression, which does not involve the cumbersome O(n 3 ) inversion of the large covariance matrix. Leaving the Bayesian framework for a moment, let us consider the frequentist Nadaraya-Watson estimator (Nadaraya, 1964;Watson, 1964) of the form is the kernel function parametrized by the bandwidth parameter h ∈ (0, +∞) and is assumed to decrease when x increases, so that the Nadaraya-Watson estimator is a local averaging estimator (Györfi et al., 2006). As the Nadaraya-Watson estimator does not yield an optimal rate of convergence when the true regression function satisfies the α-Hölder condition for α ≥ 2 (Devroye et al., 2013), Fan and Gijbels (1996) considers the more general local polynomial regression to capture higher-order curvature information of the unknown regression function and gain optimal rate of convergence.
Inspired by the Nadaraya-Watson estimator and the local polynomial regression, we seek to develop a kernel mixture of polynomials prior by assuming the following form of f , where k and s lie in some index sets, and {ξ ks (x − µ k ) s } ks are some suitably selected polynomial functions that mimic the behavior of f locally around µ k , where µ k and µ k are sufficiently close. Furthermore, in order for µ k 's to detect both the local behavior of f and the global behavior, they need to be well-spread. To achieve this, we partition the design space X into K p disjoint hypercubes X = k X K (k), where X K (k)∩X K (k ) = ∅ if k = k , and let µ k ∈ X K (k). This leads to an identifiable sub-model (µ k , ξ ks ) k,α → f for each fixed K. The partition restriction is preferred because an unidentifiable model may result in poor mixing of the Markov chain Monte Carlo sampler for the posterior inference (Xie and Carlin, 2006). Alternatively, repulsive prior (Affandi et al., 2013;Xu et al., 2016a;Xie and Xu, 2017) can be incorporated to gain well-spread kernel centers µ k 's, but the identifiability issue remains unsolved. We formulate the setup rigorously in section 2.2.
The proposed kernel mixture of polynomials not only avoids the inversion of large covariance matrix, but also features nice frequentist asymptotic behaviors. The major contribution of this paper is that the proposed method is, to the best of our knowledge, the first one in the literature that simultaneously achieves the following theoretical goals: (i) It yields the near-optimal rate of contraction with respect to the L 2 -topology with adaptation to the smoothness level of the true function; (ii) It provides a frequentist sieve maximum likelihood estimator with a near-optimal convergence rate; (iii) When used to model the nonparametric component in the partial linear model, it leads to optimal convergence results for both the nonparametric component and the parametric component.
For goal (i), it is worth mentioning that most literatures concerning posterior convergence for Bayesian nonparametric regression only discuss the rate of contraction with respect to the weaker empirical L 2 -norm (van der Vaart and van Zanten, 2008;De Jonge et al., 2010;van der Vaart and van Zanten, 2009;Bhattacharya et al., 2014), i.e., the convergence of the function at given design points. There is little discussion about the rate of contraction with respect to the exact L 2 -norm for general Bayesian nonparametric regression methods. van der Vaart and Zanten (2011), Yang et al. (2017), and Yoo et al. (2016) address this issue only in the context of Gaussian process regression. We show that the rate of contraction is minimax-optimal (Stone, 1982;Györfi et al., 2006) (up to a logarithmic factor) and is adaptive to the smoothness level in the sense that the prior does not depend on the smoothness level of the true function. The main technical tools we use for establishing the rate of contraction are the upper bounds for metric entropies of certain function classes. We also obtain a frequentists sieve maximum likelihood estimator with near-optimal convergence rate, fulfilling goal (ii).
For goal (iii), we further study the application of the kernel mixture of polynomials to the partial linear model as a natural semiparametric extension. The partial linear model is of the form y i = z T i β + η(x i ) + e i , where z i , x i 's are design points, β is the linear coefficient, η is some unknown function to which the kernel mixture of polynomials prior is imposed, and e i 's are independent N(0, 1) noises, i = 1, . . . , n. The literatures of the partial linear model from both the frequentist perspective (Engle et al., 1986;Chen et al., 1988;Speckman, 1988;Hastie and Tibshirani, 1990;Fan and Li, 1999) and Bayesian approaches (Tang et al., 2015;Lenk, 1999;Bickel et al., 2012;Yang et al., 2015) are rich. However, there is little discussion regarding the theoretical properties of the Bayesian partial linear model. To the best of our knowledge, only Bickel et al. (2012) and Yang et al. (2015) discuss the asymptotic behavior of the marginal posterior distribution of β with Gaussian process priors for η under the Bayesian partial linear model. We impose the kernel mixture of polynomials prior on η and obtain both the near-optimal rate of contraction for η and the Bernstein von-Mises limit (i.e., asymptotic normality) of the marginal posterior of β. The technical tools developed for the kernel mixture of polynomials play a fundamental role in the proofs of the convergence properties of β and η.
The layout of this paper is as follows. Section 2 introduces necessary notations and presents the kernel mixture of polynomials prior for nonparametric regression. Section 3 elaborates the convergence properties of the kernel mixture of polynomials regression. In particular, we provide upper bounds for metric entropies of certain function classes, derive an adaptive rate of contraction with respect to the L 2 -topology, and obtain a sieve maximum likelihood estimator with a near-optimal convergence rate. Section 4 presents a semiparametric application of the kernel mixture of polynomials to the partial linear model. We explore the asymptotic behavior and obtain both a near-optimal rate of contraction for the nonparametric component and the Bernstein von-Mises limit of the parametric component. Section 5 illustrates the proposed methodology using numerical examples. Further discussions are included in section 6.

Preliminaries 2.1 Notations
For 1 ≤ r ≤ ∞, we use · r to denote both the r -norm on any finite dimensional Euclidean space and the L r -norm of a measurable function. We follow the convention that when r = 2, the subscript is omitted, i.e., · 2 = · . For any integer n, denote [n] = {1, . . . , n}. For x ∈ R p and > 0, denote We use x to denote the maximal integer no greater than x, and x to denote the minimum integer no less than x. The notations a b and a b denote the inequalities up to a positive multiplicative constant. Given an integer vector s = [s 1 , . . . , s p ] T ∈ N p and a real vector x = [x 1 , . . . , x p ] T ∈ R p , we denote the monomial x s = x s1 1 . . . x sp p and the mixed partial derivative operator . Given α, L > 0 and a compact X ⊂ R p , we say that a function f : X → R satisfies the α-Hölder condition with for all x 1 , x 2 ∈ X . The class of functions on X that satisfies the α-Hölder condition with envelope B, denoted by C α,B (X ), is referred to as the α-Hölder function class (with envelope B). Given a distribution P x on X , The notation 1(A) denotes the indicator of the event A.
We slightly abuse the notation and do not distinguish between a random variable and its realization. We refer to P as a statistical model if it consists of a class of densities on a sample space X with respect to some underlying σ-finite measure. Given a (frequentist) statistical model P and the independent and identically distributed data D n = (x i ) n i=1 from some P ∈ P, the prior and the posterior distribution on P are always denoted by Π(·) and Π(· | D n ), respectively. Given a function f : X → R, we use P n f = n −1 n i=1 f (x i ) to denote the empirical measure of f , and G n f = n −1/2 n i=1 (f (x i ) − Ef (x i )) to denote the empirical process of f , given the independent and identically distributed. data (x i ) n i=1 . We use p x (x) or p(x) to denote the density of x, P x to denote the distribution of x, and E x for the corresponding expected value. In particular, φ denotes the probability density function of the (univariate) standard normal distribution, and we use the shorthand notation φ σ (y) = φ(y/σ)/σ to denote the density of N(0, σ 2 ). The Hellinger distance between two densities p 1 , p 2 is denoted by H(p 1 , p 2 ), and the Kullback-Leibler divergence is denoted by For a metric space (F, d), for any > 0, the -covering number of (F, d), denoted by N ( , F, d), is defined to be the minimum number of -balls of the form {g ∈ F : d(f, g) < } that are needed to cover F. The -bracketing number of (F, d), denoted by N [·] ( , F, d), is defined to be the minimum number of brackets of the form [l i , u i ] that are needed to cover F with l i , u i ∈ F such that d(l i , u i ) < . We refer to log N ( , F, d) as the metric entropy, and log N [·] ( , F, d) as the bracketing (metric) entropy. The bracketing integral 0 log N [·] (u, F, d)du is used extensively and is denoted by J [·] ( , F, d).
Without loss of generality we assume that the design space X is the unit hypercube [0, 1] p in R p , and p x is a continuous density yielding a distribution on X . We partition the design space X into K p disjoint

Setup
We begin with the notion of boxed kernel functions. We say that a continuous function ϕ : For each K ∈ N + and each k ∈ [K] p , denote the kernel mixture weight as where µ k ∈ X K (k), and h > 0. Given K and (w k (x) : k ∈ [K] p ), we further define the kernel mixture of polynomial system to be the set of functions of the form We next describe the setup. The data D n = (x i , y i ) n i=1 are assumed to be i.i.d. according to the joint density p 0 (x, y) = φ σ0 (y − f 0 (x))p x (x) for some σ 0 ∈ [σ, σ] and some unknown function f 0 ∈ C α,B (X ). In nonparametric regression the estimation of f 0 is of interest, and therefore the marginal density p x of the design points (x i ) n i=1 is assumed known and fixed. We use pr 0 and E 0 to denote the probability and expected value under p 0 , respectively.
Formally, the kernel mixture of polynomials nonparametric regression model is described by the following Here k = [k 1 , . . . , k p ] T and µ k = [(2k 1 − 1)/2K, . . . , (2k p − 1)/2K] T is the center of X K (k). The parameters σ, K, (µ k : k ∈ [K] p ), and (ξ ks : k ∈ [K] p , |s| = 0, 1, . . . , m) are to be assigned a hierarchical prior in section 2.3. The problem is nonparametric, and therefore f 0 is not necessarily in F K for any K ∈ N + . The only assumption we make for f 0 is that f 0 ∈ C α,B (X ) for some α, B > 0. The function f described in (3) is referred to as the kernel mixture of polynomials. Clearly there exists some constant A such that to be the denominator of the weights in (1). We remark that the kernel weights in (1) are well-defined, i.e., the denominator D(x) > 0. To see why this is true, notice that ϕ(x) > 0 when x < 1 by assumption, it suffices to show that X ⊂ k∈[K] p B ∞ (µ k , h). In fact, for all Hence D(x) > 0 and w k (x)'s are well-defined.
Remark 2. We require that K ranges over all positive integers and that Kh stays bounded as K → ∞, so that ∞ K=1 F K is rich enough to provide good approximation to arbitrary f 0 ∈ C α,B (X ) (Györfi et al., 2006). The kernel bandwidth parameter h, which is typically endowed with a prior in the literature of Bayesian kernel methods (Ghosal et al., 2007;Shen et al., 2013), also plays a key role in the establishing the convergence properties of the kernel mixture of polynomials. It turns out that requiring 1 < h ≤ Kh ≤ h < ∞ yields the near-optimal rate of contraction.
3 Convergence properties of kernel mixture of polynomials regression In this section, we establish the convergence results of the kernel mixture of polynomials regression and obtain a frequentist sieve maximum likelihood estimator with the corresponding convergence rate. For nonparametric regression problems, when the true regression function f 0 is in C α,L (X ), p x (x) = 1, and e i ∼ N(0, 1), i = 1, . . . , n, it has been shown that the minimax rate of convergence for any estimator f with respect to the L 2 -norm is n −α/(2α+p) (Stone, 1982;Györfi et al., 2006). The optimal rate of contraction cannot be faster than the minimax rate of convergence. Theorem 1 below, which is one of the main results of this section, asserts that the rate of contraction with respect to the L 2 (P x )-topology is minimax-optimal up to a logarithmic factor. Furthermore, the rate of contraction is adaptive to the smoothness α of f 0 in the sense that the prior specification does not require knowledge of α. In this section the proofs of lemma 1, proposition 1, proposition 2, and theorem 2 are deferred to the supplementary material.
Theorem 1 (Rate of contraction). Suppose Π is the prior constructed in section 2.3. Then The sketch of the proof goes as follows. First of all lemma 1 below guarantees that in order to prove theorem 1, it suffices to show that in pr 0 -probability. Then we prove (6) by verifying a set of sufficient conditions presented in Kruijer et al. (2010). For convenience we describe these conditions in our context. Let M be a statistical model, i.e., a class of density functions, equipped a prior Π. Define the Kullback-Leibler ball by B KL (p 0 , ) = {p f,σ : Kruijer et al. (2010) proved that in order that (6) holds, it suffices to find another sequence ( n ) ∞ n=1 with n ≤ n , a sequence of sub-classes of densities (M n ) ∞ n=1 , M n ⊂ M, and for each M n a partition ( The sequence of sub-classes of densities (M n ) ∞ n=1 is referred to as a sieve in the literature (Shen and Wong, 1994).
The following lemma serves as the first step of proving theorem 1.
and hence for all sufficiently small > 0 and for some constant C 1 > 0, We now provide a lower bound for the prior concentration Π(p ∈ B KL (p 0 , n )).
Proposition 1 (Prior concentration). Suppose Π is the prior constructed in section 2.3. Then there exists some constant C 2 > 0 such that for all sufficiently small > 0, .
We next turn to the estimation of the metric entropies of F K , which is one of the major technical contributions of this paper.
By the construction, lemma 1, proposition 2, and the fact that f Lr(Px) ≤ f ∞ for any r ≥ 1, we have for some constant C 1 > 0, where we have used the fact t > pδ + 1 in the last inequality. On the other hand, for sufficiently large n and some constants b 1 , B 1 > 0, we argue that M n capture sufficiently large prior mass. In fact, simple algebra yields where the fact pδ + r 0 > 2γ is applied. Lastly, for the prior concentration, by proposition 1 for some constant C 2 > 0, where we use the fact 2γ > max(r 0 , 1) − pγ/α in the last inequality. Hence we conclude that Π (H (p f,σ , p 0 ) > M n | D n ) → 0 in pr 0 -probability for some constant M > 0. The proof is completed by applying lemma 1.
Besides the rate of contraction, which is a frequentist large sample evaluation of Bayesian posterior distribution, we also obtain a frequentist sieve maximum likelihood estimator with the convergence rate as a result of the metric entropy bounds. This convergence rate is also minimax optimal up to a logarithmic factor. Interestingly, this convergence rate is faster compared to the rate of contraction of the full posterior, although the construction of the sieve depends on the smoothness level α and the rate is non-adaptive.
Theorem 2. Consider the sieve maximum likelihood estimator f K (x) defined by

Application to partial linear model
In this section we focus on a natural semiparametric application: we use the kernel mixture of polynomials to model the nonparametric component in the partial linear model. Specifically, we consider the partial linear

Setup and prior specification
We begin with the detailed description of the model. Let X = [0, 1] p ⊂ R p be the design space of the nonparametric component, Z ⊂ R q be the design space of the parametric component, and p (x,z) : X × Z → (0, ∞) be a continuous density function supported on X × Z. The partial linear model is of the form are independent N(0, 1) noises, and β is the linear coefficient. In many applications, the estimation of the nonparametric component η is of great interest. For example, in Xu et al. (2016b), the parametric term z T i β models the baseline disease progression and the nonparametric term η(x i ) models the individual-specific treatment effect deviations over time. When the regression coefficient β is of more interest, the estimation of η can still be critical since it could affect the inference of β.
We incorporate the partial linear model with the kernel mixture of polynomials prior for the nonparametric component η through the following statistical model: for some β 0 ∈ R q and some function η 0 ∈ C α,L (X ). We make several additional assumptions: The design space Z ⊂ R q for the linear component is compact with sup z∈Z z 1 ≤ B; The sampling distribution for z satisfies Ez = 0 and Ezz T being non-singular; The density of the design points ( For the prior specification, we assume η follows the kernel mixture of polynomials prior Π η constructed in section 2.3 with σ = 1. For the parametric component β, we impose a standard Gaussian prior Π β = N(0, I q ), independent of Π η . The joint prior is denoted by Π = Π η × Π β .

Convergence results
Before exploring the asymptotic behavior of the posterior distribution of β, we need to establish the convergence result for η. The following theorem not only addresses the rate of contraction of the marginal posterior of η, but also serves as one of the building blocks for proving the Bernstein von-Mises limit of the marginal posterior of β. The proofs of theorem 3 and theorem 4 are deferred to the supplementary material.
Theorem 3 (Nonparametric rate). Under the setup and prior specification in section 4.1, Now we turn to the convergence results for the parametric component. The focus is the asymptotic normality of marginal posterior distribution of the linear regression parameter β, i.e., the Bernstein von-Mises limit (Doob, 1949). To achieve this, we need the notion of the least favorable model for semiparametric models (Bickel et al., 1998). For each fixed β ∈ R q , the least favorable curve η * β is defined by the minimizer of the KL-divergence: η * β (x) = arg inf η∈F D KL (p 0 ||p β,η ). It is easy to see that η * β coincides with η 0 (x) in our context. In fact, for each β, we have The least favorable submodel is defined to be {p β,η * β : β ∈ R q }, which coincides with {p β,η0 : β ∈ R q } in our context.
Theorem 4. Under the setup and prior specification in section 4.1, if α > p/2, then The proof of theorem 4 is based on verifying a set of sufficient conditions in Yang et al. (2015), which are provided in section D of the supplementary material. However, we remark that the metric entropy results (proposition 2) we obtain in section 3 and the previous rate of contraction for η (theorem 3) are also of fundamental interest in the verification process.

Numerical studies
In Numerical evidence shows that all Markov chains converge within 1000 iterations. The kernel we use for the kernel mixture of polynomials is the bump kernel ϕ(x) = exp{−(1 − x 2 ) −1 }1(|x| < 1).
For comparison we consider three competitors for estimating f 0 : the local polynomial regression (Fan and Gijbels, 1996), implemented in the locpol package (Cabrera, 2012), DiceKriging method (Roustant et al., 2012), and the robust Gaussian stochastic process emulation (Gu et al., 2017), implemented in the RobustGaSP package (Gu et al., 2016). The point-wise posterior means and 95%-credible/confidence intervals for f (x) using the four nonparametric regression methods are plotted in Figure 1, respectively. We also compute the mean-squared error of the posterior means of the four methods, where the ground true f 0 is evaluated at 1000 equidistant design points. In terms of accuracy measured by mean-squared error and

Partial linear model for the wage data
We further analyze the cross-sectional data on wages (Wooldridge, 2015), a benchmark dataset for partial linear model. This dataset is also available in the np package (Hayfield et al., 2008). The data consist of 526 observations with 24 variables and are taken from U.S. Current Population Survey for the year 1976. In particular, we are interested in modeling the hourly wage on the logarithm scale as the response with respect to 5 variables: years of education (educ), years of potential experience (exper), years with current employer, gender, and marital status. Choi and Woo (2015) and Hayfield et al. (2008) suggest the following form of the model: , the truncated inverse-Gamma density), π β = N(0, 10 2 ), π ξ = N(0, 10 2 )1(|ξ| ≤ 100). The range of K is set to be {11, 12, . . . , 20}.
We calculate the posterior means and posterior 95%-credible intervals for β. For comparison, we also provide the least-squared estimate of β and the estimate computed by the np package (Hayfield et al., 2008).
The results are summarized in Table 1 where β np is the estimate of β computed using the np package.

Discussion
There are several potential extensions of the current work. Firstly, we develop the theoretical results under the assumption that the noises (e i ) n i=1 are Gaussian. In cases where the noises are only assumed to be sub-Gaussian, further exploration of the convergence properties can be investigated. Secondly, the design points are assumed to be random in the present paper. In cases where the design points are fixed, which is also a common phenomenon in many physical experiments (Tuo and Wu, 2015), theoretical results for the kernel mixture of polynomials can be further extended using the techniques developed for non-independent nor-identically distributed observations by Ghosal et al. (2007). Thirdly, the kernel mixture of polynomials prior is constructed over a uniformly bounded function space. Dropping such a requirement may require significant work. In addition, when applying the kernel mixture of polynomials to the partial linear model, we only consider the case where Ez = 0 and x is independent of z, indicating that the linear component and the nonparametric component are orthogonal. On one hand, the idea of orthogonality has been explored in the literature of calibration of inexact computer models (Plumlee and Joseph, 2016;Plumlee, 2017), and therefore exploring the application of the kernel mixture of polynomials to calibration of orthogonal computer models is a promising extension. On the other hand, it is also interesting to investigate the convergence theory when the two components are not orthogonal to each other. Finally, we have developed a theoretical support for the sieve maximum likelihood estimator with compact restrictions on the parameter spaces. In particular, the loss function is of the least-square form. From the computational perspective, an efficient optimization technique can be designed to obtain the frequentist estimator in light of the rich literature of solving nonlinear least-square problems (Nocedal and Wright, 2006).

Supplementary material
The supplementary material contains additional numerical studies, additional technical results, the remaining proofs, and the cited theorems that are used in the proofs.

Supplementary Material
Abstract This supplementary material contains additional numerical studies, additional technical results, the remaining proofs, and the cited theorems that are used in the proofs.
A Additional numerical studies: a synthetic example for partial linear model We consider a synthetic example to evaluate the performance of the kernel mixture of polynomials for the partial linear model. We simulate n = 500 observations according to the model Table 2, (e i ) n i=1 are independent N(0, 1) noises that are independent of (x i , z i ) n i=1 , and η 0 (x) = 2.5 exp(−x) sin(10πx). The nonparametric function η 0 is highly nonlinear and hence brings natural challenge to estimation. The design points (z i ) n i=1 for the linear component follow Unif([−1, 1] 8 ) independently, and the design points (x i ) n i=1 for η are independently sampled from Unif(0, 1). The hyperparameters for the the kernel mixture of polynomials prior are set as follows: h = 1.2, h = 2, B = 100, and m = 3. For the prior specification, we assume π µ = Unif(−1, 1), π β = N(0, 10 2 ), π ξ = N(0, 10 2 )1(|ξ| ≤ 100), and Kh ∼ Unif(h, h). The range of K is set to be {6, 7, . . . , 15}.
For the parametric component, we compute the posterior means and the posterior 95% credible intervals for β. For comparison, we calculate the least-square estimate of β: The comparison is provided in Table 2 where β LS is the least-squared estimate of β.

B Additional technical results
To estimate the prior concentration Π(p f,σ ∈ B KL (p 0 , )), we need to bound the Kullback-Leibler numbers D KL (p 0 ||p f,σ ) and E 0 {log(p 0 (x)/p f,σ (x))} 2 . The following lemma provides upper bounds for these two quantities in terms of f − f 0 L2(Px) , and hence connects the Kullback-Leibler ball B KL (p 0 , ) with the where f and f 0 lie in some uniformly bounded function class with f ∞ , f 0 ∞ < A.
Lemma B.2 (Approximation lemma). Let f be of the form (4) and µ k ∈ X K (k). Then there exists some constant C 1 such that for sufficiently small the following holds whenever K ≥ −1/α , Proof. Suppose f ∈ B K . Define θ k (x) = m s:|s|=0 ξ ks ψ ks (x), k ∈ [K] p , where ψ ks (x)'s are the kernel mixture of polynomial system, and θ k (x) to be the Taylor polynomial of f 0 at µ k : ∞ is bounded uniformly over µ k , α, k. By the Taylor's expansion, for all x ∈ X we have for some constantC 1 > 0. Since we assume that f 0 satisfies the α-Hölder condition globally over X , the constantC 1 does not depend on µ k . By the Cauchy-Schwart inequality (a + b) 2 ≤ 2a 2 + 2b 2 , we write By the Jensen's inequality, for any a > 0, we proceed to derive Since f 0 − θ k ∞ ≤ A+B for all k, where the constant A is the uniform upper bound on { f ∞ : f ∈ ∞ K=1 F K }, then we apply the Taylor approximation to obtain It follows that I K ≤C 2 1 (a+h) 2α h 2α 2 when is sufficiently small. Similarly by Jensen's inequality and Cauchy's inequality (a + b) 2 ≤ 2a 2 + 2b 2 we write The first term on the right-hand side is upper bounded by 2 up to a constant. Now we analyze the second term. Write Now that h + 1/2K 1/K, h + 1/2K ≤ 1 for sufficiently large K, and |s| ≥ α, it follows that We conclude that J K 2 , and hence 2I K + 2J K 2 . To sum up, there exists a constant C 1 , such that for The proof is thus completed.
of proposition 1. By lemma B.1 there exists some constant C 1 , such that for sufficiently small > 0, for some constants C 1 ,C 2 , C 2 > 0. The proof is thus completed.
We proceed to compute Since max |s|=0,1,...,m |ξ (2) ks | ≤ B for all k and (x − µ l ) s ∞ is upper bounded by a universal constant, l ∈ [K] p , |s| = 0, 1, . . . , m, it follows that For any x ∈ X , there exists a unique k is the same as the cardinality of the index set {l ∈ [K] p : x − µ jl ∞ < h}. We now argue that where the cardinality of the right-hand side set of the last display is upper bounded by 8h + 2 p . Suppose l is in the complement of the right-hand side of the last display. Then the (reverse) triangle inequality yields Therefore, we obtain where we have applied lemma B.3 in the last inequality.
We now construct the -net for F K . Let E ξ ( ) be an -net of [−B, B], E µ k ( ) be an -net of X K (k) with respect to · ∞ , k ∈ [K] p , and E h ( ) be an -net of h/K, h/K . Then |E ξ ( )| ≤ 3B/ , |E µ k ( )| ≤ (2/K ) p , and |E h ( )| ≤ (2h − 2h)/(K ) for sufficiently small . We claim that . In fact, for all f (x) ∈ F K of the form (4), there exist some h ∈ E h ( /K 2 ), ξ ks ∈ E ξ ks ( ), µ k ∈ E µ k ( /K) for each k and each α, such that |h − h | < /K 2 , when is sufficiently small. Taking logarithm to both sides of the last display completes the proof of the second inequality.
Now we prove the first inequality. Suppose (f j ) N j=1 forms an -net of F K with respect to · ∞ such that N = N ( , F K , · ∞ ). Then define l j (x) = max(f j (x) − , −A) and u j (x) = min(f j (x) + , A), yielding the brackets ([u j , l j ]) N j=1 such that F K ⊂ N j=1 [l j , u j ]. Furthermore, l j − u j Lr(Px) ≤ l j − u j ∞ ≤ 2 . The proof is completed by the fact that log N = log N ( , F K , · ∞ ). of theorem 2. Note that we take the closure of X K (k) so that the sieve maximum likelihood estimator exists.
Also, when σ 0 is unknown, the computation of f K is not affected since the maximum likelihood estimator of σ 0 is equivalent to the least-squared estimator under the assumption of Gaussian noises. Furthermore, we remark that the metric entropy bound for F K in proposition 2 also applies to G K .
We first verify (13). Since M n = Kn K=1 P (F K , Θ Jn ), then for some constants b 1 , B 1 > 0, where we have used the Chernoff bound, the fact that β 2 ∼ χ 2 (q), and E Π exp β 2 2 4 = √ 2 q under the prior Π in the last inequality. Since for sufficiently large n, K p n (log K p n ) r0 J 2 n , it follows from simple algebra that Π(p β,η ∈ M c n ) ≤ exp −b 1 n p p+2α (log n) pδ+r0 ≤ exp(−4n 2 n ) for some constant b 1 > 0 when n is sufficiently large, where (11) is applied.
where K = −1/α , and B K is defined in lemma B.2. Furthermore by simple algebra D KL (p 0 ||p β,η ) = ρ 2 2 (β T z + η(x), β T 0 z + η 0 (x))/2, and the second moment of the likelihood ratio can be upper bounded using The following classical theorem originally due to Wong and Shen (1995) is used to study the convergence rate of sieve maximum likelihood estimator.
Theorem D.2. Let (y i ) n i=1 be independent and identically distributed observations following a distribution pr 0 with density p 0 , and (P n ) ∞ n=1 be a sequence of classes of densities (referred to as the sieves). Suppose ( n ) ∞ n=1 is a sequence decreasing to 0 such that n 0 log [·] N ( , P n , H)d √ n 2 n . Let p n = arg max p∈Pn n i=1 log p(y i ) be the sieve maximum likelihood estimator on P n and be well-defined, and define δ n = inf q∈Pn D KL (p 0 q), τ n = lim k→∞ E 0 {log p 0 (y)/p k (y)} 2 for some sequence (q k ) ∞ k=1 ⊂ P n such that D KL (p 0 q k ) → δ n . If max(δ n , τ n )  2015)). Let P = {p θ,η : θ ∈ Θ ⊂ R q , η ∈ F} be a class of density functions with respect to some underlying σ-finite measure over Y parametrized on Θ × F, Θ is open, and F is equipped with metric d H (η 1 , η 2 ) = H (p θ0,η1 , p θ0,η2 ). Let (y i ) n i=1 be i.i.d. according to p 0 = p θ0,η0 for some θ 0 ∈ Θ and η 0 . Assume that the least-favorable submodel p θ,η * θ : θ ∈ Θ defined through the least-favorable curve η * θ = arg inf η∈F D KL (p 0 ||p θ,η ) exists for all θ ∈ Θ, and denote the semiparametric bias ∆η θ = η * θ − η 0 . Suppose the following conditions hold: Condition D.1. Θ × F is endowed with a product prior Π θ × Π η , and Π θ yields a density with respect to the Lebesgue measure on Θ that is positive at θ 0 Condition D.2. There exists a sequence n → 0 satisfying n 2 n → ∞ and a sequence of submodels ( F n ) ∞
The following maximum inequality for empirical process plays a fundamental role in the verification of III in theorem D.3.
Theorem D.4 (van der Vaart (2000), lemma 19.36). Let (y i ) n i=1 be i.i.d. according to a distribution P y over Y, and let F be a class of measurable functions f : Y → R. If Y f 2 (y)P y (dy) < δ 2 and f ∞ ≤ M for all f ∈ F, where δ and M does not depend on F, then where J [·] (δ, F, · L2(Py) ) = δ 0 log N [·] ( , F, · L2(Py) )d is the bracketing integral.