Bayesian two-step estimation in differential equation models

Ordinary differential equations (ODEs) are used to model dynamic systems appearing in engineering, physics, biomedical sciences and many other fields. These equations contain unknown parameters, say $\bm\theta$ of physical significance which have to be estimated from the noisy data. Often there is no closed form analytic solution of the equations and hence we cannot use the usual non-linear least squares technique to estimate the unknown parameters. There is a two-step approach to solve this problem, where the first step involves fitting the data nonparametrically. In the second step the parameter is estimated by minimizing the distance between the nonparametrically estimated derivative and the derivative suggested by the system of ODEs. The statistical aspects of this approach have been studied under the frequentist framework. We consider this two-step estimation under the Bayesian framework. The response variable is allowed to be multidimensional and the true mean function of it is not assumed to be in the model. We induce a prior on the regression function using a random series based on the B-spline basis functions. We establish the Bernstein-von Mises theorem for the posterior distribution of the parameter of interest. Interestingly, even though the posterior distribution of the regression function based on splines converges at a rate slower than $n^{-1/2}$, the parameter vector $\bm\theta$ is nevertheless estimated at $n^{-1/2}$ rate.


Introduction
Suppose that we have a regression model Y = f θ (t) + ε, θ ∈ Θ ⊆ R p . The explicit form of f θ (·) may not be known, but the function is assumed to satisfy the system of ordinary differential equations (ODEs) given by (1.1) here F is a known appropriately smooth vector valued function and θ is a parameter vector controlling the regression function. Equations of this type are encountered in various branches of science such as in genetics (Chen et al., 1999), viral dynamics of infectious diseases [Anderson and May (1992), Nowak and May (2000)]. There are numerous applications in the fields of pharmacokinetics and pharmacodynamics (PKPD) as well. There are a lot of instances where no closed form solution exist. Such an example can be found in the feedback system (Gabrielsson and Weiner, 2000, page 235) modeled by the ODEs where R(t) and M (t) stand for loss of response and modulator at time t respectively. Here k in , k out and k tol are unknown parameters which have to be estimated from the noisy observations given by ε R (t), ε M (t) being the respective noises at time point t. Another popular example is the Lotka-Volterra equations, also known as predator-prey equations. The prey and predator populations change over time according to the equations where θ = (θ 1 , θ 2 , θ 3 , θ 4 ) T and f 1θ (t) and f 2θ (t) denote the prey and predator populations at time t respectively. If the ODEs can be solved analytically, then the usual non linear least squares (NLS) [Levenberg (1944), Marquardt (1963)] can be used to estimate the unknown parameters. In most of the practical situations, such closed form solutions are not available as evidenced in the previous two examples. NLS was modified for this purpose by Bard (1974) and Domselaar and Hemker (1975). Hairer et al. (1993, page 134) and Mattheij and Molenaar (2002, page 53) used the 4-stage Runge-Kutta algorithm to solve (1.1) numerically. The NLS can be applied in the next step to estimate the parameters. The statistical properties of the corresponding estimator have been studied by Xue et al. (2010). The strong consistency, √ n-consistency and asymptotic normality of the estimator were established in their work. Ramsay et al. (2007) proposed the generalized profiling procedure where the solution is approximated by a linear combination of basis functions. The coefficients of the basis functions are estimated by solving a penalized optimization problem using an initial choice of the parameters of interest. A data-dependent fitting criterion is constructed which contains the estimated coefficients. Then θ is estimated by the maximizer of this criterion. Qi and Zhao (2010) explored the statistical properties of this estimator including √ n-consistency and asymptotic normality. Despite having desirable statistical properties, these approaches are computationally cumbersome especially for high-dimensional systems of ODEs as well as when θ is high-dimensional. Varah (1982) used an approach of two-step procedure. In the first step each of the state variables is approximated by a cubic spline using least squares technique. In the second step, the corresponding derivatives are estimated by differentiating the nonparametrically fitted curve and the parameter estimate is obtained by minimizing the sum of squares of difference between the derivatives of the fitted spline and the derivatives suggested by the ODEs at the design points. This method does not depend on the initial or boundary conditions of the state variables and is computationally quite efficient. An example given in Voit and Almeida (2004) showed the computational superiority of the two-step approach over the usual least squares technique. Brunel (2008) replaced the sum of squares of the second step by a weighted integral of the squared deviation and proved √ n-consistency as well as asymptotic normality of the estimator so obtained. The order of the B-spline basis was determined by the smoothness of F (·, ·, ·) with respect to its first two arguments. Gugushvili and Klaassen (2012) used the same approach but used kernel smoothing instead of spline. They also established √ n-consistency of the estimator. Another modification has been made in the work of Wu et al. (2012). They have used penalized smoothing spline in the first step and numerical derivatives instead of actual derivatives of the nonparametrically estimated functions. In another work Brunel et al. (2013) used nonparametric approximation of the true solution to (1.1) and then used a set of orthogonality conditions to estimate the parameters. The √ n−consistency as well as the asymptotic normality of the estimator was also established in their work.
In ODE models Bayesian estimation was considered in the works of Gelman et al. (1996), Rogers et al. (2007) and Girolami (2008). First they solved the ODEs numerically to approximate the expected response and hence constructed the likelihood. A prior was assigned on θ and MCMC technique was used to generate samples from the posterior. Computation cost might be an issue in this case as well. Campbell and Steele (2012) proposed the smooth functional tempering approach which is a population MCMC technique and it utilizes the generalized profiling approach (Ramsay et al., 2007) and the parallel tempering algorithm. Jaeger (2009) also used a Bayesian analog of the generalized profiling by putting prior on the coefficients of the basis functions. The theoretical aspects of Bayesian estimation methods have not been yet explored in the literature.
In this paper we consider a Bayesian analog of the approach of Brunel (2008) fitting a nonparametric regression model using B-spline basis. We assign priors on the coefficients of the basis functions. A posterior has been induced on θ using the posteriors of the coefficients of the basis functions. In this paper we study the asymptotic properties of the posterior distribution of θ and establish a Bernstein-von Mises theorem. We allow the ODE model to be misspecified, that is, the true regression function may not be a solution of the ODE. The response variable is also allowed to be multidimensional with possibly correlated errors. Normal distribution is used as the working model for error distribution, but the true distribution of errors may be different. Interestingly, the original model is parametric but it is embedded in a nonparametric model, which is further approximated by high dimensional parametric models. Note that the slow rate of nonparametric estimation does not influence the convergence rate of the parameter in the original parametric model.
The paper is organized as follows. Section 2 contains the description of the notations and the model as well as the priors used for the analysis. The main results are given in Section 3. We extend the results to a more generalized set up in Section 4. In Section 5 we have carried out a simulation study under different settings. Proofs of the theorems are given in Section 6. Section 7 contains the proofs of the required lemmas.

Notations, model assumption and prior specification
We describe a set of notations to be used in this paper. Boldfaced letters are used to denote vectors and matrices. For a matrix A, the symbols A i, and A ,j stand for the i th row and j th column of A respectively. The notation ((A i,j )) stands for a matrix with (i, j) th element being A i,j . We use the notation rows s r (A) with r < s to denote the sub-matrix of A consisting of r th to s th rows of A. Similarly, we can define cols s r (A) for columns. The notation x r:s stands for the sub-vector consisting of r th to s th elements of a vector x. By vec(A), we mean the vector obtained by stacking the columns of the matrix A one over another. For an m×n matrix A and a p × q matrix B, A ⊗ B denotes the Kronecker product between A and B; see Steeb (2006) for the definition. The identity matrix of order p is denoted by I p . By the symbols maxeig(A) and mineig(A), we denote the maximum and minimum eigenvalues of the matrix A respectively. For a vector x ∈ R p , we denote x = p i=1 x 2 i 1/2 . We denote the r th order derivative of imsart-generic ver. 2011/11/15 file: EJS_1.tex date: November 5, 2014 a function f (·) by f (r) (·), that is, f (r) (t) = d r dt r f (t). The boldfaced symbol f (·) stands for a vector valued function. For functions f : [0, 1] → R p and w : . . , f (x p )) T . The notation ·, · stands for inner product. For numerical sequences a n and b n , by a n = o(b n ), we mean a n /b n → 0 as n → ∞. The notation a n = O(b n ) implies that a n /b n is bounded. We use the notation a n ≍ b n to mean a n = O(b n ) and b n = O(a n ), while a n b n stands for a n = O(b n ). The symbol a n ≫ b n will mean b n = o(a n ). Similarly we can define a n ≪ b n . The notation o P (1) is used to indicate a sequence of random vectors which converges in probability to zero, whereas the expression O P (1) stands for a sequence of random vectors bounded in probability. The boldfaced symbols E(·) and Var(·) stand for the mean vector and dispersion matrix respectively of a random vector. For the probability measures P and Q defined on R p , we define the total variation distance P For an open set E, the symbol C m (E) stands for the collection of functions defined on E with first m continuous partial derivatives with respect to its arguments. Now let us consider the formal description of the model. We have a system of d ordinary differential equations given by . . , f dθ (·)) T and θ ∈ Θ, where we assume that Θ is a compact subset of R p . Let us denote F (·, ·, ·) = (F 1 (·, ·, ·), . . . , F d (·, ·, ·)) T . We also assume that for a fixed θ, F ∈ C m−1 ((0, 1), R d ) for some integer m ≥ 1. Then, by successive differentiation of the right hand side of (2.1), it follows that f θ ∈ C m ((0, 1)). By the implied uniform continuity, the function and its several derivatives uniquely extend to continuous functions on [0, 1]. Consider an n × d matrix of observations Y with Y i,j denoting the measurement taken on the j th response at the point x i , 0 ≤ x i ≤ 1, i = 1, . . . , n; j = 1, . . . , d. Denoting ε = ((ε i,j )) as the corresponding matrix of errors, the proposed model is given by while the data is generated by the model where f 0 (·) = (f 10 (·), . . . , f d0 (·)) T denotes the true mean vector which does not necessarily lie in {f θ : θ ∈ Θ}. We assume that f 0 ∈ C m ([0, 1]). Let ε i,j iid ∼ P 0 , which is a probability distribution with mean zero and finite variance σ 2 0 for i = 1, . . . , n ; j = 1, . . . , d.
Since the expression of f θ is usually not available, the proposed model is embedded in nonparametric regression model 4) where X n = ((N j (x i ))) 1≤i≤n,1≤j≤kn +m−1 , {N j (·)} kn+m−1 j=1 being the B-spline basis functions of order m with k n − 1 interior knots. Here we denote B n = β (kn+m−1)×1 1 , . . . , β (kn+m−1)×1 d , the matrix containing the coefficients of the basis functions. Also we consider P 0 to be unknown and use N (0, σ 2 ) as the working distribution for the error where σ may be treated as another unknown parameter. Let us denote by t 1 , . . . , t kn−1 the set of interior knots with t l = l/k n for l = 1, . . . , k n − 1. Hence the meshwidth is ξ n = 1/k n . Denoting by Q n , the empirical distribution function of x i , i = 1, . . . , n, we assume Let the prior distribution on the coefficients be given by Simple calculation yields the posterior distribution for β j as and the posterior distributions of β j and β j ′ are mutually independent for Let w(·) be a continuous weight function with w(0) = w(1) = 0 and be positive on (0, 1). We define (2.7) It is easy to check that ψ(f η ) = η for all η ∈ Θ. Thus the map ψ extends the definition of the parameter θ beyond the model. Let us define θ 0 = ψ(f 0 ). We assume that θ 0 lies in the interior of Θ. From now on, we shall write θ for ψ(f ) and treat it as the parameter of interest. A posterior is induced on Θ through the mapping ψ acting on f (·) = B T n N (·) and the posterior of B n given by (2.6).

Main results
Our objective is to study the asymptotic behavior of the posterior distribution of √ n(θ − θ 0 ). The asymptotic representation of √ n(θ − θ 0 ) is given by the next theorem under the assumption that for all ǫ > 0, inf Since the posterior distributions of β j are mutually independent when ε ,j are mutually independent for j = 1, . . . , d, we can assume d = 1 in Theorem 1 for the sake of simplicity in notation and write f (·), f 0 (·), F (·, ·, ·), β instead of f (·), f 0 (·), F (·, ·, ·) and B n respectively. Extension to d-dimensional case is straightforward as shown in Remark 4 after the statement of Theorem 1. We deal with the situation of correlated errors in Section 4.
Theorem 1. Let the matrix Let m be an integer greater than or equal to 5 and n 1/2m ≪ k n ≪ n 1/8 . If D 0,2,1 F (t, y, θ) and D 0,0,2 F (t, y, θ) are continuous in their arguments, then under the assumption Remark 1: Condition (3.1) implies that θ 0 is the unique point of minimum of R f0 (·) and θ 0 should be a well-separated point of minimum.
Remark 2: The posterior distribution of Γ(f ) − Γ(f 0 ) contracts at 0 at the rate n −1/2 as indicated by Lemma 4. Hence, the posterior distribution of (θ−θ 0 ) contracts at 0 at the rate n −1/2 with high probability under the truth. We refer to Theorem 2 for a more refined version of this result.
Remark 3: We note that a minimum of fifth order smoothness of the true mean function is good enough to ensure the contraction rate n −1/2 . We do not gain anything more by assuming a higher order of smoothness. For m = 5, the required condition becomes n 1/10 ≪ k n ≪ n 1/8 . Also, the knots are chosen deterministically and there is no need to assign a prior on the number of terms of the random series used. Hence, the issue of Bayesian adaptation, that is, improving convergence rate with higher smoothness without knowing the smoothness, does not arise in the present context.

Remark 4:
When the response is a d-dimensional vector, (3.2) holds with the scalars being replaced by the corresponding d-dimensional vectors. Let A(t) stands for the p × d matrix Then we have Then in order to approximate the posterior distribution of θ, it suffices to study the asymptotic posterior distribution of the linear combination of β j given by (3.3). The next theorem describes the approximate posterior distribution of √ n(θ − θ 0 ).

Theorem 2. Define
..,p for j = 1, . . . , d. If B j is non-singular for all j = 1, . . . , d, then under the conditions of Theorem 1, Inspecting the proof, we can conclude that (3.4) is uniform over σ 2 belonging to a compact subset of (0, ∞). Also note that the scale of the approximating normal distribution involves the working variance σ 2 assuming that it is given, even though the convergence is studied under the true distribution P 0 with variance σ 2 0 , not necessarily equal to the given σ 2 . Thus, the distribution matches with the frequentist distribution of the estimator in Brunel (2008) only if σ is correctly specified as σ 0 . The next result assures that putting a prior on σ rectifies the problem.
Theorem 3. We assign independent N (0, nk −1 n σ 2 (X T n X n ) −1 ) prior on β j for j = 1, . . . , d, and inverse gamma prior on σ 2 with shape and scale parameters a and b respectively. If the fourth order moment of the true error distribution is finite, then
Then under the conditions of Theorem 1 and Theorem 2, If σ 2 is unknown and is given an inverse gamma prior, then where σ 2 0 is the true value of σ 2 . Remark 5: In many applications, the regression function is modeled as h θ (t) = g(f θ (t)) instead of f θ (t), where g is a known invertible function and h θ (t) ∈ R d . It should be noted that which is known function of t, h θ and θ. Now we can carry out our analysis replacing F and f θ in (1.1) by H and h θ respectively.

Simulation Study
We consider the Lotka-Volterra equations to study the posterior distribution of θ. We consider both the situations where the true regression function belongs to the solution set and lies outside the solution set. For a sample of size n, the x i 's are chosen as x i = (2i − 1)/2n for i = 1, . . . , n. Samples of sizes 50, 100 and 500 are considered. The weight function is chosen as We simulate 1000 replications for each case. Under each replication a sample of size 1000 is directly drawn from the posterior distribution of θ and then 95% equal tailed credible interval is obtained. Each replication took around one minute. We calculate the coverage and the average length of the corresponding credible interval over these 1000 replications. The estimated standard errors of the interval length and coverage are given inside the parentheses in the tables. We also consider 1000 replications to construct the 95% equal tailed confidence interval based on asymptotic normality as obtained from the estimation method introduced by Varah (1982) and modified and studied by Brunel (2008). We abbreviate this method by "VB" in tables. The estimated standard errors of the interval length and coverage are given inside the parentheses in the tables. Thus we have p = 4, d = 2 and the ODE's are given by with initial condition f 1θ (0) = 1, f 2θ (0) = 0.5. The above system is not analytically solvable. We consider two cases. Case 1: The true regression function is in the model. Thus the true mean vector is given by (f 1θ0 (t), f 2θ0 (t)) T , where θ 0 = (θ 10 , θ 20 , θ 30 , θ 40 ) T . We take θ 10 = θ 20 = θ 30 = θ 40 = 10 to be the true value of the parameter.
Case 2: The true mean vector is taken outside the solution set of the ODE and is given by (f 1θ0 (t) + 0.4 sin(4πt), f 2θ0 (t) + 0.4 sin(4πt)) T .
The true distribution of error is taken either N (0, (0.2) 2 ) or a scaled tdistribution with 6 degrees of freedom, where scaling is done in order to make the standard deviation 0.2. We put an inverse gamma prior on σ 2 with shape and scale parameters being 99 and 1 respectively and independent Gaussian priors on β 1 and β 2 with mean vector 0 and dispersion matrix nk −1 n σ 2 (X T n X n ) −1 . As far as choosing k n is concerned, we take k n in the order of n 1/9 . The simulation results are summarized in the tables 1 and 2. Not surprisingly asymptotic normality based confidence intervals obtained from VB method are shorter but too optimistic, failing to give adequate coverage for finite sample sizes. On the other hand, the posterior credible intervals appear to be slightly conservative.

Proofs
Proof of Theorem 1. The structure of the proof follows that of Proposition 3.1 of Brunel (2008) and Proposition 3.3 of Gugushvili and Klaassen (2012), but differs substantially in detail since we address posterior variation and also allow Interchanging the orders of differentiation and integration and using the definitions of θ and θ 0 , imsart-generic ver. 2011/11/15 file: EJS_1.tex date: November 5, 2014 Replacing the difference between the values of a function at two different values of an argument by the integral of the corresponding partial derivative, we get where M (f, θ) is given by Note that M (f 0 , θ 0 ) = J θ0 . We also define for sufficiently large n, where In view of Lemmas 2 and 4, on a set in the sample space with high true probability, the posterior distribution of J −1 θ0 √ n(T 1n + T 2n + T 3n ) assigns most of its mass inside a large compact set. Thus, we can assert that inside the set E n , the asymptotic behavior of the posterior distribution of √ n(θ − θ 0 ) is given by that of We shall extract √ nJ −1 θ0 (Γ(f ) − Γ(f 0 )) from (6.4) and show that the remainder term goes to zero. First write which follows by integration by parts and the fact that w(0) = w(1) = 0. Note that the first integral of the above equation appears in (6.1). The norm of the second integral can be bounded above by a constant multiple of sup using the continuity of D 0,1,1 F (t, y, θ). Now we consider T 3n in (6.4). Then, (6.5) The first integral on the right hand side of (6.5) can be written as say. Now T 31n appears in (6.1). By the continuity of D 0,2,0 F (t, y, θ), T 32n can be bounded above up to a constant by a multiple of sup t∈[0,1] |f (t) − f 0 (t)| 2 . We apply the Cauchy-Schwarz inequality and the continuity of D 0,1,1 F (t, y, θ) to bound the second integral on the right hand side of (6.5) by a constant multiple of sup{|f (t) − f 0 (t)| 2 : t ∈ [0, 1]}. As far as the first term inside the bracket of (6.4) is concerned, we have The first integral appears in (6.1). The norm of the second integral of the above display can be bounded by a constant multiple of sup{|f (t) − f 0 (t)| 2 : t ∈ [0, 1]} utilizing the continuity of D 0,2,1 F (t, y, θ) with respect to its arguments. Combining these, we find that the norm of J −1 Now applying Lemma 2, we get the desired result.
Proof of Theorem 2. By Theorem 1 and (3.3), it suffices to show that (6.6) Note that the posterior distribution of G T n,j β j is a normal distribution with mean vector (1+σ 2 k n /n) −1 G T n,j (X T n X n ) −1 X T n Y ,j and dispersion matrix σ 2 (1+ σ 2 k n /n) −1 G T n,j (X T n X n ) −1 G n,j . We calculate the Kullback-Leibler divergence between two Gaussian distributions to prove the assertion. Alternatively, we can also follow the approach given in Theorem 1 and Corollary 1 of Bontemps (2011). The Kulback-Leibler divergence between the distributions N kn+m−1 (µ 1 , Ω 1 ) and N kn+m−1 (µ 2 , Ω 2 ) is given by Taking µ 1 = (1+σ 2 k n /n) −1 G T n,j (X T n X n ) −1 X T n Y ,j , Ω 1 = σ 2 (1+σ 2 k n /n) −1 G T n,j (X T n X n ) −1 G n,j and µ 2 = G T n,j (X T n X n ) −1 X T n Y ,j , Ω 2 = σ 2 G T n,j (X T n X n ) −1 G n,j , we get tr(Ω −1 1 Ω 2 ) = k n + m − 1 + o(1). Also, log(det(Ω −1 1 Ω 2 )) ≍ k n log(1 + k n /n) ≍ k 2 n /n = o(1). From the proof of Lemma 4, it follows that Hence, the total variation distance between the posterior distribution of G T n,j β j and a Gaussian distribution with mean G T n,j (X T n X n ) −1 X T n Y ,j and dispersion matrix given by σ 2 G T n,j (X T n X n ) −1 G n,j converges in P 0 − probability to zero for j = 1, . . . , d. Since the posterior distributions of β j and β ′ j are mutually independent for j = j ′ ; j, j ′ = 1, . . . , d, we can assert that the posterior distribution of √ n d j=1 G T n,j β j − √ nJ −1 θ0 Γ(f 0 ) can be approximated in total variation by N (µ n , σ 2 Σ n ).
Proof of Theorem 3. The marginal posterior of σ 2 is also inverse gamma with parameters (dn + 2a)/2 and b + d j=1 Y T ,j (I n − P Xn (1 + (k n /n)) −1 )Y ,j /2, where imsart-generic ver. 2011/11/15 file: EJS_1.tex date: November 5, 2014 P Xn = X n (X T n X n ) −1 X T n . Straightforward calculations show that which give rise to |E(σ 2 |Y ) − σ 2 0 | = O P0 (n −1/2 ) and Var(σ 2 |Y ) = O P0 (n −1 ). In particular, the marginal posterior distribution of σ 2 is consistent at the true value of error variance. Let N be an arbitrary neighborhood of σ 0 . Then, Π(N c |Y ) = o P0 (1). We observe that The total variation distance between the two normal distributions appearing in the second term is bounded by a constant multiple of |σ − σ 0 |, and hence the term can be made arbitrarily small by choosing N appropriately. The first term converges in probability to zero by (3.4). The third term converges in probability to zero by the posterior consistency. Hence, we get the desired result.
Proof of Theorem 4. According to the fitted model, Y 1×d i, ∼ N d ((X n ) i, B n , Σ d×d ) for i = 1, . . . , n. The logarithm of the posterior probability density function (p.d.f.) is negative half times where B n = (β 1 , . . . , β d ). The quadratic term in β j above for j = 1, . . . , d, can be consolidated to The term in (6.7) which is linear in β j , j = 1, . . . , d, is given by A completing square argument gives the posterior density to be proportional to which can be identified with the pdf of a matrix normal distribution. More precisely, Fixing a j ∈ {1, . . . , d}, we observe that the posterior mean of β j is a weighted sum of (X T n X n ) −1 X T n Y ,j ′ for j ′ = 1, . . . , d. The weight attached with (X T n X n ) −1 X T n Y ,j is of the order of 1, whereas for j ′ = j, the contribution from (X T n X n ) −1 X T n Y ,j ′ is of the order of k n /n which goes to zero as n goes to infinity. Thus, the results of Lemmas 1 to 4 can be shown to hold under this setup. We are interested in the . We note that the posterior distribution of Σ −1 + k n I d /n 1/2 ⊗ I kn+m−1 vec(B n ) is a (k n + m − 1)d-dimensional normal distribution with mean vector and dispersion matrix being vec (X T n X n ) −1 X T n Y Σ −1 Σ −1 + k n I d /n −1/2 and I d ⊗ (X T n X n ) −1 respectively, since by the properties of Kronecker product, for the matrices A, B and D of appropriate orders (B T ⊗ A)vec(D) = vec(ADB). Let us consider the mean vector of the posterior distribution of the vector Σ −1 + k n I d /n 1/2 ⊗ I kn+m−1 vec(B n ). We observe that where C n = ((c jk )) = Σ + k n Σ 2 /n −1/2 . For k = 1, . . . , d, we define Z k to be the sub-vector consisting of [(k − 1)(k n + m − Also, the posterior distributions of Z k and Z k ′ are mutually independent for k = k ′ ; k, k ′ = 1, . . . , d. Now we will prove that the total variation distance between the posterior distribution of Z k and N (X T n X n ) −1 X T n d j=1 Y ,j σ jk , (X T n X n ) −1 converges in P 0 −probability to zero for k = 1, . . . , d, where Σ −1/2 = ((σ jk )). The total variation distance between two multivariate normal distributions with equal dispersion matrix (X T n X n ) −1 and mean vectors ( since the eigenvalues of X n (X T n X n ) −1 X T n are either zero or 1. Since clearly C n converges to Σ −1/2 at the rate k n /n, we have for j = 1, . . . , d, Hence, we conclude that the total variation distance between the distributions converges to zero in P 0 −probability. Note that we can write G T n,1 . . . G T n,d vec(B n ) in terms of Z k as d k=1 cols k(kn+m−1) Since the posterior distributions of Z k , k = 1, . . . , d are independent, we therefore obtain √ n G T n,1 . .
where µ * * n is given by and Σ * * n is given by Following the steps of the proof of Lemma 4, it can be shown that the eigenvalues of the matrix Σ * n mentioned in the statement of Theorem 4 are bounded away from zero and infinity. We can show that the Kullback-Leibler divergence of N (µ * * n , Σ * * n ) from N (µ * n , σ 2 Σ * n ) converges in probability to zero by going through some routine matrix manipulations. Hence, The above expression is equivalent to (6.6) of the proof of Theorem 2. Following steps similar to those of Theorem 2, we get (4.1). We obtain (4.2) by following the proof of Theorem 3.

Appendix
We need to go through a couple of lemmas in order to prove the main results. We denote by E 0 (·) and Var 0 (·) the expectation and variance operators respectively with respect to P 0 − probability. The following lemma helps to estimate the bias of the Bayes estimator.
Lemma 1. For m ≥ 2 and k n satisfying n 1/2m ≪ k n ≪ n, for r = 0, 1, Proof. We note that f (r) (t) = (N (r) (t)) T β for r = 0, 1 with N (r) (·) standing for the r th order derivative of N (·). By (2.6), Zhou and Wolfe (2000) showed that Using the fact that sup t∈[0,1] |(N (r) (t)) T β * | = O(1), first term on the right hand side of the previous inequality is of the order of k −r+1/2 n / √ n. Using the Cauchy-Schwarz inequality, (7.2) and (7.3), we can bound the second term up to a constant multiple by √ nk −m n . The third term has the order of √ nk −m−1/2 n as a result of (7.3). By the assumed conditions on m and k n , the assertion holds.
The following lemma controls posterior variability.
Lemma 2. If m ≥ 5 and n 1/2m ≪ k n ≪ n 1/8 , then for r = 0, 1 and for all Proof. By Markov's inequality and the fact that |a + b| 2 ≤ 2(|a| 2 + |b| 2 ) for two real numbers a and b, we can bound Π sup Now we obtain the asymptotic orders of the expectations of the two terms inside the bracket above. We can bound the expectation of the first term by 2 sup (7.5) where s k = k/n for k = 1, . . . , n. Applying the mean value theorem to the second term of the above sum, we can bound the expression inside the E 0 expectation in the second term of (7.5) by a constant multiple of (7.6) By the spectral decomposition, we can write X n (X T n X n ) −1 X T n = P T DP , where P is an orthogonal matrix and D is a diagonal matrix with k n + m − 1 ones and n − k n − m + 1 zeros in the diagonal. Now using the Cauchy-Schwarz inequality, we get By Lemma 5.4 of Zhou and Wolfe (2000) and the fact that Var 0 (P ε) = Var 0 (ε), we can conclude that the expectation of the first term of (7.6) is O(k 2r+2 n /n). Again applying the Cauchy-Schwarz inequality, the second term of (7.6) is bounded by whose expectation is of the order n(k 2r+3 n /n 3 ) = k 2r+3 n /n 2 , using Lemma 5.4 of Zhou and Wolfe (2000). Thus, the expectation of the bound given by (7.6) is of the order k 2r+2 n /n. Combining it with (7.5) and Lemma 1, we get 1] (N (r) (t)) T (X T n X n ) −1/2 ε * and using the Cauchy-Schwarz inequality and Lemma 5.4 of Zhou and Wolfe (2000), the second term inside the bracket in (7.4) is seen to be O(k 2r+2 n /n). Combining it with (7.4) and (7.7) and using 2r + 2 ≤ 4, we have the desired assertion.
imsart-generic ver. 2011/11/15 file: EJS_1.tex date: November 5, 2014 Lemmas 1 and 2 can be used to prove the posterior consistency of θ as shown in the next lemma.

Also note that
Var(G T n,j β j |Y ) = σ 2 1 + σ 2 k n n −1 G T n,j (X T n X n ) −1 G n,j .