Exact adaptive confidence intervals for linear regression coefficients

We propose an adaptive confidence interval procedure (CIP) for the coefficients in the normal linear regression model. This procedure has a frequentist coverage rate that is constant as a function of the model parameters, yet provides smaller intervals than the usual interval procedure, on average across regression coefficients. The proposed procedure is obtained by defining a class of CIPs that all have exact $1-\alpha$ frequentist coverage, and then selecting from this class the procedure that minimizes a prior expected interval width. Such a procedure may be described as"frequentist, assisted by Bayes"or FAB. We describe an adaptive approach for estimating the prior distribution from the data so that exact non-asymptotic $1-\alpha$ coverage is maintained. Additionally, in a"$p$ growing with $n$"asymptotic scenario, this adaptive FAB procedure is asymptotically Bayes-optimal among $1-\alpha$ frequentist CIPs.


Introduction
Linear regression analyses routinely include point estimates and confidence intervals for the regression coefficients β = (β 1 , . . . , β p ) of the linear model y ∼ N n (Xβ, σ 2 I). The most widely-used confidence interval procedure (CIP) for an element β j of β is perhaps the usual t-interval centered around the ordinary least-squares (OLS) estimateβ j . This interval is uniformly most accurate among CIPs that are derived from inversion of unbiased tests, and so it is called the uniformly most accurate unbiased (UMAU) CIP.
In this article we consider alternatives to the UMAU procedure that have constant coverage, that is, interval procedures C j (y) satisfying Pr(β j ∈ C j (y)|β, σ) = 1 − α, ∀ (β, σ) ∈ R p × R + . (1) This property is what we normally think of as the usual frequentist definition of 1−α coverage -the random interval C j (y) covers the true value β j with probability 1 − α, no matter what β and σ are.
We introduce the term "constant coverage" to distinguish such intervals from other intervals whose coverage is bounded below by 1 − α but varies with (β, σ 2 ), or so-called "frequentist intervals" whose coverage rate is only constant as a function of the parameters asymptotically. For example, the usual score interval for a coefficient in a logistic regression model has an actual finite-sample coverage rate that depends on the values of the parameters.
The UMAU interval procedure of course has constant coverage, and it also has an expected width that is constant for all values of β. However, in many cases we have prior information that many of the elements of β may be close to a particular value, such as zero. In this case, we might prefer a CIP for β j that has a smaller expected width for "likely" values of β j in exchange for having wider intervals for values of β j that are less likely. Specifically, if our prior information could be quantified in terms of a prior distribution with density π(β), then arguably we would be interested in an interval procedure that minimizes the prior expected width E[|C j (y)|] = |C j (y)| p(y|β, σ) dy π(β)dβ among all CIPs that satisfy the constant coverage property (1). Such a procedure would still be "frequentist" in that it would have 1 − α constant coverage, but it would also be Bayes-optimal among frequentist procedures. We refer to such a statistical procedure as "frequentist, assisted by Bayes" or FAB.
In practice, an appropriate prior distribution may not be known in advance. In this article we present a method for adaptively estimating a normal prior distribution for β from the data y, and then using this estimated prior distribution to construct an approximately Bayes-optimal CIP for each regression coefficient β j , j = 1, . . . , p. The CIP we propose satisfies the constant coverage condition (1) exactly and non-asymptotically, but it is also Bayes-optimal asymptotically as p and n increase to infinity. Our proposed adaptive CIP builds on the work of Pratt (1963), who obtained a Bayes-optimal frequentist confidence interval for the mean of a normal population with a known variance. In the next section we review Pratt's FAB interval, and discuss an extension developed in Yu and Hoff (2016) to accommodate an unknown variance. In Section 3 we further extend these ideas to the case of interval estimation for a linear regression coefficient, and show how we may adaptively estimate a normal prior distribution for the elements of β. The resulting adaptive FAB confidence interval we propose maintains exact, non-asymptotic constant coverage.
Additionally, since the accuracy of our adaptive estimate improves as n and p increase, our adaptive FAB procedure is Bayes-optimal under this type of asymptotic regime. Section 4 includes several numerical examples illustrating the use of the adaptive FAB procedure, including analyses of two datasets and a small simulation study. A discussion follows in Section 5.
Several other authors have studied alternatives to UMAU intervals for regression parameters.
O'Gorman (2001)  Their procedure depends on a user-specified spline function for which the constant coverage property must be checked numerically. In contrast, our proposed CIP is obtained by adaptively selecting from a class of constant-coverage intervals based on easy to obtain estimates of a few parameters. For our procedure, constant coverage follows by construction and does not need to be checked numerically. Lee et al. (2016) developed a procedure that has exact conditional coverage, given a model selection event and knowledge of σ 2 . However, for cases where σ 2 is unknown, their suggested modification uses a plug-in estimate of σ 2 and achieves exact coverage only asymptotically. Other authors (Bühlmann (2013), van de Geer et al. (2014), Zhang and Zhang (2014)) have considered confidence interval construction for sparse, high-dimensional regression, including the case that p > n. These approaches generally work by de-biasing sparse estimators of the regression coefficients. However, the coverage rates of these methods are asymptotic, and typically depend on conditions on the design matrix X and the degree of sparsity of β. For example, in Section 4 we show that the finite-sample coverage of one such procedure can be very good in a sparse setting, but extremely poor if β is not sparse.

Review of FAB intervals
Supposeθ is normally distributed with unknown mean θ and known variance σ 2 . Then for any choice of s ∈ [0, 1], where z p denotes the pth quantile of the standard normal distribution. As described in Yu and Hoff (2016), this implies that for any function s : R → [0, 1] the set-valued function is a 1 − α confidence procedure, satisfying Pr(θ ∈ C s (θ)|θ) = 1 − α. We refer to such a function s(θ) as a spending function, as it corresponds to regions of the parameter space upon which type I error is "spent." The usual procedure is C 1/2 (θ), obtained from the constant spending function s(θ) = 1/2.
While C 1/2 is the uniformly most accurate unbiased (UMAU) confidence interval procedure (CIP), the lack of a uniformly most powerful test of H θ : E[θ] = θ versus K θ : E[θ] = θ means there are confidence procedures corresponding to collections of biased level-α tests that have smaller expected widths than the UMAU procedure for some regions of the parameter space. If prior information is available that θ is likely to be near some value µ, then we may be willing to incur wider intervals for θ-values far from µ in exchange for smaller intervals near µ. With this in mind, Pratt (1963) developed a Bayes-optimal 1 − α CIP that minimizes the "Bayes width" or expected interval width averaged over values of bothθ and θ, where the latter averaging is done with respect to a N (µ, τ 2 ) prior distribution for θ. The resulting CIP has 1−α frequentist coverage for each value of θ, but has lower expected width for values of θ near the prior mean (and wider expected widths elsewhere).
We describe this interval as being "frequentist assisted by Bayes" or FAB. As shown in Yu and Hoff (2016), the spending function corresponding to Pratt's FAB confidence interval is characterized as follows: If τ 2 > 0, then If τ 2 = 0, then s(θ) = 1 if θ > µ and s(θ) < 0 if θ < µ. The value of s(µ) ∈ [0, 1] does not affect the width of the confidence interval, but can affect whether or not µ is included in the interval or not (as an endpoint). We suggest taking s(µ) to be 1/2 when τ 2 = 0, as it is in the case that τ 2 > 0. Now consider confidence interval construction for θ in the more typical case that σ 2 is unknown.
Items 1 and 2 were shown in Yu and Hoff (2016), and a proof of item 3 is in the appendix. This result says that every spending function corresponds to a 1 − α frequentist confidence procedure, and every nondecreasing spending function corresponds to a 1 − α confidence interval procedure. Yu and Hoff (2016) showed that the spending function (3) that corresponds to Pratt's z-interval is strictly increasing. If such a nondecreasing spending function s is used, then the lower and upper endpoints of the interval, θ and θ, are obtained by solving the equations where F is the CDF of the t q distribution. These equations can be solved using a zero-finding algorithm, and noting that θ <θ +σt α andθ +σt 1−α < θ. Furthermore, this implies that θ <θ < θ as long as α < 1/2.
The spending function s(θ) should be chosen on the basis of any additional information we may have about the value of θ. While Pratt's FAB interval uses prior information for a scalar parameter, in multiparameter settings such information may come from the data itself. For example, estimates of some parameters might suggest plausible values for others. In this case, we may want to use a spending functions(θ) that minimizes a Bayes risk corresponding to a "prior" distribution that is adaptively estimated from the data. We refer to such as procedure as adaptive FAB. Fortunately, the results of Proposition 1 hold not just for fixed spending functions, but also those that are random but statistically independent ofθ andσ 2 : Corollary 1. Ifθ ∼ N (θ, σ 2 ) and qσ 2 /σ 2 ∼ χ 2 q , andθ,σ 2 ands are independent, then Pr(θ ∈ Cs(θ,σ)|θ, σ) = 1 − α.
This result follows by conditioning ons: Pr(θ ∈ Cs(θ,σ)|θ, σ) = E[Pr(θ ∈ Cs(θ,σ)|s, θ, σ)|θ, σ], but the inner conditional probability is 1 − α sinces and (θ,σ) are independent and Cs has 1 − α coverage for each fixeds. Yu and Hoff (2016) made use of this fact to develop an adaptive FAB confidence interval procedure for the means of multiple normal populations. Their adaptive procedure for the mean θ of a given population is Cs(θ,σ), wheres is the spending function (3) with (µ, σ 2 , τ 2 ) replaced by estimates using data from the other populations. This procedure provides exact 1 − α confidence intervals for each population, and is asymptotically optimal in the case of the normal hierarchical model.

FAB t-intervals for regression parameters
We now show how the results discussed in the previous section may be used to construct adaptive frequentist confidence intervals for linear regression parameters. The intervals we construct have exact 1 − α constant coverage and do not require asymptotic approximations or assumptions on the design matrix or the unknown parameters. Under some conditions, the intervals are also asymptotically Bayes-optimal. The intervals do require that the number n of observations is larger than the number p of regressors.

FAB regression intervals
Consider the problem of constructing confidence intervals for the elements of an unknown vector β ∈ R p based on data y ∈ R n and X ∈ R n×p from the normal linear regression model y ∼ N n (Xβ, σ 2 I). As is well known, withβ andσ 2 being independent. In particular, This interval has 1 − α coverage probability and an expected width that is constant as a function of the true value of β j . Now suppose that prior information about β suggests that β ∼ N p (0, τ 2 I) for some value of τ 2 (other prior distributions will be discussed in Sections 4 and 5). If τ 2 and σ 2 were known, the Bayes-optimal CIP for β j would be obtained simply by replacingθ and σ in (2) and (3) withβ j and w j σ, yielding However, since τ 2 and σ 2 are unknown we alter this interval as follows: • σ 2 is replaced byσ 2 , which is independent ofβ j and satisfies (n − p)σ 2 /σ 2 ∼ χ 2 n−p ; • z-quantiles are replaced by the quantiles of the t n−p distribution; These modifications yield an adaptive FAB interval given by Such an interval satisfies the conditions of Corollary 1, thereby guaranteeing exact 1 − α frequentist coverage, regardless of whether or not the values of β are approximately normally distributed, or if the estimatesτ 2 ,σ 2 are accurate. However, if the normal approximation and adaptive estimates are accurate, then we expect the resulting FAB interval (8) to be close to the "oracle" interval (7), which is Bayes-optimal and narrower on average than the UMAU procedure given by (6).
The approximate optimality of Cs is considered more formally in the next subsection using an asymptotic argument. First, we discuss obtaining estimators (τ 2 ,σ 2 ) that are independent of (β j ,σ 2 ) so that the conditions of Corollary 1 are met. Let P X = X(X X) −1 X and P 0 = I − P X be the projection matrices onto the space spanned by the columns of X and the corresponding null space, respectively. Recall that the OLS estimateβ j is given byβ j = a y, where a is the jth row of the matrix (X X) −1 X . Let P 1 = aa /a a be the projection matrix associated with a, and let P 2 = P X (I − P 1 ). We can decompose y as Since P k P l = 0 for k = l, we have that y 0 , y 1 and y 2 are statistically independent. Now the OLS estimate satisfiesβ j = a P 1 y = a y 1 , andσ 2 = y 0 y 0 /(n − p), and so both estimates are statistically independent of each other and the vector y 2 . Therefore, any estimates (τ 2 ,σ 2 ) that are functions of y 2 will be independent of (β j ,σ 2 ) and so can be used to construct a spending functioñ s(β j ) that satisfies the conditions of Corollary 1.
To obtain such an estimate (τ 2 ,σ 2 ), let G 2 be an orthonormal basis for the space spanned by P 2 (for example, the matrix of eigenvectors of P 2 that correspond to non-zero eigenvalues). Then . Under the prior model β ∼ N (0, τ 2 I), the marginal distribution for z 2 is therefore where X 2 = G 2 X. A variety of empirical Bayes estimates of (τ 2 , σ 2 ) may be obtained from this marginal distribution. For example, noting that E[z 2 A Az 2 ] = tr(X 2 A AX 2 )τ 2 + tr(A A)σ 2 for any matrix A, unbiased moment estimates may be obtained by finding (τ 2 ,σ 2 ) that solve simultaneously two equations, given by z 2 A Az 2 = tr(X 2 A AX 2 )τ 2 + tr(A A)σ 2 , for two different values of A. Alternatively, (τ 2 ,σ 2 ) may be taken to be the maximum likelihood estimate based on the marginal model (9). This estimate is discussed further in the next subsection.

Approximate optimality
As discussed above, if β j ∼ N (0, τ 2 ) and σ 2 and τ 2 were known then the oracle FAB interval C s given by (7) is Bayes-optimal in that it minimizes the prior expected interval width E[|C|] among procedures C that have 1 − α frequentist coverage. This prior expected width is an expectation over both the estimateβ j and the value of β j with respect to the N (0, τ 2 ) prior distribution.
The adaptive FAB interval Cs given by (8) differs from the oracle FAB interval in three ways: the value of σ 2 has been replaced byσ 2 ; the z-quantiles have been replaced by t-quantiles; and the spending function s that depends on (τ 2 , σ 2 ) has been replaced bys that depends on (τ 2 ,σ 2 ). In this subsection we take (τ 2 ,σ 2 ) to be the maximizers of the likelihood given by the marginal model (9). The resulting interval still has 1 − α frequentist coverage, but it is only an approximation to C s , and so we must have E[|Cs|] > E[|C s |] since C s is Bayes-optimal. However, if n − p is large then the t-quantiles will be close to the corresponding z-quantiles, and we expect thatσ 2 ≈ σ 2 . If p is also large then under the prior β ∼ N (0, τ 2 I) we expect thatτ 2 ≈ τ 2 andσ 2 ≈ σ 2 . As a result, we should haves(β j ) ≈ s(β j ) and so we expect that E[|Cs|] ≈ E[|C s |], that is, the FAB procedure will be approximately Bayes-optimal.
We investigate this more formally with an asymptotic comparison of the widths of the adaptive and oracle FAB procedures. We first obtain an asymptotic result for a single scalar parameter β j , and then discuss the result in the context of the linear regression model. Consider a sequence of experiments indexed by n such that for each n we have statistics (β j ,σ 2 ,τ 2 ,σ 2 ) that satisfy coverage conditions C1 and C2 given above. Furthermore, suppose the following asymptotic conditions hold as n → ∞: We consider this case where σ 2 grows with n since otherwise, if σ 2 were fixed then the widths of the oracle FAB, adaptive FAB and UMAU intervals would all converge to zero at the same rate.
Lemma 1. Under the conditions C1, C2, A1, A2 and A3, the width |Cs| of the FAB procedure (8) satisfies A proof is in the appendix. The lemma says that under this asymptotic regime, the performance of the adaptive FAB interval is asymptotically equivalent to that of the oracle FAB interval: They both have 1 − α frequentist coverage for each n, and the prior expected width of the FAB procedure approaches that of the oracle FAB interval as n → ∞.
We now consider how this result applies to the linear regression model and the specific estimateŝ β j ,σ 2 ,τ 2 ,σ 2 described in the previous subsection. Consider a sequence of experiments indexed by n such that the following conditions hold: B1. For each n, • X is full-rank; • y ∼ N n (Xβ, σ 2 I) with σ 2 = nσ 2 ∞ ; • β ∼ N (0, τ 2 I).
B3. The empirical distribution of the eigenvalues of X X/n is bounded uniformly in n, and converges in distribution to a non-degenerate limit as n → ∞.
If conditions B1, B2 and B3 are met then the estimates (β j ,σ 2 ,τ 2 ,σ 2 ) defined in Section 3.1 satisfy the conditions C1, C2, A1 and A2 and so the FAB interval for β j of any variable j satisfying condition A3 will satisfy the conditions of Theorem 1, and hence be asymptotically optimal. To see that this holds, first note that the definition of the model in condition B1 implies that (β j ,σ 2 ,τ 2 ,σ 2 ) satisfy the coverage conditions C1 and C2. Second, asymptotic condition A1 is met by the definition σ 2 = nσ 2 ∞ in B1 and that n is growing faster than p as assumed by B2. The remaining necessary result is the following: Lemma 2. Suppose B1, B2 and B3 hold, and that (τ 2 , σ 2 ∞ ) ∈ Θ, a compact subset of [0, ∞)×(0, ∞). Let (τ 2 ,σ 2 ) be the maximizers over Θ of the likelihood given by the marginal model (9). Then This result is proven in the appendix. Putting Lemma 1 and Lemma 2 gives the following summary of the asymptotic behavior of Cs: This result makes precise the heuristic idea that if n and p are large, then the adaptive FAB interval should be nearly as good as the oracle FAB interval.

Numerical examples
In this section we illustrate the adaptive FAB procedure numerically, and show how it can be modified to accommodate different adaptation strategies. For example, in the next subsection we use an empirically estimated prior distribution that is not centered around zero, thereby providing improved performance if most of the effects are of a common sign. In the following subsection, we show how adaptation may be done separately for different groups of parameters, such as main effects and interactions. We also provide a simulation study that illustrates how a CIP that adapts to sparsity may have very poor coverage if the regression parameter is not actually sparse, whereas the adaptive FAB procedure maintains constant coverage for all parameter values. Conlon et al. (2003) measured the binding intensity of a protein to each of n = 287 DNA segments, and related each intensity to scores measuring abundance of the DNA segment in p = 195 genetic motifs. These data were also used as an example by Meinshausen et al. (2009), among others.

Motif regression
Assuming a normal linear regression model for the centered and scaled data, the usual unbiased estimate of σ 2 is 0.77, and the usual standard errors for the OLS regression coefficients range from 0.12 to 0.85 with a mean of 0.30. On the other hand, empirical Bayes estimates of µ and τ 2 under the prior β ∼ N p (µ1, τ 2 I) are around 0.004 and 0.001 respectively, (τ ≈ 0.036) suggesting that the true values of the elements of β are highly concentrated around zero.
We constructed 95% FAB confidence intervals for the effects of the p = 195 genetic motifs, using the adaptive FAB procedure described in Section 3.1 except under a N p (µ1, τ 2 I) distribution for β.
This is to allow for the possibility that the distribution of true effects is not centered around zero, which seems reasonable for this particular dataset where it is expected that abundance has either a positive or negligible effect on binding intensity. In the analysis that follows, for each coefficient j, values of (μ,τ 2 ,σ 2 ) are estimated from the j-specific vector y 2 defined in Section 3.1, thereby ensuring thatβ j is independent of (μ,τ 2 ,σ 2 ) and constant coverage of the FAB confidence interval for each β j is maintained.
The intervals are shown graphically in Figure 1, along with the UMAU intervals for comparison. The

Motif regression simulation study
Zhang and Zhang (2014) developed a confidence interval procedure for sparse parameters in highdimensional normal linear regression models. When applied to the motif dataset, this low dimensional projection (LDP) procedure produces intervals that are narrower than the FAB intervals for all regression coefficients, with relative widths ranging from 0.27 to 0.71, and being about half as wide on average across coefficients. However, unlike the FAB and UMAU procedures, the actual To compare the performance of the UMAU, FAB and LDP procedures we constructed two related simulation studies based on the motif binding dataset described in the previous subsection.
In each study, we obtained estimates (β 0 , σ 2 0 ) from the real data y and X, and used these estimates to simulate new response vectors y (k) ∼ N (Xβ 0 , σ 2 0 I) independently for k = 1, . . . , 5000. For each response vector y (k) we construct UMAU, FAB and LDP confidence intervals for each of the p = 195 regression coefficients. These intervals are used to obtain Monte Carlo approximations to the finite-sample coverage rates of the LDP procedure, as well as approximations to the expected interval widths of the UMAU, FAB and LDP procedures.
In the first of these two simulation studies we simulated 5000 datasets from the model y (k) ∼ N n (Xβ 0 , σ 2 0 I), where X is the original design matrix and β 0 is the lasso estimate from the original data, using an empirical Bayes estimate of the L 1 -penalty parameter. This resulted in a sparse β 0 -vector with 176 of the 195 coefficients being identically zero, so in the context of this simulation study, the "truth" is highly sparse. The value of σ 2 0 used to simulate the data was the usual unbiased estimate from the original data. We computed the UMAU, FAB and LDP confidence intervals for each of the 5000 simulated datasets. The widths of the FAB and LDP intervals were 85% and 43% of the UMAU interval widths respectively, on average across datasets and parameters. The empirical coverage rates of the nominal 95% LDP intervals ranged between 93.8 and 96.1 percent. There was some evidence that the coverage rates were not exactly 95%: Exact level-.05 binomial tests rejected the hypothesis that the coverage rates were 95% for 71 of the 195 regression parameters (36%).
All of these 71 parameters had true values of 0, and the empirical coverage rates of 64 of these 71 parameters were larger than 95%, suggesting that LDP intervals slightly overcover β j when it is zero. However, in general the coverage rates of the LDP procedure were very close to the nominal rates, in this case where the truth is sparse.
The second simulation study was the same as the first except the value β 0 used to generate the simulated data was the OLS estimate from the original data, and so in this case the "true" regression model is not sparse. On average across the 5000 simulated datasets and 195 parameters, the widths of the FAB and LDP intervals were 88% and 54% of the UMAU interval widths respectively, similar to the results from the first study. These relative widths are shown in the left panel of Figure 2. However, the coverage rates for the LDP intervals were generally far from their nominal levels: Based on exact binomial tests, coverage rates for 183 of the 195 parameters were significantly different from 95% (at level 0.05). As shown in the right panel of Figure 2, the LDP intervals generally overcover parameter values near zero, and greatly undercover parameters larger in magnitude. For comparison, the empirical coverage rates of the FAB intervals are also shown.
These rates show no evidence of deviation from the nominal rates, as should be the case -the FAB intervals have exact 95% coverage for each component of β by construction. Efron et al. (2004) considered parameter estimation for a model of diabetes progression from data on ten explanatory variables from each of n = 442 subjects. The expected progression of a subject was assumed to be a linear function of the linear, quadratic and two-way interaction effects of the ten variables, resulting in a linear model with p = 64 regressors total (the binary sex variable does not have a separate quadratic effect).

Diabetes progression
We generally expect that main effects will be larger than quadratic effects and two-way interactions. For this reason, it makes sense to obtain adaptive intervals separately for these three types of parameters, so that the spending functions used to obtain the confidence interval for the effect q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q of a given regressor is obtained adaptively from the estimated effects of regressors in the same category. This can easily be done as follows: Write the design matrix X as X = [X 1 , X 2 , X 3 ], where X 1 , X 2 , X 3 are the design matrices corresponding to the main effects, quadratic effects and two-way interactions, respectively, and let β = [β 1 , β 2 , β 3 ] be the corresponding partition of β.
To obtain the FAB CIs for the main effects, we let G be an orthonormal basis for the null space of [X 2 , X 3 ]. Lettingỹ = G y andX = G X, we haveỹ ∼ N n−p 2 −p 3 (Xβ 1 , σ 2 I). We can then apply the FAB CI procedure to (ỹ,X) to obtain intervals that adapt to the magnitude of β 1 (and not to the magnitudes of β 2 and β 3 ). Adaptive confidence intervals for β 2 and β 3 can be obtained analogously.
In the analysis that follows we use an adaptively estimated N (0, τ 2 ) prior distribution for each coefficient. Recall that our FAB procedure generates an empirical Bayes estimateτ 2 of τ 2 = Var[β j ] for each coefficient j that is statistically independent of the OLS estimateβ j . For the main effects the values ofτ ranged between 0.19 and 0.21, with a mean of 0.20, and were larger than the standard errors of the OLS coefficients except for those of four somewhat co-linear predictors. In contrast, values ofτ for the quadratic and interaction terms were all less than 0.03, and were all less than the corresponding standard errors.
We computed the adaptive FAB interval for each regression coefficient using these coefficientspecific estimates of τ 2 . The FAB intervals are as narrow or narrower than all but three of the corresponding UMAU intervals, with the relative interval widths ranging from 0.84 to 1.0003, and being 0.86 on average. The FAB CIs for the main effects are essentially the same as the UMAU CIs, whereas the FAB CIs for the quadratic and interaction terms are all narrower than the corresponding UMAU intervals, by about 16% on average. This example illustrates some flexibility of the FAB procedure, in that the adaptation for a particular parameter may be based on a subset of the data information that is deemed most relevant for that parameter.

Discussion
We have constructed a class of 1 − α confidence interval procedures (CIPs) for individual regression coefficients of the normal regression model y ∼ N n (Xβ, σ 2 I). Each member of this class corresponds to a spending function s : R → [0, 1]. Under the regression model, every member of the class has constant 1 − α coverage for all possible values of β, σ 2 and full-rank design matrices X. We have described a method of adaptively selecting the spending function so that the across-parameter average interval width is reduced, and the 1 − α coverage rate is maintained for each regression coefficient. The coverage guarantee is non-asymptotic, does not rely on β being sparse and does not rely on conditions on the design matrix. However, under some assumptions on the distribution of the elements of β and the design matrix, the adaptive technique we propose is asymptotically optimal as both n and p increase.
The spending function s(β) that we adaptively estimate from the data is based on a normal prior distribution for the elements of β. As such, we expect our procedure to provide the most improvement when the empirical distribution of β 1 , . . . , β p is approximately normal. If instead we suspect that β is sparse, it may seem preferable to base the adaptation on other families of prior distributions, such as Laplace or "spike and slab" distributions. Some numerical work not presented here suggests that FAB intervals obtained using the Laplace family of priors are in practice similar to those obtained with normal priors. However, FAB procedures based on spike and slab priors do seem more efficient but also present a problem: The spending function for a spike and slab prior is not generally nondecreasing, and so by Theorem 1 the corresponding confidence region may not be an interval. We suspect that non-interval confidence regions have limited appeal in practice, but even if they were of interest they present the numerical challenge of identifying multiple disconnected sets of parameter values to include in the region.
We conjecture that the LDP interval procedure proposed by Zhang and Zhang (2014) could be related to a FAB procedure based on a sparsity-inducing prior distribution, as both procedures should be asymptotically optimal under a sparse regime. The fact that a FAB procedure based on such a prior might produce non-interval confidence regions might partly explain why the LDP procedure fails when β is non-sparse: If a sparse 1 − α FAB procedure yields a non-interval region but only a sub-interval is numerically identified, then the coverage rate will be below 1 − α.
Adaptive FAB intervals for linear regression coefficients may be computed using the R-package

Proof of Lemma 1
For notational convenience, in this proof we write β j and w j as β and w, and write the α-quantile of the t q distribution as t(α), suppressing the index that denotes the degrees of freedom. We begin the proof of Lemma 2 with another lemma: Lemma 3. The width |Cs| of Cs satisfies |Cs| < |β| + wσ(|t(α/2)| + |t(1 − α/2)|).
Since is arbitrary, we conclude that the first term in (12) converges to zero in probability. Now we show the expected width converges to the oracle width by integrating overβ. This is done by first showing |Cs| is uniformly integrable and then applying Vitali's theorem. By the previous lemma we know that |Cs| < |β n | + wσ(|t(α/2)| + |t(1 − α/2)|).
We prove consistency ofθ in three steps: First, we show that Q n (θ) − E[Q n (θ)] converges uniformly to zero as n → ∞. Second, we show that this implies that as a function of θ ∈ Θ, Q n converges uniformly to a function Q 0 . Third, we show that Q 0 is uniquely minimized at θ 0 .
Consistency ofθ follows from these latter two results (see, for example, Theorem 2.1 of Newey and McFadden (1994)).
At the critical point (τ 2 0 , σ 2 0 ) this simplifies to the expectation of the matrix in the integrand with respect to the probability measure with density proportional to (λτ 2 + σ 2 ) −2 with respect to F 0 .
Again, if F 0 is not degenerate then the expectation of this matrix, and hence the Hessian of Q 0 , is strictly positive definite. The critical point is a local minimum, and since it is the only critical point of the continuous function Q 0 , it is the unique minimizer. This completes the proof of Lemma 4.
To see how this applies to the properties of the empirical Bayes estimates (τ 2 ,σ 2 ) of (τ 2 , σ 2 ) based on the marginal model (9), let U be the (p − 1) × (p − 1) matrix of left singular vectors of X 2 , and let nΛ 2 be the diagonal matrix of the squared singular values. Then U 2 z 2 / √ n ∼ N p−1 (0, Λ 2 τ 2 + Iσ 2 ∞ ), and so the properties of the MLE of (τ 2 , σ 2 ∞ ) based on z 2 will be the same as those of (τ 2 , σ 2 ) in Lemma 4 if Λ 2 satisfies the assumption of the Lemma. To see that it does, recall that the assumption of Lemma 2 was that the empirical distribution of the eigenvalues of X X/n is uniformly bounded and converges weakly to a non-degenerate distribution with finite support. For a given n, let γ 1 , . . . , γ p be the eigenvalues of X X/n. Since X 2 X 2 /n is a compression of X X/n, by the Cauchy interlacing theorem we have γ 1 ≤ λ 1 ≤ γ 2 ≤ · · · ≤ γ p−1 ≤ λ p−1 ≤ γ p . Therefore, if the values of {γ 1 , . . . , γ p } are bounded uniformly in p and have an empirical distribution that converges to a nondegenerate limit, then the same properties hold for the values of {λ 1 , . . . , λ p−1 }.