Exact prior-free probabilistic inference on the heritability coefficient in a linear mixed model

Linear mixed-effect models with two variance components are often used when variability comes from two sources. In genetics applications, variation in observed traits can be attributed to biological and environmental effects, and the heritability coefficient is a fundamental quantity that measures the proportion of total variability due to the biological effect. We propose a new inferential model approach which yields exact prior-free probabilistic inference on the heritability coefficient. In particular we construct exact confidence intervals and demonstrate numerically our method's efficiency compared to that of existing methods.


Introduction
Normal linear mixed effects models are useful in a variety of biological, physical, and social scientific applications with variability coming from multiple sources; see Khuri and Sahai (1985) and Searle et al. (1992). In this paper, we focus on the case of two variance components, and the general model can be written as where Y is a n-vector of response variables, X and Z are design matrices for the fixed and random effects, of dimension n × p and n × a, respectively, β is a p-vector of unknown parameters, α is a normal random a-vector with mean 0 and covariance matrix σ 2 α A, and ε is a normal random n-vector with mean 0 and covariance matrix σ 2 ε I n . Here, σ 2 = (σ 2 α , σ 2 ε ) is the pair of variance components. Assume A is known and X is of rank p < n. The unknown parameters in this model are the p fixed-effect coefficients β and the two variance components σ 2 , so the parameter space is (p + 2)-dimensional.
In biological applications, the quantities α and ε in (1) denote the genetic and environmental effects, respectively. Given that "a central question in biology is whether observed 2 Preliminaries

Overview of inferential models
The goal of the IM framework is to get valid prior-free probabilistic inference. This is facilitated by first associating the observable data and unknown parameters to a set of unobservable auxiliary variables. For example, the marginal distribution of Y from (1) is which can be written in association form: That is, observable data Y and unknown parameters (β, σ 2 ) are associated with unobservable auxiliary variables U , in this case, a n-vector of independent standard normal random variables. Given this association, the basic IM approach is to introduce a predictive random set for U and combine with the observed value of Y to get a plausibility function. Roughly, this plausibility function assigns, to assertions about the parameter of interest, values in [0, 1] measuring the plausibility that the assertion is true. This plausibility function can be used to design exact frequentist tests or confidence regions. Martin and Liu (2013) give a general description of this three-step IM construction and its properties; in Section 3.1 below we will flesh out these three steps, including choice of a predictive random set, for the variance components problem at hand. The construction of exact plausibility intervals for the heritability coefficient is presented in Section 3.3, along with a theoretical justification of the claimed exactness.
Notice, in the association (3), that the auxiliary variable U is, in general, of higher dimension than that of the parameter (β, σ 2 ). There are two reasons for trying to reduce the auxiliary variable dimension. First, it is much easier to specify good predictive random sets when the auxiliary variable is of low dimension. Second, inference is more efficient when only a relatively small number of auxiliary variables require prediction. This dimension reduction is via a conditioning operation, and Martin and Liu (2014a) give a general explanation of the gains as well as a novel technique for carrying out this reduction. We employ these techniques in Section 3.1 to get a dimension-reduced association for the heritability coefficient in the linear mixed model.
Another important challenge is when nuisance parameters are present. Our interest is in the heritability coefficient, a function of the full parameter (β, σ 2 ), so we need to marginalize over the nuisance parameters. The next section marginalizes over β using standard techniques; further marginalization will be carried out in Section 3.1.

Marginalizing out the fixed effect
Start with the linear mixed model (1). Following the setup in E et al. (2008), let K be a n × (n − p) matrix such that KK = I n − X(X X) −1 X and K K = I n−p . Next, let B = (X X) −1 X . Then y → (K y, By) is a one-to-one mapping. Moreover, the distribution of K Y depends on (σ 2 α , σ 2 ε ) only, and the distribution of BY depends on (β, σ 2 ), with β a location parameter. In particular, from (2), we get where G = K ZAZ K and C σ is a p×p covariance matrix of a known form that depends on σ 2 = (σ 2 α , σ 2 ε ); its precise form is not important. From this point, the general theory in Martin and Liu (2014b) allows us to marginalize over β by simply deleting the BY component. Therefore, a marginal association for (σ 2 α , σ 2 ε ) is This marginalization reduces the auxiliary variable dimension from n to n − p.
In the marginal association for (σ 2 α , σ 2 ε ) above, there are n − p auxiliary variables but only two parameters. Classical results on sufficient statistics in mixed effects model that will facilitate further dimension reduction. For the matrix G defined above, let λ 1 > · · · > λ L ≥ 0 denote the (distinct) ordered eigenvalues with multiplicities r 1 , . . . , r L , respectively. Let P = [P 1 , . . . , P L ] be a (n − p) × (n − p) orthogonal matrix such that P GP is diagonal with eigenvalues {λ : = 1, . . . , L}, in their multiplicities, on the diagonal. For P , a (n − p) × r matrix, define Olsen et al. (1976) showed that (S 1 , . . . , S L ) is a minimal sufficient statistic for (σ 2 α , σ 2 ε ). Moreover, the distribution of (S 1 , . . . , S L ) is determined by This reduces the auxiliary variable dimension from n − p to L. We take (4) as our "baseline association." Even in this reduced baseline association, there are L auxiliary variables but only two parameters, which means there is room for even further dimension reduction. The next section shows how to reduce to a scalar auxiliary variable when the parameter of interest is the scalar heritability coefficient.
3 Inferential model for heritability

Association
For the moment, it will be convenient to work with the variance ratio, ψ = σ 2 α /σ 2 ε . Since ψ = ρ/(1 − ρ) is a one-to-one function of ρ, the two parametrizations are equivalent. Rewrite the baseline association (4) as If we make the following transformations, then the association (5) becomes Since for every (X, U, ψ), there exists a σ 2 ε that solves the right-most equation, it follows from the general theory in Martin and Liu (2014b) that a marginal association for ψ is obtained by deleting the component above involving σ 2 ε . In particular, a marginal association for ψ is If we write then we get a marginal association for ρ of the form Marginalization reduces the auxiliary variable dimension by 1. Further dimension reduction will be considered next. Note that the new auxiliary variable U is a multivariate F-distributed random vector (e.g., Amos and Bulgren 1972).
Here we construct a local conditional IM for ρ as described in Martin and Liu (2014a). Select a fixed value ρ 0 ; more on this choice in Section 3.3. To reduce the dimension of the auxiliary variable U , in (6), from L−1 to 1, we construct two pairs of functions-one pair, (T, H 0 ), on the sample space, and one pair, (τ, η 0 ), on the auxiliary variable space. We insist that x → (T (x), H 0 (x)) and u → (τ (u), η 0 (u)) are both one-to-one, and H 0 = H ρ 0 and η 0 = η ρ 0 are allowed to depend on the selected ρ 0 . The goal is to find η 0 which is, in a certain sense, not sensitive to changes in the auxiliary variable.
Write the association (6) so that u is a function of x and ρ, i.e., We want to choose the function η 0 such that the partial derivative of η 0 (u(x, ρ)) with respect to ρ vanishes at ρ = ρ 0 . By the chain rule, we have ∂u(x, ρ) ∂ρ , so our goal is to find η 0 to solve the following partial differential equation: Toward solving the partial differential equation, first we get that the partial derivative of u (x, ρ) with respect to ρ satisfies This simplifies the relevant partial differential equation to the following: where diag(a) is a diagonal matrix constructed from a vector a. The method of characteristics (e.g., Polyanin et al. 2002) for solving partial differential equations identifies a logarithmic function of the form For example, since the matrix that projects to the orthogonal complement to the column space of g(ρ 0 ) has rank L − 2, we can take M 0 to be a matrix whose L − 2 rows form a basis for that space. For M 0 defined in this way, it is easy to check that η 0 in (8) is indeed a solution to the partial differential equation (7). We have completed specification of the η 0 function; it remains to specify τ and (T, H 0 ). The easiest to specify next is H 0 (x), the value of η(u(x, ρ 0 )), as a function of x: As we describe below, the goal is to condition on the observed value of H 0 (X). Next, we define T and τ to supplement H 0 and η 0 , respectively. In particular, take a (L − 1)-vector w(ρ) which is not orthogonal to g(ρ) at ρ = ρ 0 . It is easy to check that the entries in g(ρ) are strictly positive for all ρ. Therefore, we can take w(ρ) independent of ρ, e.g., w(ρ) ≡ 1 L−1 , is not orthogonal to g(ρ). Now set This is a log-linear transformation, and the linear part is non-singular, so this is a oneto-one mapping. Finally, we take T as Since (T (x), H 0 (x)) is log-linear, just like (τ (u), η 0 (u)), it is also one-to-one. We can now write the conditional association for ρ. For the given ρ 0 , the mapping x → (T (x), H 0 (x)) describes a split of our previous association (6) into two pieces: The first piece carries direct information about ρ. The second piece plays a conditioning role, correcting for the fact that some information was lost in reducing the (L − 1)dimensional X to a one-dimensional T (X). To complete the specification of the conditional association, write ϕ(ρ) where P V |h 0 ,ρ 0 is the conditional distribution of τ (U ), given that η 0 (U ) equals to the observed value h 0 of H 0 (X). To summarize, (10) completes the association step that describes the connection between observable data, unknown parameter of interest, and unobservable auxiliary variables. Of particular interest is that this association involves only a one-dimensional auxiliary variable compared to the association (5) obtained from the minimal sufficient statistics that involves an L-dimensional auxiliary variable. This dimension reduction will come in handy for the choice of predictive random set in the following section. The price we paid for this dimension reduction was the choice of a particular localization point ρ 0 . In Section 3.3 we employ a trick to side-step this issue when the goal is, as in this paper, to construct confidence intervals.

Predictive random sets
Having reduced the auxiliary variable to a scalar in (10), the choice of an efficient predictive random set is now relatively simple. Though there is an available theory of optimal predictive random sets (Martin and Liu 2013), here we opt for simplicity; in particular, we propose a default predictive random set that is theoretically sound and computational and intuitively simple. Consider the following version of what Martin and Liu (2013) call the "default" predictive random set: This S, with distribution P S|h 0 ,ρ 0 , is a random interval, centered at the mean µ 0 of the conditional distribution P V |h 0 ,ρ 0 . One key feature of this predictive random set is that it is nested, i.e., for any two distinct realizations of S, one is a subset of the other. The second key feature is a calibration of the predictive random set distribution with the underlying distribution of the auxiliary variable. Following Martin (2014), define the contour function γ S (v) = P S|h 0 ,ρ 0 (S v), which represents the probability that the predictive random set contains the value v of the auxiliary variable. We shall require that For the default predictive random set in (11), it is easy to check that where F h 0 ,ρ 0 is the distribution function of |V −µ 0 | for V ∼ P V |h 0 ,ρ 0 . From the construction above, it is clear that it is a continuous distribution. Then, |V −µ 0 | is a continuous random variable, so γ S (V ) is uniformly distributed on (0, 1). Therefore, (12) holds for the default predictive random set S. Results on optimal predictive random sets are available (Martin and Liu 2013), but here, again, our focus is on simplicity. See Section 5.

Plausibility intervals
Here we combine the association in Section 3.1 with the predictive random set described above to produce a plausibility function for inference about ρ. In general, a plausibility function is a data-dependent mapping that assigns, to each assertion about the parameter, a value in [0, 1], with the interpretation that small values suggest the assertion is not likely to be true, given the data; see Martin and Liu (2013). For simplicity, we focus only on the collection of singleton assertions, i.e., {ρ = r} for r ∈ [0, 1]. These are also the relevant assertions for constructing interval estimates based on the IM output. Let X = x be the observations in (6). The association step in Section 3.1 yields a data-dependent collection of sets indexed by the auxiliary variable. In particular, write R x (v) = {ρ : T (x) = ϕ(ρ) + v}, a set-valued function of v. These sets are combined with the predictive random set in Section 3.2 to get an enlarged x-dependent random set: Now, for a given assertion {ρ = r}, we compute the plausibility function, the probability that the random set R x (S) contains the asserted value r of ρ. A simple calculation shows that, in this case with singleton assertions, we have where F h 0 ,ρ 0 is defined in Section 3.2. The above display shows that the plausibility function can be expressed directly in terms of the distribution of the predictive random set, without needing to go through the construction of R x (S) as in (13). We pause here to answer a question that was left open from Section 3.1, namely, how to choose the localization point ρ 0 . Following Martin and Liu (2014a), we propose here to choose ρ 0 to match the value of ρ specified by the singleton assertion. That is, we propose to let the localization point depend on the assertion. All the elements in the plausibility function above with a 0 subscript, denoting dependence on ρ 0 , are changed in an obvious way to get a new plausibility function We treat this as a function of ρ to be used for inference. In particular, we can construct a 100(1 − α)% plausibility interval for ρ as follows: The plausibility function, and the corresponding plausibility region, are easy to compute, as we describe in Section 4.1. Moreover, the calibration (12) of the predictive random set leads to exact plausibility function-based confidence intervals, as we now show. We need some notation for the sampling distribution of X, given all the relevant parameters. Recall that the distribution of X actually depends on (σ 2 α , σ 2 ε ) or, equivalently, (ρ, σ 2 ε ). The error variance σ 2 ε is a nuisance parameter, but σ 2 ε still appears in the sampling model for X. We write this sampling distribution as P X|ρ,σ 2 ε . Theorem 1. Take the association (10) and the default predictive random set S in (11). Then for any ρ, any value h ρ of H ρ , and any σ 2 ε , the plausibility function satisfies P X|ρ,σ 2 ε pl X|hρ,ρ (ρ) ≤ α | H ρ (X) = h ρ = α, ∀ α ∈ (0, 1).
Proof. For given (σ 2 ε , ρ), if h ρ is the value of H ρ (X), then it follows from the conditional distribution construction that the plausibility function in (14), as a function of T (X) with X ∼ P X|ρ,σ 2 ε , is Unif(0, 1). Then the equality in (16) follows immediately. Averaging the left-hand side of (16) over h ρ , with respect to the distribution of H ρ (X), and using iterated expectation gives the following unconditional version of Theorem 1.
Since we have proper calibration of the plausibility function, both conditionally and unconditionally, coverage probability results for the plausibility interval (15) are also available. This justifies our choice to call Π α (x) a 100(1 − α)% plausibility interval, i.e., the frequentist coverage probability of Π α is exactly 1 − α.

Implementation
Evaluation of the plausibility function in (14) requires the distribution function F hρ,ρ of |V − µ ρ | corresponding to the conditional distribution P V |hρ,ρ of V = τ (U ), given η ρ (U ) = H ρ (X). This conditional distribution is not of a convenient form, so numerical methods are needed. For ρ fixed, since the transformation (9) from U to (τ (U ), η ρ (U )) is of a loglinear form, and the density function of U can be written in closed-form, we can evaluate the joint density for (τ (U ), η ρ (U )) and, hence, the conditional density of V = τ (U ). Numerical integration is used to evaluate the normalizing constant, the mean µ ρ , and the distribution function F hρ,ρ . R code is available at www.math.uic.edu/~rgmartin.

Simulation results
In this section, we consider a standard one-way random effects model, i.e., where α 1 , . . . , α a are independent with common distribution N(0, σ 2 α ), and the ε ij s are independent with common distribution N(0, σ 2 ε ); the α i s and ε ij s are also mutually independent. Our goal is to compare the proposed IM-based plausibility intervals for ρ with the confidence intervals based on several competing methods. Of course, the properties of the various intervals depend on the design, in this case, the within-group sample sizes n 1 , . . . , n a , and the values of (σ 2 α , σ 2 ε ). Our focus here is on cases with small sample sizes, namely, where the total sample size n = n 1 + · · · + n a is fixed at 15. The three design patterns (n 1 , . . . , n a ) considered are: (1, 1, 1, 1, 1, 10), (2,4,4,5), and (2, 3, 10). The nine (σ 2 α , σ 2 ε ) pairs considered are: (0.05, 10), (0.1, 10), (0.5, 10), (1, 10), (0.5, 2), (1, 1), (2, 0.5), (5, 0.2), and (10, 0.1). Without loss of generality, we set µ = 0.  For each design pattern and pair of (σ 2 α , σ 2 ε ), 1000 independent data sets were generated and 95% two-sided interval estimates for ρ were computed based on the exact method of Burch and Iyer (1997), the fiducial method of E et al. (2008), and the proposed IM method. Empirical coverage probabilities and average length of the confidence interval under each setting were compared to investigate the performance of each method. Besides these three methods, we also implemented Bayesian and profile likelihood approaches. The three aforementioned methods all gave intervals with better coverage and length properties than the Bayesian method, and the profile likelihood method was very unstable with small sample sizes, often having very high coverage with very wide intervals or very low coverage with very narrow intervals. So, these results are not reported.
A summary of the simulation results is displayed in Figure 1. Panel (a) displays the coverage probabilities, and Panel (b) displays the relative length difference, which is defined as the length of the particular interval minus the length of the IM interval, scaled by the length of the IM interval. As we expect from Corollary 2, the IM plausibility intervals have coverage at the nominal 95% level. We also see that the IM intervals tend to be shorter than the fiducial and Burch-Iyer confidence intervals. The fiducial intervals have coverage probability exceeding the nominal 95% level, but this comes at the expense of longer intervals on average. Overall, the proposed IM-based method performs quite well compared to these existing methods. We also replicated the simulation study in E et al. (2008), which involves larger sample sizes and a broader range of imbalance, and the relative comparisons between these three methods are the same as here.

Real-data analysis
Example 1. Equal numbers of subjects are tested under each standard and test preparations and a blank dose under a (2K + 1)-point symmetrical slope-ratio assay. The response, on logarithmic scale, is assumed to depend linearly on the dose level. A modified balanced incomplete block design with 2K + 1 (K < K) block size is introduced by Das and Kulkarni (1966). The ith dose levels for standard and test preparations are represented by s i and t i , i = 1, . . . , K. Under this design, the dose will be equally spaced and listed in ascending order. A balanced incomplete block design with K doses of the standard preparation inside K blocks is constructed and used as the basic design. Then a modified design is constructed by adding a blank dose and K doses of the test preparation into every block, under the rule that dose t i should accompany s i in every blocks. The model developed by Das and Kulkarni can be written as where y sjm , y tjm , y cjm represent observation response in mth block for jth dose of standard preparation, test preparation and blank dose; x sj and x tj represent jth dose level for standard and test preparation; x cj is zero by default, α m denotes mth block effect; and ε ijm denotes independent random errors with common distribution N(0, σ 2 ε ). We consider random block effects and assume that α m are independent with common distribution N(0, σ 2 α ). Independence of α m and ε ijm is also assumed. We analyze data coming from a nine-point slope-ratio assay on riboflavin content of yeast, with two replications in each dose; see Table 2 in E et al. (2008) for the design and data. For this design, we have L = 3 distinct eigenvalues, namely, 4.55, 1, and 0, with multiplicities 1, 1, and 10, respectively. A plot of the plausibility function for ρ is shown in Figure 2(a). The function exceeds 0.2 at ρ = 0, which implies that 90% and 95% plausibility intervals include zero. The left panel of Table 1 shows the 90% and 95% interval estimates for ρ based on the Burch-Iyer, fiducial, and IM methods. In this case, the IM intervals are provably exact and shorter.
Example 2. Harville and Fenech (1985) analyzed data on birth weights of lambs. These data consist of the weights information at the birth of 62 single-birth male lambs, and were collected from three selection lines and two control lines. Each lamb was the offspring of one of the 23 rams and each lamb had a distinct dam. Age of the dam was also recorded and separated into three categories, numbered 1 (1-2 years), 2 (2-3 years), and 3 (over 3 years). A linear mixed model for these data is where y ijkl represents the weight of the lth offspring of the kth sire in the jth population lines and of a dam in the ith age category; β i represents ith level age effect; π j represents the jth line effects; α jk denotes random sire effects and are assumed to be independently distributed as N(0, σ 2 α ); and random errors denoted by ε ijkl is supposed to be independently distributed as N(0, σ 2 ε ). Furthermore, the α jk s and ε ijkl s are assumed to be independent. In this case, L = 18, λ 1 = 5.09, λ L = 0 and r L = 37; all non-zero eigenvalues have multiplicity 1 except λ 8 = 2 with multiplicity 2. A plot of the plausibility function for ρ is shown in Figure 2(b). As in the previous example, the plausibility function is positive at ρ = 0, which means that plausibility intervals with any reasonable level will contain ρ = 0. We also used each of the three methods considered above to compute 90% and 95% interval estimates for ρ. The results are shown in the right panel of Table 1. In this case, IM gives a shorter interval compared to Burch-Iyer. The fiducial interval, however, is shorter than both exact intervals. We expect the IM interval to be most efficient, so we explore the relative performance a bit further by simulating 1000 independent data sets from the fitted model in this case, i.e., withσ 2 α = 0.767 andσ 2 ε = 2.763 as the true values. In these simulations, the fiducial and IM coverage probabilities were 0.944 and 0.954, respectively, both within an acceptable range of the nominal level, but the average lengths of the intervals are 0.488 and 0.456. That is, the IM intervals tend to be shorter than the fiducial intervals in problems similar to this example, as we would expect.

Concluding remarks
The IM method proposed here gives exact confidence intervals for the heritability coefficient ρ, as well as the variance ratio ψ, and numerical results suggest increased efficiency compared to existing methods. A question is if these same techniques can be employed for exact prior-free probabilistic inference on other quantities related to the variance components (σ 2 α , σ 2 ε ). It is well-known that, for the unbalanced design case, exact marginalization is challenging. In the IM context, this means that the association is not "regular" in the sense of Martin and Liu (2014b). Therefore, some more sophisticated tools are needed for exact inference on, say, the individual variance components σ 2 α and σ 2 ε . This application provides clear motivation for our ongoing investigations into more general IM-based marginalization strategies.
Another important question is if the techniques presented herein can be applied in more complex and high-dimensional mixed-effect models. In genome-wide association studies, for example, the dimensions of the problem are extremely large. We expect that, conceptually, the techniques described here will carry over to the more complex scenario. However, there will be computational challenges to overcome, as with all approaches (Kang et al. 2010;Zhou and Stephens 2014). This, along with the incorporation of optimal predictive random sets (e.g. Martin and Liu 2013) is a focus of ongoing research.