Heritability estimation in case-control studies

Abstract: In the field of genetics, the concept of heritability refers to the proportion of variations of a biological trait or disease that can be explained by genetic factors. Quantifying the heritability of a disease is a fundamental challenge in human genetics, especially when the causes are plural and not clearly identified. Although the literature regarding heritability estimation for binary traits is less rich than for quantitative traits, several methods have been proposed to estimate the heritability of complex diseases. However, to the best of our knowledge, the existing methods are not supported by theoretical grounds. Moreover, most of the methodologies do not take into account a major specificity of the data coming from medical studies, which is the oversampling of the number of patients compared to controls. We propose in this paper to investigate the theoretical properties of the Phenotype Correlation Genotype Correlation (PCGC) regression developed by Golan, Lander and Rosset (2014), which is one of the major techniques used in statistical genetics and which is very efficient in practice, despite the oversampling of patients. Our main result is the proof of the consistency of this estimator, under several assumptions that we will state and discuss. We also provide a numerical study to compare two approximations leading to two heritability estimators.


Introduction
In the field of genetics, the concept of heritability refers to the proportion of variations of a biological trait or disease that can be explained by genetic factors. Quantifying the heritability is a major challenge for diseases that are suspected to have a strong genetic component but whose causes are often vague and multiple. Indeed, determining a high value of heritability is a powerful argument in favor of further research for genetic causes, but it also opens the possibility of predicting a risk of illness based on the genetic background.
There exist several methods to estimate the heritability of quantitative traits, which we will describe hereafter, with interesting theoretical and practical properties. Regarding binary traits, such as the presence or absence of a disease, a few methodologies have been proposed, but as far as we know, none of them has been validated theoretically. Golan, Lander and Rosset (2014) developed a method, called phenotype correlation genotype correlation (PCGC) regression, that they compared to recent methodologies and which was shown to be very efficient in practice. The aim of this paper is to investigate the theoretical properties of the PCGC method.
Let us first recall the main existing methods to estimate the heritability of quantitative traits, which will be strongly linked to the methods used for binary traits. Linear Mixed Models (LMMs) have been widely used for estimating the heritability of quantitative traits. Indeed, Yang et al. (2010) proposed for instance to estimate the heritability of human height by using a classical LMM defined by Y = Xβ + Zu + e, (1.1) where Y = (Y 1 , . . . , Y m ) is the vector of observations of a phenotype of interest, X is a m × p matrix of predictors (or fixed effects), β is a p × 1 vector containing the unknown linear effects of the predictors, and u and e correspond respectively to the genetic and the environmental random effects. We assume that u and e are Gaussian random effects with variances σ 2 u Id R N and σ 2 e Id R m respectively. Moreover, Z is a m × N matrix which contains the genetic information. They proposed to estimate the parameter ( 1.2) commonly considered as the mathematical definition for heritability since it determines how the variance is shared between u and e. Several methods were developed to estimate the parameter η , see Patterson and Thompson (1971), Searle, Casella and McCulloch (1992), Yang et al. (2011), Pirinen, Donnelly and Spencer (2013), Zhou and Stephens (2012).
From a theoretical point of view, Bonnet, Gassiat and Levy-Leduc (2015) showed the asymptotic normality of the maximum likelihood estimator of η as well as a central limit theorem leading to confidence intervals for η .
The previous model and the corresponding methods obviously do not apply when considering non continuous traits. However, the quantitative and the binary cases can be related by assuming the existence of an underlying Gaussian variable linked to the binary phenotype. More precisely, it consists of assuming that the observations Y 1 , . . . , Y n are distributed according to the following Generalized Linear Mixed Model (GLMM): with p i = g −1 (G i ) where g is a link function and G i a Gaussian variable. A particular case, which is very often used to define heritability of binary traits, is when g is a probit link function. This model was proposed by Falconer (1965), who assumed that the binary observations could be seen as an indicator function of a Gaussian variable exceeding a given threshold t: with l i defined by l = Zu + e (1.5) and l = (l 1 , . . . , l n ), u ∼ N (0, σ 2 u Id R N ) and e ∼ N (0, σ 2 e Id R n ), like in classical LMM defined in Equation (1.1). The heritability is then intuitively defined "at the liability scale", which means that it is the heritability of the unobserved continuous trait l, and it is given by the same expression (1.2) as for quantitative traits.
Observe that the threshold t is directly linked to the prevalence of the disease in the population, that is the proportion K of the population which is affected by the disease. Indeed, (1.6) The unobserved Gaussian variable l = (l 1 , . . . , l n ) is called the liability in this model, which is usually called the "liability threshold model" (Falconer (1965), Lee et al. (2011), Tenesa and Haley (2013)) and has been shown to be a reasonable modeling for complex diseases, for instance by Purcell et al. (2009). Several methods were established to estimate heritability in Model (1.3): among them we can quote the MCMC method of Hadfield (2010) and the penalized quasi-likelihood approach of Breslow and Clayton (1993). The theoretical properties of these estimators have not been demonstrated and their numerical performances can be found in the comparative study of de Villemereuil, Gimenez and Doligez (2013). Lee et al. (2011) proposed to use a maximum likelihood approach as if the binary traits were Gaussian, and then to apply a multiplicative factor to correct this approximation. Golan, Lander and Rosset (2014) showed that this heritability estimator was strongly biased in several realistic scenarios, in particular it was very sensitive to the prevalence of the disease (when the disease is rarer, the bias increases). The estimator also underestimates the heritability when the true heritability is high.
However, all the aforementioned methods raise two main concerns: first, they have no theoretical validation. Second, they do not take into account an essential element of case-control studies: in a medical study, the number of patients is similar to the number of controls even though the studied disease might be rare, which means that the proportion of cases in the study does not reflect the proportion of cases in the population. This oversampling of the cases, which had been noticed for instance by Lee et al. (2011) but had never been properly addressed, is handled by the PCGC approach of Golan, Lander and Rosset (2014), who proposed a moment based method to estimate the heritability. The ground of their methodology was to compute an approximate quantity of the expectation E of W i W j , for two individuals i and j, W i being a centered and normalized version of the binary data Y i , and conditionally to the fact that individuals i and j are in the study. This approach will be further described in Section 3.1.
Since the PCGC method presented very good numerical results but was not supported by theoretical grounds, we propose in this paper to investigate the theoretical properties of their method. Our main result is to show that the least squares estimator obtained with the first order approximation of E provides a consistent estimator of η . We also propose a simulation study to compare the numerical performances of the estimators obtained with first and second order approximations of E. We show in particular that the computational times associated to the second order estimator are substantially larger with no obvious improvement from the statistical point of view.
The model we study and the main definitions are given in Section 2. Section 3 contains the first order approximation of the expectation E with the corresponding estimator of η and Section 4 presents our consistency result for this estimator. The second order approximation of E is given in Section 5 and the numerical comparison of the two estimators can be found in Section 6. In Section 7, we discuss the results and potential perspectives. Finally, the proofs are given in Section 8.

Liability model
Let us denote K the prevalence of a disease in a population, that is the proportion of the population affected by the disease. Let Y i be the random variable such that Y i = 1 if the individual i is affected (then, individual i is called a case) and Y i = 0 if the individual i is unaffected (then individual i is called a control). We assume that the Y i 's are linked to unobserved variables l i as follows where t is a given threshold, related to the prevalence K by (1.6), and the l i 's are defined as l = Zu + e, (2.2) where l = (l 1 , . . . , l n ), u and e are random effects such that u ∼ N (0, σ 2 u Id R N ) and e ∼ N (0, σ 2 e Id R m ). The vector u corresponds to the genetic effects and e to the environmental effects. Moreover, Z is a m × N random matrix which contains the genetic information, and which is such that the Z i,k are normalized random variables in the following sense: they are defined from a matrix A = (A i,k ) 1≤i≤m, 1≤k≤N by In practice, the matrix A contains, for all the individuals in the study, the genetic information described by the Single Nucleotide Polymorphisms (SNPs).
More precisely, at each SNP, the genotype can be either qq, qQ or QQ, q being the less frequent (or minor) allele.
Then for each k, A i,k = 0 (resp. 1, resp. 2) if the genotype of the ith individual at locus k is qq (resp. Qq, resp. QQ). In this paper, we consider a more general case with mild assumptions on the distribution of the random variables A i,k , which are described in Section 4. However, note that the assumption of independence between the columns of A is quite strong, since in particular it does not take into account the linkage disequilibrium, that is precisely the correlation betweens genetic variants. To the best of our knowledge, the other theoretical works regarding estimation of heritability (Jiang et al. (2016) and Bonnet, Gassiat and Levy-Leduc (2015)) also neglect these correlations, even in the Gaussian scenario, which shows the difficulty of getting rid of this assumption.
With the definition (2.3), the columns of Z are empirically centered and normalized, and one can observe that The heritability at the liability scale, which is the parameter we want to estimate, is defined as the ratio of variances: (2.5) The variance of l conditionally to Z can then be rewritten with respect to η and σ 2 = Nσ 2 u + σ 2 e as: We will assume in the sequel without loss of generality that σ 2 = 1. Indeed, if σ 2 = 1, we can consider the variable l i = li σ and then, instead of estimating t from the prevalence K with the relationship (1.6), we estimate directly t/σ . Note that in Model (2.2), we consider a particular case of linear mixed model where there is no fixed effects. In the PCGC method, Golan, Lander and Rosset (2014) propose a solution to handle covariates that we did not study here, but it would be interesting to investigate as well the theoretical properties of such an approach. This point is further discussed in Section 7.

Case control study
Since the prevalence P in the study can be very different from the prevalence K in the general population (the cases are substantially oversampled in a casecontrol study), it is essential to consider that the observations that we have access to depend on the probabilities for both cases and controls to be selected in the study. We recall that m corresponds to the total size of the population and we define n the number of individuals in the study. Each individual of the population will either be selected or not for the study with a probability depending on their status (case or control).
More precisely, if p control denotes the probability for a control to be selected in the study, we can define the corresponding variable U i ∼ B(p control ) which is equal to 1 if individual i is part of the study. Similarly we define the probability p case for a case to be selected for the study and the corresponding variable V i ∼ B(p case ). We will increase the size population until we obtain n individuals in the study. Then for any individual i, we define the variable i by which is equal to 1 if individual i belongs to the study and 0 if not. We assume that the variables U 1 , . . . , U m , V 1 , . . . , V m are independent and independent of Y 1 , . . . , Y m and Z.
Since we do not observe Y i for the whole population but only for the individuals who belong to the study, we will work with the variables W i defined by which are centered versions of Y i in the study and are non-zero only if individual i belongs to the study. The probabilities p case and p control are chosen such that the prevalence in the study is equal to P . Indeed, if we assume that it implies that (2.7) The proof of (2.7) is given in Appendix A.1. Equation (2.6) means that all cases are accepted in the study and it is usually called a "full ascertainment" assumption (see for instance Golan, Lander and Rosset (2014)).

Description of the PCGC regression
Golan, Lander and Rosset (2014) considered a version of Model (2.2), where the liability is given by where g is a genetic random effect, which can be correlated across individuals, and e is the environmental random effect, which is assumed to be independent of the genetic effect. Both effects are assumed to be Gaussian: e has a variance equal to (1−η )Id R n and g has a covariance matrix V g defined for all 1 ≤ i, j ≤ n, as: The covariance matrix of (l i , l j ) is given by The heritability estimator of the PCGC regression is a least squares estimator obtained by minimizing i =j Since the expression of E[W i W j | i = j = 1] has no explicit formula as we shall see hereafter, Golan, Lander and Rosset (2014) proposed to take advantage of the fact that the correlations G i,j are typically small for i = j. The ground of the method is to write and to propose approximations of P(Y i = Y j ), P(Y i = Y j = 0) and P(Y i = Y j ) thanks to Taylor developments around the quantity G i,j . The computations leading to (3.2) can be found in Appendix A.2. This approximation, plugged in the least squares criterion (3.1), led to the heritability estimator given bŷ φ being the density of the standard Gaussian distribution. The proof of (3.3) and (3.4) can be found in the supplementary material from Golan, Lander and Rosset (2014).

Our method
In Model defined in (2.1) and (2.2), the variance matrix Σ (N ) of (l i , l j ) conditionally to Z can be written as where for all 1 ≤ i, j ≤ n, Note that in the model we consider, G N (i, j) is a random variable, which is not the case of the quantity G i,j in the model studied by Golan, Lander and Rosset (2014). A key element is to notice that Σ (N ) is close to the n×n identity matrix, more precisely . The proof of (3.6) can be found in Appendix A.3. Then, following the idea of Golan, Lander and Rosset (2014), we propose to approximate The detailed computations are devised in Section 8.2. A first order approximation of E[W i W j |Z, i = j = 1], plugged in (3.1), leads to the same estimatorη (1) as the one obtained with the PCGC regression. Indeed, we obtainη where c = φ(t) 2 P (1−P ) K 2 (1−K) 2 . In Section 5, we consider the second order approximation, which is different from the one devised by Golan, Lander and Rosset (2014).
Remark 1. Note that in practice, we cannot access directly G N (i, j) defined in (3.5), since the matrix Z should be centered and normalized in the whole population. This is obviously a limitation, but we propose to show with a numerical study that replacing Z byZ which has been centered and normalized in the study has a small influence on the heritability estimates. The results are displayed in Section 6.3.
Remark 2. The main difficulty to study the theoretical properties ofη (1) is due to the approximation neglecting a remainder term which depends on A N (i), A N (j) and B N (i, j) defined in (3.6), which means that it varies for each pair of individuals i and j.
Assumption 1. There exist d > 0, C > 0 and a neighborhood V 0 of 0 such that for all λ in V 0 for all i = j and for all k, where the A i,k 's are defined in (2.3) and σ 2 k is the variance of A i,k .
with A satisfying Assumptions 1 and 2, andη (1) the estimator of η defined in Equation (3.7). Then, as n, N → ∞ such that n/N → a ∈ (0, +∞), Note that we focus on the case where both the number n of individuals and the number N of genetic variants tend to infinity, which is the same framework chosen for instance by Jiang et al. (2016) and Bonnet, Gassiat and Levy-Leduc (2015). In practice, these values are obviously finite upper bounded, for instance by the length of the human genome for N .
The proof of Theorem 1 relies on the following lemmas.
Lemma 1. When n and N tend to infinity and n/N tends to a, 1 n i =j G N (i, j) 2 converges in probability to a.
We will then have to focus on Let E N be the following event with γ a positive number such that γ < 1/10. The choice of N is crucial, since it has to be large enough so that on the one hand, the probability not to be in event E N is very small (Lemma 2). On the other hand, N must be small enough to verify Lemmas 3 and 4, which ensure respectively that if E N holds, the first term converges to 0 and the second term of (4.1) will converge to η (up to a constant).
Let us denote E c N the complement of the event E N . We consider the following decompositionη Using the result of Lemma 2,η (1) 1 E c N converges in probability to 0 since Lemma 3. When n and N tend to infinity and n/N tends to a ∈ (0, +∞), Lemma 4. When n and N tend to infinity and n/N tends to a ∈ (0, +∞), The results of Lemmas 3 and 4 achieve the proof of Theorem 1. Section 8.1 contains a short version of the proofs, while the detailed proofs of Lemmas 1, 2, 3 and 4 are given in Section 8.3.

Second order approximation of
The purpose of this section is to study the behaviour of the heritability estimator obtained thanks to a second order approximation of Instead of computing the approximation till order 1/ √ N , we compute the approximation till order 1/N and we obtain: The proof of this computation is detailed in Section 8.4. The minimizer in η of the quantity is the root of a third order polynomial and could be found thanks to a closedform formula. Since the expressions are quite complex, we propose here, for the sake of simplicity, to use a Newton-Raphson approach to obtain the corresponding heritability estimatorη (2) of the second order approximation. Note that the second order approximation, which depends on B N (i, j) but also on A N (i) and A N (j), is different from the one found by Golan, Lander and Rosset (2014).

Numerical study
In this section, we propose to study the numerical performance of the estimatorŝ η (1) andη (2) devised respectively in Sections 3 and 5. Since Golan, Lander and Rosset (2014) already compared the estimatorη (1) to the one proposed by Lee et al. (2011) and stated several arguments in favor of their estimator, we will focus on comparing our two estimators in terms of statistical and computational efficiency.

Simulation process
In this simulation study, we generated data sets with n = 200, N = 10000 that is smaller than the size of typical data sets (n 5000, N 500000 for instance). The reason is purely computational, since we have to generate data for m n/K individuals in the population in order to have n individuals in the study. However, we choose the values of n and N such that the classical scenario where N >> n is respected. The value of the prevalence in the population varies from 0.005 to 0.1. The observations were generated as follows.
• We set the parameters η , K, P = 1/2 and the size of the general population, chosen very large. Notice that the size of the population can vary from one sample to another. In practice, we generate new individuals in the population until we have n individuals in the study. • We generate the SNPs matrix A, the columns of which are independent binomial variables with parameters 2 and p j , p j being uniformly generated between 0.1 and 0.5 (it represents the probability of appearance of the less frequent SNP). The matrix Z is then obtained by centering and normalizing A in the whole population. Notice that we use Z to generate the data, but since we would not access to the whole matrix in practice, we will use for the estimations the matrix that we denoteZ, which is centered and normalized in the study. This point is further discussed in Section 6.3 • We generate the Gaussian random effects u and e with respective variances σ 2 u = η /N and σ 2 e = 1 − η . • We generate liabilities, from which we generate binary observations in order to have a prevalence equal to K in the general population. • For each individual, we determine those who stayed in the study: the cases are automatically selected (full ascertainment assumption) but each control is selected with probability p control computed in Equation (2.7).

Results
Figure 1 displays the estimations of η obtained with both estimatorsη (1) and η (2) . First, we can notice that both estimators seem empirically unbiased. Second, we observe no obvious improvement of the performance ofη (2) compared toη (1) in terms of empirical variance. Table 1 and Figure 2 show the computational performance of both estimators. The computation of the estimatorη (2) obtained with the more refined approximation is obviously slower, but for small values of n (namely, n = 100), the time required to compute an estimation of η remains quite small (86 seconds, against 40 seconds for the other estimator). However, when n is larger, the computational time increases substantially and the "slower" estimator needs up to 13500 seconds, that is almost 4 hours, to compute an estimation of η . In conclusion, both estimators are empirically unbiased and since the computation of the estimatorη (2) is slower and does not improve the estimations of η , we are satisfied with the first order approximation and the corresponding estimatorη (1) .

Normalization of Z in the study
We propose to study the impact of performing normalization described in Equations (2.3) and (2.4) in the study and not in the whole population. Since for synthetic data we have access to the complete matrix Z, we propose to compare our results to those we would obtain when considering the reduction of Z to   the individuals of the study. The results are displayed in Figure 3, and we can see the minor changes obtained between the manners of centering the genetic information matrix.

Discussion
In this paper, we propose theoretical grounds to support the heritability estimator in case-control studies developed by Golan, Lander and Rosset (2014). We prove indeed its consistency in the framework where both the number n of individuals and the number N of SNPs tend to infinity, when the ratio n/N tends to a constant a. This consistency result was obtained under several assumptions, the necessity of which it would be interesting to question. For instance, removing the Gaussianity assumption on the distribution of the random effects could allow to take into account possible sparsity and remains a very challenging issue. The independence of the columns of the SNP matrix is also a very strong assumption, which neglects the linkage disequilibrium (LD) between alleles. Going beyond this assumption seems challenging from the theoretical point of view, even for quantitative traits (Jiang et al., 2016;Bonnet, Gassiat and Levy-Leduc, 2015). Indeed, independence is required to prove consistency, to determine the order of magnitude of different quantities but also to be able to apply large deviation results that are essential to prove Lemma 2. For quantitative traits, LD has been shown to result in an overestimated contribution of variants in strong LD (Speed et al., 2012). Despite theoretical limitations, several efficient filtering procedures (Patterson, Price and Reich, 2006;Speed et al., 2012) were proposed to modify the kinship matrix G before estimating heritability.
Another sensible question is the closeness to the asymptotic results on finite samples. One key ingredient could be a careful calibration of N defined in (4.2) of Theorem 1. This quantity is indeed constrained by a lower bound to ensure that Lemma 2 holds and an upper bound coming from Lemmas 3 and 4. Determining the optimal balance for N should lead to a lower bound of the convergence rate of the estimation procedure. The numerical results also confirm the good performance of the PCGC method on finite samples, and in particular the similar results obtained with first and second order approximations suggest that the remainder term is indeed negligible compared to the main term.
We would also like to extend our theoretical grounds to a more general model that includes fixed effects, and for instance investigate the properties of the PCGC regression in this scenario. Finally, it would also be interesting to complete this work with theoretical results which could allow the user to compute accurate confidence intervals, similarly to existing results for quantitative traits.

Summary of the proofs
Since the proof of Theorem 1 is quite long and requires heavy computations, we propose in this section a short version of the main arguments that we used to prove Lemmas 1, 2, 3 and 4.

Short proof of Lemma 1
Let us write Then let us show that We will prove these two results by computing the expectation and variance and both terms, the order of magnitude of which we will evaluate thanks to the properties on Z that are given in Proposition 1 of Section 8.

Short proof of Lemma 2
We will show that P(E c N ) can be upperbounded by a sum of two terms of theform with C and α being positive constants and β N going to infinity when N tends to infinity. These two terms come from the first upper bound We will rewrite each term with the A i,k instead of Z i,k so that we can use Assumption 1. We need for instance to upper bound the probability that the difference between empirical mean and theoretical mean exceeds a certain value, that is: By Chernoff inequality, for all λ ≥ 0, Then, we use Assumption 1.2 to upper bound the right term and we obtain that, for all positive values of λ, The right term of (8.1) is minimum when which implies in particular that Similarly we will upper bound all terms using on the one hand Chernoff inequality and on the other hand, one or several assumptions from Assumptions 1 and 2. The detailed computations are given in Section 8.3.3.

Short proof of Lemma 3
According to the results of Section 8.3.2, we have Thus, we just need to prove that We shall compute an explicit form of the remainder term R N (i, j) and then we shall see that R N (i, j)1 E N may be upper bounded by a finite sum of terms of the form since k 1 + k 2 + k 3 + 1 ≥ 3 and γ < 1/10.

Short proof of Lemma 4
Let us show that For this purpose, we will separate three cases depending on the cardinal of the set {i 1 , i 2 , i 3 , i 4 } in the sum of Equation (8.21).
, since the sum in Equation (8.2) has only n(n − 1) terms, the upper bound of G N (i 1 , i 2 )G N (i 3 , i 4 ) on the event E N will be enough to obtain the convergence to 0.
has no term of order less than 1/ √ N , that is no constant term. Then the other terms can be upper bounded on E N by terms of the form k N , with k large enough to compensate the n(n − 1)(n − 2) terms of the sum.
-If card({i 1 , i 2 , i 3 , i 4 })=4, each term can be handled using one of the following arguments: • The order of the term is high enough so that it can be upper bounded on E N by terms of the form k N , with k large enough to compensate the n(n − 1)(n − 2)(n − 2) terms of the sum.
• The term is equal to 0 (we will propose a detailed computation to verify this).
This achieves the proof of Equation (8.2).

Taylor development of
According to Equation (3.2), we only need to compute approximations of P( where the matrix Σ (N ) is the covariance matrix of (l i , l j ).
We will use the result of Equation (3.6), which will be demonstrated in Appendix A.3, that is

Using a first order Taylor development around
A Taylor development of the exponential function leads to This remainder and its order will be carefully studied in Section 8.3.4. Similarly, we can compute P( Replacing these terms in the expression of the numerator of E[W i W j |Z, i = j = 1] given in equation (3.2) leads to: where r N is a linear combination of μ N ,μ N andμ N . Since there is no constant term in this numerator, we only need the development of order 0 of the denominator of E[W i W j |Z, i = j = 1] to obtain the first order approximation of E[W i W j |Z, i = j = 1].
We obtain that the denominator can be written as wherer N is the sum of a term of order 1 √ N and a linear combination of μ N ,μ N andμ N . Thus, we obtain that

Properties of Z
In the following proofs, we will use several properties of the matrix Z, which are stated in Proposition 1.

Proposition 1. Uniformly in k,
. The proof of Proposition 1 is given in Appendix A.4.

Proof of Lemma 1
Let us prove that, when n and N tend to infinity and n/N tends to a, where P → denotes the convergence in probability.
Since Z i,k and Z j,l are independent for any i and j when k = l, we will always consider separately the cases where k = l from the cases where k = l. Indeed, let us show that The second term of (8.10) can be rewritten as:

A. Bonnet
This last equality comes from the definition of Z as a centered and normalized variable given in Equation (2.3), which implies that for all k, Then, This proves (8.8).
In the first term, {i 1 , i 2 , j 1 , j 2 } can be of cardinal 2, 3 or 4 and counting the number of combinations gives the expression: This was obtained by using (3), (5) and (6) of Proposition 1. Finally, This completes the proof of (8.9).

Proof of Lemma 2
Note that Let δ be a positive real number such that √ δ/2c ∈ V 0 and δ < δmin 4 , where V 0 and δ min are defined in Assumptions 1 and 2.1 respectively.
Let us show that By Chernoff inequality, for all λ ≥ 0, Then, by Assumption 1.2, for all positive values of λ in V 0 , The right term of (8.13) is minimum when which implies in particular that Similarly, for all negative values of λ in V 0 , (8.14) The right term of (8.14) is minimum when which proves (8.12).
Since 4δ − δ min < 0 by assumption on δ, we apply again Chernoff inequality, which gives us that: This result, combined with (8.12), proves that Using Chernoff inequality and Assumption 1.1, we can prove that Using Chernoff inequality and Assumption 1.3, we obtain that and with Assumption 1.1 we have with n 2 N 2 N = a 2 N 2+2γ and nN N = aN 3 2 +γ where γ > 0, which implies that the main term in the exponential is − n 2 Nδ 2 2 N 256d . Similarly, we can show that This concludes the proof that for all values of q, We use similar techniques to otain an upper bound for Since we have already proved (8.15) and (8.16), we will conclude the proof by showing that (8.18) (8.17) is obtained using Assumption 1.3 and Chernoff inequality.
which proves (8.18) and achieves the proof of Lemma 2.

Proof of Lemma 3
According to the results of Section 8.3.2, we have Thus, we just need to prove that We shall see that R N (i, j)1 E N may be upper bounded by a finite sum of terms of the form (8.19) with k in J2, 22K and k 1 + k 2 + k 3 = k. Thus, 1 since k 1 + k 2 + k 3 + 1 ≥ 3 and γ < 1/10. This achieves the proof of Lemma 3. Let us explain why Equation (8.19) holds. We need to evaluate |R N (i, j)1 E N |. Then, let us look at the previous remainders which compose R N (i, j), and we will provide upper bounds when E N holds. Similarly, where P 1 , P 2 , P 3 are polynomial functions. This expression comes from upper bounding the terms Then similarly to the expression 8.20, Similarly to the computations made for α N , β N , γ N , ν N , all the remainder terms can be upper bounded by products of which proves (8.19).

Proof of Lemma 4
In this section, all the expectations that we consider are conditionally to the presence of the observed individuals in the study, for instance However, for the sake of simplicity, we will not always make explicit such conditioning. Let us show that For this purpose, we will separate three cases depending on the cardinal of the set {i 1 , i 2 , i 3 , i 4 } in the sum of Equation (8.21).
-If card({i 1 , i 2 , i 3 , i 4 })=2, the corresponding terms in (8.21) are equal to where α is a positive constant and ρ N (i, j) can be upper bounded by a finite product of G N (i, j), G N (i, i) − 1 and G N (j, j) − 1, according to proof of Lemma 3. This result is obtained by using a similar decomposition of N and all terms of ρ N (i, j) are upper bounded by a finite sum of k N , with k greater than 1, which all tend to 0, it is clear that -If card({i 1 , i 2 , i 3 , i 4 })=3, the corresponding terms in (8.21) are equal to Since the sum of Equation (8.22) has n(n − 1)(n − 2) terms, we have the refine the upper bound that we used in the case where the cardinal of {i 1 , i 2 , i 3 , i 4 } was equal to 2. Indeed, we will use the following proposition: has no term of order less than 1/ √ N , that is no constant term.
Let us explain why Proposition 2 is enough to prove

A. Bonnet
Let us first recall that, according to Lemma 3, where, if E N holds, all these terms are upper bounded by a finite sum of terms of the form k N , with k ≥ 2. Then, can be upper bounded by a finite sum of terms of the form k N , with k ≥ 4.
Similarly, according to Proposition 2, each term of it achieves the proof of (8.22).
-If card({i 1 , i 2 , i 3 , i 4 })=4, let us first observe that which means that we shall only focus on the approximation of of order 1/N . Let us recall that is a remainder, each term of which is upper bounded by a finite sum of terms of the form k N , with k ≥ 2. In particular, it implies that Thus, we need to prove that To do so, we shall prove first the following proposition: Proposition 3. The terms of order less than or equal to 1/ √ N in are null.
The term of order exactly 1/ We will demonstrate the propositions: Proposition 5. For all terms T N (i 1 , i 2 , i 3 , i 4 ) of order 1/N in Propositions 3, 4 and 5 prove (8.24). Let us prove now Propositions 2, 3, 5 and 4.
Since each case has a probability 1 and each control a probability K(1 − P )/P (1 − K) to be in the study (these probabilities are given in Equation (2.6) and (2.7)), Similarly, gives us that the approximation of order 0 is null, which achieves the proof of Proposition 2.
Let us prove now Proposition 3. If card({i 1 , i 2 , i 3 , i 4 })=4, let us compute the approximation of order 1/ can take values: if one individual is a control and the three others are cases.
• 1 if two individuals are controls and two are cases. • −P 1−P if one individual is a case and the three others are controls.
The matrix of variance covariance of (l i1 , l i2 , l i3 , l i4 ) is For the sake of clarity, let us denote , and similarly we define A 2 , A 3 and A 4 . Let us also denote C 1,2 = √ N G N (i 1 , i 2 ) and similarly, C 1,3 , . . . , C 3,4 . Then, let us rewrite Σ as: The approximation of order 1/ √ n of its inverse matrix is given by Similarly, we compute • P("1 control, 3 cases"|Z) • P("2 controls, 2 cases"|Z) • P("3 controls, 1 case"|Z) Regrouping all the first terms in the expression of Similarly we regroup the terms in η Finally, we regroup all the terms in η √ n (C 1,2 + ... + C 3,4 ): This proves Proposition 3.
Let us prove Proposition 5. The main term of the second order approximation of f (w, x, y, z) can be written as: (8.25) In order to prove Proposition 5, we will show that: We will develop the proof of Equation ( We recall that since Z i,k and Z j,l are independent for any i and j when k = l, we will always consider separately the cases where k = l from the cases k = l. Let us first focus on the last term of (8.35).
n 2 (n−1) 2 +o (1) Now let us decompose the second term of (8.35) as: Using the results given by Proposition 1, we obtain that Similarly, we can prove that by using the properties of Proposition 1 or similar relationships coming from other properties of Z that we have not detailed here. Hence we have shown (8.27). The proofs of (8.26), (8.28), (8.29), (8.30), (8.31), (8.32) are very similar to this proof.
The term in C 1,2 C 3,4 of P("2 cases, 2 controls"|Z) is The term in C 1,2 C 3,4 of P("1 case, 3 controls"|Z) is The term in It remains to compute the approximation of the denominator of of order 0, that is which is exactly the term in This proves Proposition 4.

Second order approximation of
The density function f can still be written as but with the explicit term of order 1/N in the expressions of |Σ (N ) | −1 and |Σ (N ) | − 1 2 : with the last term obtained by developing the exponential function.
we have:

Multiplying by
Multiplying by Finally, we compute similarly P( We replace the expressions of P(Y i = Y j = 1|Z), P(Y i = Y j = 0|Z) and P(Y i = Y j |Z) in the expression of E[W i W j |Z, i = j = 1]. Since we already computed the terms of order 1 √ N for the numerator, it only remains the terms of order 1 N . Eventually, we find that the numerator can be writen as: Similarly, we compute the expression of the denominator (at order 1 √ N since the main term of the numerator is of order 1 √ N ). We obtain the following expression: tφ(t) .

A.1. Proof of Equation (2.7)
By definition, the probabilities p case and p control are linked to the variables i as follows: p case = P( i = 1|Z, Y i = 1) and p control = P( i = 1|Z, Y i = 0).

A.2. Proof of Equation (3.2)
This equation was proved in Golan, Lander and Rosset (2014), we recall the proof here for the sake of completeness. Conditionally to the event { i = j = 1}, the variable W i W j can take the following values: Let us write the expectaction of W i W j conditionally to Z and conditionally to { i = j = 1}: under the full ascertainment assumption given by Equation (2.6).
Similarly, since we have seen in Equation (2.7) that a control has a probability K(1−P ) P (1−K) to be selected in the study and since i and j are assumed to be independent conditionally to Z, Y i and Y j : and The probability that both individuals i and j are included in the study is equal to If we combine all these computations and we plug them in the expression (A.1), we obtain (3.2).

A.3. Proof of Equation (3.6)
Notice first that Moreover, since the variables (Z i,k ) 1≤i≤n are normalized according to Equation (2.3), By taking the expectation and since the variables (Z i,k ) 1≤i≤n are exchangeable, we obtain that E[Z 2 i,k ] = 1. (A.2) Using (2) of Proposition 1 and Equation (A.2), we obtain that is bounded and Similarly, (3) and (1)