Nonparametric distribution estimation in the presence of familial correlation and censoring

We propose methods to estimate the distribution functions for multiple populations from mixture data that are only known to belong to a specific population with certain probabilities. The problem is motivated from kin-cohort studies collecting phenotype data in families for various diseases such as the Huntington’s disease (HD) or breast cancer. Relatives in these studies are not genotyped hence only their probabilities of carrying a known causal mutation (e.g., BRCA1 gene mutation or HD gene mutation) can be derived. In addition, phenotype observations from the same family may be correlated due to shared life style or other genes associated with disease, and the observations are subject to censoring. Our estimator does not assume any parametric form of the distributions, and does not require modeling of the correlation structure. It estimates the distributions through using the optimal base estimators and then optimally combine them. The optimality implies both estimation consistency and minimum estimation variance. Simulations and real data analysis on an HD study are performed to illustrate the improved efficiency of the proposed methods. MSC 2010 subject classifications: Primary 62G08; secondary 62N01.


Introduction
This work is motivated by research goals arise from studies such as the Cooperative Huntington's Observational Research Trial (COHORT, Dorsey and the Huntington Study Group COHORT Investigators, 2012 [3]). Huntington's disease (HD) is a fatal neurodegenerative disease caused by expanded C-A-G repeats in the huntingtin gene (Huntington Study Group, 1993 [6]). Subjects with expanded CAG repeats at the huntingtin gene will be affected by HD and the distribution of age-at-onset of HD in individuals with mutation is characterized in Langbehn et al. (2004) [8] and Ma and Wang (2012) [10]. However, less discussed in the literature is the effect of CAG expansion status on patient survival. Here, we aim to estimate the age-at-death distribution function for the HD gene expanded individuals. In COHORT, the phenotype information (age-at-death or age at study baseline) in relatives from the same families was collected. One challenge is that the genotype information in a relative is not collected and thus the gene expansion status in some relatives are unknown. Nevertheless, a relative's probability of carrying the huntingtin gene mutation can be obtained from external information (see for example, Ma and Wang 2012 [10]).
Other challenges for the COHORT as well as other family studies include handling of the correlation among the relatives in the same family and the right censoring. Using the probability of carrying a mutation associated with each relative and assuming that the observations are independent given the mutation status, the distribution of any trait of interest for both the mutation carrier and non-carrier populations can be estimated efficiently (Ma and Wang, 2012 [10]) in the absence of censoring. When data are also subject to censoring, Ma and Wang (2014) [11] further developed effective methods to perform the estimation and inference through a weighted-least-squares formulation. However, both methods assume that relatives in the same family are independent given their genotypes (i.e., no residual familial correlation), and hence it is unclear how to properly handle the potential correlation due to shared life style or causal genes at other loci to improve estimation efficiency.
In this work, we address the within-family correlation for censored mixture data where the subjects' population group identifiers (i.e., mutation carriers or non-carriers) are only known up to the probabilities. Since the familial correlation can be a result of similar life style or shared biological markers other than the gene under study, it may be challenging to choose a satisfactory parametric model for the distribution of the shared latent effects, and therefore we do not make such attempts. We provide two different modeling approaches, and subsequently two estimation procedures. In the first approach, we leave the distribution of the unobserved latent familial effects and their correlation structure completely unspecified. We first eliminate the need to handle the familial correlation by using only one member per family to form a base estimator, and then construct an optimal new estimator that takes advantage of multiple members in a family by resampling and minimizing the variance of the combined estimator. When forming the base estimator, we use the approach by Ma and Wang (2014) [11], which is simple and practically as effective as the efficient estimator. This first proposed method can handle arbitrary distribution functions and arbitrary correlation structures without imposing parametric assumptions or modeling the correlation. In addition, it is easy to compute and flexible. In the second approach, we assume exchangeable correlation structure between family members to improve estimation accuracy but leave the distribution function of the phenotype unspecified to protect against misspecification. In this case, we proceed with a modified weighted least square estimator that takes full advantage of the assumed correlation structure.
The rest of the paper is organized as follows. We present the methods, describe implementation, and demonstrate their optimality property in Section 2. Simulations are carried out in Section 3 to illustrate the performance of the estimators in both simple and complex settings. Finally, we analyze the COHORT data which motivated this work in Section 4 and conclude the paper with some discussions in Section 5. All the technical derivations are in an Appendix.

Methodologies
We first define some notations. Suppose there are N families in the study, and the ith family has n i members, i = 1, . . . , N. The random event time for the jth member of the ith family is S ij . Further, the event is subject to random censoring at time C ij . Let Y ij = min(S ij , C ij ) and the censoring indicator Δ ij = I(S ij ≤ C ij ). Furthermore, we assume there are p different populations, and their event times have cumulative distribution functions F 1 (t), F 2 (t), . . . , F p (t) respectively. Write F(t) = {F 1 (t), F 2 (t), . . . , F p (t)} T . We assume the p event processes associated with the p populations are independent of the censoring process. Assume for all i = 1, . . . , N, j = 1, . . . , n i , S ij is a random sample from one of the p populations, but the exact population identifier is not known. We use q ijk to denote the probability of S ij belonging to the kth population, for k = 1, . . . , p. Let q ij = (q ij1 , q ij2 , . . . , q ijp ) T . Obviously, p k=1 q ijk = 1. Using these notations, the observed data can be written as The above data structure typically arises from kin-cohort and quantitative trait locus (QTL) studies, where q ij only has finitely many, say m, m < ∞, possible values, denoted as u 1 , u 2 , . . . , u m . We write the frequencies of the occurrences of u 1 , u 2 , . . . , u m as

Independent case
We consider a special case where the observations O = {(q ij , Y ij , Δ ij ), i = 1, . . . , N, j = 1, . . . , n i } are independent of each other. Obviously, this happens when each family has only one observation, i.e. n i = 1 for i = 1, . . . , N. This also happens when there is no within-family correlation. In this case, following Ma and Wang (2014) [11], we use the relation and estimate F(t) through Here H l (t) is the Kaplan-Meier (KM) estimate (Kaplan and Meier (1958) [7]) for H l (t). Kaplan and Meier (1958) [7] has established the consistency for the KM estimator, while Breslow and Crowley (1974) [2] has shown that it converges weakly to a Gaussian process. Since F(t) is a linear transformation of H(t), it is also consistent and converges weakly to a Gaussian process when N → ∞. These observations will be used in our following derivation when within family correlation exists.

Arbitrary correlation 1: Best Linear Resampled Estimator (BLRE)
In the general situation when members from a same family may be correlated, we propose a two stage procedure that utilizes the results described in Section 2.1.
In the first stage, we randomly sample one member from each family, regardless of the family size, and then use (2.2) to obtain a crude estimation of F(t).
Repeat this process multiple, say R, times to collect multiple estimators for F(t), denoting these estimators F 1 (t), . . . , F R (t). In the second stage, we aim to combine the multiple estimators from the first stage in an optimal way. Since each F r (t), r = 1, . . . , R is a consistent estimator of F(t), it is natural to use a weighted average of these estimators to form an estimator that is not only consistent but also more efficient. In general, we write the combined estimator and A is a p×pR weight matrix. The consistent requirement mandates AJ = I p , where I p is the size p identity matrix, and J is a pR × p matrix formed by I p 's, i.e. J = (I p , . . . , I p ) T . In the Appendix, we further show that the optimal choice of A in terms of minimizing the variance of . We summarize the above results in Theorem 1 and give the proof in the Appendix.

Theorem 1. Let F(t) be given in (2.3). Then as long as AJ = I p , F(t) is a consistent estimator of F(t) and is asymptoticlly normally distributed. In addition, var{ F(t)} is minimized when
(2.5) The resulting optimal variance of √ N F(t) is To take advantage of the result in Theorem 1, we still need to obtain U. Because of our construction of F(t), U is naturally an R × R block matrix with each block size p × p. Although the diagonal blocks of U can be approximated using results in Section 2.1, the analysis of the off-diagonal blocks is intractable due to the unspecified correlation structure among family members and the potentially complex pattern resulting from the sampling procedure. Thus, we resort to a bootstrap procedure (Efron (1981) [4] and Akritas (1986) [1]) to assess U. Here, caution needs to be taken in performing the bootstrap procedure. In particular, although our interest is to repeatedly draw family members to form estimators, we need to bootstrap families, not members of the families. Specifically, for b = 1, . . . , B, we randomly draw N families with replacement and with equal probability, and denote the bootstrap sample O * b . We then repeat the estimation procedure described above on O * b to obtain F * b L (t). The sample variance of F * 1 L (t), F * 2 L (t), . . . , F * B L (t) is then used to estimate U. The complete procedure of our BLRE is the following.
Step 3. Randomly sample N families with replacement from the original families. Perform Steps 1 and 2 on the sampled data, obtain the correspond-

Arbitrary correlation 2: Best Quadratic Inference Function Estimator (BQIF)
We now investigate the issue of familial correlation from a different perspective. The basic idea is that every time when we collect one member from each family, we can construct an estimating equation using Ma and Wang (2014) [11] because the observations are now iid. If we repeat this procedure multiple times, we then have multiple estimating equations. It is natural to consider combining these estimating equations optimally to obtain a final estimator. Specifically, when there is only one member per family, we rewrite the relation in (2.1) as and view F(t) as the root that solves where H i (t) is the same KM estimator as before.
We use the sampling scheme in the first stage of BLRE in section 2.2 to sample R data sets, and write the estimating equation (2.7) based on the rth sampled data Here, q r i denotes the q value of the member from the ith family in the rth sample. Because the number of equations, pR, can be much larger than the number of the parameters p, we resort to the Quadratic Inference Function (QIF) method (Lindsay and Qu 2003 [9]). Write where W is a weight matrix. In a typical QIF construction, g i (t)'s are functions of the ith observation and are hence independent of each other, and the subsequent root-N consistency and asymptotic normality of the resulting estimator have been established in Lindsay and Qu (2003) [9]. However here, it is important to recognize that g i (t)'s are not independent since they contain H r i (t)'s, which are estimated based on all the observations from the rth sample for r = 1, . . . , R. Nevertheless, in Theorem 2, we show that the resulting estimator still enjoys the usual asymptotic normality property. The proof is in the Appendix. (2.9).

Theorem 2. Let F(t) be the minimizer of the quadratic form in
(2.10) Theorem 2 prescribes the choice of the optimal weight matrix. To achieve efficiency, it is essential to estimate M. Because no correlation structure is modeled for members from the same family, we resort to the bootstrap procedure mentioned in Section 2.2 to approximate M. Using the bth bootstrap sample O * b , we follow the procedure described above to construct estimation equation The detailed algorithm based on BQIF is the following.
Step 3. Randomly sample N families with replacement from the original families. Perform Steps 1 and 2 on the sampled data, obtain the corresponding Step 5. Calculate the sample variance M of Obtain the estimator F(t) from minimizing (2.9).
In "step 2" of both algorithms (BLRE and BQIF), we need to repeat "step 1" R times. As we repeat, more combinations of family members are formed and included in data analysis. In total there are N i=1 n i ways to form different estimating equations in "step 1" (BQIF). However, in practice, we suggest setting R to be the largest family size initially, and increasing it gradually until no significant improvement can be seen. If the process of increasing R continues and costs substantial computation time, the method described in section 2.5 will be an alternative to consider.

Equivalence of BLRE and BQIF
To understand the advantages and disadvantages of BLRE and BQIF introduced respectively in Section 2.2 and 2.3, we perform further analysis to compare their relative performance. Given that BLRE is a combination of the estimators from R samples, while BQIF results from solving a combination of estimating equations from the same R samples, it is not surprising that these two procedures are in fact equivalent. In the following, we formally establish that there is a one-to-one mapping between the estimators in the two classes, and in particular, the optimal estimation variances from the two estimators are identical asymptotically.
Because the BLRE is uniquely determined by the weight matrix choice A, while the BQIF is uniquely decided by the weight matrix W, we only need to establish the one-to-one mapping between A and W in order to show our results. Define a pR × pR block diagonal matrix For any weight matrix W defined in the BQIF estimator, consider as the weight matrix in BLRE. Obviously, AJ = I p . We now investigate the resulting BLRE and BQIF from the corresponding A and W. Let the BLRE analogously as F L (t) and recall the definition of g i (t) in (2.8). We can write which leads to On the other hand, the BQIF, denoted F (2) (t), is obtained from minimizing (2.9), thus standard Taylor expansion leads to where the last equality follows from the relation Further using the connection between A and W in (2.11), we immediately have subtraction of (2.14) from (2.13) yields (2.11).
Having established the one-to-one mapping between BLRE and BQIF via (2.11), it is not surprising to expect that the optimal weight matrix choices in the two estimator classes, A opt and W opt , also satisfy (2.11). This can be easily verified through using the equality U = DMD, which follows from (2.12). Furthermore, we can explicitly verify that the two optimal asymptotic estimation variances are identical, i.e.

Structured correlation
The estimators proposed in Section 2.2 rely on resampling because we do not impose any assumption on the familial correlation structure. Here, we consider a special case where the correlation between family members are identical. The assumption is natural when the data do not contain information that allows us to differentiate various levels of correlation among relatives such as those due to shared life style when living together. We denote the common correlation as ρ.
Recall that in Section 2.1, we demonstrate that in the independent case, the estimator of Ma and Wang (2014) [11] is based on (2.2) where H l (t) is a KM estimator for H l (t). We can view this as H l (t) = u T l F(t) + l , where l ∼ N (0, V l /d l ) for l = 1, . . . , m. Ma and Wang (2014) [11] advocates to replace the V l 's with a common value V and use a weighted least square with m observations to recover F(t).
However, when familial correlation is ρ instead of zero, when estimating H l (t), we effectively have fewer than d l observations. The effective number of observations is Here d li is defined as the number of members in family i that belong to the same group l, i = 1, . . . , N. Note that some d li 's may be zero by its definition. Further taking into account the correlations between the m groups in a similar way, we propose to estimate F(t) through where W −1 has diagonal elements d −1 1 , . . . , d −1 m , and the (l, l ) entry is In matrix W, the only unknown quantity is ρ. We estimate ρ through using the large sample properties of H l (t) under correlated data as described in the following. We first sort the observed time in the l-th group and obtain ordered data [{Y (1) , Note that they are from the same group, and hence have the same q value. Following Breslow & Crowley (1974) [2], for correlated observations, KM estimator has the property where Because some {Y (i) , Δ (i) , q (i) = u l }'s are from a same family, there may exist an in-group correlation ρ between a l i and a l j in the l-th group. Furthermore, the calculation of cov{ H l (t), H l (t)} may also involve a between-group correlation between a l i and a l j if the observations are also from a same family. Since a common correlation ρ is assumed among family members, the between-group correlation is also ρ. We match ρ with the sample correlation from the in-group and between-groups pairs to obtain ρ, the detailed procedure contains three steps described in the following.
i. Approximate the integrals in a l i and re-organize the summation. For notational simplicity, we still write the result as a l i . This gives iii. Calculate Once W is obtained, we form estimator F(t) from (2.15). Standard calculation shows that it has a Gaussian limiting distribution with variance (UWU T ) −1 . We denote this estimator as the correlated weighted least square (CWLS) estimator.

Simulation studies
We now demonstrate the finite sample performance of the BLRE, BQIF and CWLS methods via two simulation studies. The first simulation is a relatively simple one used to illustrate the effectiveness of the theoretical properties derived in Section 2. In the second simulation, we generated a more complex data, where we increased the number of event distributions, considered larger families and varied family sizes. For both simulations, we generated 1000 data sets.
In the first simulation, we set the sample size N = 1000, p = 2, m = 5 and n i = 4 for i = 1, . . . , N. The two (p = 2) true functions F 1 (t) and F 2 (t) are respectively the distribution functions of two truncated exponential densities, with rate 3 and 5 respectively. The support is [0, 10]. To generate correlated survival times for members from a same family, we implement the following procedure. For the ith family, we construct a multivariate distribution, that is, we generate a random vector (S 1 i1 , . . . , S 1 i4 , S 2 i1 , . . . , S 2 i4 ) from a Clayton copula with parameter 10. It provides the jth member of the ith family with possible survival time S 1 ij or S 2 ij , corresponding to the two functions F 1 (t) and F 2 (t). We select S ij = S 1 ij or S 2 ij with probabilities in the q ij vector, where q ij is assigned to five (m = 5) different vector values (0.9, 0.1) T ,(0.6, 0.4) T ,(0.4, 0.6) T , (0.2, 0.8) T and (0.15, 0.85) T , with probabilities 0.3, 0.3, 0.2, 0.1 and 0.1 respectively. Lastly, we generate the censoring time from a uniform distribution on (0, 5.4), resulting in a censoring rate of 50% approximately. We then create Y ij = min(S ij , C ij ) and We implement CWLS method in section 2.5, and use Algorithms 1 and 2 to carry out the BLRE and BQIF methods, and consider R = 6. We implement B = 500 bootstrap repetitions to estimate the variance-covariance matrices U in BLRE and M in BQIF respectively. We summarize the results of our analysis at t = 2.5 in Table 1. As a comparison, we also implemented the method from Ma and Wang (2014) [11], which we named "MW" method. From Table 1, it is clear that Algorithms 1 and 2 produce very similar results. This fact concurs with our theoretical results on the asymptotic equivalence of the two methods in Section 2.4. For all methods considered here, the mean of the 1000 estimates are very close to the true function values, the sample standard deviations of the 1000 estimates are very close to the average of the estimated standard deviations, and coverage rate of the 95% confidence interval is indeed close to the nominal value. BLRE, BQIF and CWLS reduce the mean squared error (MSE) by 27% for F 1 (t) and 30% for F 2 (t), indicating large improvement in estimation efficiency. The estimation result of the entire functions F 1 (t) and F 2 (t) is given in Figure 1, where the mean estimated curves almost overlap with the true curves. The lower and upper bound of the 95% confidence bands of F 1 (t) and F 2 (t) can be separated in our proposed methods, while they overlap with each other in the Ma and Wang (2014) [11] method. In the second simulation, we set N = 800, m = 5, p = 3 and generate 50, 250,150,100,150,50 and 50 families with sizes n i = 8, 10, 12, 14, 15, 16 and 18 respectively. We set F 1 (t), F 2 (t) and F 3 (t) as truncated exponential densities with rates 5, 9 and 11 on [0, 10]. As in simulation 1, we generate correlated survival time for members in a same family from a multivariate distribution. Specifically, for the ith family, we first generate a vector (S  from a Clayton copula with parameter 20. The survival time S ij of the jth member in the ith family is then assigned to S 1 ij , S 2 ij or S 3 ij , corresponding to F 1 (t), F 2 (t) and F 3 (t) with probabilities q ij1 , q ij2 and q ij3 . Here q ij = (q ij1 , q ij2 , q ij3 ) T is set to be (1.00, 0.00, 0.00) T , (0.60, 0.40, 0.00) T , (0.0, 0.20, 0.80) T , (0.20, 0.00, 0.80) T and (0.30, 0.70, 0.00) T . The mixing percentages are 15.92%, 15.92%, 26.37%, 20.90% and 20.90%. We generate censoring time from a uniform distribution on (0, 5.4), resulting in a censoring rate around 37%. Finally we let Y ij = min(S ij , C ij ) and Δ ij = I(S ij ≤ C ij ).
We implement Algorithms 1 (BLRE) and 2 (BQIF) with R = 15, CWLS method and the method from Ma and Wang (2014) [11]. We use B = 600 bootstraps to estimate U and M. The simulation results at t = 1.5 are summarized in Table 2. From the results in Table 2, we find that Algorithms 1 and 2 produce similar results. It again validates our theoretical discovery in Section 2.4, regardless of which distribution and correlation structure we use to generate the data. The mean of the 1000 estimates are fairly close to the true values. The average of the 1000 estimated standard errors are also close to the sample standard errors of the 1000 estimates. The coverage rates of the 95% confidence intervals are close to the nominal level. Compared with the method in Ma and Wang (2014) [11], the CWLS method reduces the MSE by 52.3%, 45.2% and 43.9% for F 1 (t), F 2 (t) and F 3 (t) respectively, while BLRE and BQIF reduce the MSE by 32.0% for F 1 (t), 33.6% for F 2 (t) and 27.8% for F 3 (t), indicating quite large improvement. The estimation results of the entire functions F 1 (t), F 2 (t) and F 3 (t) are given in Figure 2. The mean curves of 1000 simulations and the true distribution curves of F 1 (t), F 2 (t) and F 3 (t) overlay. The 95% confidence bands of F 1 (t)(red) and F 2 (t)(blue) in the left panel have an overlapped area, while in middle and right panels, they are better separated. In addition, there Simulation study 2. Curves of F 1 (t)(red), F 2 (t)(blue) and F 3 (t)(green) are displayed. True CDFs (solid grey) and mean (F -dashed) of the estimated CDFs. Left: Assuming independence as in Ma and Wang (2014) [11]; Middle: CWLS; Right: BLRE.
is a gap between 95% confidence bands of F 1 (t)(red) and F 3 (t)(green) in the middle and right panels, but not in the left panel.

Real data example
In the Cooperative Huntington's Observational Research Trial (COHORT, Dorsey and the Huntington Study Group COHORT Investigators, 2012 [3]), initial participants (probands) were sequenced for huntingtin gene expansion status: subjects with C-A-G repeats length greater than 36 are the HD mutation carriers and will eventually develop HD. The proband participants also provided their relatives' phenotype information such as age-at-death if deceased. However, the relatives were not genotyped due to practical difficulties in collecting blood samples (especially for deceased individuals). It is of interest to estimate the distribution functions using the relatives phenotypes only, while avoid potential ascertainment bias when recruiting proband participants. Comparing survival functions of gene-expanded and non-expanded subjects is essential for understanding the disease risk associated with a causal mutation, for timing intervention in the disease progression course, and for genetic counseling.
The COHORT data includes 771 families with different numbers of firstdegree relatives within each family. There are a total of 3661 individuals. The barplot in Figure 3 characterizes the distribution of the family sizes. Using the available relationship (parents, children, siblings) between each family member and his/her proband, we calculated the probability of the family member carrying the huntingtin gene mutation. We obtained three (m = 3) different q ij values in total, (1.0, 0.0) T , (0.5, 0.5) T and (0.0, 1.0) T , with frequency 558, 1805 and 1298, respectively. Denote the distribution function of age-at-death in mutation carrier population to be F 1 (t) and in non-carrier group to be F 2 (t). Our goal is to estimate F(t) = {F 1 (t), F 2 (t)} T . The COHORT data has approximately 29% censoring, and we assume the censoring time is independent of the event time.  Figure 4. The estimated F 1 (t) and F 2 (t), and their 95% confidence bands are provided. It is clear that the huntingtin gene mutation carriers have a much lower survival rates than non-carriers, especially in the age range 50 to 90. This indicates that the detrimental effect of the Huntington's disease on survival is most severe in the mid-to old age range. The difference in survival probability starts to be present as early as age 40, which is the same age as the mean of HD disease onset age (Foroud et al. 1999 [5]). The result is also consistent clinical observation that majority of gene carriers die between age 45 and 70 (Foroud et al. 1999 [5]).
As a comparison, we also performed the analysis of Ma and Wang (2014) [11], where the within-family correlation is ignored. We present the estimated curves of F 1 (t) and F 2 (t), and their 95% pointwise confidence bands in the upper left panel of Figure 4. From these plots, we can see that for estimating the distribution function in the carrier population, the confidence band of MW appears wider than all other three methods. To quantify this observation, we calculated the integrated confidence interval width and obtained the values 0.0577, 0.04, 0.0544 and 0.0545 for MW, CWLS, BLRS and BQIF, respectively. This corresponds to a reduction of 30.6%, 5.6% and 5.5% of the three methods in comparison with MW, indicating improved efficiency of the three proposed methods. Our methods here also provide an assessment of the familial correlation, which is estimated to be smaller than 0.1 across all ages and has a general decreasing trend. Specifically, the correlation is 0.0822 at age t = 40, decreases to 0.046 by age 65, and almost diminishes beyond age 70.

Discussion
In this paper, we propose various methods to account for within-family correlation for mixture data from multiple populations, while the population label is only known up to a probability. Such data arise from kin-cohort studies, such as COHORT study of HD and other disorders, for instance, breast cancer (Wacholder et al. 1998 [12]) or Parkinson's disease (Wang et al. 2008 [13]). The proposed estimators are easy to implement, and assume unstructured correlation or exchangeable correlation. The finite sample efficiency of the estimators under unstructured correlation relies on both the number of resamples and the bootstrap size. Although in theory, large values of both are preferable, in practice, one can always gradually increase these values and stop when the improvement becomes sufficiently small. Comparing with ignoring correlation information, the proposed estimators provide efficiency gain as observed in simulation studies and real data analysis (Section 3 and 4).
The relative performance of the proposed BLRE, BQIF and CWLS methods in terms of variance reduction depends on many factors such as sample size, family sizes, true underlying model and covariate values. The practical ef-ficiency gain of accounting for the correlation by the optimal estimator could be somewhat limited in a real data example especially when the sample size of the data is large. However, other studies with smaller sample sizes and lower expected number of diseased subjects in carrier group could potentially benefit more from our method by improving estimation efficiency.
Here, p denotes the number of latent distributions and it is typically small in practice. In our application example, the goal is to estimate the survival function in HD-gene mutation carriers, as compared to non-carriers. For HD, which is an autosomal dominant disease, p = 2. In general, mode of inheritance is specified as autosomal dominant (p = 2), autosomal recessive (p = 2), or additive model (p = 3), depending on the nature of an inherited disorder. Thus, p is often rather small in these applications. However, when p becomes large in other settings, the proposed model and methods are not suitable because not enough information is available in the data to estimate many latent distributions without stronger parametric assumptions. Parametric or semi-parametric models and methods are needed to enable estimation under reasonable sample size.
Lastly, we point out that one way to relax the exchangeable correlation assumption is to break large extended families into nuclear families and assume a hierarchical model for the correlation structure for the members in the nuclear family and in the extended family. It is easy to verify that I − U − 1 2 J(J T U −1 J) −1 J T U − 1 2 is an idempotent matrix, hence it is semi-positive definite. Therefore, AUA T − (J T U −1 J) −1 is also semipositive definite.

A.2. Proof of Theorem 2
Taking derivative of the quadratic form (2.9) with respect to F(t) and omit the higher order terms, we obtain In the following, we first investigate N − where the first equality is obtained through rewriting the summation in (2.8), and the last equality is obtained similarly. Viewing q r i 's as random quantities, we have that N − 1 2 N i=1 g i (t) is the average of independently identically distributed mean zero random quantities hence it converges to a mean zero normal distribution with variance denoted M.
Standard Taylor expansion of (A.1) then yields Similar derivation as in the proof of Theorem 1 can be used to show that the optimal choice of the weight matrix is W = M −1 , and the resulting variance is