A fast linkage method for population GWAS cohorts with related individuals

Linkage analysis, a class of methods for detecting co‐segregation of genomic segments and traits in families, was used to map disease‐causing genes for decades before genotyping arrays and dense SNP genotyping enabled genome‐wide association studies in population samples. Population samples often contain related individuals, but the segregation of alleles within families is rarely used because traditional linkage methods are computationally inefficient for larger datasets. Here, we describe Population Linkage, a novel application of Haseman–Elston regression as a method of moments estimator of variance components and their standard errors. We achieve additional computational efficiency by using modern methods for detection of IBD segments and variance component estimation, efficient preprocessing of input data, and minimizing redundant numerical calculations. We also refined variance component models to account for the biases in population‐scale methods for IBD segment detection. We ran Population Linkage on four blood lipid traits in over 70,000 individuals from the HUNT and SardiNIA studies, successfully detecting 25 known genetic signals. One notable linkage signal that appeared in both was for low‐density lipoprotein (LDL) cholesterol levels in the region near the gene APOE (LOD = 29.3, variance explained = 4.1%). This is the region where the missense variants rs7412 and rs429358, which together make up the ε2, ε3, and ε4 alleles each account for 2.4% and 0.8% of variation in circulating LDL cholesterol. Our results show the potential for linkage analysis and other large‐scale applications of method of moments variance components estimation.


| INTRODUCTION
Linkage analysis jointly models the inheritance of a trait and genetic material in a family. One group of methods developed to study linkage of quantitative traits uses variance-component models to relate identical-bydescent (IBD) sharing to phenotype similarity (Blangero et al., 2001). When used for linkage analysis, the variance components model usually also contains a component for genomic region-specific effects influencing the trait (Amos, 1994). Using this model, it is possible to construct a test for genetic linkage by estimating and testing this variance component for genomic region-specific effects.
A variety of algorithms exist for estimating and testing variance components for linkage analysis. These include: iterative methods for fitting variance components using maximum likelihood (Lange et al., 1983), restricted maximum likelihood (Van Arendonk et al., 1998), and generalized estimating equations (Amos, 1994). In addition to these iterative methods, Haseman and Elston developed a regression-based approach to estimate variance components for the specific application of inferring genetic linkage in sibling pairs (Haseman & Elston, 1972). Haseman-Elston regression fits variance components by a methods of moments estimator derived from the expectation of either the product, squared sum, or squared difference of pairs of observations (Sham & Purcell, 2001). Their approach has since been generalized to linkage analysis in other relationship types (Sham et al., 2002) and to estimating genetic variance components in unrelated individuals (G. B. Chen, 2014).
The most popular linkage methods currently available to researchers were developed when genotype data were relatively sparse and expensive to collect. With the current ability to assess genetic variation at thousands to millions of variants using genotyping arrays or short-read sequencing at much lower cost, genome-wide association scans (GWAS) have become much more widespread as a method for gene-mapping. GWAS continue to report novel variant-trait associations as genotyping technology continues to improve and more individuals are recruited. Nevertheless, linkage analysis can outperform GWAS in the presence of population structure or allelic heterogeneity (Minster et al., 2015). Linkage also continues to be a useful tool to associate traits with complex variation that is difficult to genotype-like structural variants, copy number variants (Kathiresan et al., 2007), variants in highly repetitive regions (Mousavi et al., 2019), or variants at loci exhibiting epistatic interaction (Hodge et al., 2016).
As the cost of genotyping has fallen dramatically in recent decades, the speed of linkage methods has lagged behind the relative computational efficiency of GWAS in keeping up with the size and structure of modern data sets. To illustrate, consider MERLIN, a widely used implementation of a variance-components linkage method . To run linkage analysis with MERLIN in an old-order Amish pedigree with 364 individuals and genotypes at 1991 microsatellite markers, the single large pedigree had to be split into nuclear families to complete the analysis in a tractable amount of time (Georgi et al., 2014). In addition, classical linkage methods like MERLIN are limited in their ability to model allele sharing between distant relatives when genotypes for intervening relatives are not available (Thompson, 2019).
Some linkage methods use information outside of pedigrees as well. Day-Williams et al. (2011) proposed a method to reconstruct pedigrees and perform linkage analysis using genotype data, although their approach requires pedigrees that can be uniquely reconstructed. A related approach is the KELVIN method for combined linkage and association analysis in pedigrees (Vieland et al., 2011), which was successfully used to study autism (Piven et al., 2013) and musical ability (Oikkonen et al., 2015).
Genomic segments shared IBD by distant relatives are both detectable and potentially informative for mapping disease-causing genes (Donnelly, 1983) but are frequently ignored by the classical linkage methods described above. To address this, Glazner and Thompson (2015) proposed a method for linkage analysis in sets of sparse relatives without pedigree information by building a graph of IBD segments shared by multiple individuals for each locus. They then used a Markov chain Monte Carlo (MCMC) approach similar to Lange and Sobel (1991) and Tong and Thompson (2008) to calculate a likelihood of the observed traits given this graph and evaluate evidence of linkage using a permutation approach. While solving the problem of linkage in sparse relatives without pedigrees, the proposed algorithm for IBD estimation, graph construction, and likelihood calculation scaled poorly to examples with more than a few dozen individuals (Thompson, 2019).
Motivated by the opportunity to perform linkage analysis on related individuals in large GWAS cohorts and by the limitations of existing methods, we developed Population Linkage, a fast method to perform variance-component linkage analysis on hundreds of traits with arbitrarily related individuals genotyped on arrays. IBD and kinship estimation only need to be performed once for a set of individuals and the same estimates used to fit variance components for many traits as shown in Lange and Sobel (1991). We do this with fast methods to estimate IBD (Manichaikul et al., 2010;Young et al., 2022) and fit variance components using Haseman-Elston regression (X. Zhou, 2017). The resulting estimates are then used to test for linkage and calculate LOD scores, a standard yardstick for the strength of genetic linkage signals. Our method uses only the estimated relatedness and IBD segregation between pairs of individuals. Despite the minimal information in individual SNPs for inferring recombination events in linkage analysis compared to more traditional, highly polymorphic microsatellite markers (Evans & Cardon, 2004), our use of IBD estimates effectively combines this information across multiple variants. Our method does not require prior knowledge of pedigree information or reconstruction of pedigrees. We intend for Population Linkage to complement GWAS by providing additional insights for the proportion of trait variance explained by a region and to capture the effects of ungenotyped variation, allelic heterogeneity, and epistatic effects that might otherwise be missed.

| MATERIALS AND METHODS
Population Linkage has four basic steps: (1) preparing the input data, including estimating kinship and IBD regions for the cohort; (2) running diagnostic tests to select the appropriate variance-components model; (3) running the linkage analysis using efficient and (4) processing the results. In this section, we will describe these steps in detail, along with various improvements in IBD estimation and Haseman-Elston regression that satisfy our goal of achieving scalability to large datasets. Figure 1 is a flow chart that outlines some of the major steps in this method.

| Notation
First, we introduce relevant notation. Assume we have genotype information on n individuals, and values for a quantitative trait, y. For each pair of individuals, i, j, their relatedness can be summarized by their kinship ϕ ij , which is the probability that two randomly sampled alleles (one from each individual) are IBD. The full kinship matrix for all n n ( − 1)/2 pairs of individuals is denoted as Φ. Two alternate summaries of the relationships between individuals include p ij , the total proportion of DNA that is shared in IBD segments, and c ijl , the total proportion of chromosome ends that are IBD in the first and last l megabases of each chromosome. These two additional summaries are important in ensuring calibration of our method as they enable us to cope with biases of population-based IBD estimates. We denote the full matrices of p ij and c ijl as P, and C l , respectively. The full set of IBD segments are contained in S, where each element indicates the IBD status (1 or 2) and start and end position of the segment for a pair of individuals. The IBD status (0, 1, or 2) for individuals i and j at marker m can also be indicated as d ijm , with the full matrix of all individuals' IBD sharing at marker m as D m . ZAJAC ET AL. | 233

| The data
The first step for Population Linkage is to prepare the input data. These consist of quantitative trait values from the cohort of interest and estimates of genetic relationships and allelic segregation (in the form of IBD estimates) between all pairs of individuals.

| Traits
We prepare raw trait values y* for analysis by regressing on relevant covariates like age, sex, medication usage, and principal components, and then inverse-normalizing the residuals. This approach has the benefit of reducing the effect of extreme observations and will be helpful for obtaining a cross product yy T that does not depend on the mean and variance of y*.

| IBD segments
To estimate the set of all autosomal IBD segments S in our cohort, we used a particularly fast strategy for detecting these segments, implemented in the software package KING with the -ibdseg option (Young et al., 2022). From Ŝ, we calculate the matrix D m at the unique endpoints of all IBD segments in Ŝ where there is a change in IBD status.

| Proportion of IBD
From the collection of estimated autosomal IBD segments Ŝ we calculate the genome-wide proportion P of genetic material estimated to be shared IBD between each pair of individuals. P is preferred over pedigreecalculated kinship in this context as a measure of F I G U R E 1 A flow chart of Population Linkage. A flow chart showing the basic steps for researchers to run Population Linkage on their own data. genome-wide genetic similarity since pedigree information can be absent, incomplete, or incorrect, and actual realized IBD sharing in some instances diverges from theoretical expectations by a large amount (Thomson & McWhirter, 2017;Wang et al., 2017).

| Kinship
A single statistic for genome-wide genetic similarity between pairs of individuals can fail to capture all the subtleties and nuances that exist in genetic relationships. One challenge presented by P is that it is biased downward for distant relatives since the above method for estimating S does not consider IBD segments less than 2.5 Mb or 64 markers in length. For this reason, we also use the genotype-based kinship estimator Φ implemented in the KING software package (Manichaikul et al., 2010). This estimator is computationally efficient even with tens of thousands of individuals. In addition, it does not depend on allele frequencies and is robust in cohorts with population structure.

| Proportion of IBD chromosome ends
There are additional challenges to estimating IBD segments at the ends of chromosomes. The end of each chromosome physically truncates the length of IBD segments, commercial genotyping arrays typically have lower marker density near chromosome ends, and the higher recombination rate in telomeres results in shorter IBD segments. These factors all lead to downward bias in D m toward the ends of chromosomes (Young et al., 2022), and importantly, the shared segments that can be identified tend to be concentrated in closer relatives who often have more similar trait values because of shared nongenetic but familial factors.
For these reasons, for all pairs of individuals we estimate C ijl , the proportion of chromosome ends (of length l) that contain an IBD segment between individuals i and j. Ĉ ijl is the sum of the highest observed IBD status in the first and last l Mb of each chromosome in Ŝ between individuals i and j, divided by twice the number of diploid chromosomes.

| Statistical model
We begin this section by describing our framework for variance-components estimation and linkage analysis using cross-product Haseman-Elston regression.
where K a is a matrix of genetic relationships that has been centered so all rows and columns sum to 0 and scaled by the mean of its diagonal terms, and I is an n-byn identity matrix. Since each K a is scaled, σ K 2 a can also be interpreted as the proportion of variance explained (PVE) and a is the proportion of trait variance attributed to environmental effects and individual variability.
We obtain point estimates for σ σ , …, by crossproduct Haseman-Elston regression using the following formulas from X. Zhou (2017): And the standard errors: Equation (5) is an asymptotic approximation for q V ( ) and corresponds to an n-fold speedup (complexity O k n ( ) 2 3 to O k n ( ) 2 2 ) for estimating the standard errors compared to using the expected information matrix (X. Zhou, 2017). Both the point estimates and standard errors are implemented in the GEMMA software package (X. Zhou, 2017;X. Zhou & Stephens, 2012).

| VC linkage model
The full variance-components model for linkage analysis at a particular marker m is The tilde over the matrices reflects that these are the centered and scaled versions of these matrices and not the original estimates. The fitting procedure for these models to obtain point estimates and standard errors of the variance components are the same as Equations (2)-(7). This fit must then be repeated at all marker locations m that will be tested for linkage.
is asymptotically normally distributed (X. Zhou, 2017) we compute its probability on the upper tail of the standard normal distribution to obtain a one-sided p-value for linkage. Logarithm of odds (LOD) is a traditional statistic for the strength of evidence for (or against) genetic linkage (Morton, 1955). For Population Linkage, we define f here denotes the probability density function (PDF) of the standard normal distribution. From this likelihood, we can derive a simple equation for LOD score We set LOD= 0 when ≤ σ 0 D2 m because negative estimates for variance components do not constitute evidence for linkage. We use the established threshold of LOD>3 for genome-wide significance which approximately corresponds to a one-sided p-value of 10 −4 (Annunen et al., 1999;Risch, 1991).

| Linkage analysis integration with GWAS
After completing the linkage analysis and calculating LOD scores across the genome, we report which loci show evidence of linkage with the trait of interest. Since the region over which LOD is greater than 3 can be large and span many loci tested with many small increases and decreases in LOD score, we report a specific site as a linkage peak if its LOD is greater than 3 and greater than the LOD scores of the two adjacent sites to the right and the two adjacent sites to the left. This peak marker then becomes the focus for follow-up analyses.
Once GWAS results have been generated for the same data used in the linkage analysis, we extract all GWAS variants within 5 Mb of each linkage peak and choose the one with the smallest p-value for comparison. We chose to focus our search for GWAS variants within 5 Mb of linkage peaks because the 2.5 Mb minimum length to detect IBD segments limited the resolution of linkage peaks.

| Computational approach
The previous sections describe how Population Linkage uses variance components estimated by Haseman-Elston regression to perform a genome-wide linkage analysis on population level data. Despite this improvement in scalability over full-likelihood linkage methods based on the Elston-Stewart (1971) or the Lander-Green (1987) algorithm, Population Linkage must still deal with several large n n × matrices as input and iterate over a dense marker map that can make the analysis timeconsuming and challenging.
Our first strategy for improving the runtime of Population Linkage is to limit the number of sites tested to the unique endpoints of estimated IBD segments in Ŝ. We only fit Haseman-Elston regression and test for linkage at start and end coordinates in the set of Ŝ. This can reduce the number of sites tested to only those with distinct patterns of IBD sharing. To further limit the number of tests, one can optionally decide to test for linkage at fixed, evenly spaced intervals across the genome.
Our next strategy for improving the runtime of Population Linkage was to manage how our software processes input data. Because the large size of Φ , P , Ĉ l , and D m , makes reading these inputs computationally burdensome, we try to minimize input and output by keeping quantities that will be re-used later in memory. We also calculate quantities that can be re-used across markers only once. For example, the first k − 1 terms in the q vector from Equation (2) and the first k − 1 rows and columns of the S matrix from Equation (3), and the intermediate calculation y can be computed once and reused across markers or analysis positions.
While high-performance computing systems can generally handle keeping the matrices Φ̃, P̃, or Cl, in RAM for repeated use, when there are a large number of IBD segments in Ŝ it may not be feasible to store all these in RAM to calculate D̃m and fit variance-components at every marker m. To remedy this issue, our process relies on sorting IBD segments and then gradually reading a small portion of IBD segment changes at a time, updating the matrix D m , calculating linkage statistics, and then proceeding to load another small series of IBD segments after discarding previous ones that are no longer relevant.
We also simplify the linkage model with 4 variance components by fitting variance components for Φ̃, P̃, and Cl jointly without D̃m and then reweighting and combining into a single composite matrix to run linkage analysis with D̃m, a 2-variance component model. This approach simplifies the linkage analysis because the matrices Φ̃, P̃, and Cl do not depend on m and their fit relative to one another only needs to be calculated once rather than repeated at all markers m. The rest of the analysis proceeds identically to the linkage analysis with the full model.

| Implementation
The method we used for kinship and pairwise IBD estimation was implemented in KING versions 2.1.2 and higher (W. M. Chen et al., 2017). The Haseman-Elston regression in Equations 1-7 was implemented in GEMMA 0.96 (X. Zhou, 2017), which we modified to directly read in the output from KING, incorporate the computational enhancements described in this paper, and calculate LOD scores for linkage. This modified version and source code are available at https://github. com/gjmzajac/GEMMA-population-linkage.

| Experimental data: SardiNIA
The SardiNIA project (Pilia et al., 2006) is an ongoing study of a population isolate in the Lanusei valley in the Italian island of Sardinia. This data set comprises information on 6602 individuals with genotype data at 18,754,911 autosomal variants that passed imputation quality filters (Nielsen et al., 2020). All samples were genotyped on commercial genotyping arrays and a subset of 3445 of these samples were also sequenced and used to impute the remaining genotypes in the rest of the samples (Chiang et al., 2018;Gagliano et al., 2019;Pistis et al., 2015;Sidore et al., 2015). We estimated IBD and kinship in the full set of variants with KING 2.2. The data set also included information on LDL cholesterol, HDL cholesterol, total cholesterol, and triglycerides together with covariates of sex, age, and age 2 . The SardiNIA study team also shared GWAS summary statistics generated by running EMMAX (Kang et al., 2010) with default settings on these same data.

| Experimental data: HUNT
To see how well our method could scale to larger cohorts we chose to test for linkage in the HUNT study (Krokstad et al., 2013). Because of the multigenerational time scale and high participation rate in this sparsely populated region of Norway, the HUNT cohort contains a very large number of family relationships that can be inferred using the genetic data and used to test for genetic linkage. Genetic samples for 69,716 individuals were collected from the HUNT2 (1995( -1997( ) and HUNT3 (2006 population-based health surveys of all adults in the Nord-Trøndelag region in Norway (Brumpton et al., 2021). All samples were genotyped on a version of the IlluminaHumanCore-24 Exome array with custom content (Illumina, 2017). Genotypes at 359,432 autosomal variants phased with SHAPEIT2 (Delaneau et al., 2013) and imputed dosages at 45,453,131 autosomal variants (Brumpton et al., 2021) were available for population linkage and GWAS analyses. We estimated IBD and kinship with the 359,432 genotyped autosomal variants using KING 2.1.3. As in SardiNIA, we focused our analyses on LDL cholesterol (estimated by the Friedewald formula), HDL cholesterol, total cholesterol, and triglycerides. We also used body mass index (BMI) as a control phenotype to help with model selection. As covariates, we used genetic principal components 1-4, genotyping batch, age at time of measurement, and sex.
To select a model for linkage analysis as described, we randomly sampled 25,000 HUNT participants and carried out a series of preliminary analyses with 1000 equally spaced tests on a set of simulated trait values with no true linkage signals. We simulated a correlated phenotype for our cohort with covariance based on the genotype relatedness matrix (GRM) estimated by GEMMA 0.96 on the complete set of phased genotype data. After generating the linkage results and determining significant loci for each trait, we compared these with GWAS results generated using the SAIGE package (W. Zhou et al., 2018). In addition to these GWAS in the HUNT data, we also obtained summary statistics from the Global Lipids Genetics Consortium (GLGC) metaanalysis of these four lipid traits in 1.6 million individuals of European ancestry across 75 million variants .

| SardiNIA
We identified 44,006 putative relative pairs of 3rd degree or closer and a total of 316,729,132 IBD segments in the SardiNIA data set. 97% of the individuals had at least one putative relative of 3rd degree or closer. All but seven individuals shared at least one segment IBD with another individual. Table 1 summarizes the number of relative pairs, total and average number of IBD segments, and total and average length of IBD segments for each relationship type. The vast majority of estimated IBD segments (98.9%) were shared among distant relatives of 3rd degree or greater. These IBD segments would have been excluded in a linkage analysis that split the SardiNIA cohort into smaller pedigrees (Liu et al., 2008).
We next fit single-variance component models using estimated kinship, genome-wide IBD proportion, and proportion of 1 Mb chromosome ends that were IBD for each of the four lipid traits: high-density lipoprotein cholesterol (HDL), low-density lipoprotein cholesterol (LDL), total cholesterol (TC), and triglycerides (TG) to calculate the proportion of variance explained (PVE), and genomic-control (GC) lambda values from linkage analyses of 1000 equally spaced sites. The full results of these tests are reported in Table 2. Using the genome-wide IBD proportion matrix resulted in low GC lambda values (HDL 1.2, LDL 1.0, TC 1.1, and TG 0.9) and visual inspection of the LOD plots (e.g., see Figure 2) did not reveal any T A B L E 1 Relatedness and IBD sharing statistics in SardiNIA. Note: A table comparing the impact of different choices of single variance components on the proportion of variance explained (PVE) and genomiccontrol lambda (GC lambda) in a linkage analysis of the phenotypes highdensity lipoprotein cholesterol (HDL), low-density lipoprotein cholesterol (LDL), total cholesterol (TG), and triglycerides (TG) levels. "IBD prop" refers to the proportion of IBD shared genome-wide and "Chr ends" refers to average IBD sharing in the first and last Mb of each chromosome.

Degree
problems. We therefore used the genome-wide IBD proportion along with marker-specific IBD status to run linkage analysis of the four lipid traits at larger numbers of equally spaced sites and determine significant linkage peaks for SardiNIA. After calculating LOD scores for linkage in these 4 lipid traits, we had two significant peaks, one for the trait LDL and one for TC, both on chromosome 19 near the gene APOE (LDL LOD 3.9, PVE 5.0%; TC LOD 3.5, PVE 4.7%). These linkage peaks are reported in Table 3. The SardiNIA study  previously associated the missense variants rs7412 and rs429358 in the gene APOE with variation in LDL and TC. rs429358 is wellknown as a variant that disrupts APOE function in lipid transport and metabolism, influencing risk for Alzheimer's disease, macular degeneration, and other traits (Jiang et al., 2008;Liutkeviciene et al., 2018). The original analysis showed that rs7412 and rs429358 were independent signals for both LDL and TC, with R 2 values of 2.4% and 0.8% for LDL and 1.7% and 0.5% for TC. Our linkage analysis estimated that IBD sharing in the APOE locus explained 5.0% of the variance in LDL and 4.7% for TC. This suggests that our linkage test captured the effects of both these variants and additional genetic variation in the region that influences LDL and TC levels that remains undetected through GWAS.
Our ability to observe this linkage signal for LDL and TC near APOE was not impacted by limiting the number of equally spaced sites tested to 1000, 5000, 10,000, or 20,000 equally spaced genetic markers, with our method yielding LOD > 3 in each analysis (Table 3). While the number of sites tested impacted the largest observed LOD score in the region, calculating LOD scores at only 1000 markers to improve runtime was sufficient to detect the strongest linkage signals present. This result is consistent with previous findings that only a few thousand loci needed to be tested in a genome-wide linkage scan of dense SNP genotypes (Matise et al., 2003).
Our linkage analysis successfully replicated the strongest GWAS signal for LDL in the gene APOE and showed elevated LOD scores near other GWAS peaks in HBB, PCSK9, and other genes (Figure 2). This result indicates that Population Linkage captures genetic effects in known lipid genes beyond APOE and that rerunning this analysis with a larger sample size might yield additional verifiable linkage peaks with LOD > 3.

| HUNT
We estimated 451,864 relative pairs of 3rd degree or closer and a total of 6,867,367,662 IBD segments in the HUNT data set. Ninety-three percent of the individuals had at least one relative of 3rd degree or closer. The IBD estimates resulted in a total of 279,100 unique IBD segment endpoints. Table 4 gives a summary of the number of pairs of parent-offspring, full sibling, 2nd degree, 3rd degree, and more distant relationships, with the average and total number and length of estimated IBD segments in these relationship types. The vast majority (99.6%) of estimated IBD segments were in more distant relatives of 3rd degree or greater.
A preliminary linkage analysis of the null simulated phenotype with genome-wide IBD proportion revealed severe inflation in the test statistics at the ends of chromosomes (Figure 3, top) where IBD estimates were significantly biased toward close relatives (Figure 3, bottom). A diagnostic plot revealed an almost perfect correlation between this increased kinship of relatives IBD at the ends of chromosomes with the estimated PVE (Pearson's correlation 0.99, Figure 4), driving up the LOD scores in those regions. Linkage analysis of LDL, BMI, and the simulated phenotype with different choices of variance components in a randomly sampled subset of 25,000 samples revealed that simultaneously using estimated kinship, genome-wide IBD proportion, and IBD in 0.5 Mb chromosome ends resulted in GC lambda values near 1 (1.06 LDL, 1.15 BMI, 0.91 Simulation, Tables 5-7). Reweighting and combining estimated kinship, IBD proportion, and IBD in 0.5 Mb chromosome ends into a single composite matrix resulted in z-scores for linkage that were almost perfectly correlated (r > 0.999 for LDL, BMI, and simulated trait) and LOD scores that were, on average, 0.0103, 0.0098, and 0.0124 lower for LDL, BMI, and the simulated trait, respectively. We concluded that using the combined matrix would adequately control for inflation at a negligible cost in statistical power while achieving similar runtime to the preliminary linkage analysis with only the genome-wide IBD proportion. F I G U R E 3 HUNT inflation example. Top: Logarithm of odds (LOD) plot from linkage analysis of the null simulated phenotype across 69,716 HUNT samples, using genome-wide pairwise IBD proportion and region-specific IBD sharing as variance components. Bottom: The average kinship values for individuals IBD across the genome. The phenotype below is low-density lipoprotein (LDL) (n = 67,429) so 2287 samples in the simulation above do not appear in the plot below but otherwise the plots would be identical since the values of kinship and estimated IBD status at each marker do not depend on the phenotype.
F I G U R E 4 Proportion of variance explained versus mean kinship of IBD pairs. The estimated proportion of variance explained by IBD sharing at 1000 sites from a linkage analysis of the HUNT null simulation phenotype plotted against the average kinship values for individuals IBD at each locus tested.
Once we had decided on this model with estimated kinship, genome-wide IBD proportion, and IBD sharing at the ends of chromosomes, we proceeded to run linkage analysis on the four lipid traits (HDL, LDL, TC, and TG) at the full sample size. The sample sizes of our traits ranged from 67,429 for LDL to 69,479 for TG. Running each trait on one CPU core used an average of 133 GB RAM over 11 h, 35 min for reweighting and combining matrices and 81 GB RAM over 11 days, 20 h for linkage analysis.
We observed a total of 25 significant linkage peaks with LOD > 3 across 19 distinct loci for the four traits. HDL had seven significant linkage peaks ( Figure 5), LDL 9 ( Figure 6), TC 7 (Figure 7), and TG 2 ( Figure 8). All these peaks and supporting GWAS evidence for them are reported in Table 8. The majority of our significant linkage peaks were in established lipid loci that were easily replicated by our GWAS of genotyped and imputed HUNT variants and the GLGC meta-analysis. Our strongest signals were between the trait LDL and the region of chromosome 19 near the gene APOE (LOD 29.3, PVE 4.0%) and HDL and the region of chromosome 16 near the gene CETP (LOD 30.2, PVE 4.3%) both well-known genes for lipid regulation (Freeman & Remaley, 2016) and were supported in the HUNT GWAS of genotyped variants, imputed variants, and the GLGC meta-analysis (Table 8).
In addition to confirming known lipid genes, it was of interest to know where Population Linkage could provide additional insights beyond GWAS. There were five peaks with LOD > 3 which were not replicated at genome-wide significance in the HUNT GWAS of 359,432 genotyped variants. All of them were replicated in either the HUNT GWAS of imputed variants or the GLGC meta-analysis (Table 8). Since these five peaks did not have a genome-wide significant variant among single-variant tests of 359,432 genotyped variants that were used in the linkage analysis but were confirmed in either the HUNT GWAS of imputed variants or the GLGC meta-analysis, these results suggest that Population Linkage is able to detect linkage signals in ungenotyped variants.

| DISCUSSION
In this paper, we introduced Population Linkage, which has enabled linkage analysis to be performed on larger sample sizes and in less time than ever before. We have demonstrated the feasibility of genome-wide linkage analysis on tens of thousands of individuals with hundreds of thousands of markers with our method, finding that this approach is able to replicate known associations in lipid traits and identify loci missed by GWAS of the same data. Sample relatedness is a nearly unavoidable situation in population or case-control cohorts of GWAS scale, prompting the development of an entire class of methods for linear-mixed-model GWAS to correct and control for the effect of sample relatedness (Kang et al., 2008;Kang et al., 2010 A table comparing the impact of different combinations of variance components on the proportion of variance explained (PVE) and genomic-control lambda (GC) in a linkage analysis of the phenotypes low-density lipoprotein (LDL), body mass index (BMI), and the null simulation. All results are from the random n = 25,000 subset. "IBD prop" refers to the proportion of IBD shared genome-wide and "Chr ends" refers to average IBD sharing in the first and last Mb of each chromosome.
2012; W. Zhou et al., 2018). By contrast, our method provides the opportunity to use this relatedness to map traits to genetic loci, often with greater variance explained in a region than the top associated SNP in GWAS. Additionally, because our method does not make any distributional assumptions about the phenotype and Haseman-Elston regression had been successfully extended to binary traits (Golan et al., 2014;Zeegers et al., 2003), we are hopeful that our method can be similarly adapted to perform linkage on binary traits with thousands of cases and controls. In comparison to GWAS, Population Linkage offers potential advantages including estimation of the proportion of variance explained by all genetic variation in a region (rather than individual, observed variants), robust analysis in the presence of untyped genetic variation or population structure, and the ability to leverage cryptic relatedness in a data set. Despite the lower power of Population Linkage compared to GWAS at most loci, there are situations where we expect Population Linkage analysis will offer advantages -including for trait loci with substantial allelic heterogeneity. To illustrate, we highlight the relationship between breast cancer and BRCA1. Variation in BRCA1 has been identified as a risk factor for breast cancer in many linkage studies (Easton et al., 1993;Hall et al., 1990;Miki et al., 1994) but as of July 2022 there are still no GWAS reporting significant associations between breast cancer and BRCA1 (Buniello et al., 2019). The capacity of Population Linkage to scale linkage analysis to larger data with more loosely related individuals offers additional potential to characterize these types of traits and loci with linkage analysis. Population Linkage scales better than methods that attempt to estimate IBD jointly (and more accurately!) across many individuals (see, e.g., Glazner & Thompson, 2015). While some power is lost by analyzing pairwise IBD rather than joint IBD shared across many individuals, pairwise IBD is a convenient choice for fitting in the variance-components model used here. One of the purported benefits of linkage analysis over GWAS is the ability to test for linkage in untyped genetic variation, as long as IBD segments in the region can be identified. Our results for HDL, LDL, and TC from the HUNT study support this claim since we observed five linkage signals with LOD > 3 where the GWAS of genotyped variants had no nearby genome-wide significant loci but GWAS with additional variants and/or samples did. In addition, our results for linkage in the APOE region in the SardiNIA study show how a test for linkage, even at a single marker, can capture the effects of multiple variants in the region that influence LDL levels. Since the inclusion of an individual SNP has little effect on the IBD estimates calculated in a region, this feature of linkage analysis would extend to capturing the effects of ungenotyped variants that influence the trait as well. Significant linkage findings can then give impetus to additional targeted sequencing, methylation analyses, and other functional characterization in purported linked regions to yield a more complete catalog of variant effects.
This work opens new possibilities for how linkage analysis might be further applied and developed in the future. These could include incorporating new algorithms for fitting genetic variance components even more quickly (Hou et al., 2019;Pazokitoroudi et al., 2020), linkage analysis with shorter IBD segments at higher resolution via whole-genome sequencing data, or recent innovations in finding shorter IBD segments in distant, apparently unrelated, individuals (Delaneau et al., 2019). Most importantly, a seamless integration of variance components linkage analysis and GWAS into a joint mixed effects model can increase the power and utility of both. Such an approach can not only combine the signal from both linkage and association for a more powerful test to find additional novel associations, but also opens the possibility of testing a variant for association conditioned on the genetic background of individuals in a region. This can help to fine-map causal variants in a region that shows evidence for association and further our understanding of how they impact biology and disease risk. F I G U R E 7 HUNT total cholesterol logarithm of odds (LOD) scores. LOD plot from linkage analysis of total cholesterol (TC) levels from HUNT.

ACKNOWLEDGMENTS
This work was supported by HG007022. The authors thank Xiang Zhou for support with using GEMMA, and Wei-Min Chen for support with using KING and for fixing bugs the authors had found while preparing this paper. The SardiNIA project is supported in part by the intramural program of the National Institute on Aging through contracts N01-AG-1-2109 and HHSN271201100005C to the Consiglio Nazionale delle Ricerche of Italy. The SardiNIA authors would like to thank the CRS4 HPC group in Italy for the computational infrastructure supporting this project, and all volunteers who generously participated in the study. The Trøndelag Health Study (HUNT) is a collaboration between HUNT Research Centre (Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology NTNU), Trøndelag County Council, Central Norway Regional Health Authority, and the Norwegian Institute of Public Health. The genetic investigations of the HUNT Study are a collaboration between researchers from the K.G. Jebsen Center for Genetic Epidemiology, University of Michigan Medical School, and University of Michigan School of Public Health. The K.G. Jebsen Center for Genetic Epidemiology is financed by Stiftelsen Kristian Gerhard Jebsen; Faculty of Medicine and Health Sciences, NTNU, Norway.