Genome-Wide Linkage Analysis of Large Multiple Multigenerational Families Identifies Novel Genetic Loci for Coronary Artery Disease

Coronary artery disease (CAD) is the leading cause of death, and genetic factors contribute significantly to risk of CAD. This study aims to identify new CAD genetic loci through a large-scale linkage analysis of 24 large and multigenerational families with 433 family members (GeneQuest II). All family members were genotyped with markers spaced by every 10 cM and a model-free nonparametric linkage (NPL-all) analysis was carried out. Two highly significant CAD loci were identified on chromosome 17q21.2 (NPL score of 6.20) and 7p22.2 (NPL score of 5.19). We also identified four loci with significant NPL scores between 4.09 and 4.99 on 2q33.3, 3q29, 5q13.2 and 9q22.33. Similar analyses in individual families confirmed the six significant CAD loci and identified seven new highly significant linkages on 9p24.2, 9q34.2, 12q13.13, 15q26.1, 17q22, 20p12.3, and 22q12.1, and two significant loci on 2q11.2 and 11q14.1. Two loci on 3q29 and 9q22.33 were also successfully replicated in our previous linkage analysis of 428 nuclear families. Moreover, two published risk variants, SNP rs46522 in UBE2Z and SNP rs6725887 in WDR12 by GWAS, were found within the 17q21.2 and 2q33.3 loci. These studies lay a foundation for future identification of causative variants and genes for CAD.

Multipoint NPL analysis was further performed. Multipoint NPL scores were plotted along the genetic map for each of 22 chromosomes (Figs 2 and 3). Multipoint NPL analysis identified four significant genetic loci for CAD on chromosomes 17q21.1, 7p22.2, 2q33.3 and 3q29. The top CAD locus on chromosome 17q21.2 identified by the two-point linkage analysis remained to be a highly significant linkage peak with a NPL score of 5.38 by multipoint NPL analysis. The CAD locus on 17q21.2 covered a genetic interval from 56.9 cM to 83.1 cM (Fig. 3). Caucasian, % 100 Pedigree structure:

No. of pedigrees 24
Pedigree size, n (mean ± SD) 18.04 ± 10.55 Pedigree size, n (min, median, max) 5, 15,38 No. of relative pairs: Sibling/sibling, n 398 Sister/sister, n 154 Brother/brother, n 105 Brother/sister, n 139 Half sibling/half sibling, n 0 The second best CAD locus identified by multipoint NPL analysis was on 7p22.2 with a NPL score of 4.74, and the linkage covered an interval between 1.4 cM and 11.0 cM (Fig. 2). Compared with two-point NPL scores, multipoint NPL scores of the six CAD loci slightly decreased except for the 3q29 locus with an improved NPL score from 4.00 to 4.49. Moreover, both two-point and multipoint NPL analyses were carried out in individual families. Each of the 6 significant CAD loci was found to occur in at least one individual family (Table 3). NPL scores in single families were, in general, higher than those in the combined families (Table 3). In addition, this analysis identified 15 new linkages for CAD, including 7 highly significant linkages on chromosomes 12q13. 13 Potential CAD-related genes underlying six significant CAD loci. To explore candidate genes for CAD under the six significant genetic loci identified for CAD in the combined GeneQuest II families, we annotated all genes underlying each linkage. Genetic intervals of the six linkages were converted to physical locations according to the genetic maps generated by the HapMap 2 project (lifted over to hg19). RefSeq genes located under the six linkages were retrieved from the UCSC database (Tack: RefSeq Genes; Assembly, GRCh37/hg19), and then evaluated for potential relationship with cardiovascular diseases using the online program DisGeNET 42,43 . Counts of RefSeq genes and gene-disease pairs with score of >0.001 are summarized in Table 5.    Table 3. Six Genetic Loci for CAD Confirmed by GWLS in Individual GeneQuest II Families. a The genetic map position was based on Marshfield Medical Genetic marker set 11. b Physical genomic position was retrieved from the UCSC database with human build GRCh37/hg19.

Discussion
Identification of new genetic loci for CAD is critical for addressing the important issue of "missing heritability" in the field of genetics, and in fully elucidating the genetic basis of CAD. In this study, we report a unique genome-wide linkage scan for CAD in 24 large, multigenerational families from a well-characterized U.S cohort (GeneQuest II). We carried out a model-free NPL-all scan and identified six susceptibility loci for CAD on chromosomes 2q33.3, 3q29, 5q13.2, 7p22.2, 9q22.33 and 17q21.2. It is interesting to note that the 3q29 and 9q22.33 loci were previously identified by us in a genome-wide linkage scan for CAD in 428 nuclear families in the GeneQuest population 40 . Suggestive evidence of linkage to the 3q29 CAD locus (P = 2.0 × 10 −4 ) was also found in a meta-analysis of four GWLS in Finnish, Mauritan, Germany, and Australian cohorts 44 . Therefore, the present study provides strong validation of the 3q29 and 9q22.33 linkages for CAD using an independent, large family-based linkage scan, suggesting that these two loci can be prioritized for identifying the underlying causative genes for CAD. Candidate genes for CAD at the 3q29 and 9q22.33 loci are listed in Table 5. There are 31 unique RefSeq genes annotated within the CAD locus on 3q29. DisGenNET analysis identified 5 genes related to cardiovascular diseases ( Table 5). The UTS2B gene encodes Urotensin IIB and was shown to play a role in the acceleration of atherosclerosis development. Increased human Urotensin II levels were observed in hypertension, diabetes, atherosclerosis and CAD 45 . There are 20 unique genes within the 9q22.33 locus and three genes (TGFBR1, NR4A3, and INVS) were linked to cardiovascular diseases (Table 5). TGFBR1 encodes transforming growth factor beta receptor 1 (TGFβ1) and an increase in active TGFβ1 levels were correlated with both the occurrence and severity of CAD 46 .
The four other CAD loci on 2q33.3, 5q13.2, 7p22.2 and 17q21.2 are all novel. The chromosome 17q21.2 linkage is the most significant locus for CAD identified in this study. The 17q21.2 CAD locus was initially identified at marker D17S1299 at the position of 62.01 cM with a two-point NPL score of 6.20 and a multipoint NPL score of 5.38 (Table 2). This CAD locus spans a large genetic interval of 26.2 cM (corresponding to 34.40-57.50 Mb) (Fig. 3). Within the 17q21.2 CAD locus, we found SNP rs46522, which is a CAD-risk variant identified by a large-scale GWAS for CAD in 2013 12 and located about 8 Mb away from D17S1299. SNP rs46522, located in the UBE2Z-GIP-ADTP5G gene cluster, exhibited a strong cis-eQTL (expression quantitative trait locus) to UBE2Z in whole blood samples and to ATP5G1 in left ventricle samples according to the GTEx database v6 47 . On the other hand, we identified a set of 514 unique RefSeq genes within the 17q21.2 CAD locus; 77 of them were linked to cardiovascular diseases based on data from DisGenNET (Table 5). In particular, CCL3 and CCL4 encoding small CC chemokines known as macrophage inflammatory protein 1α and 1β, respectively, were well-recognized as key mediators of both diabetes and atherosclerotic cardiovascular disease 48 . Elevated expression levels of both CCL3 and CCL4 were found in atherosclerotic lesions in ApoE −/− mice 49 . Leukocyte-derived CCL3 can induce neutrophil chemotaxis toward the atherosclerotic plaque, causing accelerated lesion formation 50 . CCL4 was also upregulated in atherosclerotic plaques in stroke patients 51 . NR1D1 is also a candidate gene for CAD. It is located 600 kb from marker D17S1299, encodes a member of the nuclear receptor superfamily and regulates genes involved in triglyceride metabolism, inflammatory and the pathogenesis of atherosclerosis 52 . NR1D1 can regulate apolipoprotein APOC3 via binding to the proximal promoter 53 . Future studies may focus on these strong candidate genes to identify causative genes that contribute to the risk of CAD in families.
The second most significant linkage for CAD on 7p22.2 was identified with marker D7S3056 at a position of 7.44 cM (physical position: 4.49 Mb) with two-point NPL score of 5.19 and a multipoint NPL score of 4.74 (Table 2). This is a novel locus for CAD. No GWAS variants were found to be located within the 7p22.2 locus. The closest GWAS SNP for CAD was rs2023938 in HDAC9, which is located at 7p21.1 13 (Table 5). SDK1 was found to be associated with hypertension in the Japanese population 54 . GPER1 encodes a multi-pass membrane protein that is localized to the endoplasmic reticulum and Gper1 knockout mice showed increased atherosclerosis progress and vascular inflammation 55 Fig. 2). GWAS found that SNP rs6725887 in WDR12, which is only 1.48 Mb from marker D2S1384, was associated with early-onset MI and ischemic stroke 7, 12, 57 at a genome-wide signifcance level. Moreover, DisGenNET analysis identified 21 genes related to cardiovascular disease (Table 3). PDE1A encodes a cyclic nucleotide phosphodiesterase and differential expression of PDE1A was observed in human epicardial adipose tissues from male patients affected with CAD 58 . TFPI encodes a tissue factor (TF)-dependent pathway of blood coagulation 59 . An elevated plasma TFPI level was significantly associated with the presence and severity of CAD 60,61 . TFPI expression can be regulated by ADTRP, a CAD susceptibility gene identified by our group 17 .
The 5q13.2 locus was mapped at marker GATA138B05 at 78.80 cM (or 71.40 Mb) and spanned an interval of 5.1 cM (4.95 Mb) (Fig. 2). This is a novel locus for CAD. DisGenNET analysis identified 8 genes linked to cardiovascular diseases at the 5q13.2 locus (Table 5). PIK3R1 encodes Phosphoinositide-3-Kinase Regulatory Subunit 1 and was predicted to be a cardiovascular disease-related gene by a network topology analysis 62 . PIK3R1is a target of miR-221, and a recent small RNA sequencing analysis revealed that the miR-221-PIK3R1pair was deregulated in late endothelial progenitor cells (late EPCs) of CAD patients 63 . CCNB1 encodes a regulatory protein involved in mitosis and a recent study showed that genetic variants in CCNB1 contributed to risk of the restenosis of intracoronary stents 64 .
The compelling results above demonstrated that linkage analysis with fewer but larger pedigrees can achieve comparable performance with hundreds of small nuclear families. As shown in Table 2, the 3q29 and 9q22.33 CAD loci were identified by both GWLS with 24 large families (GeneQuest II) and by a similar analysis with 428 nuclear families in the GeneQuest population 40 . Our results also demonstrate that GWLA has a comparable power to GWAS. The 2q33.3 and 17q21.2 CAD loci, which were identified by the GWLS with 24 large families here (GeneQuest II) and represented by D2S1384 and D17S1299, respectively, contain CAD-risk SNPs identified by GWAS (rs6725887 at 2q33.3 and rs46522 at 17q21.2) ( Table 2, Figs 2 and 3). Therefore, we conclude that increasing family members within individual families can markedly improve the power for identifying disease linkage and loci. These data also suggest that our GeneQuest II database is a promising resource for identifying novel risk genes for CAD. Future studies on fine mapping and targeted sequencing will uncover causative variants or genes for CAD at the CAD loci identified in this study.
We also carried out genome-wide linkage analysis in each GeneQuest II family and found that each of the six significant CAD loci identified in the combined family cohort (Table 1) were also identified in at least one individual family (Table 3). For example, the top two CAD loci on chromosomes 17q21.2 and 2p22.2 were observed in two families (the best NPL score = 16.81) and three individual families (the best NPL score = 12.42), respectively. Moreover, individual family-based analyses also identified 15 new, significant linkages in 5 families that were not captured by joint linkage analyses of 24 GeneQuest II families, including 7 highly significant linkages, 2 significant linkages, and 6 suggestive significant linkages ( Table 4). None of the 15 new genetic loci have been previously reported for CAD. Of interest, the two top ranked CAD locus on 12q13.13, represented by D12S297 (multipoint NPL score = 6.76) and 17q22 represented by D17S1290 (multipoint NPL score = 6.51), were linked to CAD-associated traits of body mass index (BMI) 65 and metabolic factors 66 .   CCL3, CCL4, CCL3L3, CCL4L1, CCL4L2, CCL3L1, TADA2A, PLXDC1, TCAP,  PNMT, PGAP3, ERBB2, IKZF3, CSF3, MED24, THRA, NR1D1, CCR7, KRT12,  KRT20, GAST, HAP1, JUP, FKBP10, CNP, KCNH4, HCRT, STAT5B, STAT5A,  STAT3, ATP6V0A1, MLX, RAMP2, WNK4, BECN1, AOC3, BRCA1, SOST, PYY,  G6PC3, HDAC5, GRN, ITGA2B, FZD2, ADAM11, GJC1, CCDC103, GFAP,  HEXIM1, MAP3K14, CRHR1, MAPT, WNT3, GOSR2, MYL4, ITGB3, MRPL10,  PNPO, MIR10A, UBE2Z, GIP, IGF2BP1, B4GALNT2, ZNF652, NGFR, ITGA3,  PDK2, SGCA, COL1A1, XYLT2, CACNA1G, LUC7L3, NME1, MMD,   Despite a list of significant CAD loci identified in this study, there were several limitations. First, the density of microsatellite markers in this study was low (10 cM per marker). Future fine mapping studies may be carried out with additional markers surrounding the microsatellite polymorphisms used for linkage analysis or SNP microarrays with a much increased marker density. Single SNPs may not as informative as microsatellite markers for linkage analysis due to their bi-allelic status, but haplotypes constructed using multiple SNPs may be considered as multi-allelic markers 67 . Fine mapping will confirm that the linkage loci are overlapping in different families, shorten and narrow the linked regions (if shared) and eventually reduce the number of candidate genes for some loci. Moreover, fine mapping with SNP arrays may allow us to compare the SNP linkage data with the top hits from previous GWAS and identify new SNPs associated with CAD. Similarly, ongoing whole genome sequencing may be another powerful approach to capture SNPs or causal variants associated with CAD in the 24 GeneQuest II families. Second, we highlighted 3-77 genes at each CAD locus based on the evidence from existing literature with a purpose to illustrate the relevance of each CAD locus to etiological process of CAD. However, the CAD causal genes being responsible for each linkage were possibly overlooked in this study (Table 5). Third, the 24 GeneQuest II families were of European descent, and it is likely that some significant CAD loci may not be expanded to other ethnic populations.

Methods
Study participants. Twenty-four large, extended, and multigenerational CAD families were recruited at the Center for Cardiovascular Genetics of the Cleveland Clinic. The study was referred to as GeneQuset II to distinguish it from the original GeneQuest study which recruited more than 428 nuclear families, mostly for sibpair analysis. The GeneQuest II study started in the year of 2001 and is completely independent from the earlier GeneQuest study carried out between 1995 and 2000. This study was reviewed and approved by the Cleveland Clinic Institutional Review Board (IRB) on Human Subject Research, and conformed to the guidelines set forth by the Declaration of Helsinki. Written informed consent was obtained from all participants.
Clinical phenotypic evaluation of study participants was carefully carried out by a panel of cardiologists. The presence or absence of CAD was assessed according to coronary angiography with >70% stenosis, a history of revascularization procedures such as percutaneous coronary angioplasty (PCA) or coronary artery bypass (CABG), and a previous diagnosis of myocardial infarction (MI) as described 35,68,69 . Families or patients with hypercholesterolemia, insulin-dependent diabetes, childhood hypertension, and congenital heart disease were excluded from this study. Each family has at least four definitely diagnosed CAD patients; and the average pedigree size was 18. Clinical and demographic features of the 24 GeneQuest II CAD families with 433 family members are summarized in Table 1. All recruited family members were Caucasians. The distinguishing features for the GeneQuest II cohort are large families with three or more generations, 100% whites and a well-balanced male versus female ratio (209/224). A total of 398 sibling pairs were generated in this cohort, including 154 sister/ sister pairs, 105 brother/brother pairs and 139 brother/sister pairs. In contrast to sib-pair analysis of 428 nuclear families in our previous study 40 , genome-wide linkage analysis was carried out using all family members instead of sibling pairs only, given the large pedigrees collected in GeneQuest II (Fig. 1).
Extraction of human genomic DNA and genotyping. Whole blood samples were drawn from each study participant. Genomic DNA was isolated using the Gentra Puregene blood (QIAGEN, Valencia, CA, USA). All DNA samples were quantified using NanoDrop 2000 (Thermo Scientific, Wilmington, DE, USA) and inspected for quality by agarose gel electrophoresis.
Genome-wide genotyping was performed by Mammalian Genotyping Service of the National Heart, Lung, and Blood Institute directed by Dr. James L. Weber at Center for Medical Genetics at Marshfield Clinic (http:// research.marshfieldclinic.org/genetics/GeneticResearch/screeningsets.asp) using Screening Set 11. The screening set consists of 410 microsatellite markers spanning the whole human genome by every 10 cM on average. Linkage analysis. Prior to linkage analysis, raw genotyping data were cleaned as described in our previous studies 35,40 . In brief, genotypes with non-consensus calls were re-genotyped or deleted. Microsatellite markers on sex chromosomes were excluded. Missing parental genotypes were added and treated as missing values to complete family pedigrees (Fig. 1) for linkage analysis. Mendelian inconsistencies were detected by using MARKERINFO built in software S.A.G.E (Statistical Analysis for Genetic Epidemiology) 70 . Genotypes with Mendelian errors were excluded from further genome-wide linkage analysis by Genehunter version 2.1_r2 beta 71 . Relationship between family members (i.e., sibling pairs, parents-offerings trios) within each family was verified by the RELTEST program included in the S.A.G.E software page 70 . The RELTEST program did not detect any inconsistent family relationship. Allele frequencies for all microsatellite markers were estimated by module FREQ in S.A.G.E in the pooled samples containing all of our existing family studies. Program Mega2 72 was used to generate the input format required for Genehunter version 2.1_r2 beta 71 . Affected and unaffected individuals were coded as "2" and "1", respectively, whereas individuals with uncertain phenotype were coded as "0". The principle of the Genehunter linkage analysis is to examine any excess of identity-by-decent 73 allele-sharing between all affected subjects within a family. We used the NPL-all statistic within Genehunter version 2.1_r2 beta for linkage analysis, which examines all individuals in the 24 GeneQuest II families simultaneously and provides a more powerful test (www.broad.mit.edu/ftp/distribution/software/genehunter/). Without specifying the disease transmission model for all markers, non-parametric linkage (NPL) analysis was carried out to jointly analyze genotype data of all 24 GeneQuest II families. The linkage between CAD and a genetic marker was evaluated by calculating NPL score Z, which is the summation of standardized identity-by-descent allele-sharing scores across multiple families. Under a null hypothesis of no linkage, Z has mean 0 and variance 1 by choosing appropriate weighting factors. Statistical significance of Z can be inferred by comparing the observed Z against to its null distribution. Two types of NPL scores were calculated for each marker: 1) A two-point NPL score examined whether a single marker was linked to CAD; 2) A multipoint NPL score investigated whether a group of markers were linked to CAD. The advantage of the multipoint approach is its capability of incorporating the information of adjacent markers into linkage analysis (making markers more informative). The NPL-all linkage analysis was also carried out individually in each of the 24 GeneQuest II families. The larger a NPL score is, the stronger the linkage it indicates. As suggested by Lander and Kruglyak 74 , linkage peaks were defined in three categories: (1) Highly significant linkage: NPL of 4.99 (or P value of 3 × 10 −7 ); (2) Significant linkage: NPL of 4.08 (or P value of 2.2 × 10 −5 ); (3) Suggestive Linkage: NPL of 3.18 (or P value of 7.4 × 10 −4 ).