Genome-wide analysis of runs of homozygosity identifies new susceptibility regions of lung cancer in Han Chinese

Runs of homozygosity (ROHs) are a class of important but poorly studied genomic variations and may be involved in individual susceptibility to diseases. To better understand ROH and its relationship with lung cancer, we performed a genome-wide ROH analysis of a subset of a previous genome-wide case-control study (1,473 cases and 1,962 controls) in a Han Chinese population. ROHs were classified into two classes, based on lengths, intermediate and long ROHs, to evaluate their association with lung cancer risk using existing genome-wide single nucleotide polymorphism (SNP) data. We found that the overall level of intermediate ROHs was significantly associated with a decreased risk of lung cancer (odds ratio = 0.63; 95% confidence interval: 0.51-0.77; P = 4.78×10−6 ), while the long ROHs seemed to be a risk factor of lung cancer. We also identified one ROH region at 14q23.1 that was consistently associated with lung cancer risk in the study. These results indicated that ROHs may be a new class of variation which may be associated with lung cancer risk, and genetic variants at 14q23.1 may be involved in the development of lung cancer.

A run of homozygosity (ROH) is defined as a continuous or uninterrupted stretch of a genomic sequence without heterozygosity in the diploid state. In general, very long ROHs are the by-product of recent inbreeding or chromosome abnormity, whereas the derivation of relatively shorter ROHs is still disputable. It is also hypothesized that ROH may be a result of linkage disequilibrium (LD) [12,13] . However, increased LD in the vicinity of a given variant is neither necessary nor sufficient for a series of variants to be included in an ROH [14] . Emerging evidence has supported that ROH may represent a novel type of independent characteristic of the genome [15] . With the development of the genotyping technology of single nucleotide polymorphism (SNP), it has become more feasible to carry out ROH studies. Several studies have reported that ROHs are widely but not randomly distributed in the outbred human genomes [12,14,16] , and have been implicated in multiple complex diseases [12,[17][18][19][20][21][22][23][24] . In the current study, we attempted to clarify the association of ROHs with the development of lung cancer by using a case-con-trol study including 1,473 cases and 1,962 cancer-free controls of Han Chinese.

Subjects
Demographic and clinical information is summarized in Table 1 and has been described elsewhere [9] . Subjects which were included in this study consisted of 1,473 cases and 1962 controls. The cases that were histopathologically or cytologically confirmed as lung cancer were recruited from local hospitals. All cancerfree control subjects were selected from individuals receiving routine physical examinations at local hospitals or those participating in our community-based screening of non-communicable diseases. All subjects were unrelated ethnic Han Chinese. At recruitment, informed consent was obtained from each subject, and this study protocol was approved by the local institutional review boards of authors' affiliated institutions.

Genome-wide scan and ROH calling
Genome-wide scan was conducted using Affymetrix Genome-Wide Human SNP Array 6.0 chips. Systematic quality control procedure was used to filter out both unqualified samples and SNPs based on predefined criteria [9] . Briefly, SNPs were excluded if: (i) they were not mapped on autosomal chromosomes; (ii) they had a call rate < 95%; or (iii) they had a minor allele frequency (MAF) < 0.05. Samples with low call rates (< 95%), ambiguous gender or familial relationships (PI_HAT > 0. 25 removed. Finally, 591,370 SNPs from 1,473 cases and 1,962 controls were used for ROH detection. ROHs were determined using the command "--homozyg" implemented in PLINK v1.07 [25] , by setting the minimum length of ROH at 0.50 Mb, the minimum number of SNPs per ROH at 50 and gap threshold between two ROHs at 0.10 Mb. The minimum length of ROH was selected in order to exclude some short copy number variations (CNVs) and false ROHs formed by chance [16] . A ROH was broken into two if a gap was found more than 100 kb between adjacent homozygous SNPs. To show a full view of ROH burden, we set the minimum number of SNPs in an ROH at 50, 60, 70, 80, 90 and 100 and performed detection, respectively.

Statistical analysis
For each individual, F ROH , defined as the proportion of the autosomal genome in ROHs above a specified length threshold (the total length of all their ROHs in the autosome divided by the total SNPmappable autosomal distance) [16,21] , was used as a predictor of case-control status in ROH description and burden analysis.
Mclust from the mclust package (v.4) in R was used to run unsupervised Gaussian fitting of the ROH length distribution. According to Pemberton et al. [26] , we divided ROHs into two groups (Class A and B as one group, and Class C as another) depending on their length. For each group, overlapping pools between individuals were defined in case and control separately via a program coded by R based on the algorithm from command "-homozyg-group" implemented in PLINK v1.07, which calculated overlapping ROH number in each SNPs and considered regions with peak overlapping ROH number as pools. Considering that there might be difference between ROH structure in cases and controls, pools identified either in case or controls were involved in subsequent analysis. While analyzing, the status of each pool was coded as 0 for no ROH, 1 for class A or class B ROH, and 2 for class C ROH (dummy variables). Pools with frequency greater than 20% in either the case or control groups were considered to be "hotspot" and were further analyzed in the study. The relationship of F ROH with lung cancer risk was analyzed by using logistic regression to assess the burden of ROHs on lung cancer. The association of each hotspot with lung cancer risk was also evaluated by using logistic regression with adjustment for age, gender and pack-years of smoking as top principle component. Population structure was evaluated by principle component analysis (PCA) using SNPs in the software package EIGENSTRAT 3.0 [21,27] . R v2.15.1 was used for general statistical analysis [28] .

RESULTS
F ROH of ROHs was defined as a SNP number of more than 50 approximates to a normal distribution except for some individuals with extremely large F ROH (Fig. 1A to 1F). To extensively evaluate ROHs across the entire genome, we set 50 as a default value to define ROHs in subsequent analysis. F ROH distribution suggested that there were multiple components (Fig.  1A); therefore, we classified ROHs into three classes (class A: 500Kb-689.346Kb, class B: 689.346Kb-1548.887Kb, and class C: over 1548.887Kb) according to the cluster method based on Gaussian mixture model (see Method).
We evaluated the ROH burden on lung cancer using F ROH . As shown in Table 2, overall moderate ROH level (F ROH ) was significantly associated with a decreased risk of lung cancer (OR = 0.63, 95% CI: 0.51-0.77, P = 4.78×10 ). In contrast, long F ROH levels was significantly associated with an increased risk of lung cancer (OR = 1.13, 95% CI: 1.01-1.26, P = 0.030). As individuals were divided into 4 ROH levels according to the quartile of F ROH in controls, logistic regression analysis showed that the high levels of moderate ROH were consistently associated with a decreased risk of lung cancer (trend OR = 0.85, 95% CI: 0.79-0.91, P = 3.33×10 -5 ) as compared to low levels while the association was not observed with the high levels of long ROH (trend OR = 1.06, 95% CI: 0.99-1.14, P = 0.08).
To identify specific regions associated with lung cancer risk, we performed a genetic association analysis of ROH hotspots. Totally 3,288 and 3,657 ROH hotspots detected from the case and control groups, respectively, were separately evaluated for an associa-   through recent inbreeding, contributed to the development of cancer. Furthermore, we found a protective effect of intermediate ROHs.
On the basis of candidate gene/region approach, ROHs have been reported to be associated with several diseases, including schizophrenia, late-onset Alzheimer's disease, cancer and etc. [22,[31][32][33] . In the current study, we identified several pools associated with lung cancer risk in the region (chr14:59,835,003-60,105,136) at 14q23.1 including genes DAAM1, GRP135, c14orf149, JKAMP and RTN1. Gain of copy numbers on this region was associated with shorter lung cancer survival [34] . The phenomenon of ROH increase in this region may indicate the presence of a number of deleterious heterozygosity mutation, suggesting that it is worthy of doing some further studies in this region.
DAAM1 is one important participant in the planar cell polarity (PCP) pathway, which is one of the key sub-pathways of the Wnt signaling pathway. Bounded to both DVL and Rho, DAAM1 is thought to function as a scaffolding protein and mediates Wnt-induced DBL-Rho complex formation [35] . DAAM1 is also considered as a form in homology (FH) protein that is expressed in complementary patterns in lungs [36] , and DAAM1 protein is enriched at the apical surface of airways [37] . However, there was no further data and research on the relationship between DAAM1 and the lungs. Recently, two studies found that DAAM1 was involved in regulating heart and kidney morphogenesis [38,39] . In our study, the protective region in 14q23.1 included DAAM1 upstream position and some parts of DAAM1, within which there were some eQTL SNPs of DAAM1 that may regulate the expression of DAAM1.
JKAMP, also called JNK1-associated membrane protein (JAMP), was reported to be associated with JNK1. Its association with JNK1 outcompetes JNK1 association with MKP5, resulting in increased and prolonged JNK1 activity following stress [40] . JNKdeficient mice exhibited delayed epithelial development in the lungs [41] . Furthermore, one study indicated that JNK was activated in a subset of NSCLC biopsy samples and promoted ontogenesis of the bronchial epithelium [42] .
RTN1 was regarded as neuroendocrine-specific proteins and identified by screening the expression library of the small-cell lung cancer (SCLC) NCI-H82 cell line with antibodies to the previously identified proteins [43] . The gene can produce 3 different transcripts with identical 3-prime ends but unique amino termini. The B transcript was found only in the NCI-H82 cell lines, while A and C transcripts were found in 18 different SCLC lines but not in any of the 11 non-endocrine NSCLCs [44] . tion with lung cancer risk. Several intermediate ROH pools at 14q23.1 were found to be significantly associated with lung cancer risk (PFDR < 0.05, OR < 1) after multiple test correction (FDR) ( Table 3), and interestingly, long ROH pools at the same location exerted reverse effects on lung cancer (P < 0.05, OR > 1).

DISCUSSION
SNPs are considered as the major source of genetic diversity in humans and have been extensively implicated in multiple diseases and traits. Although GWAS has successfully established the link between SNPs and phenotypes, the identified loci can only explain a small fraction of the risk of diseases or the variance of traits. In addition to SNPs, other types of genetic variants may also contribute to the individual risk of disease either as causal variants or as proxies for causal variants. In the current study, we performed a genomewide survey on ROHs, a mysterious type of genomic variant, utilizing SNP data from genome SNP scanning chips, evaluated the overall ROH levels on lung cancer and conducted genome-wide association of ROHs with lung cancer risk. We found a significantly decreased overall ROH level among lung cancer cases and identified a ROH region at 14q23.1 that was consistently associated with lung cancer risk. This study has made an important effort to investigate the role of ROHs in a specific disease and provided a proof-of-principle approach that can be used in further ROH studies of other diseases using existing GWAS data.
In this study, F ROH of ROHs approximated to a normal distribution, which is consistent with the results reported in previous studies [21,26] . According to the right-skewed distribution of F ROH for overall ROHs, the boundaries of each component were defined at 689.346 kb and 1548.887 kb, similar to the cutoffs (0.5 Mb and 1.5 Mb) reported by McQuillan et al. [16] . Moreover, we found that the similar F ROH distributions of class A and class B ROHs, which, however, were significantly different from that of class C. Class C ROHs, representing long ROHs, are very rare in Han Chinese and may be caused by recent inbreeding [16,26,29] . According to the simulation analyses in previous studies [26,29] , a large sample size (e.g., 12,000-65,000) is needed for an adequate statistical power as detecting the associations of long ROHs with phenotypes. In the current study, we paid attention primarily to class A and B ROHs that have a relatively high frequency. These ROHs are probably the consequence of ancestry inbreeding and positive selection and act as a stable genomic structure across populations [16,26] . Intriguingly, we obtained consistent results that long ROHs (class C), accompanied by some deleterious recessive variants passed down In summary, our results indicated that the overall intermediate ROH levels were low in lung cancer cases and may be implicated with susceptibility to lung cancer. We also identified one intermediate ROH region at 14q23.1 that was associated with the risk of lung cancer in Han Chinese population by using genome-wide ROH association analysis. It is important to further elucidate the genetic determinants of lung cancer. However, the exact mechanisms through which this region leads to the development of lung cancer are still unclear and need to be further investigated. Nevertheless, this study represents a pioneering effort to explore the role of intermediate ROHs in a specific disease on a genome-wide scale.