Global Autozygosity Is Associated with Cancer Risk, Mutational Signature and Prognosis

Simple Summary Global autozygosity in the form of runs of homozygosity is associated with various diseases. Heterozygosity ratio, an alternative measure of global autozygosity, is used to assess cancer risk in this study. Our analysis shows strong and consistent associations between heterozygosity ratios and various cancer types. Further analysis reveals the heterozygosity ratio’s potential connections to mutational signatures and cancer prognosis. Abstract Global autozygosity quantifies the genome-wide levels of homozygous and heterozygous variants. It is the signature of non-random reproduction, though it can also be driven by other factors, and has been used to assess risk in various diseases. However, the association between global autozygosity and cancer risk has not been studied. From 4057 cancer subjects and 1668 healthy controls, we found strong associations between global autozygosity and risk in ten different cancer types. For example, the heterozygosity ratio was found to be significantly associated with breast invasive carcinoma in Blacks and with male skin cutaneous melanoma in Caucasians. We also discovered eleven associations between global autozygosity and mutational signatures which can explain a portion of the etiology. Furthermore, four significant associations for heterozygosity ratio were revealed in disease-specific survival analyses. This study demonstrates that global autozygosity is effective for cancer risk assessment.


Introduction
The human genome is comprised of approximately three billion base pairs. Single Nucleotide Polymorphisms (SNPs) can affect various disease risks as shown by numerous genome-wide association studies (GWAS). According to the GWAS catalog (May 2020), 4424 unique SNPs have been found to influence cancer risk with p < 10 −5 significance. While an SNP describes the allelic information at a single genomic position, global heterozygosity and homozygosity describe the genome-wide zygosity level. Heterozygosity describes the possession of two different alleles of an SNP, and homozygosity describes the possession of the same allele at a genomic position. Global homozygosity is often

HR NonRef vs. HR Minor vs. ROH
The HR was traditionally computed as the ratio between the number of heterozygous SNPs divided by the number of non-reference homozygous SNPs, which we define as HR NonRef . Since the human reference genome was constructed from a small subset of individuals, it is limited by a small sample size and poor global race representation. These limitations result in many cases where the reference allele is not the major allele of the population. By comparing The Cancer Genome Atlas (TCGA) SNP data and reference allele in the GRCh38 genome, we estimated that around 8% of the variants in the human reference genome may not represent the major allele of the population. This inconsistency between reference allele and major allele has a potential effect on HR computation, which has not been previously considered. To investigate this potential difference, we define HR NonRef and HR Minor , where HR NonRef uses reference alleles based on the traditional HR definition and HR Minor uses major alleles defined by the study cohort instead. HR NonRef and HR Minor were computed for all TCGA SNP and International Genome Sample Resources (IGSR) SNP datasets. High correlations were observed between HR NonRef and HR Minor for all Caucasians, Blacks, and Asians (Figure 1), although the correlation dropped slightly after imputation, most likely due to the additional noise introduced. The only difference between HR NonRef and HR Minor is the scaling, as HR Minor appears to be 2 to 3 times the value of HR NonRef . Due to the similarity between HR NonRef and HR Minor , we chose to use HR NonRef for all subsequent analyses. Furthermore, the scatter plot ( Figure S1) between HR NonRef and HR Minor resembles the scatter plot of principal component 1 vs. principal component 2 from a principal component analysis of ancestry informative markers. This result implies that the differences in HR measures between the two definitions is strongly associated with race. A previous study [7] has shown that HR is more robust than ROH due to immunity to SNP density. We performed verification of this result by comparing the results before and after imputation. Figure 2 shows that imputation produced ROH outliers which severely hampered the overall correlation of ROH. We further evaluated the race and sex differences between HR and ROH. Differences by race for HRNonRef and ROH are shown in Figure 3A, with the highest average HR for Blacks, followed by Caucasians and Asians. These results are consistent with previous publications [7,12]. Sex differences for HR were not previously studied. For HRNonRef, females have a substantially higher HR than males for all races ( Figure 3A). For ROH, females have higher ROH than males in Caucasians ( Figure 3A) and Asians ( Figure 3B), males had higher ROH than females in Blacks ( Figure  3C). However, the sex difference of HR and ROH is primarily contributed by the sex chromosomes, and after removing Chromosomes X and Y from the computation of HR and ROH, the difference is A previous study [7] has shown that HR is more robust than ROH due to immunity to SNP density. We performed verification of this result by comparing the results before and after imputation. Figure 2 shows that imputation produced ROH outliers which severely hampered the overall correlation of ROH. We further evaluated the race and sex differences between HR and ROH. Differences by race for HR NonRef and ROH are shown in Figure 3A, with the highest average HR for Blacks, followed by Caucasians and Asians. These results are consistent with previous publications [7,12]. Sex differences for HR were not previously studied. For HR NonRef , females have a substantially higher HR than males for all races ( Figure 3A). For ROH, females have higher ROH than males in Caucasians ( Figure 3A) and Asians ( Figure 3B), males had higher ROH than females in Blacks ( Figure 3C). However, the sex difference of HR and ROH is primarily contributed by the sex chromosomes, and after removing Chromosomes X and Y from the computation of HR and ROH, the difference is substantially reduced ( Figure S2). The violin plots also demonstrate that HR NonRef has less variation than ROH. The cancer subjects' median HR NonRef is visibly higher than normal subjects' median HR NonRef , which results in significant associations with cancer risk. Furthermore, we found no significant association between HR NonRef /ROH and age, which is in concordance with the conventional genetic concept that that germline variants should not be affected by age. Because of the strong differences of HR NonRef and ROH based on sex and race, and varying prevalence of cancer, all subsequent analyses were stratified by sex and race.
Cancers 2020, 12, x 4 of 13 substantially reduced ( Figure S2). The violin plots also demonstrate that HRNonRef has less variation than ROH. The cancer subjects' median HRNonRef is visibly higher than normal subjects' median HRNonRef, which results in significant associations with cancer risk. Furthermore, we found no significant association between HRNonRef/ROH and age, which is in concordance with the conventional genetic concept that that germline variants should not be affected by age. Because of the strong differences of HRNonRef and ROH based on sex and race, and varying prevalence of cancer, all subsequent analyses were stratified by sex and race. The correlations for ROH are weaker than HRNonRef before and after imputation due to the ROH outliers resulted from imputation. This shows that HR is less prone to the effect of SNP density than ROH. The correlations for ROH are weaker than HR NonRef before and after imputation due to the ROH outliers resulted from imputation. This shows that HR is less prone to the effect of SNP density than ROH.

Global Autozygosity and Cancer Risk
We evaluated the association of HRNonRef and ROH with cancer risk using logistic regression with cancer cases from TCGA and healthy controls from IGSR. To avoid any selection bias, HRNonRef and ROH were computed from SNPs present in both TCGA and IGSR. Races (Caucasian, Black and Asian) and sex were tested separately. Tests were limited to case sample size greater than 100. Ten cancer types met the criteria and were tested. For Caucasians, logistic regression analyses showed that HRNonRef is significantly positively associated with cancer risk in all ten cancer types ( Table 1). The strongest cancer association was for male skin cutaneous melanoma ( = 5.28 × 10 ), followed by female ovarian cancer ( = 3.34 × 10 ). For Blacks, only breast invasive carcinoma met the case sample size greater than 100 criteria. HRNonRef was found to be positively significantly associated with breast invasive carcinoma ( = 4.89 × 10 ) , a result more extreme than the Caucasian's counterpart. For Asians, only liver hepatocellular carcinoma met the sample size requirement, and HRNonRef was positively significantly associated ( = 0.001). The global positive associations between HRNonRef and cancer risk suggest that individuals with a more heterozygous genome are at higher risk for multiple cancer types. Receiver operating characteristic (ROC) curves show that statistically significant cancer, race, sex groups had area under curve (AUC) between 0.54 and 0.88, with Black females in breast cancer being the most predictive group ( Figure S3).

Global Autozygosity and Cancer Risk
We evaluated the association of HR NonRef and ROH with cancer risk using logistic regression with cancer cases from TCGA and healthy controls from IGSR. To avoid any selection bias, HR NonRef and ROH were computed from SNPs present in both TCGA and IGSR. Races (Caucasian, Black and Asian) and sex were tested separately. Tests were limited to case sample size greater than 100. Ten cancer types met the criteria and were tested. For Caucasians, logistic regression analyses showed that HR NonRef is significantly positively associated with cancer risk in all ten cancer types ( Table 1). The strongest cancer association was for male skin cutaneous melanoma (p = 5.28 × 10 −12 ), followed by female ovarian cancer p = 3.34 × 10 −11 . For Blacks, only breast invasive carcinoma met the case sample size greater than 100 criteria. HR NonRef was found to be positively significantly associated with breast invasive carcinoma (p = 4.89 × 10 −28 ), a result more extreme than the Caucasian's counterpart. For Asians, only liver hepatocellular carcinoma met the sample size requirement, and HR NonRef was positively significantly associated (p = 0.001). The global positive associations between HR NonRef and cancer risk suggest that individuals with a more heterozygous genome are at higher risk for multiple cancer types. Receiver operating characteristic (ROC) curves show that statistically significant cancer, race, sex groups had area under curve (AUC) between 0.54 and 0.88, with Black females in breast cancer being the most predictive group ( Figure S3). For Caucasians, ROH was found to be significantly associated with nine out of ten cancer types (Table 2). However, the directions of association were mixed, with four negative and five positive associations. Furthermore, there were several differences based on sex. In head and neck squamous cell carcinoma, ROH was not significantly associated with cancer risk for females but was nominally significant for males (p = 0.02). In lung adenocarcinoma, ROH was significantly associated with cancer risk for females (p = 0.007), but was not significant for males. Similarly, in skin cutaneous melanoma, ROH was nominally significantly associated with cancer risk for females (p = 0.01), but not for males. For Blacks, in breast invasive carcinoma, ROH was borderline associated with breast invasive carcinoma (p = 0.07). For Asians, in liver hepatocellular carcinoma, ROH was significantly associated with cancer risk for males (p = 0.03). The associations between HR NonRef and cancer risk are stronger and more consistent than those of ROH. The inconsistency of association direction and the inconsistency between sexes observed in ROH results may also represent the instability of ROH measurement from incomplete genotyping data. In addition to the cancer risk analysis by race, sex and cancer type, we also conducted meta analysis across all possible cancer types to study the overall effect. The meta-analyses were conducted based on the results from cancer risk analysis with the same inclusion criteria. A random effect model was used because the heterogeneity test was significant which indicated heterogeneity across the cancer types. A meta-analysis on HR NonRef showed significant association of HR NonRef with cancers (p = 3.04 × 10 −19 ) ( Figure 4A). For ROH, the meta random effect model produced was not significant (p = 0.1) ( Figure 4B). The meta analysis further demonstrated the robustness of HR NonRef over ROH. In addition to the cancer risk analysis by race, sex and cancer type, we also conducted meta analysis across all possible cancer types to study the overall effect. The meta-analyses were conducted based on the results from cancer risk analysis with the same inclusion criteria. A random effect model was used because the heterogeneity test was significant which indicated heterogeneity across the cancer types. A meta-analysis on HRNonRef showed significant association of HRNonRef with cancers ( = 3.04 × 10 ) ( Figure 4A). For ROH, the meta random effect model produced was not significant ( = 0.1) ( Figure 4B). The meta analysis further demonstrated the robustness of HRNonRef over ROH.  (Table 1). (B) Meta-analysis of cancer risk results of ROH (Table 2). Heterogeneity p < 0.05 indicates significant heterogeneity across cancer datasets, thus a random effect model was used for the meta-analysis. HRNonRef behaved consistently across multiple cancer types, which resulted in a significant metaanalysis p-value. ROH, on the other hand, was not significant.  (Table 1). (B) Meta-analysis of cancer risk results of ROH (Table 2). Heterogeneity p < 0.05 indicates significant heterogeneity across cancer datasets, thus a random effect model was used for the meta-analysis. HR NonRef behaved consistently across multiple cancer types, which resulted in a significant meta-analysis p-value. ROH, on the other hand, was not significant.

Mutational Signatures and Survival Analysis
Somatic mutation is one of the most important aspects of cancer. Somatic mutations occur as the consequence of a mutational process that is triggered by either endogenous errors in DNA replication and repair or exogenous mutagens. Taking into consideration DNA's complementarity, six distinct substitutions can be formed between the four nucleotides. When adding up the 5'-neighbor and the 3'-neighbor, we can derive a three-nucleotide motif from the focal substitution, and thus expand the six-substitution inventory to a 96-motif catalog, known as mutational signatures [15]. The profile of various mutational motifs in a cancer patient can be modeled as a combination of distinct mutational signatures. A mutational signature is conceived as the footprint of a mutational process in the nuclear genome, represented in the form of relative frequencies of the motifs of a mutational catalog [16][17][18][19][20].
Both mutational signature and global autozygosity represent genome-wide patterns, with mutational pattern at the somatic level and HR and ROH at the underlying germline level. As we have shown that HR NonRef is highly associated with cancer risk, we hypothesize that HR NonRef may be related to mutational signatures. Using TCGA somatic mutation data, we fit each patient into the established COSMIC reference mutational signatures, as described in the Methods section. Linear regression models were used to describe the association between mutational signatures and HR NonRef and ROH. False discovery Rate (FDR) < 0.05 was used as the significant threshold. Datasets with a sample size greater than 100 were included in the analyses. Eleven significant associations were identified, seven for HR NonRef and four for ROH (Table 3). All 11 significant results were from those subjects of Caucasian descent. Five of the seven significant HR NonRef associations were from the ovarian cancer dataset, and consisted of SBS9 (FDR = 0.001), SBS18 (FDR = 0.001), SBS5 (FDR = 0.007), SBS7c (FDR = 0.007), and SBS22 (FDR = 0.03). SBS9 is a mutational signature resulting from mutations during replication by polymerase eta. SBS18's etiology is proposed to be damaged by reactive oxygen species; SBS5's etiology is currently unknown; SBS7c is related to ultraviolet light damage and is possibly the consequence of translesion DNA synthesis by enzymes with a propensity to insert T, and SBS22 is related to aristolochic acid exposure. The other two significant associations with HR NonRef were SBS44 (related to DNA mismatch repair, FDR = 0.02) in female skin cutaneous melanoma and SBS36 (related to defective base excision repair, FDR = 0.045) in prostate adenocarcinoma. The most significant association was found between ROH and SBS44 FDR = 4.62 × 10 −8 in females with head and neck squamous cell carcinoma. The other three significant associations with ROH were SBS36 (FDR = 3.02 × 10 −5 ) in prostate adenocarcinoma, SBS42 (related to haloalkanes exposure, FDR = 0.0002) in male lung squamous cell carcinoma, and SBS7b (related to ultraviolet light exposure, FDR = 0.02) in males in male lung squamous cell carcinoma. Next, we performed survival analyses to examine whether HR NonRef and ROH have prognostic value. Disease-specific survival analyses using Cox proportional hazard models identified no significant results for ROH under any scenarios. For HR NonRef , four race and gender-specific significant results were found ( Figure 5): Asian males and liver hepatocellular carcinoma (p = 5.96 × 10 −5 ), Caucasian males and lung adenocarcinoma (p = 0.03), Caucasian females and lung adenocarcinoma (p = 0.01), and Caucasian males and skin cutaneous melanoma (p = 0.02). All survival results show that lower HR NonRef is associated with better prognosis.
Cancers 2020, 12, x 9 of 13 significant results for ROH under any scenarios. For HRNonRef, four race and gender-specific significant results were found ( Figure 5): Asian males and liver hepatocellular carcinoma ( = 5.96 × 10 ) , Caucasian males and lung adenocarcinoma ( = 0.03 ), Caucasian females and lung adenocarcinoma ( = 0.01), and Caucasian males and skin cutaneous melanoma ( = 0.02). All survival results show that lower HRNonRef is associated with better prognosis.

Discussion
A single SNP can have a severe impact, as shown by Mendelian diseases. Multiple SNPs together can help explain a portion of a disease's variation in the population but never fully account for the heritability. This is the famous missing heritability problem [21]. One proposed solution for this problem is that a person's susceptibility to disease may be polygenic and dependent on many low effect variants [22]. Global autozygosity measurement expands on the polygenic idea, by measuring the genome globally. Global autozygosity as a risk factor for diseases such as schizophrenia and Alzheimer's have been established. However, its connection to cancers has not been examined previously.
Cancer risk analysis showed that both HRNonRef and ROH are closely associated with cancer risk. However, given the same sample size, HRNonRef demonstrated stronger associations than ROH. Eight of the 14 significant HRNonRef associations were at ≤10 −8 , the GWAS significance level. However, we stress that since we did not carry out 1 million independent tests, GWAS level significance is not required for the multiple testing correction. For ROH, the most significant association is at p = 0.0009. Furthermore, the associations between HRNonRef and cancer risk are more consistent than ROH. While all 14 significant HRNonRef and cancer risk associations are positive, eight of the 14 significant associations for ROH are positive, and six are negative. These results further illustrate the robustness of HR as a characterization of the genome variability.

Discussion
A single SNP can have a severe impact, as shown by Mendelian diseases. Multiple SNPs together can help explain a portion of a disease's variation in the population but never fully account for the heritability. This is the famous missing heritability problem [21]. One proposed solution for this problem is that a person's susceptibility to disease may be polygenic and dependent on many low effect variants [22]. Global autozygosity measurement expands on the polygenic idea, by measuring the genome globally. Global autozygosity as a risk factor for diseases such as schizophrenia and Alzheimer's have been established. However, its connection to cancers has not been examined previously.
Cancer risk analysis showed that both HR NonRef and ROH are closely associated with cancer risk. However, given the same sample size, HR NonRef demonstrated stronger associations than ROH. Eight of the 14 significant HR NonRef associations were at ≤10 −8 , the GWAS significance level. However, we stress that since we did not carry out 1 million independent tests, GWAS level significance is not required for the multiple testing correction. For ROH, the most significant association is at p = 0.0009. Furthermore, the associations between HR NonRef and cancer risk are more consistent than ROH. While all 14 significant HR NonRef and cancer risk associations are positive, eight of the 14 significant associations for ROH are positive, and six are negative. These results further illustrate the robustness of HR as a characterization of the genome variability.
The literature [23,24] has shown that a single SNP can increase cancer risk. Our analysis results for global autozygosity also suggest that genome-wide characteristics can also affect cancer risk. However, the etiology behind global autozygosity and cancer risk is not well understood, and it would be even harder to study the etiology for global autozygosity compared to single SNPs due to the lack of precise targets. Nonetheless, we performed additional analyses to assess the associations between global autozygosity and mutational signatures. Mutational signatures are constructed from somatic mutations, which can represent the mutagenesis history. A previous study has shown the link between germline variants and somatic mutation [25]. For example, germline variants in RBFOX1 increased the incidence of SF3B1 somatic mutation eight-fold via functional alterations in RNA splicing, and 19p13.3 variants were associated with a four-fold increased likelihood of somatic mutations in PTEN. Thus, it is not unreasonable to hypothesize that there are connections between global autozygosity and mutational signatures. Our analyses found eleven significant associations between global autozygosity and various mutational signatures after correcting for false discovery. These results suggest that global autozygosity is related to some mutational processes. It might affect the risk of DNA mismatch in the repair process after exposure to carcinogens such as ultraviolet light and haloalkanes. Disease-specific survival analysis also identified four significant associations for HR NonRef , which also suggest potential prognosis associations of global autozygosity.
One of the limitations of HR NonRef is that it requires the measurement of the entire genome. Compared to biomarkers of a few SNPs and genes, HR NonRef computation is more expensive and time consuming. However, in previous work [7], we showed that HR NonRef computed from a random subset of SNPs can be a robust representation of the true HR NonRef . Furthermore, the price of whole genome genotyping has dropped below USD 100 per subject, well within the range of acceptable cost. Further cost reduction can be achieved by estimating HR NonRef from the subset of SNPs. Although, the criteria of the subset of SNPs to best estimate HR NonRef requires additional study.

Genotyping Data Acquisition and Imputation
Germline SNP data were obtained from 4833 subjects with 12 cancer types from the Affymetrix Genome-Wide Human SNP Array 6.0 in The Cancer Genome Atlas (TCGA), which contains 934,968 SNPs. All SNP data used in this analysis were derived from blood samples. Additional SNP imputation was performed using a Hapmap Phase 3.0 reference through the Michigan Imputation Server [26]. Imputed SNPs with R 2 > 0.8 were retained for further analysis. After imputation, each cancer type contained 10-16 million SNPs. The total SNP number was 164,497,868. SNP data with imputation of 1668 subjects from The International Genome Sample Resources (IGSR), formally known as the 1000 Genome Project, were also downloaded.

Somatic Mutation Data Acquisition and Mutational Signature Computation
Somatic mutation data of 10,179 patients with 33 cancer types were downloaded from the Genomic Data Commons, the gateway of TCGA. The cancer type abbreviations, full name, and detailed sample size are available in Table S1. The probability matrix for 49 established COSMIC reference mutational signatures (v3) was downloaded from Synapse Documentation (https://www.synapse.org/#!Synapse: syn11738319) (Table S2). We formalized a catalog of 96 three-nucleotide motifs that surround the mutational focus (one upstream nucleotide, one mutation site, and one downstream nucleotide site), and derived frequency tables of this motif catalog for each patient. We leveraged a computational function from the R package MutationalPatterns [27] to fit the patient mutational motif frequency tables to the reference mutational signatures while requiring the coefficients, i.e., signature-to-patient contribution strengths, to be non-negative values. The estimated coefficients formed a 96-by-10,179 matrix of non-negative values, representing the distribution of 96 mutational motifs across the 10,179 patients.

HR and ROH Computation
Two types of HR were computed, which we denote as HR NonRef and HR Minor . HR NonRef is computed as N het /N HomNonRef , where N het is the number of heterozygous SNPs, and N HomNonRef is the number of homozygous non-reference SNPs. These definitions are consistent with previous studies [7,11,12] of HR. To study the potential effect when the reference allele does not equal the major allele in a cohort, we also defined HR Minor as N het /N HomMinor , where N HomMinor is the number of homozygous minor alleles based on the patient cohort. ROHs were computed using PLINK [5]. The median ROH length was used for subsequent analyses.

Statistical Analyses
All statistical analyses were conducted using 64 bit R 4.0.2. Both Spearman's and Pearson's correlation coefficients were used to compare HR NonRef and HR Minor . Linear regression (R glm function with family = Gaussian parameter) was used to evaluate the association between age and HR NonRef . Logistic regression (R glm function with family = binomial parameter) was used to evaluate the association between HR/ROH and cancer risk. The cancer cases from TCGA were matched with non-cancer controls from IGSR by race and sex. Moreover, to avoid any potential bias, the common set of SNPs between TCGA and IGSR was used to compute HR and ROH. The unit for HR NonRef and ROH was per standard deviation. The R function coxph was used to evaluate the survival predictability of HR NonRef and ROH. Both HR NonRef continuous and dichotomized models were conducted. For the dichotomized model, the R package maxstat was used to find the optimal dichotomization threshold. This threshold was used for dichotomizing HR NonRef into high and low groups for the Kaplan-Meier curve presentation. The associations between mutational signatures and HR NonRef /ROH were found by linear regression (R glm function with family = Gaussian parameter). The R function p.adjust with FDR parameter was used for adjusting for multiple test correction. ROC curves were drawn with the ggplot2 package. The auc_roc function from mltools R package was used to compute AUC. Meta-analyses across all possible cancer types and groups were conducted for HR NonRef and ROH. The meta-analyses were conducted based on the results from cancer risk analysis (Tables 2 and 3). The R function rma from the metafor package was used for the meta analysis. The random effect model was used during the meta analysis because the heterogeneity test across cancer types was significant (p < 0.05).

Conclusions
Our analyses of global autozygosity show that HR NonRef is a more robust measurement than ROH. More importantly, our study demonstrates the connections between global autozygosity and cancer risk. We identified strong associations between HR NonRef and cancer risk. Even though the majority of the subjects were Cacuasian, strong associations for minority groups, such as breast invasive carcinoma risk in Black women and liver hepatocellular carcinoma in Asian men, were identified. Further evidence was identified by exploring the associations between global autozygosity, mutational signatures, and cancer prognosis. These results show that global autozygosity can be used for reliable cancer risk assessment.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-6694/12/12/3646/s1, Figure S1: Scatter plot between HR NonRef and HR Minor , Figure S2: Comparison of HRNonRef and ROH between sex across all three races tested, Figure S3: ROC curves based on the cancer risk analysis in Table 1, Table S1: Cancer names and sample size, Table S2: List of mutational signatures.