Ancestry-specific predisposing germline variants in cancer

Distinct prevalence of inherited genetic predisposition may partially explain the difference of cancer risks across ancestries. Ancestry-specific analyses of germline genomes are required to inform cancer genetic risk and prognosis of diverse populations. We conducted analyses using germline and somatic sequencing data generated by The Cancer Genome Atlas. Collapsing pathogenic and likely pathogenic variants to cancer predisposition genes (CPG), we analyzed the association between CPGs and cancer types within ancestral groups. We also identified the predisposition-associated two-hit events and gene expression effects in tumors. Genetic ancestry analysis classified the cohort of 9899 cancer cases into individuals of primarily European (N = 8184, 82.7%), African (N = 966, 9.8%), East Asian (N = 649, 6.6%), South Asian (N = 48, 0.5%), Native/Latin American (N = 41, 0.4%), and admixed (N = 11, 0.1%) ancestries. In the African ancestry, we discovered a potentially novel association of BRCA2 in lung squamous cell carcinoma (OR = 41.4 [95% CI, 6.1–275.6]; FDR = 0.002) previously identified in Europeans, along with a known association of BRCA2 in ovarian serous cystadenocarcinoma (OR = 8.5 [95% CI, 1.5–47.4]; FDR = 0.045). In the East Asian ancestry, we discovered one previously known association of BRIP1 in stomach adenocarcinoma (OR = 12.8 [95% CI, 1.8–90.8]; FDR = 0.038). Rare variant burden analysis further identified 7 suggestive associations in African ancestry individuals previously described in European ancestry, including SDHB in pheochromocytoma and paraganglioma, ATM in prostate adenocarcinoma, VHL in kidney renal clear cell carcinoma, FH in kidney renal papillary cell carcinoma, and PTEN in uterine corpus endometrial carcinoma. Most predisposing variants were found exclusively in one ancestry in the TCGA and gnomAD datasets. Loss of heterozygosity was identified for 7 out of the 15 African ancestry carriers of predisposing variants. Further, tumors from the SDHB or BRCA2 carriers showed simultaneous allelic-specific expression and low gene expression of their respective affected genes, and FH splice-site variant carriers showed mis-splicing of FH. While several CPGs are shared across patients, many pathogenic variants are found to be ancestry-specific and trigger somatic effects. Studies using larger cohorts of diverse ancestries are required to pinpoint ancestry-specific genetic predisposition and inform genetic screening strategies.


Background
Cancer risk differs across ancestries. According to the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) program, the cancer incidence per 100,000 ranges from 449 in race/ethnicity population self-identified as Whites, 453 in Blacks, 298 in Asian/Pacific Islanders, 315 in American Indian/Alaskan Natives, and 336 in Hispanics in the USA between 2011 and 2015 [1,2]. While some of these differences may be attributed to non-genetic factors such as access to health care or diet, much can likely be explained by differences in the genomic architecture of these ancestries and differing frequencies of inherited genetic predisposition. Previous studies revealed different carrier rates of pathogenic variants across ancestries, albeit often in a limited panel of genes or selected cancer types [3][4][5].
While multiple large-scale genome-wide association studies have investigated the common risk variants contributing to cancer [6][7][8][9][10], fewer studies have interrogated rare pathogenic variants in non-European ancestries [5,[11][12][13][14][15]. A 2019 systematic review of cancer sequencing studies found a total of only 764 reported non-European (minority) cases in 27 published studies with reported race/ethnicity [9]. Consequently, germline genetic testing in non-White patients often results in higher rates of variants of unknown significance (VUSs) [16]. Ongoing efforts are bridging the knowledge gap of cancer genetic predisposition in under-studied populations [17][18][19]. Meanwhile, systematic cross-ancestry investigations of predisposing variants across cancer types are urgently needed to inform genetic testing for each ancestral group.
Herein, we analyzed germline variant data of 9899 cancer cases across 33 cancer types from the Cancer Genome Atlas Project (TCGA) [20] to identify ancestryspecific cancer-gene associations where the genes show an excess of pathogenic/likely pathogenic germline variants the TCGA samples. In samples of African ancestry, we identified two associations, BRCA2 in lung squamous cell carcinoma (LUSC) and ovarian serous cystadenocarcinoma (OV). In analyses of individuals with East Asian ancestry, we identified an association for BRIP1 in stomach adenocarcinoma (STAD). Using a rare-variant association analysis, we identified seven additional suggestive cancer gene associations. Evidence of a somatic second hit event (i.e., loss of heterozygosity [LOH] or a biallelic mutation) was found in two thirds of the tumors with germline predisposing variants. Many carriers of ancestry-specific predisposition variants showed altered expression of the affected genes, including allelic-specific expression (ASE), mis-splicing, and reduced tumor suppressor gene expression, further supporting these genetic variants' contribution to cancer predisposition.

Study cohort and genetic ancestry assignment
We used the clinical data provided by TCGA PanCanAtlas and restricted analyses to those with pass-QC blood/ normal sequencing data. In addition to excluding cases with PanCanAtlas blacklisted germline BAM-files, cases with less than 60% genotype concordance between sequencing variant calls and SNP-genotype data were eliminated, where 10,389 cases were left [20]. We further overlapped with the cases included in the PanCanAtlas Ancestry Informative Markers (AIM) genetic ancestry assignment, resulting in the final set of 9899 samples. The detailed descriptions of ancestry assignment procedures are available in the marker publication [21].
Briefly, consensus genetic ancestry for each TCGA case was determined as the majority of ancestry assignments that were independently determined by five methods across four institutions. These methods include those based on SNP-array genotypes used by Broad Institute, University of California San Francisco (UCSF), and Washington University (WashU), as well as those based on whole-exome sequencing data used by University of Trento and ExAC/Broad Institute. The five methods conducted variations of principal component analyses (PCA) on TCGA normal samples to infer genetic ancestry. We further provide the PCA plots showing the alignment of the major PCs in the UCSF and WashU analyses with the AIM-group consensus genetic ancestry in Additional file 1: Fig. S1.
For each sample, the percentage of global ancestry of African, European, East Asian, Native/Latin American, and South Asian (k = 5) was further estimated using AD-MIXTURE [22] version 1.23 based on the common SNP markers (1000 genomes allele frequency (AF) > 1%) in the Broad Institute analysis. Samples with the proportion of the secondary ancestry greater than 20% were considered as admixed samples (Additional file 2: Table S1). Sensitivity analyses revealed increased power by including admix samples in this cohort. Thus, cases with admixed ancestry assignments were grouped to their nearest neighbors (e.g., afr_admix to afr) for downstream analyses.

Pathogenic and likely pathogenic germline variant calls
We downloaded the overall and predisposing germline variant calls previously reported by the PanCanAtlas Germline Analyses Working Group (https://gdc.cancer. gov/about-data/publications/PanCanAtlas-Germline-AWG) [20]. The detailed description of variant calling and classification procedures are available in the TCGA PanCanAtlas germline publication [20].
Briefly, germline SNVs were identified using the union of variant calls between Varscan [23] and GATK [24]. Germline indels were identified using Varscan, GATK, and Pindel [25], and we only retained variants called by at least two out of the three callers or high-confidence Pindel-unique calls (at least 30× coverage and 20% variant allele fraction [VAF]). We used the GRCh37-lite reference. We further required the variants to have an allelic depth (AD) ≥ 5 for the alternative allele. We then used bam-readcount to quantify the number of reference and alternative alleles in both normal and tumor samples. We required the variants to have at least 5 counts of the alternative allele and an alternative allele frequency of at least 20%. Of these, we included those rare variants with ≤ 0.05% allele frequency in 1000 Genomes and ExAC (release r0.3.1). We subsequently retained only cancer-relevant pathogenic variants, based on whether they were found in the curated cancer variant databases or a 152 curated cancer predisposing gene list. Finally, we manually reviewed all variants using integrative genomics viewer (IGV) and filtered out variants with poor support sequence reads.
The variants defined by the above pipeline were then classified using an automatic pipeline termed CharGer [26] (https://github.com/ding-lab/CharGer) that adopts the American College of Medical Genetics and Genomics/Association of Molecular Pathology (ACMG/AMP) variant classification guidelines which are designed for assessment of germline variants in Mendelian disorders [27]. For the CharGer classification pipeline, we defined 12 pathogenic evidence levels and 4 benign evidence levels using a number of datasets, including ExAC and ClinVar. The pathogenic evidence adds points, whereas benign evidence subtracts points that amount to pathogenicity (pathogenic requires the variant to be described as pathogenic by the reviewed clinical significance in ClinVar (not including variants showing "conflicting interpretations of pathogenicity") or other cancer predisposition gene databases, likely pathogenic requires CharGer score > 8). To acquire enough CharGer points to be classified as likely pathogenic, the variants typically need to be predicted to result in truncation in cancer predisposition genes where the loss of function (LOF) is a known disease mechanism and harbor variants with a dominant (evidence level PVS1, + 8 points) or a recessive (evidence level PSC1, + 4 points) mode of inheritance. Additionally, evidence level PS1, + 7 points are scored if the variant results in the same peptide sequence change as an established pathogenic variant. All other modules will each add ≤ 2 points.

Principal component analysis (PCA)
Birdseed genotype files were downloaded from Genomic Data Commons (GDC) in the legacy (hg19) archive onto Institute for System Biology-Cancer Genome Cloud (ISB-CGC), converted to individual VCF files, and then merged into a combined VCFs containing 11,459 samples and 522,606 variants. We conducted PCA as implemented by PLINK (v1.9) [28]. Specifically, we retained 298,004 variants with AF > 0.15 for population structure analysis. The resulting eigenvalues and eigenvectors were then recorded. PC1 and PC2 accounted for 51.6% and 29.2% of the variations across the first 20 PCs, and none of the trailing PCs accounted for more than 3.2%. Thus, we subsequently controlled for PC1 and PC2 in ancestry-specific cancer predisposing gene analysis (Additional file 1: Fig. S1).

Multivariate regression to identify the enrichment of pathogenic variants
For each cancer type within each ancestry, we conducted multivariate logistic regression analyses considering the case status of the cancer type as the dependent variable (using all other cancer cohorts as controls) and the carrier status of each predisposing gene as an independent variable. The model corrected for age at the initial pathologic diagnosis, gender, and the first two principal components (accounted for 80.8% variations across the first 20 PCs). All ancestry cohorts are called using the same variant calling pipeline, thus avoiding the potential danger of comparing this population against other cohorts such as ExAC. We collapsed predisposing (pathogenic and likely pathogenic) germline variants to the gene level. Only ancestry-cancer combinations with at least 20 cases and predisposing genes with at least two individuals with predisposing variants within the cohort are tested. In total, we tested 33 cancers in European Ancestry, 15 cancers in African Ancestry, and 8 cancers in East Asian ancestry that met this criterion. No cohorts of the Native/Latin American and South Asian ancestry have sufficient sample sizes in TCGA for testing. Among these tested cancers, we tested a total of 114 cancer-gene combinations for multivariate regression analysis, of which 101 were within European ancestry, 9 were in African ancestry, and 4 were in East Asian ancestry. P values were calculated using the Wald test and adjusted to FDR using the standard Benjamini-Hochberg procedure.

Burden testing of pathogenic variants
We conducted burden testing of the cohort within each ancestry as defined by the TCGA AIM working group. Specifically, we adopted the Total Frequency Test (TFT) [29] by collapsing predisposing (pathogenic and likely pathogenic) germline variants to the gene level. For each cancer type with at least 20 cases of the tested ancestry with at least one predisposing variant carrier, we tested the burden of predisposing variants for each gene against all other cancer cohorts as controls. Among the cancers that met the sample size criteria described above, we tested a total of 120 cancer-gene combinations using rare variant burden testing, of which 104 were within European ancestry, 11 were in African ancestry, and 5 were in East Asian ancestry. The resulting P values were adjusted to FDR using the standard Benjamini-Hochberg procedure.

gnomAD analysis
We analyzed the gene-level and variant-level frequency of the identified genetic predisposition using the noncancer subset of the genome aggregation database (gno-mAD-non-cancer) cohort (118,479 WES and 15,708 WGS samples) [30,31] (http://gnomad.broadinstitute. org). For the gene-level analysis, we retained rare variants with ancestry-specific minor allele frequency < 0. 5%. We further retained pathogenic and likely pathogenic variants per ACMG/AMP criteria as ascertained by InterVar [32] and annotated using ANNOVAR [33]. Allele frequencies were summarized at gene-level within each sub-population in gnomAD using total allele counts and maximum allele numbers within each group.

Expression analysis
TCGA level-3 normalized RNA expression data were downloaded from Firehose (2016/1/28 analysis archive). The tumor expression percentile of individual genes in each cancer cohort was calculated using the empirical cumulative distribution function (ecdf), as implemented in R. We annotated germline carriers of predisposition variants with extreme mRNA tumor expression (> 80th or < 20th percentile) of the affected gene. For samples within the same ancestry and same cancer cohort, we then used the two-sample Kolmogorov-Smirnov test to compare the expression percentile distribution between variants of oncogenes and tumor suppressors. The resulting P values were adjusted to false discovery rate (FDR) using the standard Benjamini-Hochberg procedure.
For the ancestry-specific variants, we recorded the RNA VAF of the mutant allele in the RNA-Seq bam files. For splice site variants, we assessed the mis-splicing of the transcript and variants using IGV.

Power and downsampling analysis
Post hoc power analyses were performed using Rpackage SKAT [34] and the power_logistic function to calculate the number of samples for rare variant association with causal percentage = 80%, minor allele frequency < 0.1%, and using odds ratio (OR) > 1 through OR < 10. Each calculation was performed using 100 simulations over a target 5 kb region.
Additionally, we performed a downsampling analysis for each tumor type by random sampling of subsets of samples with incremental sizes from zero to the total number of samples in that tumor type. We identified the number of significantly mutated genes as described above within each subset and plotted a smoothed function (loess method) against the subset size. Each calculation was performed at ten iterations (Additional file 1: Fig. S2).

Ancestry-specific cancer predisposing genes
Acknowledging the limited power to assess ancestryspecific associations as shown by the post hoc power analyses (Additional file 1: Fig. S2), we sought to identify cancer predisposing genes within each ancestry. We considered cancer predisposing genes as those statistically enriched for pooled pathogenic and likely pathogenic variants (referred to here as predisposing variants) as previously classified [20]). For each ancestry-cancer type pair, we conducted multivariate regression analyses correcting for onset age, gender, and the first two principal components.
Along with 36 cancer-gene associations (FDR < 0.05, Wald test) found in the European ancestry, we identified two specific cancer-gene associations in the African ancestry:  Table S2a). While the association of BRCA2 and LUSC is first described in African-American ancestry here, BRCA2 was previously found to be associated with non-small cell lung cancer (including LUAD and LUSC) and ovarian cancer (OV) in the European ancestry [35][36][37]. The association of BRIP1 predisposition to STAD in the East Asian ancestry was also previously reported for the European ancestry [38]. These findings (including novel associations) in a large heterogeneous cancer population build on older studies that evaluated individual cancer predisposition genes and cancer risk across ancestries. The top associated predisposing genes and their carrier frequency vary widely across ancestries (Fig. 1a). For genes with a significant association in the African ancestry, we observed a higher carrier frequency compared to other ancestries. For example, in LUSC, BRCA2 predisposing variants were found in 2 of the 29 African ancestry samples (6.9%), whereas we only found 1 BRCA2 carrier out of the 455 Europeanancestry samples (0.44%).
We next investigated whether the cross-ancestry differences in predisposing gene frequencies were also observed in other cohorts. Specifically, we examined the gene-level rates of individuals carrying pathogenic and likely pathogenic variants in the gnomAD non-cancer cohort [30,31] (118,479 WES and 15,708 WGS samples, the "Methods" section, Additional file 2: Table S3). BRCA2 showed the highest frequency in the African ancestry (0.072%) than all other defined ancestries, including non-Finnish European (0.048%) and East Asian (0.047%). BRIP1 also showed higher frequency in the East Asian ancestry (0.068%) than all ancestries (≤ 0.045%) except for the non-Finnish European ancestry (0.099%).
To generate hypotheses for future targeted studies, we investigated additional ancestry-implicated genes using total frequency testing (TFT) of predisposing variants, fully acknowledging potential confounders using this method (Additional file 2: Table S2b). We identified 7 suggestive (FDR < 0.05 in the TFT analysis) ancestry-specific cancer-gene associations in the African ancestry, 6 of which have been previously described including SDHB in PCPG [39], ATM in PRAD [40,41], FH in KIRP [42], VHL in KIRC [43], PTEN in UCEC [44], and BRCA2 in OV [12]. We also rediscovered the BRCA2 in LUSC described above. In the East Asian ancestry, we identified 3 borderlinesuggestive associations (FDR = 0.32): RECQL in STAD, BRIP1 in STAD, and POLE in LIHC. In STAD, RECQL and BRIP1 each affected 2 of the 90 East Asian ancestry cases, but none of the 294 Europeanancestry cases. In LIHC, two protein-truncating variants were seen in POLE among 162 East Asian ancestry cases compared to none in 179 Europeanancestry cases. These suggestive associations remain to be established and are only used to identify potential predisposing variants with supporting somatic evidence. Fig. 1 Cancer predisposing genes identified in each ancestry across 9899 TCGA cases across cancer types in the African ancestry, East Asian, and European ancestries. a Ancestry-specific cancer-gene pairs from TCGA dataset containing cancer predisposing variants as identified by multivariate logistic regression analyses. Each number represents carrier frequencies of predisposing genes within that cancer cohort. Genes with significant associations (Wald test FDR < 0.05) are highlighted with blue boxes. b Significant cancer-predisposing gene associations (FDR < 0.05) identified in the African and East Asian ancestries

Ancestry-specific predisposing variants
We next examined ancestry-specific predisposition at the variant level (Fig. 2, Additional file 2: Table S4) for the 3 significant associations from the multivariate logistic regression analyses and the 7 suggestive associations from the TFT analysis. The cancer-gene pairs included 15 predisposing variants within the African ancestry and another 6 within the East Asian ancestry. None of the above variants discovered in the African ancestry were observed in any other ancestry within that cancer type (Fig. 2). Across the pancancer TCGA cohort, all of the BRCA2 frameshift variants found in LUSC and OV were unique to the African ancestry. For other associated genes in the African ancestry, including ATM (PRAD), FH (KIRP), and VHL (KIRC), the predisposing variants differ between the African and European ancestries (Fig. 2b). The African ancestry-specific predisposing variants include splice site variants ATM c.2921+1G>A and FH c.556-2A>T, protein-truncating variants ATM p.T2333fs and FH p.S187*, and missense variants ATM p.R3008C. VHL p.C162F is the only recurrent variant found in two KIRC cases.
In the East Asian ancestry, we assessed predisposing variants in BRIP1 (STAD), POLE (LIHC), and RECQL (STAD) (Fig. 2a and c). These include two BRIP1 variants p.I525fs and p.E1222fs and two protein-truncating variants in POLE and RECQL, respectively. All six predisposing variants were not shared with any other ancestry in the TCGA cohort (Fig. 2c).
We also investigated the presence of the six predisposing variants in the East Asian ancestry from the gno-mAD non-cancer dataset. Only the POLE p.Y1078fs (AC/AN = 1/17,692, AF = 0.0056%) and BRIP1 p.E1222fs (AC/AN = 11/19,232, AF = 0.057%) were present exclusively in the East Asian ancestry of gnomAD-non-cancer dataset. All other East Asian-ancestry variants were not detected in this dataset. Of note, none of the six variants were previously reported in ClinVar [45].

Germline-somatic two-hit events
We next examined the two-hit hypothesis, whereby a somatic second hit of the same gene is found in carriers of the germline predisposing variants [46,47]. First, we investigated the extent of loss of heterozygosity (LOH) of the predisposing variants using our previously developed statistical test [38] (the "Methods" section) that compares the variant allele fractions in tumor vs. normal samples. Among the variants observed in the African ancestry, we observed significant LOH (FDR < 0.05) for both truncating variants in SDHB p.R116fs and p.R46* in PCPG (Fig. 3a). Three additional variants exhibited significant LOH, including BRCA2 p.R3128* (LUSC), BRCA2 p.K1202fs (OV), and FH p.S187* (KIRP). We also observed suggestive LOH (FDR < 0.15 or tumor VAF > 0.6) for ATM c.2921+1G>A (PRAD) and BRCA2 p.Y1710fs (OV) (Fig. 3b). Among the six predisposing variants in the East Asian ancestry, only POLE p.E2137* (LIHC) showed significant LOH (Fig. 3a).
As an alternative mechanism of a somatic second hit, we identified three biallelic mutations where the rare germline predisposing variant was coupled with a second somatic mutation of the same gene, all found in African ancestry carriers (labeled in Fig. 2b, Additional file 2: Table S4b). In a PRAD carrier of ATM, the germline p.L2332fs variant was coupled with a somatic p.E2164K mutation; in the KIRC carrier of VHL, the germline p.C162F variant was coupled with somatic p.E186* mutation. In a KIRP carrier of FH, whose FH gene expression is low (Fig. 4a), germline p.S187* variant was coupled with a somatic splice-site mutation c.1390+6T>A. Analysis of RNA from the KIRP tumor revealed that the somatic FH: c.1390+6T>A causes missplicing of 27.6% of the transcripts in tumor RNA, as indicated by the number of reads spanning consensus splice site (n = 68) and the new cryptic splice site (n = 26) (case 2 in Fig. 4b). None of the six carriers of the predisposing variants in East Asian ancestry harbored a biallelic somatic mutation. Overall, the assessment of LOH and biallelic mutation supports the variants' contribution to oncogenesis through the two-hit model.

Expression changes in predisposing genes
To examine the transcriptional effects of the predisposing variants, we investigated the gene expression in tumor samples of the predisposing variant carriers (Fig. 4a). We observed 154 overall and 27 non-European ancestry-specific predisposing variants co-occurring with an extreme expression (> 80% or < 20% in the same cancer cohort) of the respective gene, although the current sample sizes preclude us from discovering significantly associated genes compared to non-carriers within each ancestry-cancer cohort (Additional file 2: Table S5a).
All of the expression-associated variants were germline heterozygous variants at the DNA level. The degree of their variant allele fraction in the tumor RNAseq data (RNA VAF) thus indicates the degree of allelic-specific expression (ASE). The African carriers of SDHB truncating variants p.R116fs (the corresponding gene's expression ranks at the bottom 0.5 percentile among all PCPG cases [0.5%], RNA VAF = 0.25 and p.R46* (9% in PCGP, RNA VAF = 0.80) showed low SDHB expression. The African carriers of BRCA2 p.Y1710fs (6% in OV, RNA VAF = 0) and p.3082fs (15% in LUSC, RNA VAF = 0) also exhibited low BRCA2 (Fig. 4c). In the OV case, the germline BRCA2 p.Y1710fs is coupled with a somatic LOH event, resulting in nearly complete loss of BRCA2 expression.
For other ancestries, the tumor from one predisposing variant carrier of the Native/Latin American ancestry, NF1 p.Y489C, showed low NF1 mRNA expression (2% in BRCA, RNA VAF = 0). Overall, RNA VAF of the majority of protein-truncating variants not accompanied by LOH varied between 0 and 0.25 (Additional file 2: Table  S5a), suggesting degradation of the mutant allele.
Many predisposing truncating variants of tumor suppressors are assumed to lead to loss of gene expression through mechanisms such as nonsense-mediated decay (NMD). Using the NMD Classifier [48], we revealed all frameshift variants found in the African and East Asian ancestries were located in the NMD-competent region (Additional file 1: Fig. S3). These results support that a fraction of predisposing variants likely result in reduced gene products of tumor suppressors in ancestral groups.
Conversely, for the rare tumors with germline variants in oncogenes, the two predisposing RET variants are coupled with elevated RET expression in their African

Power consideration for predisposing gene discovery
Given the currently limited sample sizes in most of the minority cohorts, we sought to identify the required numbers of samples to discover novel cancer predisposing genes. We performed post hoc power analyses to detect a rare-variant association in an aggregation test using SKAT [34]. We assumed that a high proportion (80%) of variants are casual when focusing on prioritized predisposing variants in accordance with ACMG/AMP guidelines (Additional file 2: Table S6a, see the "Methods" section) [26,27,32]. The detection of rare variants (AF < 0.01) with moderate effect sizes (odds ratio [OR] > 5) with at least 80% power requires sample sizes exceeding 1000 samples (n = 1014) per cancer type (Additional file 1: Fig. S2A). Expression changes associated with the predisposing variants. a mRNA gene expression of the affected genes in the carriers of ancestryspecific variants as quantiles in their respective cancer cohort. Each dot denotes the gene expression level of a predisposing variant carrier colored by ancestry. Non-European variants corresponding to the bottom 25% expression in affected tumor suppressor genes and top 25% expression in affected oncogenes are further labeled. b Tumor RNA expression highlighting (red box) mis-spliced exon 5 with germline or somatic splice site variants in two cases with FH splice site variants as visualized using the integrated genome viewer (IGV). c Tumor RNA expression for the BRCA2 gene. The first two rows correspond to samples with a germline predisposing variant coupled with or without somatic LOH event, respectively. The third row corresponds to an unrelated sample without any BRCA2 alteration. All three coverage plots are groupscaled to show lower expression in the two samples harboring BRCA2 alterations The sample size requirement suggests limited power for ancestry-specific analyses using TCGA, one of the largest cancer sequencing cohorts to date. For the largest ancestry subgroup in the study, European-ancestry BRCA cases (n = 811), there is 67% power to detect genes with smaller effect sizes (OR < 3). For all other ancestries, their respective largest cohorts afford inadequate power to detect genes with large effect sizes (OR = 9), including the African ancestry BRCA cohort (n = 180, power = 36%), the East Asian-ancestry LIHC cohort (n = 162, power = 24.5%), and the Native/Latin American-ancestry THCA cohort (n = 11, power = < 1%). As a reference, most known cancer predisposing genes, including ATM, PTEN, STK11, CHEK2, BRIP1, and PALB2, have an estimated OR < 10. BRCA1/BRCA2 are exceptions with an OR > 10 for BRCA, but also show more moderate OR for other cancer types [49]. Despite limited power, this TCGA study includes threefold more non-European cases (n = 1715) compared to the combined number of samples across 27 published non-TCGA sequencing studies that report race/ethnicity information from cancer cohorts (n = 764 non-Europeans, 10 cancer types) [9]. Moreover, the majority of these studies focused on somatic alterations, and only a handful reported ancestry-specific germline predisposition (Additional file 2: Table S7).
Standard power analyses have the caveat of assuming various unknown parameters that may be inaccurate. We thus performed a downsampling analysis using two cancer types with at least five significantly associated germline genes in the European-ancestry: pheochromocytoma and paraganglioma (PCPG) and sarcoma (SARC) [4] (Additional file 1: Fig. S2B, Additional file 2: Table  S6b). We found that the sample size requirements differ for each gene and cancer cohort, likely due to varying penetrance. For example, six predisposing genes are discovered in both PCPG (n = 146) and SARC (n = 217) samples of the European ancestry, respectively, at their full cohort size. Upon downsampling the cohort size in half, we found VHL, SDHB, RET, and NF1 to be still associated in 73 PCPG cases, whereas only TP53 remained significantly associated in 108 SARC cases. Even while assuming similar penetrance of the predisposing genes across ancestries, this analysis implicates that the discovery power is still far from saturation for most ancestryspecific cohorts (N < 100). The different predisposition landscapes across cancer types should also be accounted for in future study designs.

Discussion
We report one of the most extensive multi-ancestry investigations of rare cancer predisposing genes to date, encompassing 9899 cancer cases across 33 cancer types. In the African ancestry, our results validated six known predisposing genes and nominated BRCA2 as a potential predisposing gene for LUSC (Fig. 1) previously shown only for Europeans. In the East Asian ancestry, we found predisposing variants affecting BRIP1 in STAD that warrants further investigation. Although the number of germline predisposing variants is small, they were associated with LOH (Fig. 3), biallelic mutations (Fig. 2), and gene expression effects in the tumor samples (Fig. 4), supporting their potential contribution to cancer predisposition in carriers.
In this TCGA cohort, we found multiple significant predisposing genes for the European ancestry and seven for the African ancestry, yet lack cancer cohorts with sufficient testing samples for many other ancestries, including Native/Latin American and South Asian that each constitute a considerable fraction of the US population. Even when tested, this study likely contains false negatives in multiple smaller cancer cohorts, especially those of non-Europeans. To achieve 80% power, the post hoc power calculation showed that the detection of rare variants (AF < 0.01) with moderate effect sizes (OR > 5) requires at least 1014 samples (Additional file 1: Fig. S2), a cohort size larger than any of the TCGA non-European cohorts.
It is necessary to use caution when interpreting the ancestry-specific predisposing gene associations identified herein or previous studies of smaller sample sizes, where a handful of carriers may give rise to the association in a limited cancer cohort. Further, the suggestive associations nominated by the TFT analyses will need to be established by analyses of larger cohorts adjusted for potential confounders. Two of the associations we identified in the African ancestry were also complemented by familial studies [39,42], providing further validation. To design future cancer genomics studies, one must note that the power considerations differ for discovering somatic driver genes and germline predisposing genes. Current detection powers have potentially reached saturation in detecting somatically mutated genes for sample sizes in multiple cancer types of TCGA [4], although racial disparities of the sequencing data could potentially limit the generalizability of findings [50][51][52]. We further highlighted the imbalanced dataset limits power for germline gene discovery in populations underrepresented in research studies.
We observed selected predisposing genes shared across ancestries (ex. BRCA2 in BRCA/OV and SDHB in PCPG for both the African and European ancestries). Predisposing variants, on the other hand, are highly ancestry-specific (Fig. 2). Many of the predisposing variants found in the African or East Asian ancestry were not identified in the much larger European-ancestry population of TCGA (n = 8184) or even the gnomAD non-cancer cohort (n = 134,187) or submitted to ClinVar by clinical laboratories assessing patients for cancer predisposition. Rare variant classification and interpretation remain a challenge given the low frequency of observation precluding statistical associations. The identification of ancestry-specific predisposing variants further highlights this challenge in minority groups, where current germline sequencing often results in higher rates of variants of unknown significance (VUSs) [16].
Personalized medicine provides tailored disease diagnosis and treatment plans based on an individual's unique genetic profile. The knowledge of different cancer predisposing genes and prevalence across ancestries suggests that we need to provide ancestry-specific interpretations of genetic data. In particular, many of the current guidelines for when genetic testing is recommended rely on the underlying likelihood of identifying a germline variant. Thus, accurate estimates of germline prevalence may alter recommendations for different patient populations. At the current sample sizes for minority cohorts, our study is still limited in power to discover and establish ancestry-specificity of predisposing genes (Additional file 1: Fig. S2). However, we were able to discover many ancestry-specific variants not currently submitted to ClinVar. Further, much of the diverse populations within the USA, not to mention worldwide, still lack representation in existing sequencing cohorts. Ongoing sequencing projects will begin to address this disparity within US populations (e.g., CSER [17], eMER-GEIII [18], Million Veteran Program [19], and the All of Us Research Program) and multiple countries in East Asia and Europe [53]. Yet, many populations, such as the diverse African ancestry [54], remain underserved although projects like H3Africa are designed to address this problem. Additional efforts will be required to deliver the promise of genome-based precision medicine for all.
TCGA provides a powerful multi-omic sequencing dataset comprising more than ten thousand adult cancer cases [55,56]. The dataset is used not only for characterizing somatic mutations and molecular subtypes but also enables studies of rare genetic predisposition and germline-somatic interactions [20,38,[57][58][59]. However, in such applications, one needs to note that TCGA is not a prospective cohort nor designed as a case-control study. Using the matched-ancestry cases of other cancer types as "controls" (the "Methods" section) is not ideal, yet they are the only available samples in the same study. The associations herein, therefore, may show biased effect sizes that require validation in carefully designed epidemiological studies. To enhance the confidence of the reported variants, we focused on identifying their somatic impacts, including LOH, ASE, and extreme gene expression levels that can be uniquely revealed in the multi-omic dataset.
To aid interpretation of low-frequency ancestry-specific variants, evidence of a somatic second hit event (i.e., loss of heterozygosity [LOH] or a biallelic mutation) in tumor samples can support functionality. Our analysis of the two-hit model identified the second somatic events in two thirds (10/15) of the African ancestry-specific predisposing variants and in one out of six of the East Asian ancestry-specific predisposing variants (Additional file 2: Table S4b). Additionally, some carriers of ancestry-specific predisposing variants showed simultaneous extreme expression of the affected genes (Fig. 3). Such evidence derived from analysis of the somatic genome or transcriptome can be further utilized to characterizing rare germline variants [60], especially when DNA-level analysis still suffers from limited sample sizes.
Our observation of somatic second hit (Figs. 2 and 3) and transcriptional effects (Fig. 4) coupled with germline variants also adds on to the current literature on germline-somatic interactions in cancer [61]. While the majority of cancer genomic studies focus exclusively on the germline or somatic genome, pathogenic germline variants are associated with different somatic mutational signatures, allele-specific imbalance, or somatic drivers [20,38,58,62,63]. The availability of germline DNA analysis and tumor genomic and transcriptomic analyses from the same individual provides critical data to the analyses described here that is not possible in many studies that only analyze germline DNA samples alone. Collectively, these findings are providing the roadmaps of how germline variants may trigger and collaborate with specific somatic mutations, eventually leading to cancer development. In this process, genomes across different ancestral populations provide different contexts for developing somatic mutations and genomic instability, even when the individual carries the same germline predisposition variant. We showcased examples of predisposition-associated LOH and gene expression changes in diverse individuals. As sample sizes of sequencing cohorts expand, analyzing germline-somatic interactions across ancestry will be pivotal to reveal potential ancestry-specific effects.

Conclusions
In summary, we identify ancestry-specific predisposing genes and variants contributing to multiple cancer types. The results provide insights into rare genetic predisposition and their somatic impacts in cases of African and East Asian ancestries. While the identified cancer predisposition genes are known, most predisposing variants are found to be exclusive within ancestries, supporting the "clan-genomics" hypothesis [64]. Continuous studies using larger ancestry cohorts will be required to enable adequately powered discovery of predisposing genes and improve genetic screening for diverse populations [65].
Additional file 1: Figure S1. Principal component analyses (PCA) of germline TCGA samples to infer genetic ancestry as performed by PanCanAtlas Ancestry Informative Markers (AIM) working group. Figure  S2. Power analysis for ancestry-specific sample sizes to discover predisposing genes. Figure S3. Nonsense-mediated decay prediction for predisposing frameshift variants in African and East Asian ancestries.
Additional file 2: Table S1. The demographic information of TCGA PanCanAtlas cohort with separate admixture populations. Table S2a. Ancestry-specific cancer-gene associations discovered from multivariate regression analyses. Table S2b. Ancestry-specific cancer-gene associations discovered from rare variant burden testing (Total Frequency Test-TFT). Table S3. Frequency of predisposing variants in TCGA PanCanAtlas and gnomAD-non-cancer subset across all ancestries. Table S4a. Ancestry-Specific Predisposing Variants as identified from Supp. Table.2. Table S4b. Summary of somatic second hit mutations in carriers of germline predisposing variants. Table S5a. Statistical analysis of gene expression in tumor samples of the variant carriers vs. non-carriers within each ancestry-cancer combination. Table S5b. Tumor RNAseq variant allele fractions and the somatic second hit events in germline predisposing variants with extreme expression within that cancer type. Table S6a. Post hoc power analyses to detect rare-variant associations in an aggregation test using SKAT. Table S6a. Down-sampling analysis for PCGP and SARC (cancers with at least 5 significantly associated germline genes in the European ancestry). Table S7. Prior studies that report ancestryspecific germline predisposition.