Challenges and disparities in the application of personalized genomic medicine to populations with African ancestry

To characterize the extent and impact of ancestry-related biases in precision genomic medicine, we use 642 whole-genome sequences from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) project to evaluate typical filters and databases. We find significant correlations between estimated African ancestry proportions and the number of variants per individual in all variant classification sets but one. The source of these correlations is highlighted in more detail by looking at the interaction between filtering criteria and the ClinVar and Human Gene Mutation databases. ClinVar's correlation, representing African ancestry-related bias, has changed over time amidst monthly updates, with the most extreme switch happening between March and April of 2014 (r=0.733 to r=−0.683). We identify 68 SNPs as the major drivers of this change in correlation. As long as ancestry-related bias when using these clinical databases is minimally recognized, the genetics community will face challenges with implementation, interpretation and cost-effectiveness when treating minority populations.

. Admixture plot of 642 CAAPA samples. This is the admixture estimation results, which also included non-admixted populations from phase 1 of 1000 Genomes Project and the Native Americans from Bigham et al. 2010 as mentioned in the main text in the section titled "Estimation of Ancestry Proportions." The proportion of African ancestry (red) was used as a key correlate to the variation we found for different categories.
Supplementary Table 1. Correlation of ancestry with number of PAVs per individual identified separately from each of the two databases with and without filtering. Full filtering implies the allele frequency filter and either a deleterious filter or Stop/Splice site. Correlation values are shown for PAVs found in only ClinVar, only HGMD, and either one. For each of these categories, correlation values are presented before filtering, after filtering out variants with a MAF > 0.05 in any of a number of populations, after filtering variants called deleterious by at least two in silico predictors, including stop or splice sites, and after all of these filters. Regardless of database origin, each time a filter is added, the positive correlation is reduced. With both filters added, PAVs from ClinVar show a significant negative correlation, while PAVs from HGMD or the union of ClinVar and HGMD show no correlation.  Table 2. Genes significantly correlated with African Ancestry. Genes whose pathogenic annotated variants (PAVs) were significantly correlated with African Ancestry are listed. No genes had statistically significant positive correlation with African Ancestry. In other words, these correlations were negative and so individuals with greater African ancestry had fewer pathogenic variants in these genes. Significance was calculated after correcting for multiple testing (Bonferonni correction): Two asterisks (**) signify family-wide significance at the 0.05 level before removing genes with a minimum number of total pathogenic variants summed across all individuals, and a single asterisk (*) signifies similar significance after removal of such genes (representing increased power via removing weak signal genes and reducing number of statistical tests).  regularly evaluate how to best use ClinVar 1 , particularly for minority patients.

Asthma Focus In The CAAPA Dataset
The CAAPA cohort consists of samples collected for investigation into the genetics of asthma. To verify our assumption that the ascertainment of individuals with asthma should not effect our results or enrich for pathogenic, deleterious, and/or truly causal variants, we used our per gene analysis framework to demonstrate that genes implicated in asthma have no meaningful effect on our results and conclusions. After calculating ancestry-based bias in each gene, we looked at the subset of genes with the most bias, including genes with both meaningfully positive and negative correlations between African-ancestry and pathogenic variant counts per gene. Using subsets of the most bias genes (even before multiple testing correction), we found no evidence at all for enrichment of any disease networks or pathways, as annotated by the gene ontology consortium database (GO), as well as by curated Mendelian, recessive, dominant, and X-linked genes. Furthermore, we found no evidence in highly biased genes for enrichment of GWAS catalogue genes, which should contain any genes that were the most significant hits in any GWAS, including those looking at associations with asthma.
Finally, the most significantly biased genes after multiple testing (~10) have not been implicated in asthma.

Ancestry specific genomic data and databases
One might ask what the increasing numbers of whole African-ancestry genomes being deposited into public resources (through NHBLI and NHGRI etc) may do to the biases we report here, and whether such action might cause these biases to disappear.
While such increased sequencing of whole African-ancestry genomes is surely a step in the right direction, one serious limitation to the disappearing of the biases we report is that most of the current and upcoming African-ancestry genome sequencing is not being done on cohorts that have the necessary and robust phenotype data that comparable studies of predominantly European-ancestry individuals use to populate databases such as ClinVar (i.e. in annotating variants as pathogenic etc). Instead, these African-centric studies are more focused on the complex disease genetics that underlie medical illnesses in foundational areas such as cardiology, pulmonology, and psychiatry. In addition, even if this phenotype data did exist, we still believe it would take significant time for the amount of African data in the databases to "catch up" to the dominant amount of European data currently populating these databases. Finally, if the databases were to theoretically become predominantly and disproportionately populated with data specific to African populations, our results suggest that other ancestry related biases might develop for non-African ancestry populations. Therefore, the genetics community must be aware of the importance of accounting for population specificities, particularly when using databases to prioritize variants in the context of precision genomic medicine.

Per gene analysis
To explore the correlation between PAVs in ClinVar 1 and African ancestry further, we conduct a similar correlation analysis on a per gene basis. By counting up the total number of PAVs in each gene for each person, we run a weighted correlation analysis as described above on each of 24,043 human genes as annotated in UCSC's hg19 RefGene list. After multiple testing correction, only 3 genes have significant correlations (Supplementary Table 2). Since many of the genes had very small total numbers of PAVs, even across all individuals, we rerun the correlation analysis after excluding all genes with less than 5 total PAVs across all individuals. This leaves a total of 645 genes, and by cutting away the multitude of underpowered genes with low counts, we identify 10 genes with significant correlations after multiple testing correction (Supplementary Table 2). For both analyses, follow-up is qualitatively the same, and so we describe in the main text approaches and results for the larger full gene analysis.
The fact that our filtering of low count genes does little to quantitatively change our follow-up analysis, even after the removal of over 97% of genes, provides support that raising the minimum number of PAVs per gene further would do little to increase our power or improve our analysis.

Gene enrichment analyses
After calculating correlation value per gene, we make a list of 74 genes that have a significant positive association before multiple testing correction and another list of 198 genes that have a significant negative correlation before multiple testing correction. Using additional gene lists compiled from OMIM3, ClinVar1, and HGMD4, and the GWAS catalogue5, we find no significant enrichment for Mendelian, dominant, recessive, X-linked or GWAS catalogue genes amongst positive and negative correlation genes (Pearson's Chisquared test and Wilcoxon rank sum test). Our lists overlap and contain 2050 mendelian genes, 670 dominant genes, 1050 recessive genes, 491 X-linked genes, and 5045 GWAS catalogue genes. GWAS catalogue genes are defined as genes that contain at least one variant that was a top genome wide hit in a GWAS study of a complex trait. 5 We also test whether Mendelian, dominant, recessive, X-linked or GWAS catalogue genes have different correlation values than genes outside of these categories, but results are non-significant in each of these cases. As we found no evidence of an enrichment of highly biased genes in any of the annotated mendelian, recessive, or dominant gene categories, as would be expected if a model based on dominance was particularly relevant to our results, we feel that the additive approach we have taken is best. Additionally, since our goal in assessing the variants present in each individual is to build up population level evidence, it is important to consider each allele independently in assessing the population wide evidence of the likelihood that a variant is casual. Using the GORILLA program6, we tested our significant positive and negative correlation gene lists for enrichment of GO terms, but results were unremarkable for all tests, especially at genome-wide significance levels.