REVA as A Well-curated Database for Human Expression-modulating Variants

More than 90% of disease- and trait-associated human variants are noncoding. By systematically screening multiple large-scale studies, we compiled REVA, a manually curated database for over 11.8 million experimentally tested noncoding variants with expression-modulating potentials. We provided 2424 functional annotations that could be used to pinpoint the plausible regulatory mechanism of these variants. We further benchmarked multiple state-of-the-art computational tools and found that their limited sensitivity remains a serious challenge for effective large-scale analysis. REVA provides high-quality experimentally tested expression-modulating variants with extensive functional annotations, which will be useful for users in the noncoding variant community. REVA is freely available at http://reva.gao-lab.org.


Introduction
Noncoding regions occupy the majority of the human genome [1]. It has been demonstrated that noncoding variants can affect the regulation of genes [2], and more than 90% of disease-and trait-associated variants are noncoding variants [3]. Noncoding variants that could affect gene expression can be considered as expression-modulating variants [4]. Several experimental assays have been developed to characterize expression-modulating variants. Genome editing technologies such as transcription activator-like effector nucleases (TALENs), zinc finger nucleases (ZFNs), and clustered regularly interspaced short palindromic repeats with Cas9 nuclease (CRISPR/Cas9) provide high-quality validated data but are generally low throughput [5][6][7]. Recently developed massively parallel reporter assays (MPRAs) can identify transcriptional regulatory elements in an efficient way, allowing systematic screening of tens of thousands of genetic variants for pinpointing the causal variants of complex traits [4,8,9]. All expression-modulating variants stored in MaveDB [10] are validated by the MPRA experiments. MPRAs have generated over 10 million human expression-modulating variants [11]; however, only around 30 thousand of them have been collected by MaveDB without any functional annotation, which hinders the further utilization of these data.
Although experimental assays for characterizing noncoding expression-modulating variants have generated a huge amount of data, it is still inadequate for covering all noncoding variants identified in human genomes. Therefore, multiple computational tools have been developed for identifying expression-modulating variants ( Table 1). Transcription factors (TFs) could regulate genes through binding to sequence motifs [12], and noncoding variants could affect gene regulation by changing motifs [13]. FunSeq2 integrated a module for detecting motif-breaking and -gain events through the change of position weight matrix (PWM) and other functional annotations to prioritize cancer driver mutations [14]. Methods based on machine learning have been used wildly in biological researches [15]. CADD [16] used support vector machine (SVM) to classify variants into functional and nonfunctional variants, and GWAVA [17] used random forest to predict disease-related variants. Both CADD and GWAVA were based on supervised learning methods, while Eigen [18] implemented unsupervised learning methods to classifyvariants. All these tools highly depend on existing annotations at corresponding loci. In 2015, Alipanahi et al. [19] developed DeepBind based on convolutional neural networks (CNNs) to predict the binding affinity between TFs and DNA or RNA binding proteins and RNA. DeepSEA [20] applied similar methods to predict the effect of noncoding variants on binding affinity and then classified variants through logistic regression into functional or nonfunctional groups. All tools mentioned above identified expression-modulating variants through indirect inference, because they were not trained on expression-modulating variants or expression-related data. EnsembleExpr [21] used MPRA data to train an ensemble-based model for characterizing expression-modulating variants directly.
ExPecto [22] ab initio predicted the variants' effects on gene expression from 40-kb promoter-proximal sequences and then pinpointed expression-modulating variants. However, there is no comprehensive evaluation of these computational tools based on high-quality expressionmodulating variants; therefore, it is difficult for users to choose appropriate tools for their tasks.
Here, we present a repository for expression-modulating variants (REVA). The current release of REVA consists of over 11.8 million experimentally validated expressionmodulating variants in the human genome, curated with extensive functional annotations. We further benchmark seven popular computational tools in identifying expression-modulating variants [14,[16][17][18][20][21][22] based on high-quality data in REVA. All data and benchmarking results are publicly available at http://reva.gao-lab.org.

Data collection and integration
To ensure unified and high-quality data, all records in REVA were collected and curated using a standard procedure (Figure 1). We used a list of keywords, "MRPA", "STARR-seq", "CRE-seq" with "mutation", "variant", and "variation", to retrieve publications from PubMed (https:// pubmed.ncbi.nlm.nih.gov/) and then manually checked the abstracts and full texts of the matching publications to obtain literatures that experimentally validated the effects of noncoding expression-modulating variants.
For filtered literatures, we extracted related information of the variants from the main texts as well as supplementary materials of publications and converted them to the same format ( Table 2). Variants that failed to be mapped to both   GRCh37 and GRCh38 were removed. Variants only mapped to the coding region were also removed. For missing information, we used "." as a placeholder. In addition, the detailed protocols and raw data of the experiments were also extracted. For variants with the same chromosome, genome location, reference allele, alternative allele, and experimental cell line from different publications were subjected to a meta-analysis to integrate data. The harmonic mean P value (HMP) method [25] was used in the meta-analysis, and the cutoff for the meta P value was set to 0.001 to generate the meta-label. The variants involved in the metaanalysis but without a raw P value were also removed.
The label of variants was given based on the cutoff for the adjusted P value or meta P value, and then variants were classified into positive variants and negative variants based on label or meta-label. If the variant's label was 1, the variant was a positive variant and considered to have effects on gene expression; otherwise, it was a negative variant without effect on gene expression.

Database construction
All manually curated variant data, as well as metainformation, were stored in MongoDB (https://www.mongodb.com/) at two levels: accession and variant data ( Figure  S1). Each accession entry consisted of an accession number, created time, last updated time for the accession, information about the assay used in the publication (method type, original reference genome version, link to raw data, and summary of the assay), and the reference. Variant data included all related information about the variant, and each variant data entry was linked to one accession. For the data involved in the meta-analysis, the variant data contained the results of the meta-analysis and were linked to all related variants and accessions.

Variant annotation
In efforts to pinpoint plausible regulatory mechanisms for these variants, we used 2403 trained CNNs to annotate the functional effects of sequence variations [28] based on 1249 TF binding profiles, 766 histone modification profiles, 280 DNA accessibility profiles, and 108 DNA methylation profiles from the recent Encyclopedia of DNA Elements (ENCODE) data.
NVIDIA Tesla P100 Graphics Processing Units with the implementation on the deep learning framework TensorFlow (https://www.tensorflow.org/) and Python (https://www.python.org/) were used for training models. We adopted stochastic gradient descent (SGD) as the optimizer, and the initial learning rate was 0.01.
The final output layer of the CNN model was a fully connected layer with a sigmoid function used to scale the output between 0 to 1. The input layer was a one-dimensional convolution layer with the thresholded rectified linear unit (ReLU) as the activation function. Next, the maxpooling layer was performed to reduce the complexity of the data. Then, the dropout layer was used to mitigate the overfitting problem. The next two layers were a fully connected layer with thresholded ReLU as the activation function and a dropout layer.
For TF binding, histone modification, and DNA accessibility models, the positive data for training CNNs were the 200-bp sequences centered on the peak in ENCODE profiles. Then we removed positive sequences from the human reference genome and split the rest into 200-bp bins. Random sampled 200-bp bins with the same number of positive data were used as negative data. For DNA methylation, the 200-bp sequences centered on the target base with the methylation rate more than 0.5 or less than 0.5 in wholegenome bisulfite sequencing (WGBS) data were considered as positive data and negative data, respectively. One-hot encoding was conducted to transform each sequence to a 200 × 4 binary matrix for model training.
A five-fold cross-validation strategy was used to train models. During each iteration of model training, 15% of the input data were randomly selected as the independent testing dataset to evaluate model performance. The remaining data were split with 70% to train models and 15% as the validation dataset to optimize parameters. Model performance was evaluated with the area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) to test the sensitivity and specificity, and models with the best performance were selected for variant annotation. An average AUROC and an average AUPRC of 2403 models were reported.
To character the binding affinity changes of the variant, we used 2403 trained CNNs to predict on 200-bp sequences centered with the reference allele and alternative allele, respectively. For each chromatin profile, the log 2 fold change (as the method shown in DeepSEA) [20] was calculated as the variant effect on chromatin profile. Specifically, where P r was the prediction of sequence with reference allele, and P a was the prediction of sequence with alternative allele. Furthermore, we incorporated 13 DNA physicochemical properties and 8 evolutionary features into the annotation pipeline. The 13 physicochemical properties were calculated as described by Li et al. [29], and 8 conservation scores were downloaded from UCSC Genome Browser (http://genome.ucsc.edu/).
All variants in the benchmarking dataset were variants with expression-modulating potential. One of the biological mechanisms by which disease-related or phenotype-related variants function is having effects on gene expression regulation [30]. Pinpointing disease-related or phenotyperelated variants is more useful for biomedical researches. Therefore, we further selected the GWAS, ClinVar, and HGMD subsets of the benchmarking dataset to test these tools' power.

Extensive functional annotation of expression-modulating variants
We used 2403 trained CNNs to annotate the functional effects of expression-modulating variants [28]. Most of the trained CNNs were accurate, with an average AUROC of 0.908 and an average AUPRC of 0.904. Among the 5,789,688 variants annotated, both positive and negative variants were found to lead to significant changes in binding affinity for 22 and 12 TFs on average, respectively, which also suggested that expression-modulating variants may affect gene expression regulation through changing the binding affinity of TFs. Moreover, 8.72% positive and 3.56% negative variants were located at evolutionary conserved loci (phastCons100way score > 0.6).

Benchmarking of state-of-the-art computational tools
To evaluate the power of state-of-the-art computational tools in calling expression-modulating variants, we further benchmarked multiple state-of-the-art computational tools based on the curated data in REVA. With the benchmarking dataset containing 5,809,991 loci (37,816 positive and 5,772,175 negative), we found that 1289 could not be predicted by DeepSEA (since their evolutionary features were not available), and 560,577 were not included in the precomputed score set of Eigen, FunSeq2, and GWAVA, so we further excluded these 561,866 cases from follow-up analysis. Meanwhile, as EnsembleExpr could not finish the whole benchmarking dataset in a reasonable amount of time, we assessed its performance based on the average metrics over 5 randomly sampled sub-datasets with 368 positive and 56,026 negative cases on average.
Overall, the best-performing tool was DeepSEA, with Fisher's exact test with Benjamini and Hochberg correction [42] was used in the analysis of the chromosome distribution of variants. The cutoff for the adjusted P value was set to 0.05. The density distribution plot was constructed with the karyoploteR package [43] in R. No variants were located on the Y chromosome. EnsembleExpr on variants that were also included in HGMD. "All" represents the F1 score shown in (A). "All HGMD" represents the F1 score on all variants that were also included in HGMD. "DM?", "DP", "FP", and "DFP" refer to the classes of related variants documented in HGMD. AUROC, area under the receiver operating characteristic curve; N.A., not available.
the highest AUROC and F1 score (Figure 3A and B; Table S6). All tools performed well in terms of specificity but poorly in terms of sensitivity. EnsembleExpr had the highest sensitivity but the lowest specificity, whereas ExPecto showed the best specificity and worst sensitivity (Table S6). There were 52,672 variants in the benchmarking dataset that overlapped with the GWAS catalog (v1.0.2), and 658 of them were positive variants. All tools performed better on variants overlapping with the GWAS catalog, and DeepSEA still had the best performance ( Figure 3C). Meanwhile, ExPecto and FunSeq2 showed better performance on variants at evolutionarily conserved loci, while DeepSEA displayed moderate performance ( Figure 3D).
The coverage and quality of training data may contribute significantly to the performance of machine learning-based models [44]. To test whether variants from different cell lines would affect the performance of these tools, we further evaluated these tools separately on seven cell lines (Table S7). On GM18507, GWAVA-Unmatch performed best; on HEK293T and NA12878&NA19239, Eigen-PC had the highest F1 score; DeepSEA had the best performance on HepG2, K562, and K562_GATA1; and CADD performed best on SH-SY5H ( Figure 3E), which suggested that the diversity of the original training data contributes to the performance differences of these tools. Of note, thus far, only ExPecto outputted cell type-specific scores for various tissues.
To provide a further explanation of the potential mechanisms of disease-related variants, we evaluated the benchmarking dataset on disease-related variants. There were 1400 variants in the benchmarking dataset that overlapped with HGMD (2019.3 professional), and 69 of them were positive variants (Table S8). Moreover, 8 of 69 variants were verified to regulate gene expression by independent experiments; 40 of 69 variants were associated with diseases such as colorectal cancer, nervous system diseases, and autoimmune diseases. To test computational tools' power on disease-related variants, we compared their performance on these variants. All tools performed better on variants overlapping with HGMD, and DeepSEA still had the best overall performance ( Figure 3F), same on variants with class "DM?" and "DFP". Eigen-PC showed the best performance on variants with class "DP". Interestingly, ExPecto performed best on variants with class "FP" but worst on variants with other classes. We also evaluated variants overlapping with ClinVar (2019.10.08), and DeepSEA had the best overall performance, and Eigen showed the best performance on "Drug response" related variants ( Figure S3; Table S9).

Web interface
REVA (http://reva.gao-lab.org) provides an interactive web interface for users to explore all data entries and analysis results (Figure 4, Figure S4). Users can start a quick search by chromosome position, rs ID, gene name, ensemble gene ID, or disease name. "Advanced search" provides a customized search and batch search for users. The query result is presented as a table, which includes basic information, expression information (such as the label, effect size, and adjusted P value), and the related genomic region. Users can directly click the link of position and rs ID to access UCSC Genome Browser and dbSNP (https://www.ncbi. nlm.nih.gov/snp/) for more information. Users can also click the "details" link for more information. The detail page contains eight modules: "Basic Information", "Cell Line and Expression", "Three-dimensional Interacting Gene", "Chromatin State", "Disease and Phenotype", "Meta Sources" (only available for variants involved in meta-analysis), "Accession", and "Annotation". In the "Annotation" module, chromatin profile features are rendered as a heatmap by cell line and a boxplot by category, and DNA physicochemical properties and evolutionary features are presented as responsive tables. Users can download the annotation for further analysis. Moreover, we also provide benchmarking results of state-of-the-art computational tools. Users can download all variants in REVA and the benchmarking dataset through the "Download" page.

Explore plausible regulatory mechanisms of expressionmodulating variants
Autoimmune diseases are caused by the abnormal immune response to attack and damage functional tissues due to complex interactions between environmental and genetic factors [45]. GWAS and fine-mapping studies have identified thousands of noncoding variants associated with autoimmune diseases [46]. Since the mechanisms of autoimmune disease are complicated, pinpointing causal variants and exploring their possible functional mechanisms remain a challenge [47].
Ankylosing spondylitis is a kind of chronic autoimmune disease, but the pathogenesis remains unclear [48]. On the advanced search page of REVA ( Figure S5), we filtered the label to positive and searched with "ankylosing spondylitis". The search result contained 8 entries, and among them, the variant rs4456788 (near the ICOSLG locus) had the largest effect size tested in HepG2 cell line and was considered to repress expression. It was also tested in K562 cell line and resulted in the same conclusion. Through the annotation module of the detail page, we found that in HepG2 cell line, the alternative allele of rs4456788 could decrease the binding affinity of TFs MAZ and FUS. MAZ has been proven to have bidirectional transcriptional regulation [49], and FUS has a transcriptional activation function [50]. It could be the possible regulatory mechanism of rs445678, and this might be helpful for further researches on the mechanism of pathogenesis of ankylosing spondylitis.

Discussion
REVA is a database specifically designed for storing experimentally validated expression-modulating data. It currently consists of 11,862,367 entries covering 5,948,789 experimentally tested noncoding loci across 18 cell cultures. Both experimentally validated expression-modulating variants and meta-information about assays were curated. Comparing with the existing database, REVA is the largest database designed for curating experimentally validated expression-modulating noncoding variants specially. Besides, we provide 2424 functional annotations, including TF binding, epigenetic modifications, DNA accessibility, DNA physicochemical properties, and evolutionary features.
Most of the variants in REVA were located in intergenic and intronic regions and were unevenly distributed on chromosomes. Several factors may contribute to the uneven distribution. First, it has been well demonstrated that the functional elements are unevenly distributed across chromosomes [51,52]. Consistently, we found that the numbers of both positive and negative variants were highly correlated with the gene numbers across all chromosomes (Pearson's r = 0.80, P = 2.6 × 10 −6 for positive variants; Pearson's r = 0.82, P = 7.1 × 10 −7 for negative variants).
Moreover, technical challenges counted too. In particular, the Y chromosome had long been taken as a "genetic wasteland" [53] and excluded from genomic analyses for quite some time due to its genetic and structure complexities [54]. Although this idea has been shifted with more researches on chromosome Y, the underrepresentation of chromosome Y on commonly used arrays still exists [55]. We also noticed that certain experimental designs may lead to reporting bias [8,33,[38][39][40][41]. However, after removing data generated from studies designed for assessing particular regions [38] or elements [8,33,[39][40][41], we found that the uneven distribution remains. Furthermore, we provide a high-quality benchmarking dataset for evaluating state-of-the-art computational tools designed for identifying expression-modulating variants as well as benchmarking results of multiple published computational tools as a reference for users to select the best tools for their particular tasks. Overall, all seven tools have high specificity but low sensitivity. DeepSEA has the best performance on the whole benchmarking dataset in terms of AUROC and F1 score, and all tools have better performance on disease-related or phenotype-related variants, suggesting that the diversity of the original training data of these tools contributes to different performance across different benchmark subsets. We noticed that not all tools involved in the benchmark were designed for identifying expression-modulating variants originally, and a "negative" expression-modulating noncoding variant might also be associated with disease via nontranscription mechanisms like epigenetic marks [56] or chromatin structuration [57].
It should be noted that not all variants collected in our database were tested by identical experimental protocols. Non-saturation mutagenesis-based studies examine several elements at a time, and each fragment usually contains one variant, with the effect size calculated by counting reads directly [8] or employing a linear model [32]. Meanwhile, saturation mutagenesis-based studies focus on a few elements; each fragment contains two or more variants, and the effect size is calculated through linear regression [41]. Protocol details for each variant were documented during curation to help users interpret records effectively ( Figure S6).
We believe that this database will be useful for not only computational but also bench biologists in genomics, bioinformatics, and genetics communities, and we will keep the resources updated with new data and annotations that emerge in the coming years.