FAVOR: functional annotation of variants online resource and annotator for variation across the human genome

Abstract Large biobank-scale whole genome sequencing (WGS) studies are rapidly identifying a multitude of coding and non-coding variants. They provide an unprecedented resource for illuminating the genetic basis of human diseases. Variant functional annotations play a critical role in WGS analysis, result interpretation, and prioritization of disease- or trait-associated causal variants. Existing functional annotation databases have limited scope to perform online queries and functionally annotate the genotype data of large biobank-scale WGS studies. We develop the Functional Annotation of Variants Online Resources (FAVOR) to meet these pressing needs. FAVOR provides a comprehensive multi-faceted variant functional annotation online portal that summarizes and visualizes findings of all possible nine billion single nucleotide variants (SNVs) across the genome. It allows for rapid variant-, gene- and region-level queries of variant functional annotations. FAVOR integrates variant functional information from multiple sources to describe the functional characteristics of variants and facilitates prioritizing plausible causal variants influencing human phenotypes. Furthermore, we provide a scalable annotation tool, FAVORannotator, to functionally annotate large-scale WGS studies and efficiently store the genotype and their variant functional annotation data in a single file using the annotated Genomic Data Structure (aGDS) format, making downstream analysis more convenient. FAVOR and FAVORannotator are available at https://favor.genohub.org.


INTRODUCTION
A rapidly increasing number of large biobank-scale Whole Genome/Exome Sequencing (WGS/WES) studies are being conducted. They provide rich opportunities for understanding the genetic bases of complex human diseases and traits. Examples of large WGS/WES studies include the Trans-Omics Precision Medicine Program (TOPMed) of the National Heart, Lung and Blood Institute (NHLBI) (1), the Genome Sequencing Program (GSP) of the National Human Genome Research Institute (NHGRI), UK biobank (2) and All of Us (3). These large WGS/WES studies have identified hundreds of millions of coding and non-coding genetic variants across the human genome from hundreds of thousands of individuals and provided opportunities to evaluate their associations to diseases and traits.
There is a pressing need to develop a comprehensive whole genome variant functional annotation database and browser for online queries to facilitate analysis and interpretation of GWAS and WGS/WES studies, as well as software that functionally annotates any GWAS and WGS/WES study for downstream statistical genetic analysis. Although there are several well-established variant functional annotation databases, such as CADD (5,33), VEP (34), Annovar (35), WGSA (36), SnpEff (37), and recently developed functional databases VarSome (38) and VannoPortal (39), there are several limitations. First, these resources have limited online query capabilities, and do not provide user-friendly variant function annotation browsers that summarize and visualize multi-faceted functional annotations of a single variant and/or multiple variants in a gene or a region. For example, WGSA does not provide an online browser for querying variant functional annotations. VEP provides a browser with only a few annotations. CADD allows for querying a single variant or variants in a region but displays the annotation results in a large table that is difficult to navigate. The recently developed tool VannoPortal has several attractive features, including a responsive and interactive web interface with rich functional annotations, but it currently only supports single variant query. Most of these resources do not allow for gene-and region-level variant annotations and have limited capacities in summarizing and visualizing query results.
Third, there is a lack of scalable and easy-to-use tools that satisfy the need of functionally annotating largescale WGS/WES studies. Existing functional annotation databases and tools are not scalable for functionally annotating a massive number of variants in large-scale WGS/WES studies. Moreover, few of the currently available functional annotation tools can provide organized output in a format that is both storage efficient and ready to be used in downstream statistical genetic analyses, such as fine-mapping (4,11), heritability (6), rare variant association tests (12,13). There is a pressing community need to develop a convenient and comprehensive functional annotation tool that annotates any WGS study dataset at scale and generates a functionally annotated genotype file in an organized and compressed format, that can be readily integrated into the downstream analysis.
We developed Functional Annotation of Variants Online Resources (FAVOR), a comprehensive whole genome variant annotation database and a variant browser that provides hundreds of functional annotation scores from a variety of biological functional dimensions for all possible 9 billion Single Nucleotide Variants (SNVs) and observed short insertions/deletions (indels). FAVOR provides a fast, convenient, and user-friendly web interface that features online single variant, gene-and region-level variant queries. Search results are well-organized and conveniently visualized, ac-D1302 Nucleic Acids Research, 2023, Vol. 51, Database issue cording to their major functional categories. FAVOR distinguishes itself from the limitations of existing tools by providing functional annotation information that can be easily viewed through multiple functional category-based blocks and tables directly on its web interface. On top of that, FAVOR automatically generates dynamic summaries of search results by identifying important functional scores of the queried variant. These FAVOR unique features grant users immediate and intuitive insight into the search results while still maintaining users' access to the comprehensive display of multi-faceted functional scores. We have provided a comparison between FAVOR with the existing annotation databases (Supplementary Table S1).
We have also developed FAVORannotator, a tool that functionally annotates the genotype data of any WGS/WES study at scale using the FAVOR database (GRCh38 build) and stores the genotype data and their aligned functional annotation data in an annotated Genomic Data Structure (aGDS) file. The proposed aGDS data format extends the Genomic Data Structure (GDS) format (43), by storing the genotype data and the corresponding functional annotation data in a single file, making downstream integrative analysis of variants with their functional annotations more efficient and convenient. The GDS format is highly storage-efficient, with a compression rate of a thousand times compared with the VCF format. FAVORannotator is scalable and computationally efficient for functionally annotating large biobank-scale WGS/WES studies, for example, it completes the functional annotation of 1 billion variants of 184 878 multi-ethnic WGS samples in 38 CPU hours and storing those data in an aGDS file of size 488 GB. FAVORannotator automatically exports annotation results into aGDS format and achieves high storage efficiency (use CCDG Freeze 2 and TOPMed Freeze 8 datasets as examples, see Supplementary Table S2).

FAVOR DATABASE
The FAVOR relational functional annotation database provides comprehensive multi-faceted variant functional annotations of all possible 9 billion SNVs in the whole genome by integrating data from multiple different sources, including CADD v1.5 (5,33), GENCODE v31 (44) Table S3). The FAVOR database can be downloaded from the FAVOR website.

FAVOR ONLINE PORTAL
The online FAVOR portal facilitates fast and convenient online functional annotation query using an R shiny app ( Figure 1). It allows users to search for a single variant (either in position format or rsID), multiple variants in a gene or genomic region (either in position format or gene name), or batches of tens of thousands of variants. The variant functional annotation results are displayed in tabular overviews in a summary tab ( Figure 2), a full tables tab (Figure 3), and visualized using histograms ( Figure 4).
The FAVOR web interface is exceptionally nimble. Single Variant Search (both variant position and rsID) renders results on the webpage immediately, while Gene-based and Region-based Variant Search takes just a few seconds to display results, and Batch annotation directly generates the annotation results for up to 10 000 variants allowing for a range of input file formats. This fast response speed is the product of its backend database indices and table design. The indices employ a diverse set of data structures, each tailored toward specific functionalities. The table design relies upon an original primary key (a combined string that consists of variant chromosome position and reference and alternative allele, e.g. 19-44908822-C-T) that efficiently relates the tables with regard to both computation and storage. This implementation enables the fast query of 160 annotations for all 9 billion SNVs at the variant, gene and region levels. Single Variant Search organizes functional annotation results in blocks defined by annotation types (Figure 3 and Supplementary Table S3), and Gene-based and Regionbased Variant Search results display in large tables (Supplementary Table S4), all the query results display on web interface can be downloaded from the "Download query results" button at the bottom.
Compared with the other existing variant functional annotation online portals, FAVOR provides more comprehensive query options (Supplementary Table S5) including Single Variant, Gene-based and Region-based Searches, and Batch annotation. For the variant-level query, FAVOR has a similar query speed compared to CADD and is much faster than the other functional annotation online portals. FA-VOR provides gene/region-level variant functional annotations, which are lacking in other portals. FAVOR is a little slower in batch annotation, as it provides much more functional annotations compared to the other portals that allow for batch annotation (Supplementary Table S5).

Single Variant Search
For Single Variant Search, users can input a variant position (in hg38 build) or an rsID. The retrieved functional annotation results are displayed in three tabs: Summary, Full Table,  the PHRED scale (in the top 10% of the genome). By selecting and presenting the most informative functional annotation of a queried variant in the summary tab avoids overwhelming users with a large amount of information.
The Full Tables tab displays all functional annotation scores--organized into 17 blocks of annotation groups ( Figure 4). These blocks are Basic, ClinVar, Variant Category, Overall Allele Frequencies (AFs), Ancestry-Specific AF, Gender AF, Integrative Score, Protein Function, Conservation, Epigenetics, Transcription Factors, Chromatin States, Local Nucleotide Diversity, Mutation Density, Mapability and Proximity Table. Different groups of functional annotation depict the variants from multiple functional perspectives. For example, ClinVar reports the relationships between genetic variants and phenotypes (41). FAVOR provides critical information from ClinVar, including Clinical Significance, Disease Name, Review Status, Disease Database ID, and Gene Reported related to the variants. Variant Category annotations provide the consequences of the genetic variants in the context of gene, categorical regulatory information, and the relative location of the variant with the closest gene (Supplementary Table S3).
Furthermore, FAVOR displays category-specific individual functional annotations that represent multiple biological functionalities of each variant in a given functional category (Supplementary Table S3). For example, protein function scores describe various impact scores of the variant's damages to protein function. Conservation scores summarize the conservation functional annotation of the variants (both within and between species). Epigenetics scores summarize the signals of the open chromatin markers, close chromatin markers, and transcription markers. FAVOR also provides individual annotation scores of local nucleotide diversity, mutation density and mappability (e.g. using the unconverted genome Umap and the bisulfiteconverted genome Bismap) (Supplementary Table S3). Results can be visualized using histograms in the Figures tab (Figure. 4).

Region/Gene-based Search
For Region/Gene-based Search, users input either a gene name (official symbol), or region (starting and ending positions using the hg38 build). FAVOR will instantaneously output the functional annotation summary results of the variants in the gene or the region, as well as variant-specific annotations in a range of annotation categories. The fast display of the retrieved results of the Region/Gene-based Search is enabled through indexing and efficient multi-table database management.
The Region/Gene-based Search summary tab provides the summary statistics of the variants in a region or a gene using several key summary tables and histograms, including Allele Frequency Distribution, GENCODE Category, ClinVar Clinical Significance, Functional Consequences and High Integrative Functional Scores ( Figure 5).   It also has a convenient search feature that allows users to filter the variants in the region/gene based on specified features and keywords. For example, typing 'pathogenic' in the search box above the displaying table provides only the pathogenic variants of the region/gene.

Batch annotation
Batch annotation provides functional annotations of a list of variants submitted by users in a file. It supports multiple file formats as input, including CSV, TSV, VCF, XLS and RDS. Multiple formats and IDs of variants are also supported. For example, each row of a text file can specify a variant's chromosome, position, reference, and alternative allele value (e.g. 1-10253-CTA-C), or a variant's chro-mosome and position values (e.g. 1-10253), or rsIDs (e.g. rs868413313). Users can upload the variants list using the above file formats on the FAVOR batch annotation page. Batch annotation files are currently limited to 10,000 variants in the interest of online wait time. It takes a few minutes to annotate 1000 variants. The annotation results containing 160 annotations of the variants in the submitted variant list are available for download. FAVORannotator, discussed below, can be used to handle functional annotations of a larger number of variants, e.g. hundreds of millions of variants in a WGS/WES study.

ANNOTATED GENOMIC DATA STRUCTURE (AGDS)
Variant Call Format (VCF) (51) has been frequently used for storing variant call data of sequencing studies. However, VCF is text-based and thus inefficient with regard to storage, particularly for large-scale WGS data of hundreds of thousands to millions of subjects that have hundreds and simultaneous retrieval of genotype and matched functional annotation data defined by flexible filtering criteria. Second, it is convenient to integrate an aGDS file into functionally informed downstream analysis pipelines, such as STAARpipeline for rare variant association analysis. Third, it is also highly storage-efficient for genotype and their functional annotation data. An aGDS file containing TOPMed Freeze 8 WGS data, including both genotype and their functional annotations of 140,306 samples, only takes 478 GB, that is three orders of magnitude smaller compared to VCF files (Supplementary Table S2).
The GDS format is designed to host large genotype data and can achieve extremely highly efficient random access of compressed data through independently compressed data blocks. It stores genotypes in a 2-bit array with ploidy, sample, and variant dimensions. An index vector associated with genotypes is used to indicate the number of bits (43). An aGDS file uses SeqArray to build functional annotation data in an GDS file. Variable-length annotation vectors are organized in an array. Functional annotation build-in and retrieval are available for efficient random access (43). Lempel-Ziv Markov chain (LZMA) or zlib are the lossless compression algorithms supported by aGDS. LZMA offers a higher compression ratio, but requires more memory allocation and run time (43). Functional annotation data are recorded alongside genotype data in a highly compressed format that significantly reduces storage consumption. Fast random access of the compressed functional annotation of selected variant sets can be efficiently performed, making aGDS attractive to host functionally annotated large-scale WGS/WES data for convenient downstream analysis.
Several existing WGS association analysis tools support the aGDS format, e.g. STAAR (12) and STAARpipeline. Several other tools support the GDS format, e.g. GENESIS (56), SeqArray (43), SeqVarTools and SNPRelate (57). As aGDS files are fully compatible with the tools supporting GDS files, the analytic tools that support the GDS format can be extended to support the aGDS format.

FAVORANNOTATOR
FAVORannotator is an open-source tool that uses the FAVOR database to functionally annotate and efficiently store genotype and variant functional annotation data of a WGS/WES study in an aGDS file, making downstream association analysis convenient (Figure 7). FAVORannotator only requires genotype data or a variant list as input and automatically annotates the genotype data or the variant list, generating an aGDS file as an output. An aGDS file with both genotypes and their functional annotations facilitates rare variant association analysis using individual-level data, e.g. using STAAR (12), while an aGDS file with only a variant list and their functional annotations facilitates rare variant meta-analysis using WGS summary statistics.
Time and memory resources for annotating a large number of variants using FAVORannotator are very attractive, especially for large-scale WGS/WES datasets, such as TOPMed, GSP and UK Biobank. For example, FAVORannotator produces an annotated genotype file in the aGDS format for n = 184 878 whole genome samples with 1 billion variants of the TOPMed Freeze 10a WGS data in 38 hours, and for n = 60 545 whole genome samples of 450 million variants of the GSP-CCDG Freeze 2 WGS data within 30 CPU hours. FAVORannotator has also been implemented as a workflow in the cloud-based platforms, including DNAnexus (UK Biobank), AnVIL (NHGRI) and BioData Catalyst (NHLBI) (Figure 8) (52). FAVORannotator's efficiency keeps cloud computing costs low. For example, it costs ∼$25 to annotate the TOPMed Freeze 10a WGS data by chromosome in parallel, e.g. in 3 CPU hours for chromosome 1.
Users can add customized functional annotations to an aGDS file by adding new columns to the FAVOR database using either the CSV or SQL format and then running FA-VORannotator.
Both speed and storage efficiency of annotation results are crucial for downstream analysis. As existing functional annotation databases and tools, such as Annovar (35) and VEP (34), store variant annotation results in text tables (TSV, CSV), they are much less efficient in query speed and storage than FAVORannotator which uses the aGDS format (Supplementary Table S6). Several variant functional annotation tools, such as SnpEff (37), Vcfanno (55) and VarNote (54), use the VCF format. As VCF stores the same annotation variable names repeatedly for a large number of times in the INFO column, it is much less storage-efficient compared with aGDS (Supplementary Table S6). FAVO-Rannotator, SnpEff (37), Vcfanno (55) and VarNote (54)  store annotations alongside genotype data, and are convenient for downstream analysis. Supplementary Table S6 shows the aGDS format based FAVORannotator is much more storage-efficient than the existing tools export annotation results in text table or annotated VCF, such as Annovar, CADD, VarNote and Vcfanno, and is hence more efficient for downstream analysis.

DISCUSSION
FAVOR offers a comprehensive solution for the application of whole genome variant functional annotations, including open access and downloadable database, a userfriendly browser, and a tool FAVORannotator, to annotate large-scale WGS/WES data. The FAVOR database is a large relational data structure of multi-faceted functional annotations of all possible 9 billion SNVs and 80 million observed indels in the human genome. It is built using a storage-efficient postgreSQL database with indexed and relational tables, that provide fast query speeds. The FAVOR web interface provides fast variant-, gene-, region-level online multi-faceted functional annotations, as well as batch annotation. It emphasizes responsiveness while providing dynamic display and visualization features, and uses combined approaches, including visualizations, block organizations by categories, and convenient search and sorting functions, to provide a fast and convenient summary of the major functional impact of variants.
The FAVORannotator software enables researchers to use the FAVOR database to efficiently functionally annotate large WGS/WES studies at scale, and build a highly compressed and well-organized aGDS file. An aGDS file includes both genotype data and their annotations and can be easily integrated into downstream analysis pipelines. Together, FAVOR and FAVORannotator provide a valuable tool to facilitate downstream analysis and interpretation of WES/WGS studies and array based GWAS studies.
Although several compression methods are available for storing WGS data, such as gzip (vcf.gz), Bgzip or BCF (53), they are subject to two major limitations. First, they are not efficient for storing large-scale WGS data. Second, they are difficult to read while compressed. For instance, although the BCF format is more storage-efficient than the VCF format, the compression rate is 100 times. In contrast, the GDS format has a compression rate of 1000 times. Furthermore, both VCF and BCF formats do not store variant annotations efficiently nor support retrieval of annotations efficiently. The aGDS format resolves both limitations successfully. FAVORannotator is currently developed as a standalone annotation tool optimized for fast query performance using the FAVOR database. Users who would like to do functional annotation directly from commonly used public functional annotation databases can use generalpurposed functional annotation tools and aligners, such as BCFTools (53), VarNote (54) and Vcfanno (55). These tools produce annotated VCF files, which are often quite large for biobank-size WGS studies. FAVORannotator can then be used to convert annotated VCF files generated by these annotations aligners to more storage-efficient aGDS files. It is of future research interest to extend FAVORannotator to be a general-purpose aligner that can perform efficient functional annotation directly using public functional annotation databases. It is also of future interest to port the FAVOR database to be used by general-purpose functional annotation tools, such as BCFTools, VarNote and Vcfanno.
In summary, FAVOR and FAVORannotator provide an intuitive and indispensable infrastructure for facilitating downstream analysis and result interpretation of large-scale WES/WGS studies. FAVOR currently provides non-tissue specific epigenetic functional annotations for non-coding variants. It is of future interest to integrate tissue and celltype specific epigenetic functional annotations in FAVOR. As functional annotations continue to grow in depth and breadth, we will continue to improve and expand FAVOR by integrating more and state-of-art annotations and supporting more analytical scenarios.
The FAVOR essential database (containing 20 essential functional annotation scores) for all possible SNVs (8 812 917 339) and observed Indels (79 997 898)   He is also on Scientific Advisory Board of Veritas Genetics. X. Lin. is a consultant of AbbVie Pharmaceuticals and Verily Life Sciences. X.Z. is currently an employee of AbbVie Pharmaceuticals. Z.W. co-founded and serves as a scientific advisor for Rgenta Inc.