aRgus: Multilevel visualization of non-synonymous single nucleotide variants & advanced pathogenicity score modeling for genetic vulnerability assessment

The widespread use of high-throughput sequencing techniques is leading to a rapidly increasing number of disease-associated variants of unknown significance and candidate genes. Integration of knowledge concerning their genetic, protein as well as functional and conservational aspects is necessary for an exhaustive assessment of their relevance and for prioritization of further clinical and functional studies investigating their role in human disease. To collect the necessary information, a multitude of different databases has to be accessed and data extraction from the original sources commonly is not user-friendly and requires advanced bioinformatics skills. This leads to a decreased data accessibility for a relevant number of potential users such as clinicians, geneticist, and clinical researchers. Here, we present aRgus (https://argus.urz.uni-heidelberg.de/), a standalone webtool for simple extraction and intuitive visualization of multi-layered gene, protein, variant, and variant effect prediction data. aRgus provides interactive exploitation of these data within seconds for any known gene of the human genome. In contrast to existing online platforms for compilation of variant data, aRgus complements visualization of chromosomal exon-intron structure and protein domain annotation with ClinVar and gnomAD variant distributions as well as position-specific variant effect prediction score modeling. aRgus thereby enables timely assessment of protein regions vulnerable to variation with single amino acid resolution and provides numerous applications in variant and protein domain interpretation as well as in the design of in vitro experiments.


Introduction
In recent years, high-throughput sequencing methods have led to a tremendous increase in the extent of genetic and variant data related to human disease [1,2]. Upon identification of disease-associated genetic variants of unknown significance or in novel candidate genes, an investigator may need to integrate of multilayered information concerning exon-intron structure, protein domain annotation, mutational constraint, as well as known variants present in patients and healthy individuals including their allele frequency. Additionally, the potential biological impact of variants on protein structure and function can be predicted using in silico pathogenicity scores that assign a numerical value to each amino acid substitution. This is particularly helpful for estimation of damaging variant effects when no functional in vitro data is available. The American College of Medical Genetics and Genomics (ACMG) has published guidelines for structured assessment and classification of genetic variants and discusses the use of several in silico tools also utilized in the aRgus workflow (see Table 2 from Richards et al., 2015). Considering their use, the guideline states that "multiple lines of computational evidence either support a deleterious effect on the gene or gene product (PP3) or suggest no impact on gene or gene product (BP4)". Since commonly used in silico scores partially evaluate very same or similar aspects, such as the evolutionary conservation of a base or amino acid, the position of the amino acid within the protein sequence, and the biochemical and biophysical consequences of an amino acid substitution. and because scores are often trained on overlapping datasets they may produce uniform rather than divergent results potentially weakening the validity of this criterion. A global accuracy of most scores evaluating SNPs can be estimated around 65-80% [8] with a tendency to overestimate pathogenicity. [9] Therefore, the use of computational evidence in variant interpretation must be undertaken with exceptional caution, especially when multiple in silico scores produce conflicting interpretations for a given variant. [3]. Although large quantities of genetic data are publicly accessible, they are mostly shared in abstract, tabular form, and stored in a multitude of different databases that have to be accessed individually. Extraction, formatting, and analysis of such data often require extensive bioinformatic capabilities. Userfriendly platforms have previously been developed to facilitate access to genetic data but lack detailed integration and visualization capabilities of different pathogenicity scoring models [4][5][6][7].
Here, we introduce aRgus (https://argus.urz.uni-heidelberg.de/) as a standalone webtool for user-friendly and intuitive compilation and visualization of complex data on genetic variants and in silico pathogenicity scores from the extensive databases Ensembl, Simple ClinVar, the Universal Protein Resource (UniProt), the Genome Aggregation Database (gnomAD), and dbNSFP [4,5,7,10,11]. The Ensembl database contains comprehensive genomic information including chromosomal gene and transcript localization [4]. Simple ClinVar is an interactive webtool using a custom algorithm to retrieve simplified summary statistics on variant and phenotype information from ClinVar, the largest archive of genetic variants associated with human disease [5,12]. UniProt represents the largest database for protein sequence and domain annotation data [7]. The gnomAD database contains variant data from nearly 150,000 healthy individuals identified in exome and genome sequencing studies [10]. The dbNSFP database represents a rich resource containing values of numerous in silico pathogenicity scores precalculated for all biologically possible non-synonymous single-nucleotide variants (nsSNVs) and related information, such as their gnomAD allele frequencies, that can be used for variant annotation [11]. dbNSFP is implemented in several annotation tools such as ANNOVAR, Var-Some, the UCSC Genome Browser, and the Ensembl Variant Effect Predictor and also offers an own application but can only be used for single queries or short lists of SNVs [6,[13][14][15].
In contrast, aRgus provides the synopsis of both variant and pathogenicity score data using an intuitive graphical user interface. aRgus allows display of exon-intron structure and protein domain annotation together with ClinVar and gnomAD variant distributions, a vivid visualization of pathogenicity score values and their statistical comparison in different variant groups, as well as an interactive table comprising ClinVar-and dbNSFP-derived variants. The use of aRgus enables identification of protein regions susceptible to missense variation up to single amino acid (AA) resolution and represents a powerful tool for enhanced inference-based variant interpretation.

Visualization of tabular pathogenicity score data
Theoretically, a gene transcript can mutate at any base position into three alternate bases leading to nsSNVs on the gene level as well as amino acid substitutions or truncations on the protein level, depending on the position within the base triplet. The damaging effect on protein function can be predicted in silico by an individual value of different pathogenicity scores assigned to each amino acid substitution (Fig. S1). Thus, all biologically possible nsSNVs can be simulated and result in several datapoints per amino acid position. In order to visualize these data intuitively and vividly, a dual approach was conducted: First, the geom_smooth() function of the R package ggplot2 was used with default parameter settings to generate a polynomial regression of smoothed conditional means displayed by an approximation curve with 95% confidence interval. In case of ≥ 1000 data points, a generalized additive model (GAM) with a shrinkage version of penalized cubic regression splines is used by the R package mgcv. For < 1000 data points, local polynomial regression fitting (loess) is conducted. Second, the arithmetic mean of multiple pathogenicity score values at each amino acid position are calculated and visualized as a heat-strip, color-coded by the predicted degree of effect on the protein (Fig. S1).

Unspliced transcript plot
The unspliced transcript plot (UTP) displays the gene's scaled exon-intron structure from left to right starting with the first exon for improved readability regardless of the genomic localization on the forward or reverse strand. By default, pathogenic and likely pathogenic (P/LP) ClinVar variants are shown as lollipops which allows convenient visualization of intronic variants. To display the variant description, ClinVar and simulated dbNSFP variants can be manually selected in the respective tables and are visualized in the UTP and protein plot simultaneously. Fig. 2 A shows the UTP for the gene ASS1, encoding the enzyme argininosuccinate synthase (ASS), with selected P/LP ClinVar variants (red) and variants from the In silico scores table (gray), containing the dbNSFP-derived variants.

Protein plot
The primary structure of the resulting protein is visualized by the protein plot showing a linearized representation together with annotated domains retrieved from UniProt. As in the UTP, variants can be manually selected from the provided tables. Thereby, distribution of known and novel variants and their relation to protein domains/ regions can easily be assessed. This versatile visualization provides useful insights for assessment of the pathophysiological relevance of potentially functionally relevant domains, given a gene scarcely associated with pathogenic variants. Fig. 2B shows respective amino acid changes and protein domains of ASS.

ClinVar and gnomAD mutational constraint plots
Distributions of ClinVar and gnomAD variants with respect to their protein position and allele frequency are visualized by density and bar plots, respectively, facilitating assessment of a protein's mutational constraint. This includes sections of mutational hotspots, recurrent pathogenic and benign variants as well as the positionspecific degree of tolerance towards missense variation. For more precise localization, ClinVar variants are additionally shown as vertical lines underneath the density curves (Fig. 2 C). gnomAD variants are displayed in two separate logarithmic bar plots depending on their origin from the exomes (green) or genomes (blue) dataset (Fig. 2D). For ASS, ClinVar density curves reveal an accumulation of pathogenic variants in the region of AA 260-280 whereas gnomAD variants from both exomes and genomes show low population allele frequencies or are completely absent from the dataset (Fig. 2E).

In silico pathogenicity score model
Pre-calculated pathogenicity score values of all biologically possible nsSNVs are retrieved from the dbNSFP database. To improve data accessibility, the resulting multiple data points per protein position are simplified and visualized using a polynomial regression model combined with a heat-strip scaled to the linear protein representation. Depending on the user's research question, the desired pathogenicity scoring model can immediately be selected from a list of up to 43 different scores. Plots for three different scores can be displayed simultaneously. This enables assessment of the predicted, position-specific impact of amino acid substitutions within the context of known protein domains and facilitates detection of regions of increased or decreased susceptibility to missense variation. Thereby, the functional impact of novel variants can be estimated and investigation of unknown sections of predicted damaging variant effects can be addressed to formulate future research hypotheses.
In our practical example, regions with low (AA 200-250) and high (AA 270-300) values of the pathogenicity score REVEL correspond to local minima and maxima of the curve. The heat-strip representation displays mean score values allowing a more finegranular resolution (Fig. 2E).

Statistical comparisons
Pathogenicity score values within the four variant groups ClinVar_pathogenic, ClinVar_benign, gnomAD, and InSilico are shown as violin plots with integrated quartiles (for definitions see Methods Section 2.3). Additionally, score value distributions are statistically compared to assess the capability of the specific score to discriminate between variants of the different categories and hence its possible suitability for variant classification. For example, ASS1 variants, that were annotated as P/LP, yield significantly higher CADD and REVEL score values than variants in the other three groups (Fig. 2 F). The use of statistical comparison should be made with the greatest possible caution. Due to the diversity of the data basis and the non-transparent application of the ACMG criteria used for the evaluation of variants, a cautious and conscientious review of the information generated by aRgus is necessary.

Interactive table
On the bottom side of the user interface, an interactive table, that remains sticky during scrolling, is available (Fig. 1). It comprises two tabs with all ClinVar variants as well as all simulated nsSNVs and corresponding pathogenicity score values. To provide interactivity to the user, selected variants are displayed in the UTP and protein plot. Both tables can be filtered, e.g., by variant type. Individual cells with score values in the in silico table are color-coded according to the predicted variant effect using score-specific cut-offs.

Discussion
The availability of databases with clinical and genetic information has never been greater than it is today. Scientific and medical advances, particularly in terms of sequencing and storage capabilities, will lead to an exponential growth of information in the coming decades. However, database queries often require bioinformatic tools, which ultimately limit the yield and usability of such. To enable clinicians, scientists, and other users without prior bioinformatic knowledge to explore rich yet complex datasets, user-friendly tools with an intuitive interface and the possibility to easily export data for further processing are needed. Web server applications allow users to make such queries regardless of the device and operating system. aRgus is therefore designed as a lightweight, multidimensional R/Shiny application to enable fast database queries.
aRgus uses minimal user input in the form of the gene name according to HUGO Gene Nomenclature Committee (HGNC) standard. aRgus can thus retrieve information of variable complexity on the localization and distribution of pathogenic variants at the chromosomal and protein levels, which can be used to explore biological and biochemical properties, such as mutational hotspots of pathogenic and benign variance within proteins. Visual linkages of pathogenic variation can be generated by annotating functionally important regions and domains from the UniProt database. aRgus provides simple means of displaying complex distributional information using complexity-reduced density representation that is quick and easy for the human eye to comprehend.
The aRgus user is offered a wide range of possibilities to select relevant information to answer respective research questions. By allowing simultaneous display of variants stored in gnomAD, the issue of survivorship bias, as a form of selection bias, can be overcome. Survivorship bias occurs in all clinical genetic databases and potentially leads to oversight of variants, that did not pass biological selection, by sole assessment of pathogenic variants from clinical databases such as ClinVar. This often results in misconceptions in the interpretation of mutational hotspots. The gnomAD database v2.1 contains over 125,000 exomes and 15,000 genomes from different populations. A comparison of benign variants derived from gnomAD and pathogenic variants listed in ClinVar and other genetic databases thus enables an improved assessment of putative pathogenic hotspots on the gene and protein level.
Beyond pure visualization of information on known pathogenic variants, a polynomial regression model and heatmap visualization offer an additional way of data exploitation which can be particularly advantageous for proteins that have previously been described to only a limited extent. These models overcome inaccessible, tabular data on pathogenicity scores and simplify the comprehensibility of visualized predicted variant effects up to single amino acid resolution. By annotation of all biologically possible missense variants using 36 different pathogenicity scores, statements can be made about protein regions with high impact of amino acid exchanges without existing in vitro studies. Alternatively, resulting information can be used to plan functional in vitro studies, e.g., in order to investigate the functional relevance of regions in scarcely described proteins or with only limited data on pathogenic variants. Consequently, the aRgus platform compiles and visualizes multiple layers of information. In multiple previous studies investigating rare monogenic disorders, implementation of the aRgus workflow was able to demonstrate its unique value for genetic assessment. The combination of protein domain and variant distribution data as well as pathogenicity score modeling was applied in two different subtypes of acute liver failure in infancy caused by variants in the genes TRMU and LARS1 encoding for tRNA processing enzymes. Pathogenic variant accumulation could be identified in specific protein regions and functional domains whose role in pathogenesis was further strengthened by pathogenicity score modeling [21,22]. Another use case was achieved for the evaluation of disease-causing variants of mevalonic aciduria (MVA), where it was demonstrated that variants associated with the more severe subtype MVA are almost exclusively located within regions of low predicted tolerance towards missense variation as pointed out by aRgus pathogenicity score modeling. Furthermore, several pathogenicity scores could significantly discriminate between MVA subtypes and healthy individuals [23]. Additionally, the aRgus workflow could identify protein regions within functional domains that appear to be especially important in pathogenesis of a SYNCRIP-associated neurodevelopmental disorder and a gluconeogenesis defect caused by phosphoenolpyruvate carboxykinase deficiency [24,25]. For the latter two genes, this analysis was pivotal as only few disease-related variants had been previously reported and knowledge derived from the aRgus model might support future variant interpretation.

Limitations
aRgus is subject to some limitations. The quality of the visualizations and analyses produced by aRgus heavily depends on the quality of data available. According to our use cases, ClinVar data does not represent the entirety of all previously reported pathogenic variants. This is largely due to the lack of obligation of genetic laboratories to enter newly discovered disease-causing variants in centralized repositories. Extensive literature reviews are therefore necessary to obtain a comprehensive picture of mutational distribution. This could be significantly improved by the addition of further, commercial databases such as HGMD or LOVD [26,27]. The potentially inconsistent application of the ACMG criteria in the interpretation of variants in each database thus also represents a possible fundamental source of error for aRgus. However, improved automated variant interpretation and strict adherence to the ACMG criteria, ensures a high quality of the data basis in the long run. To enable users to visualize variants identified through their own literature research or genetic studies, variants can be selected from the dbNSFP-derived table of pathogenicity score values and are automatically highlighted in all plots.

Conclusion
Combining accessible and interactive visualizations of genetic and variant data with pathogenicity analysis in a synoptic, standalone tool, aRgus outstands existing applications for genetic data exploitation regarding output versatility and flexibility. With each update of the databases connected to aRgus, the diversity and analysis capabilities of its visualizations and datasets will also improve. Thus, aRgus will provide useful and previously mostly inaccessible information to a broad usership with limited bioinformatics skills such as practicing clinicians, basic scientists, and geneticists, and thus be helpful to answer scientific questions.

CRediT authorship contribution statement
JS, TD, and HB devised the project and main conceptual ideas and designed the study. JS, HB, JH, AJ, SU, and DH have designed and delivered the technical realization and implementation of aRgus. All authors were involved in the further development of aRgus during the development period through their intellectual input and the execution of targeted analyses. All authors provided critical feedback and helped shape the research, analysis, and manuscript.