Aequatus: an open-source homology browser

Abstract Background Phylogenetic information inferred from the study of homologous genes helps us to understand the evolution of genes and gene families, including the identification of ancestral gene duplication events as well as regions under positive or purifying selection within lineages. Gene family and orthogroup characterization enables the identification of syntenic blocks, which can then be visualized with various tools. Unfortunately, currently available tools display only an overview of syntenic regions as a whole, limited to the gene level, and none provide further details about structural changes within genes, such as the conservation of ancestral exon boundaries amongst multiple genomes. Findings We present Aequatus, an open-source web-based tool that provides an in-depth view of gene structure across gene families, with various options to render and filter visualizations. It relies on precalculated alignment and gene feature information typically held in, but not limited to, the Ensembl Compara and Core databases. We also offer Aequatus.js, a reusable JavaScript module that fulfills the visualization aspects of Aequatus, available within the Galaxy web platform as a visualization plug-in, which can be used to visualize gene trees generated by the GeneSeqToFamily workflow.

genomic neighbourhood and that of orthologous genes-thus enabling users to visualise local synteny (in a visualisation that is reminiscent of Genomicus). The key innovation, though, lies in a comparative display of gene structure, with an annotated gene tree on the left and a display of matching exons on the right. There are a few additional display options (tables, "sankay plots") but these are not nearly as important. Finally, there is an option to dynamically load information on the domain architecture of any gene.
The widget is well thought out and innovative, and so in general I think it will be quite useful to users. For resource developers, it is a real plus that Aequatus is available both with a full-fledged back-end integrated with Ensembl compara-which is a leading resource-while at the same time available as a JavaScript library only.
Inevitably I have a few points, but these should be easy to address in a revised manuscript.
Major point 1. From a user standpoint, where is or will Aequetus available? For instance, will http://aequatus.earlham.ac.uk/animal_compara/ be kept up to date with the latest Ensembl release? in any case, information about the release should be provided.
A. Yes, we will keep Aequatus available at the given web address. We will try our best to keep Aequatus updated with the latest release of Ensembl. We are also working on a new Aequatus version which will be able to retrieve data directly from Ensembl servers via its API. Moreover, we are going to add the Ensembl release version to the Aequatus webpage to keep the user informed.
2. From a developer standpoint, what is the rough scaling behaviour of the tool in terms of number of genome, number of genes per family etc? What are the bottlenecks? The point is that potential developers will want to know whether the widget can scale to their resource.
A. Aequatus shows the visualisation of a gene family reasonably fast, but since this is generated on the fly, it could behave slower with increasing size and complexity of the gene family (number of paralogues and number of genomes).
Minor points 3. It's not so clear what the colours mean in the gene structure view. I assume these are arbitrary colours that are nevertheless consistent across species, but I still wonder whether there is information in the fact that some exons have different intensities of the same hue, while others don't.
A. As pointed out by the referee the colour were arbitrarily selected to distinguish syntenic genes and matching exons respectively in the syntenic and gene tree views.
We have now added a description in the Figure 2 legend. 4. Also unclear to me is the meaning of the tiny arrows between exons, as well as the very fine red, black and white lines. Consider adding a legend.
A. The arrows denote the strand on which the gene is located. We have also updated the Figure 3 legend in the manuscript, adding "In gene tree view, gray blocks at the start and end of each gene represent UTRs (untranslated regions), black bars within exons indicate insertions, red lines represent deletions specific to a given gene compared with the guide, and tiny arrows denotes the coding strand of the gene." Discretionary/Suggestions: 5. It would be nice to have the possibility to save the current view as SVG, to facilitate inclusion in publications.
A. We appreciate the comment from reviewer about exporting current view as a SVG, this could be very useful. We are looking forward to integrate this into our next release.

Summary
This study presents Aequatus, a web-based tool for the visualization of syntenic relationships as well as phylogenetic relationships between genes from pre-calculated alignments and gene features. Aequatus allows for the visualization of gene structure similarities in more detail than other tools, as it can visualize synteny at the exon level.
General Comments This software will be a useful tool for the visualization of detailed syntenic relationships at the sub-gene level and at the localized gene level, though not at the large, chromosomal scale. The software meets the aims of the study to provide a tool to visualize both phylogenetic and syntenic information and the sub-gene level.
Code for this software is available on github and a demo web server is also available at links provided in the manuscript. The manuscript makes use of data from the Ensembl Compara and Core databases.
The language of this manuscript is of good quality.
Major Comments/Suggestions 1. The demo server appears to have a very limited number of species available for browsing. Is there a plan for this to be expanded so that this can be a browser for the entire Compara database?
A. We indeed plan to expand the set of species to host the Compara databases. The demo server was intended to display Aequatus functionalities, but we are working on mirroring Ensembl Compara locally and then provide Aequatus running on it, serving all available species. We are also working on a new Aequatus version which will be able to retrieve data directly from Ensembl servers via its API to always make all species available at the latest release.
2. The authors list various other tools for synteny visualization and briefly mention how Aequatus differs from these tools. I think it is necessary to expand this discussion/comparison, for example, are these other tools webservers or tools to be run command line? Is there a major speed difference or differences in the size of input data that can be handled? A figure illustrating the output visualizations of these different tools, highlighting how Aequatus differs would be informative.
A. We have added a comparison table along with figures in the Supplementary material to compare the features of various phylogenetic visualisation tools with Aequatus. Regarding the comparison of performance: the performance of each tool is based on the technology used for development, the number of genomes and gene families visualised as well as the complexity of the data. Software performances will also be affected by the configuration of the host server and in some cases the local computer when the visualisation is generated locally. Thus, a performance comparison of various other tools will not enable us to establish a conclusion on which tool performs better.
3. The authors mention in the introduction that Aequatus "allows the identification of exon/intron boundary changes and mutations, informing the user about underlying genetic changes, but can also highlight mis-annotations, pseudogenes [17], or polyploidisation in animal and plant genomes." It is not explicitly illustrated in the manuscript how one can obtain this information using Aequatus. I think that an example (including figures) of how one can arrive at each of these biological insights mentioned in the introduction, using Aequatus, would be very useful. In addition, illustrating how Aequatus facilitates these interpretations better than other tools would be a good point to illustrate explicitly.
A. We have moved the cited sentence to the Discussion and clarified it, adding also an example figure where Aequatus is visualising a potential missannotation in the pig genome. 4. This paper appears to be companion paper to "GeneSeqToFamily: a Galaxy workflow to find gene families based on the Ensembl Compara GeneTrees pipeline" (Anil S Thanki et. al, https://doi.org/10.1093/gigascience/giy005). Citing this paper and explaining more how it fits into the Aequatus pipeline is necessary in my opinion.
A. We have now added more information about how the Aequatus visualisation tool is working together with the GeneSeqToFamily workflow in Galaxy.
Minor Comments/Suggestions 5. The manuscript abstract says that the software is available for download under an MIT license, whereas on the GitHub page it is distributed under the GNU General Public License. This should be made consistent.
A. We thank the reviewer for pointing this out. It was a mistake from our side as Aequatus.js is available under MIT license, whereas whole of Aequatus software is GPL v3 license, we have corrected this mistake in manuscript.

Introduction
Sequence conservation across populations or species can be investigated at multiple levels from single nucleotides, to discrete sequences (e.g. transcription factor binding sites, exons, introns), genes, genomic blocks, and chromosomes. Analyses at each of these levels inform different evolutionary processes and time scales. While the vast majority of analyses focus on gene evolution, synteny, (the conservation of genomic blocks between multiple species) can be used to trace chromosome evolutionary history [1] and infer evolutionary relationships between genes across or within species [2]. Synteny resolution and analysis typically involves carrying out multiple sequence alignments (MSAs) and phylogenetic reconstruction, comprising multiple steps that can be computationally intensive even for relatively small numbers of data points [3].
Many methods are available for the identification of genome-wide orthology (MSOAR [4], OrthoMCL [5], OMA [6], HomoloGene [7], PhyOP [8], TreeFam [9], TreeBeST [10]). However, most of them do not incorporate taxonomic information (typically in the form of a species tree) while finding gene families, nor provide any information regarding transcript and protein structural changes across orthogroup members. The Ensembl GeneTrees pipeline [11], a computational workflow developed by the EMBL-EBI Ensembl Compara team, produces familial relationships based on clustering, MSA, and phylogenetic tree inference.
Phylogenetic reconstruction is the most traditional method to represent and view comparative datasets across a given evolutionary distance, but specific tools such as Ensembl Browser [13], Genomicus [14], SyMAP [15], and MizBee [16] also exist to provide finer-grained information. These tools are able to provide an overview of syntenic regions as a whole, with only Genomicus reaching down to the gene order and orientation level.
Conversely, phylogenetic trees retain ancestral information but do not represent the underlying information regarding structural changes within genes, such as the conservation of ancestral exon boundaries between multiple genomes or variants within genes that can be correlated to phenotypic changes. In order to build these gene-level visualisations, basic genomic feature information is required.
Therefore, we have developed Aequatus to bridge the gap between phylogenetic information and gene feature information. Here we show that Aequatus allows the identification of exon/intron boundary changes and mutations, informing the user about underlying genetic changes.

Materials and Methods
Aequatus is built using open-source technologies and is divided into a typical server-client architecture: a web interface and a server backend (see Figure 1). connected to Ensembl compara and core database using Java Data Access Objects and SMART server via REST API, and the client-side implemented using popular techniques such as JavaScript, jQuery, d3.js and jQuery DataTables.

Results
The landing page of Aequatus (see Figure 2) contains a header with a search box (2A) and a dropdown list of species (2B), followed by a selectable Chromosomal view underneath (2C).
Aequatus has a draggable control panel (2G) on the left-hand side, which contains buttons to show/hide the chromosome selector on top, modify gene views and labels, access to the search box and the export options, as well as a link to the help pages.

Aequatus user interface
Aequatus provides various ways to visualise gene trees and the inferred orthology/paralogy from them.

Main Gene Trees View
The gene tree view (see Figure 3) comprises a phylogenetic tree on the left, built from GeneTree information stored in a Ensembl Compara database [11]. Aequatus relates the genes through different events (e.g duplication, speciation, and gene split) for the gene family and homologous genes against each respective node, which are coloured based on 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 the potential evolutionary event. Homologous genes are visualised by aligning them against a given guide gene. The selected guide gene is depicted as a larger circle black leaf node in the tree, with a red label on the right, while the other genes have a smaller circle leaf node and a grey label.
On the right, Aequatus depicts the internal gene structure, using a shared colour scheme for coding regions, to represent similarity across homologues.  In gene tree view, gray blocks at the start and end of each gene represent UTRs (untranslated regions), black bars within exons indicate insertions, red lines represent deletions specific to a given gene compared with the guide, and tiny arrows denotes the coding strand of the gene.

Popups
Aequatus provides a contextual menu system via interactive popup menus, which are 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 displayed when a user clicks on a gene (see Figure 4). Each popup shows: the gene name and its position; a link to find protein domain information using SMART; links to export the protein sequence or the CIGAR alignment; an option to set the current gene as the guide in order to see insertions and deletions in homologous genes relative to the selected guide gene; a link out to the Ensembl page for the gene; an option to view the pairwise alignment.

Homologous Genes
The underlying information describing homologous genes contained within the Compara database schema can be visualised using either a tabular view or Sankey plot.

Sankey view
The Sankey view (see Figure 7) visualises homology as a interactive diagram, where the homologues of a selected gene are distinguished by homology type, i.e. paralogs, 1-to-1 orthologs, or 1-to-many orthologs. The nodes for homologous genes are coloured by species, which helps finding genes from the same species in the case of 1-to-many and many-to-many orthologs.

Gene Order
Genes that share a common ancestor and are part of a consecutive block of genes are likely to have a transcriptional and/or functional relationship [21]. Hence, inferred homologues which are present in all species and in the same order are more likely to be real homologues. In the Gene Order view, neighbouring genes are displayed for the selected gene and its homologues (shown in Figure 9)

Search
Aequatus has keyword-based search functionality, whereby the user can provide search terms and a list of all the relevant genes is returned. Aequatus can query for matching gene symbols, Ensembl stable IDs (unique identifiers in the Ensembl project for each genomic annotation), common names for genes and proteins, or any keyword in the description.
Search results then allow the user to visualise the corresponding gene tree view, or homologous genes in the tabular or Sankey views.

Export
Users can export data at different points in the visualisation. In the gene tree view the underlying genomic data for the gene families can be exported in various forms, such as a list of gene IDs, the sequence alignments, or the gene trees in Newick [22] or JavaScript Object Notation (JSON) [23] format, for use in downstream tools. The tabular view can be exported in CSV, XLS, and PDF format.

Discussion
The ultimate goal of Aequatus is to provide a unique and informative way to render and explore complex relationships between genes from various species at a level of detail that has so far been unrealised in a single platform. informing the user about underlying genetic changes, but can also highlight misannotations, pseudogenes [24], or polyploidisation (see Figure 10). We are currently testing Aequatus with a range of non-model organisms, such as koala, polyploid crops, and spiny mouse. As Aequatus can visualise relationships using simple CIGAR strings, any tool that outputs this format can use Aequatus to view them. We produce input for Aequatus using the GeneSeqToFamily pipeline, a freely available Galaxy workflow [25] for finding and visualising gene families for genomes which are not available from Ensembl databases.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 suggest a potential gene split event or just a mis-annotation.
In order to make Aequatus more accessible and reusable, the gene tree visualisation module from the standalone Aequatus browser is available as Aequatus.js [26], an open source JavaScript library. In this way, it preserves the interactive functionality of the Aequatus browser tool but can be integrated with other third-party web applications. We have demonstrated this by integrating the Aequatus.js library into Galaxy [27], where gene families generated by running the GeneSeqToFamily workflow can be visualised using the Aequatus plugin within Galaxy.

Future Directions
The main extension to the functionalities of Aequatus is the incorporation of Ensembl REST API functionality [28], where Aequatus will be able to retrieve information directly from Ensembl Compara and Core databases held at the EMBL-EBI, without any need for local database configuration. Whilst this will mean that users will need a reliable internet connection, it will reduce the need for local storage space for the Core databases, improving the portability of Aequatus.