Hymenoptera Genome Database: new genomes and annotation datasets for improved go enrichment and orthologue analyses

Abstract We report an update of the Hymenoptera Genome Database (HGD; http://HymenopteraGenome.org), a genomic database of hymenopteran insect species. The number of species represented in HGD has nearly tripled, with fifty-eight hymenopteran species, including twenty bees, twenty-three ants, eleven wasps and four sawflies. With a reorganized website, HGD continues to provide the HymenopteraMine genomic data mining warehouse and JBrowse/Apollo genome browsers integrated with BLAST. We have computed Gene Ontology (GO) annotations for all species, greatly enhancing the GO annotation data gathered from UniProt with more than a ten-fold increase in the number of GO-annotated genes. We have also generated orthology datasets that encompass all HGD species and provide orthologue clusters for fourteen taxonomic groups. The new GO annotation and orthology data are available for searching in HymenopteraMine, and as bulk file downloads.


INTRODUCTION AND OVERVIEW
The Hymenoptera Genome Database (HGD; http://HymenopteraGenome.org) (1) is a genome informatics resource for insects of the order Hymenoptera (bees, ants, wasps and sawflies). Widely accessible and efficient sequencing technologies have made it possible for researchers of hymenopteran insects to use genome sequencing to address a wide variety of questions. For example, the hymenopteran species exhibit a range of eusociality levels, from solitary to advanced eusocial lifestyles, and are used to investigate topics such as evolution of eusociality, molecular regulation of division of labor and epigenetics of behavior (2)(3)(4)(5)(6)(7)(8). Hymenopteran genome sequencing projects are also used to develop models for evolution and adaptation to fungal and plant symbioses (9)(10)(11)(12)(13), evolution of social parasitism (14), parasitoid biology (15)(16)(17), impact of endosymbionts (13,18,19), adaptation of invasive species (20), ecological speciation (21), transitions to asexual reproduction (21), phenotypic plasticity (8,14,22), selfish B chromosome drive (23) and the evolution of miniaturization (16). In addition to developing biological models, genome sequencing is used to address topics related to agriculture, such as response to pesticides (24) and roles as biological control agents (15,16). Furthermore, the Hymenoptera are the largest group of pollinators (25). The goal of HGD is to make the hymenopteran genome sequences and associated data easily accessible for further investigation.
As reported previously (1), HGD provides JBrowse (26) genome browsers with Apollo (27) annotation tools, integrated with a BLAST server (28,29), for visual inspection of genes in their genomic context. The primary method of searching HGD is with HymenopteraMine, a data mining warehouse for querying and exporting disparate sources of gene annotation data. HymenopteraMine, based on the In-terMine data mining warehouse (30), integrates data from external sources, including RefSeq (31), UniProt (32), In-terPro (33), OrthoDB (34), KEGG (35), PubMed (36) and BioGrid (37). Furthermore, by including the Dipteran outgroup, Drosophila melanogaster, in HymenopteraMine, hymenopteran genes can be connected to D. melanogaster data in Reactome (38) and IntAct (39) via orthologous relationships. First reported in 2016 (1), HymenopteraMine provides several search tools, including a simple keyword search, the QueryBuilder for constructing custom queries, pre-constructed template query menus, the List Tool to upload lists of identifiers and the Regions Search tool to query for genome features based on a list of genomic coordinates. Report pages and query outputs are provided as tables that can be further modified by clicking icons in column headings and by using menus for managing columns, filters and relationships. Tables can be exported  in several formats, including tab-delimited. Detailed methods for using the search tools have been previously published (40), and are available by clicking the 'LEARN-ING' tab in the navigation bar on the HGD home page. Here, we report a more than doubling of the number of hymenopteran species in HGD, as well as the generation of two new datasets, HGD-Ortho and HGD GO Annotation, which are available for searching in HymenopteraMine and available for bulk download.

NEW AND UPDATED GENOMES
The current HGD release has a total of 58 hymenopteran genomes. Since the previous HGD update report (1), we have incorporated genomes of 38 additional species and have updated genomes and/or gene sets of 13 species (Table 1). The acquisition of new genomes expands the insect groups previously hosted in HGD, for example, increasing from 10 to 23 ant species and 9 to 20 bee species. Previously, Nasonia vitripennis was both the only wasp and the only parasitoid in HGD. Now HGD hosts eleven wasp species, nine of which are members of the Parasitoida infraorder, and two of which are social non-parasitoid species. HGD also now hosts genomes of four sawfly species, a group previously not represented at all in HGD. All of the genomes are supported with JBrowse/Apollo genome browsers, BLAST and HymenopteraMine.

REVAMPED WEBSITE
To better organize the growing number of genomes in HGD, we have overhauled the website. HGD now combines all species into one unified website, rather than separating species into the old divisions for 'BeeBase', 'NasoniaBase' and 'Ant Genomes Portal'. The older webpages are available in the 'Archive' tab on the navigation bar. The 'Downloads' tab in the HGD main navigation bar provides access to files for all species organized into data type. There are also new pages for Learning (with documentation and examples), Release Notes, Community Data, and Contributing Data.

NEW GENE ONTOLOGY ANNOTATION DATA
For most of the HGD species, the number of genes with UniProt-GOA annotations is not sufficient for Gene Ontology (GO) enrichment analysis. The three species with the highest numbers of UniProt-GOA annotated genes are Atta cephalotes (7760 genes), Apis mellifera (4331 genes) and Nasonia vitripennis (3160 genes). Forty HGD species have fewer than 100 UniProt-GOA annotated genes. To perform GO enrichment analysis of these species with few annotations, researchers must annotate the genes themselves, or identify orthologues in a well-annotated species and perform GO enrichment based on a background gene list from that species. HymenopteraMine has always provided tools for easy GO enrichment analysis for the few UniProt-GOA annotated species. To make these tools available for all species we have enhanced the GO annotation data obtained from UniProt-GOA by generating GO annotation data for all species.
GO annotations were generated from combined sources: (i) UniProt-GOA (56), (ii) transfer of GO terms from InterPro matches (33), (iii) transfer of GO terms based on homology and InterPro domain content. GO annotations for each species, when available, were parsed from the goa uniprot all.gaf file (UniProt-GOA; UniProt Knowledgebase Release 2020 04, downloaded from ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/) and protein ids were converted to RefSeq gene ids using the UniProt idmapping selected.tab.gz file (ftp://ftp.uniprot.org/pub/databases/uniprot/ current release/knowledgebase/idmapping/).
UniProt-GOA annotations were not used if the annotated protein mapped to more than one gene. We supplemented GO annotations using computational methods. First, we used InterProScan (57) to identify protein domains from the SMART, SUPERFAMILY, Panther and Pfam databases (58-61) using the -goterms option to lookup GO terms for matching domains. Second, we used the FASTA (62) sequence comparison program to perform protein searches between each species and four annotated reference species. D. melanogaster and human were used as reference species because they have highly curated GO annotation datasets. D. melanogaster protein sequences and GO annotations were obtained from FlyBase (release FB2020 02) (63). Human proteins were obtained from UniProt, and human GO annotations from UniProt-GOA. A. mellifera and A. cepahotes were also used as reference species because they had the highest number of UniProt-GOA annotations among the hymenopteran species in HGD. GO terms were transferred to the query proteins from (i) reciprocal best-hit reference proteins and (ii) best-hit reference proteins that were not reciprocal, but had identical protein domain content identified by InterProScan. Molecular Function and Cellular Component terms, but not Biological Process Terms, were transferred from the human reference protein dataset. Inferred annotations were added using inter-ontology links (64) in the go.obo file downloaded from the Gene Ontology Consortium (release date 2020-08-11, doi:10.5281/zenodo.2529950; http://geneontology.org/docs/download-ontology/) (65,66). Finally, all parents of GO terms were added to annotations using the go.obo file.
The number of genes in HGD with GO annotations has significantly increased ( Figure 1) due to adding GO annotations generated in our pipeline to the UniProt data. The total number of genes in HGD annotated with GO increased from 47 789 when UniProt was the sole source of GO annotation data to 553 866 after adding GO annotations that we computed based on sequence comparison and protein domain content, and the mean number of annotated genes per species increased from 824 to 9549. The annotations are available in Hy-menopteraMine, allowing for GO enrichment analysis, as described previously (40). Supplementary File 1 provides an example showing GO enrichment analysis using the Hy-menopteraMine List Tool with a list of Bombus vosnesenskii gene identifiers provided in Supplementary File 2. Another example with detailed instructions for GO enrichment  using the List Tool is available by selecting 'TUTORIAL EXAMPLES' under the 'LEARNING' tab in the HGD navigation bar. In addition to HymenopteraMine, the GO annotations are available as downloadable files in Gene Annotation File (GAF) format (http://geneontology.org/docs/ go-annotation-file-gaf-format-2.1/) and in a format that can be used with GeneMerge, a command-line GO enrichment software package (67).

NEW ORTHOLOGUE DATA
HymenopteraMine has always included orthologue data from OrthoDB (34). However, OrthoDB contains only 36 of the 58 hymenopteran species currently in HGD. To provide orthologues for all species, we have generated new or-thologue data using Orthologer (34), the same software developed and used to compute orthologues by OrthoDB.
Our new orthologue dataset, called HGD-Ortho, was computed for 14 taxonomic groups based on the NCBI Taxonomy database (36), ranging from the level of genus to superorder (Table 2). When querying the data, users can select a taxonomic group representing the last common ancestor, thereby controlling evolutionary distance, which can affect the level of sequence divergence and number of paralogs within a cluster. Species lists for each taxonomic group are provided in Supplementary File 3. HGD-Ortho data are available for searching in HymenopteraMine, and as bulk downloadable files. HymenopteraMine still maintains the OrthoDB orthologue data so that researchers interested in the supported species can follow their HymenopteraMine work with other resources available at the OrthoDB Rather than entering a single 'Gene DB Identifier', the box next to 'constrain to be IN' is checked and the gene list saved previously is selected in the pulldown menu. The output includes coding sequences, protein sequences, identifiers and sequence lengths, which can then be used to select the longest coding sequence of genes with multiple transcripts for downstream molecular evolution analyses. The protein sequences are in the rightmost column and are not visible in this figure. The 'Export' button in the top right corner can be used to export the sequences.
website using orthologue cluster identifiers common to both resources.
To demonstrate the use of HGD-Ortho, we describe how to use HymenopteraMine to gather protein and coding sequences that can be used to investigate sequence evolution of the cycle gene in Nasonia vitripennis in comparison to other parasitoid species. This example involves saving a list of identifiers for use in template queries. While you can save a list temporarily without logging in to a MyMine account, saving a list while logged in stores the list in your account for future sessions. Account registration is freely available by clicking 'Log In' near the upper right corner of the HymenopteraMine home page. Use the 'Gene ID → Homologues' template query, found under the 'Homology' tab in the template category bar in the middle of the HymenopteraMine home page. Enter the RefSeq gene id Nucleic Acids Research, 2022, Vol. 50, Database issue D1037 (100118796) and select 'Parasitoida' as the 'Last Common Ancestor' to retrieve the orthologue cluster id (Figure 2A), which is found in the column labeled 'Homologues Cluster ID'. The next step is to use the cluster identifier (HG-DOG11214at1955251) in the 'Orthologue Cluster ID → Genes' template query to retrieve all pairwise gene relationships in the cluster, and save a list of the genes ( Figure 2B). Finally, use the gene list in a 'Gene ID → Protein and Coding Sequences' template query, under the 'Genes' template category, to retrieve protein and coding sequences, which you can export to perform molecular evolutionary analyses ( Figure 2C). Sequence lengths are provided in the query output so that you can easily select the longest protein and coding sequence of multi-transcript genes. To retrieve sequences for a non-parasitoid outgroup, you can repeat the 'Gene ID → Homologue' search, selecting 'Hymenoptera' rather than 'Parasitoida' as the 'Last Common Ancestor'. In the output, note the gene id for the species you would like to use as an outgroup, and use that gene id in the 'Gene ID → Protein and Coding Sequences' template query. An additional example highlighting the new HGD-Ortho dataset is provided in Supplementary File 1, which shows how Hy-menopteraMine is used to identify D. melanogaster homologues and their Reactome pathways for a list of genes in Bombus vosnesenskii, a species that currently has little annotation information available from external resources. We also demonstrate how to programmatically use this same list of genes in the following section on the HymenopteraMine Application Programming Interface (API).

APPLICATION PROGRAMMING INTERFACE
Although the HymenopteraMine API is not new, it has not been previously reported. HymenopteraMine leverages the web service API provided with the InterMine platform, enabling users to automate workflows and access data without using the webapp. Client library support is provided in Python, Perl, Java, JavaScript, Ruby and R (68,69

CITING HGD AND DATA SOURCES
You should cite this article for the use of any HGD tools, including BLAST, JBrowse/Apollo and HymenopteraMine, as well as HGD code modifications available on GitHub (https://github.com/elsiklab/). You should also cite the original genome publication and HymenopteraMine data sources for the data you used. A list of genome publications may be found by clicking 'Genome Publications' in the HGD navigation bar, and PubMed links are provided for all datasets on the HymenopteraMine Data Source page, accessible in the HymenopteraMine navigation bar.

CONCLUDING REMARKS
By gathering genomic data for hymenopteran species into a single resource, HGD facilitates data reuse, meta-analysis, and cross-species comparison. We report almost triple the number of species in HGD since the previous update. To better support species that are poorly represented in external genome annotation data sources, we have generated new GO annotation and orthologue datasets for all species in HGD. For most of the species, the new HGD GO Annotation dataset makes HymenopteraMine the only publicly available web-based tool for GO enrichment analysis.
The new HGD-Ortho dataset is the only web-based orthologue resource for twelve of the HGD species, and it benefits all of the included species by increasing the number of hymenoptera taxa available for comparison. We will continue to add species to HGD as genomes become available in the RefSeq division of NCBI. We encourage researchers to contact us if they have suggestions or data to contribute.

DATA AVAILABILITY
HGD tools and data are freely available at http:// HymenopteraGenome.org. Although HymenopteraMine does not require login, registering for a MyMine account allows users to save lists for future sessions and to create an API key for programmatic access. Registration is freely available and simply requires entering an email and creating a password. HymenopteraMine code is available at https://github.com/elsiklab/.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.