MEGGASENSE – The Metagenome/Genome Annotated Sequence Natural Language Search Engine: A Platform for the Construction of Sequence Data Warehouses

Summary The MEGGASENSE platform constructs relational databases of DNA or protein se quences. The default functional analysis uses 14 106 hidden Markov model (HMM) pro files based on sequences in the KEGG database. The Solr search engine allows sophisticat-ed queries and a BLAST search function is also incorporated. These standard capabilities were used to generate the SCATT database from the predicted proteome of Streptomyces cattleya . The implementation of a specialised metagenome database (AMYLOMICS) for bioprospecting of carbohydrate-modifying enzymes is described. In addition to standard assembly of reads, a novel ‘functional’ assembly was developed, in which screening of reads with the HMM profiles occurs before the assembly. The AMYLOMICS database in-corporates additional HMM profiles for carbohydrate-modifying enzymes and it is illustrated how the combination of HMM and BLAST analyses helps identify interesting genes. A variety of different proteome and metagenome databases have been generated by MEGGASENSE.


Introduction
Falling costs of next generation sequencing have made de novo genome and metagenome sequencing widely available. After assembly of the sequencing reads, a genome is represented by a large number of contigs with the presence of many gaps; the gaps arise from DNA regions which are difficult to sequence (e.g. recalcitrant to PCR) and from assembly problems (e.g. the presence of repeated sequences). The same problems occur in a metagenome, but are exacerbated by the presence of different species often in greatly different proportions (i.e. some rarer species may only have low level coverage). Bioinformatics offers many tools to analyse the sequences, and the identification of protein-coding regions and assignment of function are the major aim in most projects. For metagenomes, a phylogenetic analysis of the present species is usually the second important aim.
There are effective statistically based methods to identify probable protein-coding regions (1). There are also many tools to try to assign function to such proteins. The BLAST algorithm (2) will detect similar sequences in databases. A general BLAST database such as GenBank (3) consists mainly of uncurated entries, which will often contain misleading data for functional assignment. The SEED database (4) contains collections of protein sequences grouped by function and has been used for BLAST searches to find hits corresponding to in silico translation of the metagenomic sequences. Once hits are found, they can be used to assign function using comparison to FIGfams protein families (5). In order to present functional information about the whole genome or metagenome effectively, it is necessary to have a suitable data structure. The KEGG database (6) has developed the BRITE classification, which is a hierarchical scheme to incorporate different functions (often enzyme activities). There is a considerable quantity of curated information about each class and links to pathway diagrams. The KEGG database has a collection of KEGG orthologues associated with each functional class. BLAST searches against the KEGG orthologues are a useful way of assigning function to new sequences.
A general weakness of BLAST analyses is that all regions of the compared proteins are given equal weighting. In contrast, a hidden Markov model (HMM) profile constructed from a family of proteins assigns higher weighting to important conserved residues (7). The Pfam database (8) contains HMM profiles of protein families and can be used to assign function to sequences. There are also many specialised analyses, which can be used for particular projects, e.g. carbohydrate-metabolising enzymes (CAZy database (9)). Phylogenetic analysis of metagenome sequences is fairly effective. The Phymm system (10) uses coding sequences for this purpose and has been incorporated in the Glimmer-MG system (1) to identify protein-coding regions in metagenomic sequences.
The biggest problem of sequence annotation is to present the results in a form that is useful for biological scientists. Analysis of a genome or a metagenome generates an overwhelming amount of information and it is difficult to extract the relevant data. An elegant solution is to enter the data into a suitable database with appropriate search, analysis and extraction functions. The MG-RAST database (11) is designed for metagenome sequences and new data are analysed using a standard pipeline. Experience with different genome and metagenome analyses showed that the required analyses and tools differed between different projects and it was often necessary to carry out specialised analyses for particular projects. This makes use of a standard database difficult. However, most projects required common components. It was, therefore, decided to develop the MEGGASENSE platform, which would allow the generation of different databases for different projects. The databases can be constructed from DNA sequencing reads, assembled DNA sequences or protein sequences. A core functional analysis common to all databases uses HMM profiles constructed from KEGG orthologues (KO). Any other required analyses can be incorporated (e.g. BLAST-based). Metagenome samples are also subjected to a phylogenetic analysis. The whole of the annotations (functional and phylogenetic) can be searched using a powerful search engine (the default is Solr) and it is easy to extract individual sequences and sets of sequences from the databases. The utility of MEGGASENSE is illustrated by some example databases.

Materials and Methods
The databases are implemented in ZODB (12). The search engine is the enterprise search platform Lucene/ Solr (13) served by a Tomcat web server (14) to index and serve the annotated sequence libraries. The web logic was implemented by an HTML (15) and a JavaServer Pages (JSP) (16) combination. The databases also include BLAST+ v. 2.2.25 (17) and HMMER v. 3.0 (18) search functions. Metagenomic DNA databases also incorporated the Krona viewer (19) for phylogenetic data.
The KEGG database v. 58 (6) was downloaded and the annotation associated with each KEGG orthologue (KO) was retrieved with custom database loading programs. An HMM profile was built from the sequences of each of the 14 106 KOs using HMMER v. 3.0 (18) after multiple alignment of the sequences using ClustalW (20).
DNA assembly used the Newbler package, which is designed specifically for the 454 GS series of pyrosequencing platforms sold by Life Sciences, a Roche Diagnostics company (24). In the conservative annotation approach, the DNA reads were assembled and the resulting contigs were translated in all six reading frames and screened with the KEGG-based HMM profiles to generate the default functional annotation. The functional annota-tion approach first screened the reads using a customised version of the Glimmer-MG v. 0.2 pipeline (1) to identify potential protein-coding regions. This pipeline includes Scimm, PhymmBL and related tools (http://www.cbcb. umd.edu/software/scimm/ (25)), which also produce the information about the phylogeny of the potential coding regions in the metagenomic samples. The protein-coding regions were screened with the KEGG-based HMM profiles and the sequences corresponding to each KO were grouped together and assembled using the Newbler assembler. The results of the conservative and functional annotations were merged with duplicates recognised using BLAST (17). Additional specialised analyses (e.g. the CAZy analysis for AMYLOMICS) result in additional annotation detail, which was associated with each entry.
The MEGGASENSE platform was initially developed as part of a bioprospecting project for carbohydrate-modifying enzymes using metagenomes from hot springs in Iceland. The metagenomic libraries used to construct the AMYLOMICS database (http://mrcina.bioinfo.pbf.hr/Amylomics/) were derived from samples collected from three locations at geothermal sites in Iceland and subjected to carbohydrate enrichment using α-and β-glucan substrates. The metagenome MET4 was collected at Haerulangur (Vonarskarð geothermal region, Iceland) from the effluent (dark layer beneath white mat, 72 °C, pH=6) and enriched on cotton whisk as carbon source.
The HMM profiles based on KEGG and the custom programs for MEGGASENSE are available on github (https://github.com/astarsky2016/meggasense).

Simple proteome databases
The MEGGASENSE platform is designed to generate sequence databases from different types of input data: raw DNA sequencing reads, finished DNA sequences or proteome sequences. Database construction involves several steps (Fig. 1). In the first step, the sequences are collected and annotated. The default functional analysis for databases generated by MEGGASENSE is derived from the KEGG database. We constructed HMM profiles from 14 106 KOs. Each database consists of a data warehouse containing various data sources: relational databases, ob-ject databases (implemented in ZODB) and files. The core functions of MEGGASENSE can be illustrated by the construction of simple proteome databases.
The SCATT database (http://bioserv7.bioinfo.pbf.hr/ scattDB/registration/login.jsp) was constructed from the 6949 protein sequences in the predicted proteome of Streptomyces cattleya (26). The 14 106 HMM profiles were used to scan the proteome to associate the proteins with the KOs. The protein sequences and the annotations associated with each KO were loaded into the new database using a custom database loading program (Fig. 1). The graphical user interface includes search functions using the Solr search engine and the ability to generate tables using the KEGG BRITE classification. This makes it easy to investigate which metabolic pathways are present in the strain and to identify any missing steps in pathways. The database also implements BLAST and HMMER searches (Fig. 1). A similar process was used to construct the ZoophyteBase database (http://bioserv7.bioinfo.pbf. hr/Zoophyte/registration/login.jsp) from the proteome of the coral Acropora digitifera (27). The search functions and the hierarchical KEGG BRITE classification allow the recognition of particular functions as well as collections of functions belonging to particular pathways. This enabled the generation of some novel hypotheses about functions which may be involved in coral symbiosis.

Metagenome databases with additional specialised functions
The starting point for a metagenome database is usually raw DNA sequencing reads. This means that in the data and annotation phase of construction ( Fig. 1) it is necessary to carry out assembly and gene identification. Phylogenetic analyses are also carried out at this stage using the Phymm analysis in the Glimmer-MG pipeline (1) and from the detected 16S rRNA sequences. The graphical user interfaces of metagenome databases generated by MEGGASENSE incorporate the Krona tool for viewing the phylogenetic results. For data mining projects, it is often useful to add further specialised analyses.
The AMYLOMICS database used the 454 sequencing technology and the reads were assembled using the Newbler assembler (24). After in silico translation of the DNA sequences, gene function was assigned using the KEGG KO HMM profiles as described earlier. Assembly of metagenome sequences is considerably more difficult than of genome sequences, because some of the species are present in low numbers so that there are many regions with low read coverage. A novel 'functional' assembly strategy was also implemented to improve the assembly of such sequences. This used a pre-screening of the reads with the HMM profiles followed by an attempt to assemble genes. The reads were extracted from the files produced by the sequencer and trimmed to remove low--quality sequences. The Glimmer-MG pipeline (1) was used to detect protein-coding regions. After in silico translation, the reads were screened with the HMM profiles derived from the KEGG database and reads corresponding to each KO were collected together. It was necessary to implement an algorithm to resolve cases where more than one profile gave a hit with a single read. In some cas-  (13) as well as other analyses and reporting es, this was due to artefacts with low scores, whereas in other cases, it was because the read overlapped two genes. The reads corresponding to each KO were then assembled into contigs using the Newbler assembler. The results of the traditional and functional analyses were merged: BLAST was used to compare the two sets of results and eliminate duplicates. Table 1 shows a comparison of the traditional and functional approaches to assign genes to a KEGG BRITE functional category for a metagenome sample in the AMYLOMICS database. The 'functional' assembly strategy added about 35 % more predicted genes represented by contigs with two or more reads. Reads that could not be assembled were retained in the database as singleton reads. The MG-RAST pipeline is a popular tool for analysing metagenome data (11). It can also be used to assign genes to KEGG BRITE categories. The sequences of the metagenome were submitted to the MG-RAST server. Table 1 shows that MG-RAST assigned far fewer genes to KEGG BRITE categories than the AMY-LOMICS database.
The analyses described above are incorporated as a default in metagenome databases generated by MEG-GASENSE. In addition to the default annotation based on the KEGG database, the AMYLOMICS database used specialised HMM profiles for better identification of carbohydrate-modifying enzymes. These were obtained from the dbCAN database (21) and are based on sequences in the CAZy database (http://www.cazy.org (9)), which includes sequences of carbohydrate-modifying enzymes arranged in different families. The bioprospecting approach aimed at identifying gene sequences encoding enzymes with novel properties. The AMYLOMICS database contains 279 and 568 putative carbohydrate-modifying enzymes for samples enriched on α-and β-glucans, respectively, but it is likely that some hits with low scores are falsely identified. This was examined by using the BLAST function of the AMYLOMICS database, followed by sorting the results according to the degree of sequence identity with the best BLAST hit for each sequence (Fig.  2). Most of the BLAST hits were annotated as carbohydrate-modifying enzymes. However, for sequences with a very low degree of sequence identity (25-30 %), the BLAST hits were usually not annotated as carbohydrate--degrading enzymes (Fig. 2a). Examination of the alignment of these sequences with the HMM profiles to see which residues were responsible for the functional assignment decided whether the sequence would be rejected as a probable artefact or retained as a potential novel enzyme. This reduced the number of potential enzymes to 250 and 519, respectively ( Table 2). Some sequences were nearly identical to the known sequences and considered unlikely to yield novel enzymes (Fig. 2c). The majority of the sequences (Fig. 2b) showed a lower degree of sequence identity and are being analysed in more detail using biochemical knowledge about the enzyme families. The Solr search engine is integrated into the database to allow intelligent text search based on the annotations associated with the KOs and the CAZy database. This also finds sequences whose annotations are not the typed search terms, but may be related. This is illustrated by the results of a search with the term 'amylase' (Fig. 3). Apart from the expected genes, which had been annotated as various classes of α-amylase, genes that were annotated as solute carrier family 3 were also found. When the BLAST function was invoked for such a gene, the conserved domain search showed that the genes might encode transport proteins belonging to a family that also in-cluded carbohydrate transport proteins. This ability to carry out searches, browse the results and download sets of selected sequences is useful in broadening the scope of searches and suggesting ideas for detecting novel enzymes.

Discussion
The MEGGASENSE platform allows the rapid generation of specialised databases for genomes and metagenomes. The use of a data warehouse (Fig. 1) allows the incorporation of several relational databases and different  (2) to screen the protein sequences of potential carbohydrate-utilising enzymes. The 568 hits identified using the specific HMM profiles from metagenome libraries enriched on β-glucans were used for BLAST analysis. The best hits in each case were filtered according to the degree of sequence identity: a) five low identity hits, b) five medium identity hits, c) five high identity hits Fig. 3. Example of using the search system in the AMYLOMICS database (http://mrcina.bioinfo.pbf.hr/Amylomics/) objects (implemented in ZODB) as well as raw data. This is achieved using common loading programs for different databases. It is also easy to customise the graphical user interface.
The coupling of the core functional analysis to the structure of the KEGG database allows good annotation and the ability to view the hits on KEGG pathway charts, which makes it easier to assess whether complete pathways are likely to be present. BLAST analyses with KEGG orthologues will yield good results if the query organism is closely related to the KO member sequences. For more distantly related organisms, the use of HMM profiles as in the MEGGASENSE core analysis will yield better results. This was important for the ZoophyteBase for the proteome of the coral Acropora digitifera (27). In this case, the identification of functions together with the assignment of KEGG BRITE categories and the use of the Solr search engine allowed the development of novel hypotheses concerning functions such as interaction with symbiotic organisms.
Metagenomic data require assembly of the reads and functional annotations. In addition to the standard approach of assembly followed by annotation of deduced protein sequences, MEGGASENSE offers an additional 'functional' strategy in which in silico translated reads are scanned with the HMM profiles prior to the attempts to assemble the reads giving hits to a particular profile. In the case of metagenomes derived from hot springs in the AMYLOMICS database, this resulted in 35 % more predicted genes in contigs with two or more reads. The performance of this functional strategy will depend on the metagenome and the used assembler. However, a better performance of an assembler program on such pre-sorted smaller datasets is likely to be a general property. The MG-RAST server (11) identified far fewer KEGG BRITE genes than the MEGGASENSE-based AMYLOMICS database (Table 1). However, this does not accurately reflect the ability of the MG-RAST system to identify gene function as the KEGG BRITE assignment uses BLAST with stringent parameters. MG-RAST assigns genes to cluster of orthologous group (COG) classes (28). COG is in many ways similar to the use of KO (KEGG orthologue) in KEGG (6). However, the KEGG BRITE classification in which there is a hierarchical organisation of gene function and the fact that KEGG is a highly curated database makes KEGG convenient for functional studies. It would be possible to incorporate COG analyses in a MEGGASE-NSE database, e.g. by generating HMM profiles from COG sequences. MG-RAST is primarily designed to compare metagenomes and provide an overview for large datasets. Although it can be useful for data mining, this will usually involve downloading sequences for analysis in further programs.
The databases reported here are based on the 454 pyrosequencing technology, which gives relatively long reads of good quality. The Illumina technology (http:// www.illumina.com (29)) yields much larger numbers of reads, which are shorter. When the functional assembly approach was attempted for a metagenome with short Illumina reads (approx. 100 b), the functional approach did not give any advantages as the sequences were too short to give good hits with the HMM profiles (data not shown). However, such data can be incorporated in MEG-GASENSE-generated database using standard assembly programs. The Illumina technology can also produce longer reads (>200 b), which would probably be amenable to functional assembly. Single molecule DNA sequencing methods (single molecule real-time sequencing (SMRT), http://www.pacb.com (30) or nanopore sequencing, https://nanoporetech.com/ (31)) generate long reads with high error rates. It is likely that functional assembly would provide little advantage for long reads, especially as they are likely to have errors resulting in frame shifts in open reading frames.
The modular nature of the platform makes it easy to incorporate any analyses which are useful for the considered problem. In the case of the AMYLOMICS databases, extra specialised HMM profiles for carbohydrate-modifying enzymes were incorporated. The standard integration of BLAST searches in the database allowed the users to compare sequences with the NCBI nr database (3). This helped rejection of sequences that were probably falsely identified as carbohydrate-modifying enzymes. It also allowed identification of hits to particular enzyme classes that are distant in sequence from the known members, which are candidates for enzymes with novel properties. The REDPET database (http://redpet.bioinfo.pbf.hr/RED-PET; Gacesa et al, in preparation), which was also generated by MEGGASENSE, is designed for bioprospecting hydrocarbon-degrading enzymes in metagenome sequences from the Adriatic Sea. It uses a similar approach to the AMYLOMICS database incorporating appropriate specialised HMM profiles. However, it would also be possible for MEGGASENSE to incorporate more sophisticated analyses tailored to a particular bioprospecting task.
The ever increasing amount of sequence data makes it attractive to use a database-generating platform such as MEGGASENSE to allow more effective exploitation of the data. The ability to incorporate any appropriate form of analysis makes it superior to the use of standard database formats, which are not adapted to the challenges of new projects.

Conclusions
Analyses of metagenomes and genomes are greatly facilitated by incorporating the data in relational databases. Many bioprospecting projects require specialised analyses, which are not provided by standard databases. The MEGGASENSE platform allows the construction of bespoke databases with a core of common search and analysis tools. The use of HMM profiles for functional analysis offers advantages especially for metagenome projects.