Plasmobase: a comparative database of predicted domain architectures for Plasmodium genomes

Background With the availability of complete genome sequences of both human and non-human Plasmodium parasites, it is now possible to use comparative genomics to look for orthology across Plasmodium species and for species specific genes. This comparative analyses could provide important clues for the development of new strategies to prevent and treat malaria in humans, however, the number of functionally annotated proteins is still low for all Plasmodium species. In the context of genomes that are hard to annotate because of sequence divergence, such as Plasmodium, domain co-occurrence becomes particularly important to trust predictions. In particular, domain architecture prediction can be used to improve the performance of existing annotation methods since homologous proteins might share their architectural context. Results Plasmobase is a unique database designed for the comparative study of Plasmodium genomes. Domain architecture reconstruction in Plasmobase relies on DAMA, the state-of-the-art method in architecture prediction, while domain annotation is realised with CLADE, a novel annotation tool based on a multi-source strategy. Plasmobase significantly increases the Pfam domain coverage of all Plasmodium genomes, it proposes new domain architectures as well as new domain families that have never been reported before for these genomes. It proposes a visualization of domain architectures and allows for an easy comparison among architectures within Plasmodium species and with other species, described in UniProt. Conclusions Plasmobase is a valuable new resource for domain annotation in Plasmodium genomes. Its graphical presentation of protein sequences, based on domain architectures, will hopefully be of interest for comparative genomic studies. It should help to discover species-specific genes, possibly underlying important phenotypic differences between parasites, and orthologous gene families for deciphering the biology of these complex and important Apicomplexan organisms. In conclusion, Plasmobase is a flexible and rich site where any biologist can find something of his/her own interest. Availability Plasmobase is accessible at http://genome.lcqb.upmc.fr/plasmobase/.


Background
A large amount of genomic and post-genomic data is now available for the Plasmodium genus. The fully sequenced genomes of 11 species are accessible from PlasmoDB [1], but despite the availability of these complete genomes a major limitation still remains on their functional annotation. The number of proteins with unknown functions is still high for all Plasmodium species. This could be due (i) to the AT-richness of these genomes, (ii) to the specificity of a number of parasitic functional mechanisms evolved to evade host immune recognition, (iii) to the strong divergence of Plasmodium protein sequences that make homology detection a difficult task. These three reasons might have contributed to the development of new gene functions and changes in the parasite's genome, occurring through gene acquisition and deletion [2].
Proteins are composed of one or more "domains", that is structural motifs that can evolve, function, and exist independently of the rest of the protein chain. Domains might be found in different combinations, and the arrangement of these domains in a protein forms the so called "protein architecture". By focalizing on domain recognition, genome annotation can be highly improved. Several approaches and databases have been developed to detect and identify functional domains in a large number of proteins, including Pfam [3], SMART [4], and PROSITE [5]. Based on these resources, a number of conflicting domain predictions (potential domains) can be resolved and, for each protein to be annotated, a domain architecture can be proposed. There are different methods for identifying a domain architecture and the most successful ones explore domain co-occurrence for controlling the false discovery rate (FDR) associated with the predictions [6][7][8]. Here, DAMA [8] (domain annotation by a multi-objective approach), an approach that treats protein domain architecture prediction as a multi-objective optimization problem, is used. DAMA combines a number of criteria to handle multi-(possibly pairwise-) domain co-occurrence and domain overlapping, and it outperforms existing methods. It detects domain architectures with a larger number of domain co-occurrences.
DAMA can improve domain recognition methods such as HMMer [9,10] used by Pfam, but it is limited by the number of potential domains given as input. Hence, CLADE (closer sequences for annotations directed by evolution) [11], the new generation of annotation tools based on a "multi-source strategy", is used to increase the set of potential domains and consequently to improve DAMA performance. CLADE uses several hundred probabilistic profiles to represent each Pfam domain instead of one profile, based on global consensus, as for mono-source strategies. These probabilistic models are originated from different species, spanning the whole phylogenetic tree, and describe alternative evolutionary pathways for a domain. Tested on the Plasmodium falciparum genome, CLADE outperforms the widely used tools based on a mono-source annotation strategy, HMMer and HHblits [12]. The new domain annotation obtained by CLADE for P. falciparum 3D7 has been released with [11].
Plasmobase, presented here, is a novel database reporting known and new protein domains identified by DAMA and CLADE on the 11 fully sequenced genomes in PlasmoDB (see Table 1). Plasmobase contains a large number of newly discovered domains in each Plasmodium genome, leading to an enrichment of 18-30% of the total number of domain families when compared to Pfam predictions (with an FDR <1%). In addition, Plasmobase is a unique platform for the comparative study of Plasmodium genomes. It proposes a visualization of domain architectures and it allows for an easy comparison among architectures within Plasmodium species and all other species in UniProt. A friendly interface allows users to interact with the platform to access new annotations and possibly detect annotation errors.

Plasmodium species
The 11 fully sequenced genomes present in PlasmoDB [1] and considered here are: three human parasites (P. falciparum 3D7, P. falciparum IT, and Plasmodium vivax), two macaques parasites (Plasmodium knowlesi and Plasmodium cynomolgi), one chimpanzee parasite (Plasmodium reichenowi), and five rodent parasites (Plasmodium chabaudi, Plasmodium berghei, and Plasmodium yoelii 17X, Plasmodium yoelii yoelii 17XNL and Plasmodium yoelii YM). All genomes were extracted from http://PlasmoDB.org, the official repository of the Plasmodium proteins used as a reference database by malaria researchers. For each species, the genome size and the number of proteins are shown in Table 1.

The UniProt database
In order to display the proportion of proteins with similar architecture to a given query protein, all proteins (18,523,877) with known Pfam domain architectures are extracted from UniProtKB [13] and organized according

GO terms
The gene ontology initiative [14], besides its value as a database of annotations, maintains and develops a controlled vocabulary of gene and gene product attributes. The gene ontology terms (GO terms) [15] is a machinereadable vocabulary that provides a standard output for functional predictions, avoiding the ambiguity of natural language. GO terms describe three aspects of gene product function: molecular function, biological process, and cellular location. pfam2go [16], a mapping that associates a specific GO term with a Pfam domain, is used to provide GO terms for Plasmobase domain predictions. In consequence, all proteins containing this domain share the same GO term.

Tools
CLADE [11] and DAMA [8] were applied to all Plasmodium organisms, CLADE to identify potential domains and DAMA to reconstruct protein domain architectures. Both tools were run with default parameters, corresponding to a FDR smaller than 1%. The two tools are briefly described below. For the FDR estimation and other details refer to the original articles.

CLADE: closer sequences for annotations directed by evolution
CLADE is a computational approach that highly increases the sensitivity of domain prediction. It is a multi-source approach where several hundred probabilistic profiles are used to represent each domain, instead of one as for mono-source strategies, employed by methods like HMMer [9,10] and HHblits [12]. CLADE predicts domains based on two classes of probabilistic profiles. The first is the profile library available in the Pfam database (version 27). There are 14,831 profiles, one for each domain. These profiles capture the consensus of homologous sequences, and the idea behind them is that homologous proteins should share common physico-chemical and structural features that could be described by consensus on the entire set of homologs.
The second class of profiles is constituted by hundreds of probabilistic models, associated to each Pfam domain. They are constructed starting from homologous sequences spreading a large panel of taxonomic origins, to guarantee that the phylogenetic tree is well represented. For each selected homologous sequence, a specific profile, named "clade-centred model" (CCM), is constructed. To construct a CCM, the selected homologous sequence is used as a query to search for close homologs within the non-redundant protein database (NR) with PSI-BLAST [17]. The CCM is the resulting PSI-BLAST profile. Note that PSI-BLAST constructs the CCM profile from NR sequences detected with an E-value threshold set to 1e−3 by default. CLADE uses this PSI-BLAST default E-value (1e−3) and sets the number of PSI-BLAST iterations to 5. On average 161 CCMs were constructed to represent each Pfam domain. These models span regions of protein sequence space that are not well represented by Pfam consensus models, and they highlight motifs, structural characteristics or physicochemical properties that are shared by similar homologous sequences. Hence, if the original set of sequences for a domain is made of very divergent homologs, cladecentred models, are expected to describe properties that could be missed by the original Pfam model, representing the global consensus.
The CLADE library, made of Pfam consensus models and clade-centred models, is composed of 2,404,066 profiles. This library is used to identify potential domains in query proteins. Potential domains are then filtered by using support vector machines (SVM) combining various optimisation criteria, ultimately converted into a score, to specifically deal with false positives and discriminate domains. The SVM selects most probable predictions among domain hits displaying a small E-value, a sufficiently long domain hit, the phylogenetic proximity between the taxon of the sequence to be annotated and the reference species generating the probabilistic profile, and the agreement among models leading to the prediction. This filtering step is fundamental in domain prediction. Once domains are filtered, CLADE calls DAMA to find the most probable architecture for a given query sequence. CLADE can be downloaded at http://www. lcqb.upmc.fr/CLADE.

DAMA: domain annotation by a multi-objective approach
Since homologous proteins might share their architectural context, domain architecture prediction are used to improve the performance of CLADE annotation. The problem can be complex when a query sequence matches several probabilistic models, producing a set of conflicting predictions with overlapping domain boundaries. To address this problem, DAMA combines a number of criteria including multi-(possibly pairwise-) domain cooccurrence and domain overlapping. Domain co-occurrence is expected to enhance the level of confidence in a prediction [18] mainly because (i) the majority of proteins are multi-domain and (ii) fewer combinations than the statistically expected ones are observed. Some overlapping must also be admitted to increase the number of correct domain predictions, as demonstrated in [19]. DAMA encodes domain co-occurrence and hit overlapping criteria into objective functions, and treats protein domain architecture prediction as a multi-objective optimization problem. First, DAMA generates a list of possible architectures, and then maximizes a set of objective functions to select the best architecture. Five functions, designed according to several objectives, are applied in order of importance. The first objective function ensures that higher confidence domains are in the final architecture since domain scores greatly augment the trust on protein annotation. The second and the third function explore the tendency of some domain families to occur preferentially with a few other favourite families. The second function maximizes the number of multi-domain cooccurrences, while the third one maximizes the number of pairwise domain combinations when two architectures present the same number of multi-domain co-occurrences. The fourth function privileges architectures with distinct domains since domain duplication is less likely to change the protein function. Finally, the fifth function selects the architecture whose domains have the highest scores. DAMA can be downloaded at http://www.lcqb. upmc.fr/DAMA. DAMA was used with default parameters, including the parameter '--review" that adds new domains into an architecture if: (1) they present a significant E-value (<1e−10) and (2) they do not overlap an existing domain in the architecture. By setting this parameter, one can increase the number of predicted domains and takes into account the identification of new domain architectures (see "Discussion").

Data availability
Plasmobase website provides access to downloadable xls files containing the full list of annotations for the 11 Plasmodium species. Each file contains, for each domain hit, the PlasmoDB accession number of the sequence where the hit is identified, its starting and ending position, the Pfam domain name, the Pfam accession number, the E-value, CLADE SVM probability, the model identifier in the CLADE model library (either a clade-centred model or a Pfam consensus model), the start and the end position of the hit either in the domain sequence used to construct the clade-centred model or in the Pfam consensus model, the clade name (if any) and the organism name of the sequence originating the clade-centred model used for the domain identification. Note that if a protein sequence is annotated by several domains, the file contains several rows, one for each domain annotation.

Results
Plasmobase is a platform for the exploration of protein architectures and their comparison across species. Its features and its multiple ways to analyse a predicted architecture or to explore potential architectures, are presented together with some global statistics on new domain annotations for the 11 Plasmodium genomes.

The Plasmobase platform
The Plasmodium database Plasmobase is a friendly interface allowing users to search by domains (with a Pfam accession number or a Pfam description), proteins (with a protein accession number or a protein annotation) or domain architectures (with a list of Pfam accession numbers) on all Plasmodium species or on a specific one. As a result, it provides the list of corresponding proteins with their accession number, Plasmodium species name, PlasmoDB annotation, a list of domains forming the predicted architecture (where new domains are highlighted), and the accessibility ("Look up" link) to a graphical interface providing comparative information on the predicted architecture. Each protein in the list is linked to its Plas-moDB description, and each domain forming the protein architecture is linked to its Pfam description. Figure 1 illustrates the graphical interface obtained when querying for the P. falciparum protein PF3D7_1369500, identified by PlasmoDB as a "conserved Plasmodium protein, with no known function" and no predicted domain. It is accessible through the "Look up" link. On the top, the graphical interface shows the protein details: PlasmoDB accession number, protein length, and PlasmoDB text annotation. Then, it displays the predicted architecture constituted by three new domains (MIF4G, MIF4G_like_2 and MIF4G_like-see the CLADE architecture box), and the GO terms associated to the domains (accessible by clicking "GO Terms show"). For each predicted domain, there is an interactive legend (appearing when the cursor passes over the domain icon) with the description of the domain (Pfam domain, Pfam accession number, species generating the CLADE model identifying the domain, protein length, domain coverage, E-value, CLADE SVM probability, Pfam clan). The "Pfam-27 architecture" box shows the annotation proposed by the Pfam database version 27, where domains are predicted by HMMer and the architecture is obtained by employing a simple strategy that considers highest confidence domains without overlapping. For PF3D7_1369500 no domain was predicted by HMMer, and an empty box is shown. This display is accessible by clicking on the "show" link. (In Fig. 1, the button "hide" allows hiding the display). Orthologs and paralogs according to PlasmoDB can also be displayed by clicking on the "show" link. Note that all PF3D7_1369500 orthologs are identified by PlasmoDB as "conserved Plasmodium protein, unknown function" or "hypothetical protein".
The list of all overlapping domains identified by CLADE, but not selected by DAMA to belong to the proposed architecture, is shown in the "All CLADE domains" box. For each domain in the list, those that are known to co-occur with it, are accessible by passing the mouse over the domain name. Plasmobase allows the user to evaluate the interest of an overlapping domain in view of a putative functional annotation for the protein. These domains can be explored in combination to other identified domains. In fact, the user can select any combination of CLADE domains (the ones belonging to the proposed CLADE architecture but also those that have been selected and filtered) and explore for their co-occurrence either in Plasmobase (Fig. 2a) or in UniProt (Fig. 2b) through the two dedicated button in the "Explore architectures" box (see below).
An immediate access to the list of architectures of orthologous and paralogous genes in Plasmodium species is possible by clicking on the "show" link at the bottom of the page (Fig. 1). Each protein accession number is provided together with the species name, the length of the sequence and the PlasmoDB annotation. The display of the associated Plasmobase architectures can be obtained by clicking the "Compare orthologous/paralogous architectures" button. Figure 2a (left) shows all the Plasmodium species having protein sequences with a similar architecture to PF3D7_1369500 in Plasmobase. The proportion of proteins including the same co-occurring domains is shown by a pie chart. There are 10 orthologous proteins, in 10 Plasmodium species. The only exception is P. cynomolgi where MIF4G_like_2 was not identified. Architectures can be compared in more details by ticking some or all organisms and pressing the button "Compare architecture in Plasmobase". In this example, P. chabaudi and P. vivax are selected, and the architectures are displayed in Fig. 2a (right), where CLADE domains are shown with the same graphical interface of PF3D7_1369500. Information on the species, functional annotation and GO classification is provided. Figure 2b shows proteins with similar architecture in UniProt, the information is organized by taxon groups: Eukaryota, Bacteria, Archaea and "Viruses and others". For PF3D7_1369500, 173 eukaryotic proteins with similar architectures and spread over several clades are found. The main clades (collecting most proteins) are shown in Fig. 2a (left) and the remaining ones are grouped in the checkbox "others" (see "Methods"). Like for Plasmobase, one can explore some or all clades and compare domain architectures for UniProt species within clades (button "Compare architecture in Eukaryota"). The architecture for Fungi proteins are shown in Fig. 2b (right). There are 41 proteins with the same architecture in Fungi and 29 in Viridiplantae. By clicking "Show", all protein architectures are displayed.

Many brand-new domains and many enriching ones
Those domains in Plasmobase that do not overlap Plas-moDB hits or that disagree with some PlasmoDB annotation are called "new". Among new domains, there are some that are detected for the first time in the organism, and they are referred to as "brand-new" domains. The results are summarized in Table 2, where co-occurrent domains are highlighted (Cooc column) in regard to the total number of domains. For each species, the total number of new domains is shown in Table 2 (column "New domains"), this is the sum of domains that occur in proteins with no annotation in PlasmoDB (Table 2column "First time"), and domains that enrich existing PlasmoDB architectures (Table 2-column "Enriching domains"). Note that more than half of the new domains enrich existing architectures and, interestingly, the rest provides an annotation of sequences that were never annotated before. The number of "brand-new" domains is reported in Table 2 (column "Brand new domains"). In average, each species gains more than 1500 new domains compared to existing PlasmoDB annotation, and two thirds of these domain predictions are supported by cooccurrence (see columns "Cooc" in Table 2abc), increasing their confidence. Table 3 compares the number of domains, found in each species, reported in PlasmoDB and Plasmobase. Plasmobase identifies a large number of Pfam domains in each Plasmodium species when compared to Plas-moDB predictions. These latter are based on HMMer (hmmscan) [9,10] using the Pfam model library. The Fig. 2 Proteins with similar architectures explored in Plasmobase/UniProt. a All Plasmodium species contain a protein sequence sharing the same CLADE architecture as PF3D7 1369500. A selection of these species allows to explore these protein sequences in Plasmobase, and verify information for domain architecture identification in other species. The whole list of domains is reported (in this specific example, there is only one architecture per species). b Plasmobase allows to explore the UniProt database for architectures that are similar to the one identified by CLADE for PF3D7 1369500. There are similar architectures in Metazoa, Fungi, Viridiplantae and other clades. A selection of Fungi and Viridiplantae allows the user to compare the architectures among these clades. Fungi contains 41 sequences with the given architecture and the full list is accessible set of new predicted domains is expanded in all organisms, from 18% in P. knowlesi H to up to 30% in P. falciparum 3D7, and the number of proteins with no domain annotation largely decreased in all organisms. Compare the percentages of proteins with unknown function in Plasmobase and in PlasmoDB reported in Table 3.

Comparison with EuPathDomain
The EuPathDomain [20] is a protein domain database dedicated to ten eukaryotic human pathogens, including three Plasmodium species: P. falciparum 3D7, P. vivax Sal-1 and P. yoelii yoelii 17XNL. It uses hmmscan with permissive E-values to generate a set of potential domains, CODD [6] is then used to predict domain architectures based on domain co-occurrence. Plasmobase improves over EuPathDomain in several manners. When compared with EuPathDomain, Plasmobase: 1. considers 11 Plasmodium species while EuPathDomain considers just 3 organisms,  All information contained in the database can be downloaded in xls files, one for each species, permitting the user to process the information in alternative ways.

Discussion
Protein annotation plays a major role in the comprehension of the biology of Plasmodium species. Plasmodium clade contains particularly AT rich genomes (Table 1) and this specificity of Plasmodium species makes Plasmobase contribution even more important. Indeed, Plasmobase provides a huge amount of new information concerning protein domain annotation across Plasmodium species. The new protein architectures suggested in Plasmobase can play a crucial role in the functional annotation of Plasmodium proteins, and potentially, on the identification of new functions, possibly rising from new domain combinations.
The reconstruction of the most likely domain architecture for a protein sequence constitutes one of the main steps of all predictive annotation strategies. Indeed, an accurate identification of the domain architecture of a multi-domain protein provides important information for function prediction, comparative genomics and molecular evolution. Here, the latest generation tools available (the new generation annotation method CLADE, employing a multi-source annotation strategy, and the state-of-the-art architecture reconstruction approach DAMA) are used to reconstruct the domain architecture of a large number of sequences presenting no domain annotation in Pfam_27 in order to annotate all Plasmodium complete genomes. The success of CLADE methodology was demonstrated on the P. falciparum genome [11]. Here, CLADE analysis is extended to 10 more species and a web interface simplifies the comparison between annotations by helping researchers to dig more profoundly into the evolution of the species and in the functional characteristics of their proteins. In Plasmobase, this can be done with a direct enquiry on functional keywords, associated to domains or proteins. Clearly, final confirmations are expected to be experimental.
On false discovery rate in Plasmobase Plasmobase provides multiple pieces of information to help users to evaluate a domain prediction: E-value, SVM probability, co-occurrences, and the possibility to compare predictions in orthologous genes. Indeed, one of the purposes of Plasmobase is to highlight several evidences to believe in the proposed domain annotation. Yet, as for any domain prediction tool, there exists the possibility that a domain is falsely predicted by CLADE. In this respect, several statistical tests to estimate CLADE false discovery rate (FDR) have been reported in [11], where it has been shown that, for the same FDR value, CLADE detects much more domains than its competitors, HMMer and HHblits. Run with default parameters (used for constructing Plasmobase), CLADE presents a FDR of 1e−3, that is, 1 in 1000 predictions are expected to be a false domain.
Plasmobase contribution to new protein annotations In Plasmobase, the user can explore and compare a large number of new domain annotations helping to the identification of a protein function. The protein PF3D7_1369500 illustrated in Fig. 1, for instance, contains three domains MIF4G/MIF4G-like obtained with E-value 1e−21, 1e−19 and 1e−8 and known to co-exist within protein sequences of metazoan, fungi and viridiplantae. In Plasmobase, this information, coming from UniProt, is accessible. The 96 UniProt metazoan sequences can be listed and their UniProt description can be looked up by clicking the protein accession number. By so doing, the user discovers that for these sequences, the architecture suggests the involvement of the protein in RNA metabolism and its role of RNA binding protein, but that its annotation goes from "uncharacterized protein" to "nuclear cap binding protein sub unit 1 (NCBP1)" inferred by homology, by experimental evidence at transcript level (in Mustela pitorius furo et Xenopus laevis) and by experimental evidence at protein level (in Mus