eFG: an electronic resource for Fusarium graminearum

Fusarium graminearum is a plant pathogen, which causes crop diseases and further leads to huge economic damage worldwide in past decades. Recently, the accumulation of different types of molecular data provides insights into the pathogenic mechanism of F. graminearum, and might help develop efficient strategies to combat this destructive fungus. Unfortunately, most available molecular data related to F. graminearum are distributed in various media, where each single source only provides limited information on the complex biological systems of the fungus. In this work, we present a comprehensive database, namely eFG (Electronic resource for Fusarium graminearum), to the community for further understanding this destructive pathogen. In particular, a large amount of functional genomics data generated by our group is deposited in eFG, including protein subcellular localizations, protein–protein interactions and orthologous genes in other model organisms. This valuable knowledge can not only help to disclose the molecular underpinnings of pathogenesis of the destructive fungus F. graminearum but also help the community to develop efficient strategies to combat this pathogen. To our best knowledge, eFG is the most comprehensive functional genomics database for F. graminearum until now. The eFG database is freely accessible at http://csb.shu.edu.cn/efg/ with a user-friendly and interactive interface, and all data can be downloaded freely. Database URL: http://csb.shu.edu.cn/efg/


Introduction
The filamentous ascomycete Fusarium graminearum (teleomorph Gibberella zeae) is the major pathogenic agent of Fusarium head blight (FHB) and Fusarium ear rot (1), which can cause diseases for wheat, barley, maize and other crops, leading to yield loss and food quality problems, and are becoming serious problems in many countries over the world. In general, FHB causes diseases to crops within a few weeks, and results in huge economic loss (2).
Most importantly, this pathogen produces some mycotoxins, e.g. deoxynivalenol and zearelanone, which contaminate food products and therefore increase health risks (3,4). However, it is difficult to fight this destructive fungus whose pathogenic mechanism is known to a limited extent (5,6).
Recently, the accumulation of different kinds of molecular data provides invaluable information on the biology of F. graminearum, which can help to develop effective strategies to fight this fungus. For example, the complete genome of F. graminearum provides insights into the possible genome regions enriched for infection-related genes (7). A comprehensive genome database FGDB (Fusarium graminearum Genome Database) provides information on manually revised gene set (8). On the other hand, some 'omics' data provide valuable information on the biological systems inside the fungus. For example, our recently predicted protein-protein interactions deposited in FPPI database (9) give a global interactome map of F. graminearum proteins; gene expression data from PLEXdb database (http://www.plexdb.org/) (10) describes the transcriptional activity under distinct conditions; pathway information available in KEGG database (11) characterizes the context in which genes function.
Unfortunately, most of the valuable information described above is distributed in various ways: some are deposited in public databases while some are just described in literature, where each single source can only provide limited information on the complex biological systems of the fungus F. graminearum. Therefore, it is necessary to construct a ready-to-use comprehensive molecular database for F. graminearum. To fulfill this gap, in this work, we build such a uniform database, namely eFG (Electronic resource for Fusarium graminearum), which contains both genome and systematic functional information for F. graminearum. Compared with existing databases for Fusarium genus, e.g. CiF (http://www.fusariumdb.org/), FungiDB (http://fungidb.org/fungidb/), CFGP (http://cfgp.riceblast. snu.ac.kr/) and EnsemblFungi (http://fungi.ensembl.org/), eFG database is more comprehensive and provides some novel and specific information for F. graminearum. In eFG database, except for genome information collected from public databases, we also incorporate some functional annotations, such as pathway annotation, enzyme families and transcription factors. In particular, eFG contains a protein interactome map, protein subcellular localization annotations, pathogenic genes and F. graminearum orthologous genes in other species (including fungi, bacteria and mammalian), all of which are predicted by our group in our previous works (9,12,13). These derived functional genomics data can help us to understand the possible functions of F. graminearum proteins. For example, the subcellular localization data gives a spatial cellular landscape of whole genome proteins within a cell, while the orthologous information can help to annotate unknown genes by transferring annotations between orthologs. As a case study, by integrating data deposited in eFG database, we show that the pathogenic genes of F. graminearum have different molecular characteristics compared with whole genome background, e.g. higher degree in the interactome map and enriched in MAPK signaling pathway and cysteine and methionine metabolism. We believe that the comprehensive database eFG can shed light on the molecular mechanisms underlying pathogenesis of F. graminearum, and help the community to develop efficient strategies to combat this pathogen. The database can be freely accessed through distinct browsers, including Internet Explorer (version 9/10), Firefox (version 15/16), Google Chrome and Safari (Version 6), where all the data can be freely downloaded for academic purpose.

Database overview
The eFG database integrates different kinds of data, including genome information (gene and protein sequence, promoter sequence), proteome information (protein domain architecture, protein subcellular localization, protein-protein interaction) and functional annotations (pathogenic gene, transcription factor, catalytic activity of enzyme, pathway, gene ontology term and orthologs), into a uniform database ( Figure 1). All the data deposited in eFG can be freely downloaded for academic use. Furthermore, eFG provides access to gene expression data measured under different conditions deposited in GEO (Gene Expression Omnibus) (14) and PLEXdb (15) databases for further analysis.
In addition, a user-friendly interactive interface was constructed for querying genes of interest. By submitting gene symbols (e.g. FGSG_00296), one can retrieve annotations of interest, homologs in other databases, and orthologs in other species, among others, for this gene by selecting distinct drop-down options ( Figure 2). Furthermore, one can also retrieve corresponding genes' information by identifiers of enzymatic function [e.g. query with 'EC:1.3.5.1' can return genes with the catalytic function of 'succinate dehydrogenase (ubiquinone)'], protein domains (e.g. query with 'IPR001926' can retrieve the genes which contain the domain of 'Pyridoxal phosphate-dependent enzyme, beta subunit'), KEGG pathway (e.g. query with 'fgr00260' can present all genes which are included in the 'glycine, serine and threonine metabolism' pathway) and annotation key word (e.g. 'kinase' and 'transferase' can respectively return the genes that are annotated with the key words). In addition, logical combination by word 'AND' (e.g. key words 'kinase and serine' can list the genes which are kinases and contain serine) is also supported. One can retrieve all available information for a single gene, including sequence information, localization information, domain information, pathogenic information, TF (transcription factor) information, enzyme catalytic information, pathway information, protein-protein interactions, orthologs information and best hit homologs in other databases. Specifically, one can query an unknown sequence with BLAST (Basic Local Alignment Search Tool) (16)  comprehensive information on the gene set. With the batch input of a set of genes, one is able to investigate the functional relationships among these genes, e.g. protein-protein interaction or within the same pathway ( Figure 2). For instance, the possible interactions between these gene products are firstly retrieved from the interactome map and are then shown in a graph visualized with Cytoscape Web (17), a web implementation of Cytoscape (18). It enables the user to view the network in an interactive way, such as panning and zooming in/out the network without changing the original layout, and dragging/ clicking the nodes. Subsequently, pathways and GO (gene ontology) terms that are associated with queried proteins are listed with corresponding P-values calculated based on hypergeometric test to show those ones in which the queried proteins are enriched. In addition, one can query the eFG database by simply submitting gene sequence(s) if the gene(s) of interest is (are) not known, where the BLAST is run in the background to retrieve the best similar genes/ proteins in the F. graminearum genome (Figure 2).
Beyond above characteristics, the eFG database provides cross-references to other databases. For example, one can link to KEGG database by clicking the retrieved pathways for the queried genes. Similarly, for the orthologs of one F. graminearum gene, one can link to the original databases against which the orthologs are recognized, where these databases provide more detailed information about those orthologs so that the function of the F. graminearum gene can be easily inferred.

Database content
F. graminearum genome. The full genome of F. graminearum was finished in 2006 (19), which was manually revised later and deposited in the FGDB database (8). The assembled FG3 genome (version 3.1) that contains potential protein sequences and the function annotations for corresponding genes were downloaded from FGDB. These data were imported into the eFG database, which results in 13 719 genes with corresponding upstream 1000 base pairs sequence from its transcription start site for each gene, where the possible function annotations for these genes were organized in FunCat format (20). Moreover, protein domains were identified with InterProScan (21) for all potential proteins and were deposited into eFG.
F. graminearum Enzyme Proteins. The enzyme proteins are important to various biochemical reactions which are generally catalyzed by these proteins. Enzyme Commission number (EC number) is a numerical classification scheme for enzymes based on the biochemical reactions that they are involved in, and is used to identify the catalytic activities of F. graminearum enzyme proteins here. We collected 1206 enzyme proteins with known catalytic activities from KEGG database and imported them into the eFG database. As shown in Figure 3B, the two largest groups of enzyme proteins in F. graminearum are transferases and hydrolases.

Subcellular Localizations.
Protein subcellular localization information describes the spatial arrangement of proteins within cells, thereby providing important functional information on proteins. However, it is a laborious and time consuming task to experimentally determine the subcellular localization of proteins. In our previous work, one computational approach based on Support Vector Machine (SVM) and protein primary structure (12) was proposed to predict the subcellular locations of F. graminearum proteins. In addition, for the F. graminearum proteins that have significant sequence similarity to those in a non-redundant dataset for fungi collected from UniProtKB database with subcellular localization annotation, sequence alignment was used to transfer annotations of homologous proteins to uncharacterized F. graminearum proteins so that the F. graminearum proteins are annotated more comprehensively. In eFG database, the predicted subcellular localizations of 12 786 proteins were clustered into 22 groups (Table 1).
F. graminearum Orthologs and Homologs. The orthologs of F. graminearum genes in other well-studied organisms can help to annotate uncharacterized F. graminearum genes. By using an existing tool, InParanoid (25), we identified the orthologs of F. graminearum genes in 24 organisms (Table 2), where the most evolutionally related species have the largest number of orthologs in F. graminearum. These orthologous information can help to understand the possible functions of F. graminearum genes.
In total, 216 263 interactions involving 6741 unique proteins were predicted, where 1716 interactions were predicted by both methods ( Figure 4D). Furthermore, we constructed a core PPI dataset that contains high-confidence interactions predicted by either interologs or DDIs and those predicted by both methods but not necessarily to be highly confident. There are in total 34 675 interactions between 4047 proteins in the core set. All these protein interactions can be found in eFG database and freely downloadable from the Web site. Pathogenic Genes. In eFG database, we also collected pathogenic genes for F. graminearum from literature. Moreover, the pathogenic genes predicted in our previous work (13) were also imported into eFG database. In brief, those genes that interact with known pathogenic genes are more likely to be pathogenic genes. With the core PPI dataset and known pathogenic genes from PHI-base database (http://www.phi-base.org/) (34) as seed genes, pathogenic modules were identified based on the genes differentially expressed before and after the invasion of F. graminearum, where the genes in the module were regarded as putative pathogenic genes. Right now, there are in total 100 pathogenic genes deposited in eFG database.

Case study: characteristics of pathogenic genes
Understanding the molecular underpinning of F. graminearum pathogenesis is important for developing efficient strategies to combat this fungus. Therefore, using the information extracted from eFG database, we investigated whether there are specific molecular patterns associated with pathogenic genes of F. graminearum. By submitting the 100 pathogenic genes to eFG database with multi-genes querying, we found that these genes are significantly enriched in two pathways: MAPK signaling pathway (P-value 1.91 Â 10 À5 ) and cysteine and methionine metabolism (P-value 1.64 Â 10 À3 ), which is consistent with previous findings that MAPK pathway is involved in the pathogenesis of phytopathogenic fungi (35). The enrichment of cysteine and methionine metabolism indicates that those known pathogenic genes of F. graminearum may participate in the synthesis of sulfur-containing amino acids.
The enzyme catalytic activity analysis indicates that 19 pathogenic genes are enzymes, among which 11 are transferases, implying that transferases are more important for F. graminearum to infect its host. Furthermore, there is one oxidoreductase, one isomerase, two hydrolases, two lyases and two ligases in the 19 pathogenic genes. With function annotations obtained from eFG for the pathogenic genes, we found that 29 pathogenic genes are kinase, 14 are synthase, 7 are cyclin-dependent kinases, and 6 are involved in MAPK pathway.
In addition, we investigated the subcellular localizations of pathogenic genes, which occur in 18 of 22 subcellular locations ( Figure 5A).We found that the distribution of subcellular localizations of pathogenic genes is significantly (P-value of 2.63 Â 10 À6 ) different from that of the whole genome genes. The most frequent subcellular localizations in which pathogenic genes occur include cytoplasm, nucleus and cell membrane.
In summary, from above analysis, we can see that there are possible specific molecular patterns associated with pathogenic genes of F. graminearum, and these patterns can help to predict new potential pathogenic genes in the future.