MicroPhenoDB Associates Metagenomic Data with Pathogenic Microbes, Microbial Core Genes, and Human Disease Phenotypes

Microbes play important roles in human health and disease. The interaction between microbes and hosts is a reciprocal relationship, which remains largely under-explored. Current computational resources lack manually and consistently curated data to connect metagenomic data to pathogenic microbes, microbial core genes, and disease phenotypes. We developed the MicroPhenoDB database by manually curating and consistently integrating microbe-disease association data. MicroPhenoDB provides 5677 non-redundant associations between 1781 microbes and 542 human disease phenotypes across more than 22 human body sites. MicroPhenoDB also provides 696,934 relationships between 27,277 unique clade-specific core genes and 685 microbes. Disease phenotypes are classified and described using the Experimental Factor Ontology (EFO). A refined score model was developed to prioritize the associations based on evidential metrics. The sequence search option in MicroPhenoDB enables rapid identification of existing pathogenic microbes in samples without running the usual metagenomic data processing and assembly. MicroPhenoDB offers data browsing, searching, and visualization through user-friendly web interfaces and web service application programming interfaces. MicroPhenoDB is the first database platform to detail the relationships between pathogenic microbes, core genes, and disease phenotypes. It will accelerate metagenomic data analysis and assist studies in decoding microbes related to human diseases. MicroPhenoDB is available through http://www.liwzlab.cn/microphenodb and http://lilab2.sysu.edu.cn/microphenodb.


Introduction
The human body feeds a large number of microbes, mainly composed of bacteria, followed by archaea, fungi, viruses, and protozoa. Microbes, inhabiting various organs of the human body, mainly in the gastrointestinal tract, as well as in the respiratory tract, oral cavity, stomach, and skin, play important roles in human health and disease [1][2][3]. Microbial gene products have rich biochemical and metabolic activities in the host [4][5][6]. Microorganisms usually form a healthy symbiotic relationship with the host. However, when the microbial content becomes abnormal or exogenous microbes infect the host, the balance of host microecology can be broken, which in turn can possibly cause various diseases [7,8]. Tripartite network analysis in patients with irritable bowel syndrome demonstrated that the gut microbe Clostridia is significantly associated with brain functional connectivity and gastrointestinal sensorimotor function [9]. Strati et al. reported that Rett syndrome is substantially associated with a dysbiosis of both bacterial and fungal components of the gut microbiota [10]. The alteration of microbial communities on psoriatic skin is different from those on healthy skin and has a potential role in Th17 polarization to exacerbate cutaneous inflammation [11]. The ongoing pandemic of coronavirus disease 2019  has affected more than 220 countries, areas, or territories worldwide by November 2020. Lung injury has been reported in most patients with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection [12].
The interaction between microbes and hosts is a reciprocal relationship and remains largely under-explored [13]. Accurate relationship information between microbes and diseases can greatly assist studies in human health [14]. With the wide application of next-generation sequencing (NGS) technology, microbiological analysis methods and standards are being rapidly developed, such as metagenomic approaches [15]. As a result, a large amount of experimental data has been published [16]. Thus accurate database platforms are greatly needed to utilize these experimental data, determine the composition of pathogenic microbes in hosts, clarify microbialdisease relationships, and provide standardized high-quality annotation for clinical uses [17].
Due to the functional and clinical significance of microbes, several public databases have been established to collect microbe-disease association data, such as the Human Microbe-Disease Association Database (HMDAD) [18], Disbiome [19], the Virulence Factor Database (VFDB) [20], and the Comprehensive Antibiotic Resistance Database (CARD) [21]. HMDAD and Disbiome collate text-mining-based microbe-disease association data from peer-reviewed publications and describe the strength of the associations based on the credibility of the data sources. VFDB provides up-todate knowledge of the virulence factors (VFs) of various bacterial pathogens; CARD contains high-quality reference data on the molecular basis of antimicrobial resistance with an emphasis on genes, proteins, and mutations involved. Data in VFDB and CARD help to explain the relationship between pathogenic microbial genes and the health status of hosts. In addition, to assist physicians and healthcare providers to quickly and accurately diagnose infectious diseases in patients, a guideline for utilization of the microbiology laboratory for diagnosis of infectious diseases was developed and is being regularly updated by the Infectious Diseases Society of America (IDSA) and the American Society for Microbiology (ASM) [22]. The curation and analysis of microbe-disease association data are essential for expediting translational research and application. However, these computational resources lack manually and consistently curated data to connect metagenomic data to pathogenic microbes, microbial core genes, and disease phenotypes.
To bridge this gap, we developed the MicroPhenoDB database (http://www.liwzlab.cn/microphenodb) by manually curating and consistently integrating microbe-disease association data. We collected and curated the microbedisease associations from the IDSA guideline [22], the National Cancer Institute (NCI) Thesaurus OBO Edition (NCIT) [23], and the HMDAD [18] and Disbiome [19] databases, and also connected microbial core genes derived from the MetaPhlAn2 dataset [24] to pathogenic microbes and human diseases. A refined score model was adopted to prioritize the microbe-disease associations based on evidential metrics [18,25]. In addition, a sequence search web application was also implemented to allow users to query sequencing data to identify pathogenic microbes in metagenomic samples, as well as to retrieve the disease-related information of virulence factors and antibiotic resistances. MicroPhenoDB allows users to browse, search, access, and analyze data through userfriendly web interfaces, visualizations, and web service application programming interfaces (APIs).

Data collection and processing Data collection and manual annotation
To ensure data quality, we integrated the association data with annotations from HMDAD and Disbiome and manually collated and curated microbe-disease association data from the IDSA guideline and NCIT ( Figure 1). The IDSA guideline provides criteria for clinical identification of infectious microbes, while NCIT is a reference terminology that provides comprehensive information for infectious microbes. To enrich the annotation for disease-microbe associations, we manually traced the relevant literature in HMDAD and Disbiome; we also provided the microbes with annotation at the resolution of species levels, such as taxonomies and official names. Association data between infectious microbes and diseases in IDSA were extracted. Relevant information about disease phenotypes and microbes in the microorganism notes from NCIT were extracted as well. The collected and integrated association data include information about microbe symbols, disease symbols, the increased or decreased impacts of the microbes, PubMed identifiers, and validation methods.

Controlled vocabulary and ontology to describe microbes and diseases
In MicroPhenoDB, several standard terminology and controlled vocabulary resources were adopted to consistently annotate microbes and diseases ( Figure 1). Different tools and reference databases might give different taxonomies for microbes. To avoid this discrepancy, the official names of microbes were taken from NCIT [23], and the taxonomy identifiers were adopted from the National Center for Biotechnology Information (NCBI) [26] and UniProt [27]. The relationships between core genes and microbes were annotated using the MetaPhlAn2 tool [28], the microbial gene functions were annotated using the InterProScan tool [29], and the virulence factors and the drug resistance information of microbes were retrieved respectively from the databases of VFDB [20] and CARD [21]. The disease phenotypes were annotated with official names, experimental factor terms, definitions, classifications, and cross-references using the Experimental Factor Ontology (EFO) [30]. EFO provides a systematic description of many experimental variables across the European Bioinformatics Institute (EMBL-EBI) databases and the National Human Genome Research Institute (NHGRI) genome-wide association study (GWAS) catalog [31]; it also combines parts of several popular ontologies, such as Orphanet Rare Disease Ontology [32], Human Phenotype Ontology [33], and Monarch Disease Ontology [34]. The versions or releases of databases and tools used in the MicroPhenoDB construction are detailed in Table S1.

Association score model
One of the main problems in exploiting extensive collections of aggregated microbiome data is how to prioritize the associations. According to the previous studies by Ma et al. [18] and Pinero et al. [25], we refined the association score model to prioritize the microbe-disease associations using additional evidential metrics, including the number of sources that report the association, the type of curation of each source, and the number of supporting publications in the manual curation.
For every disease i and every microbe j, the raw score of their relationship Raw_score ij was defined as: In Equation (1), W IDSA is the weight of the association source from the IDSA guideline, W NCIT is the weight of the association source from NCIT, and W Literature is the weight of the association source from literature publications. N is the number of all diseases in MicroPhenoDB, and n j is the number of diseases associated with microbe j. Log(N/n j ) is computed to increase Raw_score ij for the microbes that are associated explicitly with few diseases or decrease Raw_score ij for the microbes globally associated with several diverse diseases. In Equations (2)-(4), MicroPhenoDB assigns different weights to different evidential sources according to their reliabilities ( Table 1) [25]. If the association is curated from literature, W Literature is initially assigned as 0.25, otherwise assigned as 0. If the association is curated from NCIT [23], W NCIT is initially assigned as 0.5, which is double that of W Literature , otherwise assigned as 0. If the association is curated from IDSA [22], W IDSA is initially assigned as 1.0, which is double that of W NCIT , otherwise assigned as 0. The three weights also depend on the direction of the abundance change of a microbe in a disease and the number of supporting publications. D ij (D ij 2{1, À1}) represents the direction of the abundance change of microbe j in disease i. If the microbe j is increased in the case of disease i, D ij equals 1; if the microbe j is decreased in the case of disease i, D ij equals À1. n p is the number of publications in which an association between a disease and a microbe has been reported. From the distribution of numbers of evidence, we found that n p was less than 16 and mostly ranged from 1 to 2 ( Figure S1).
Finally, the sigmoid function was used to normalize Raw_score ij to limit the range of the final association score Score ij from À1 to 1. In Equation (5), 'e' represents the natural constant e. Score ij can be used to judge the confidence of the relationship between a microbe and a disease phenotype. Please see the score distribution in Figure 2. A Score ij more than 0 indicates that the occurrence of the disease correlates with an increase of the microbial abundance, and a Score ij less than 0 indicates that the occurrence of the disease correlates with a decrease of the microbial abundance. The greater the absolute value of Score ij , the higher the number of previous reports of the respective microbe-disease association; the closer the score is to zero, the lower the number of previous reports  of the respective microbe-disease association. By investigating the Score ij distribution, most associations were found with Score ij between À0.3 and 0.3, and the two peaks with Score ij more than 0.3 were involved in high confidence associations from NCIT and IDSA ( Figure 2). This suggested that the score points of À0.3 and 0.3 would be the highly reliable thresholds to assess the confidence level of an association.

Implementation
The web applications in MicroPhenoDB were implemented in Java language by using the model-view-controller model and the SpringBoot framework and were deployed on an Apache Tomcat web server. The association data of microbes and disease phenotypes were stored in a MySQL database. Data access, search, and visualization were implemented by using the Ajax API technology. The frontend interface was visualized by using the Vue.js framework. The sequence search tool was implemented using the EMBL-EBI tool framework [35].
In total, 27,277 unique clade-specific core genes of 685 bacteria and viruses were retrieved from the dataset in MetaPh-lAn2 and were annotated with gene functions using InterProScan (Table 2). In addition, 4204 virulence factor genes and 2522 drug resistance genes were also included from VFDB [20] and CARD [21], respectively. A small percentage ((4.3%, 65/1497) and (4.4%, 66/1497)) of bacteria was annotated with virulence factor information and antimicrobial resistance information, respectively ( Table 2).

Web interface
The MicroPhenoDB website (http://www.liwzlab.cn/microphenodb) provides user-friendly web interfaces to enable users to search, browse, prioritize, and analyze the microbedisease association data in the database (Figure 4). The website offers multiple optional search applications of microbes, diseases, and associations to acquire prioritized association data with body site and microbe type filters. The prioritized microbe-disease associations can be downloaded as a CSV file for further analysis. The hierarchical structure of microbes and diseases are respectively displayed on the 'Browse' web page.   Table 3 The top ten body sites of disease-associated microbes in MicroPhenoDB Figure 4 The MicroPhenoDB web interface Information regarding the increasing or decreasing tendency of microbial abundance in a disease, virulence factor, and antibiotic resistance of the microbes, along with its core gene information, are available on the 'Browse' web page. In addition, MicroPhenoDB provides the web service APIs for programmatical access of the association data and produces an output in the JSON format. All the association data and the API documentation are available on the website. Users are also encouraged to submit their data of newly published microbe-disease associations. Once checked by our professional curators and approved by the submission review committee, the submitted record will be included in an updated release.

Applications of association data
MicroPhenoDB sequence search to explore metagenomics data In MicroPhenoDB, microbes were connected with diseases through 5677 non-redundant associations and linked to unique clade-specific core genes via 696,934 relationships ( Figure 5). Core genes could serve as a hub to connect metagenomic sequencing data to microbes and their associated diseases (Figure 5). A sequence search application was implemented on the MicroPhenoDB website (http://www.liwzlab.cn/microphenodb/#/tool) to allow users to query their metagenomic sequencing data against the MicroPhenoDB sequence datasets through the sequence alignment tools BLAST [36] and Bow-tie2 [37] (Figure 5). The application can directly identify the composition of pathogenic microorganisms in metagenomic samples and can suggest potential disease phenotypes that may be caused without running the usual metagenomic sequencing data processing and assembly, which are both time and resource consuming. Functional annotation for microbial core genes by the application includes gene ontology and pathway information. Searching against the sequence datasets of microbial pathogenic factors and drug resistance genes allows identifying homologous genes and proteins related to virulence factors and antibiotic resistance ( Figure 5).
To assess the sequence search usability, we used the sequence search application to analyze an existing metagenomic dataset downloaded from the Genome Sequence Archive (accession: PRJCA000880) [38]. The dataset contained metagenomics data of lung biopsy tissues from 20 patients with pulmonary infection [39]. Our results identified pathogenic microbes in 95% (19 of 20) of patients, significantly higher than the 75% identification rate (15 of 20) found through the original metagenomic NGS (mNGS) analysis [39]. In addition, our search identified 37 pathogenic microbes in patients, while the mNGS method only identified 29 (Table S2). Of the 37 microbes, 23 were identical to those by mNGS analysis. It was hard to estimate the false positives of the other 14 microbes, but we found that they may cause infections in patients with underlying diseases such as immunodeficiency. Therefore, this comparison suggested that the MicroPhenoDB sequence search application could screen metagenomic data for effective identification of pathogenic microbes. Due to the large size of metagenomic data and the need for a broadband network, we provide a software package of the search application for users to download and run locally. We also encourage users to upload the microbial abundance information to the online application for further analysis and visualization.

Distinguish clinical phenotypes of SARS-CoV-2 infection from different viral respiratory infections
The single-stranded RNA coronavirus SARS-CoV-2 can infect humans and cause COVID-19 disease [40]. Its structure is similar to those of viruses causing severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) [41]. At present, the diagnosis of SARS-CoV-2 infection is mainly based on clinical phenotypes, chest computed tomography (CT), and nucleic acid testing. Compared with CT and nucleic acid testing, clinical phenotype monitoring has significant advantages, such as a short turnaround time, low cost, and convenience [42]. To distinguish clinical phenotypes of SARS-CoV-2 infection from different viral respiratory infections, we searched MicroPhenoDB and obtained association data that contained 63 disease phenotypes and 14 respiratory tract infection viruses, such as human rhinovirus, parainfluenza virus, respiratory syncytial virus, metapneumovirus, and coronaviruses. The data were then imported into the Cytoscape software [43] for network analysis. The output network ( Figure 6) indicated that SARS-CoV-2 shares the clin- ical phenotype of pneumonia with the majority of other respiratory infection viruses, as well as the clinical phenotypes of dry-cough, headache, fever, myalgia, vomiting, diarrhea, and respiratory disease syndrome (underlined in green) with several influenza viruses and other coronaviruses. Importantly, the network also showed that dyspnea, fatigue, lymphopenia, anorexia, and septic shock (underlined in blue) were common clinical phenotypes of SARS-CoV-2 infection distinguished from other viral respiratory infections [12,44,45]. Bear in mind that these phenotypes of SARS-CoV-2 infection might be frequent complications of other diseases and treatments. For example, dyspnea is a frequent complication of chronic respiratory diseases [46], lung cancer [47], and hepatopulmonary syndrome [48]; septic shock is a complication of pneumococcal pneumonia, chronic corticosteroid treatment, and current tobacco smoking [49]; fatigue is a complication of multi-type cancers [50,51] and Parkinson's disease [52]; lymphopenia is a complication of human immunodeficiency viral infection [53]. However, our results suggest that these common clinical phenotypes could distinguish SARS-CoV-2 infection from infections by SARS-CoV, MERS-CoV, and other respiratory viruses.

Association network in different body sites
The microbe-disease association data can be downloaded and used for further analysis. To generate a network to explore the reliable connections between the microbial changes and the diseases in multiple body sites, we obtained the association data of body sites such as the vagina, urinary tract, and genitals using the reliable association score thresholds mentioned above (>0.3 and <À0.3). The resulting association data were imported into the Cytoscape software [43] for network analysis. The output network (Figure 7) indicated that the decreasing abundance of Lactobacillus (underlined in red) was related to vaginal inflammation and bacterial vaginosis in the vagina, while the increasing abundance of Chlamydia (underlined in green) resulted in lymphogranuloma venereum in the genitals. Moreover, the network showed that the increasing abundance of Mycoplasma genitalium (underlined in blue) was associated with multiple diseases, which involve genitals, such as pelvic inflammatory disease, nongonococcal urethritis, and nonchlamydial nongonococcal urethritis. Furthermore, the network showed that a microbe abnormality could be associated with diseases involving different body sites. For example, the increasing abundance of Neisseria gonorrhoeae (underlined in purple) was associated with two diseases, each in the genitals and urinary tract. For users to assess the microbial pathogenicity, it is recommended to filter the data by using the association scores and follow the supporting publications for further investigation. Users can follow our stepby-step guidelines on the website (http://www.liwzlab.cn/microphenodb/#/guideline) to perform similar association analyses and generate Cytoscape networks.

Concluding remarks
Microbes play important roles in human health and disease. The curation and analysis of microbe-disease association data are essential for expediting translational research and application. In this study, we developed the MicroPhenoDB database by manually curating and consistently integrating microbedisease association data. As far as we are aware, MicroPhe-noDB is the first database platform to detail the relationships between pathogenic microbes, core genes, and disease phenotypes. In terms of data coverage, scoring models, and web applications, MicroPhenoDB outperformed data resources that contain similar association data ( Table 4). For example, the numbers of associations, microbes, disease phenotypes, and supporting evidence in MicroPhenoDB were approximately 11.1, 6.1, 13.9, and 18.9-fold of those in HMDAD, respectively. Compared with both HMDAD and Disbiome, MicroPhenoDB refined the confidence scoring model using additional evidential metrics with different weights; it standardized the association annotations by manual curation and included pathogenic data of virulence factors, microbial core genes, and antibiotic resistance gens. Moreover, MicroPhenoDB implemented web applications and APIs for pathogenic microbe identifications in metagenomic data.
In MicroPhenoDB, many associations with confident scores came from our manual curation of the up-to-date clinical guidelines supported by IDSA and ASM. MicroPhenoDB assigned higher weight values to the associations derived from the guidelines and lower weight values to the associations from other literature data and databases. The original model for scoring confidence of the disease-microbe associations in HMDAD was based on a single literature evidence. Our MicroPhenoDB score model rated different supporting evidence according to the credibility of related sources and provided a score to evaluate a disease-microbe association.
By integrating unique, clade-specific microbial core genes and using the data from MetaPhlAn2, the MicroPhenoDB sequence search application enables rapid identification of existing pathogenic microorganisms in metagenomic samples without running the usual sequencing data processing and assembly. However, the resulting associations from the sequence search do not guarantee microbial pathogenicity but provide clues for further investigation. The annotated core Figure 7 The Cytoscape network illustrates the associations between clinical phenotypes and microbes at different body sites The diamonds represent clinical phenotypes resulted from a microbial abnormality at different body sites. The red circles represent the microbes. Lager size of a circle or a diamond indicates more connections to a clinical phenotype or a virus. The solid connection lines represent the associations between diseases and microbes with an increase in microbial abundance, and the dash connection lines represent the associations between diseases and microbes with a decrease in microbial abundance. Underlines indicate the microbes discussed in the main text.
genes are also limited in size and cannot represent all microbial species. To consistently analyze the important functions of microbes, other data or tools are also recommended, such as UniRef clusters [54], MetaCyc [55], HUMAnN2 [56], and pan-genomic data.
To serve the research community, we will update the database every six months and constantly improve it with more features and functionalities. As a novel and unique resource, MicroPhenoDB connects pathogenic microbes, microbial core genes, and disease phenotypes; therefore, it can be used in metagenomic data analyses and assist studies in decoding microbes associated with human diseases.

Data availability
To access the association data, the online applications, and the software package, please visit http://www.liwzlab.cn/microphenodb/#/download.

Competing interests
The authors have declared no competing interests.