Procedure and datasets to compute links between genes and phenotypes defined by MeSH keywords [version 1; referees: 2 approved with reservations]

Algorithms mining relationships between genes and phenotypes can be classified into several overlapping categories based on how a phenotype is defined: by training genes known to be related to the phenotype; by keywords and algorithms designed to work with disease phenotypes. In this work an algorithm of linking phenotypes to Gene Ontology (GO) annotations is outlined, which does not require training genes and is based on algorithmic principles of Genes to Diseases (G2D) gene prioritization tool. In the outlined algorithm phenotypes are defined by terms of Medical Subject Headings (MeSH). GO annotations are linked to phenotypes through intermediate MeSH D terms of drugs and chemicals. This inference uses mathematical framework of fuzzy binary relationships based on fuzzy set theory. Strength of relationships between the terms is defined through frequency of co-occurrences of the pairs of terms in PubMed articles and a frequency of association between GO annotations and MeSH D terms in NCBI Gene gene2go and gene2pubmed datasets. Three plain tab-delimited datasets that are required by the algorithm are contributed to support computations. These datasets can be imported into a relational MySQL database. MySQL statements to create tables are provided. MySQL procedure implementing computations that are performed by outlined algorithm is listed. Plain tab-delimited format of contributed tables makes it easy to use this dataset in other applications.


Introduction
Understanding molecular mechanisms underlying both normal cellular processes and disease-causing gene perturbations has numerous applications in clinical diagnostics, personal genomics and engineering [1][2][3][4][5] .Most of the genomic studies address two major questions: (i) What genomic and molecular markers are associated with an observed phenotype?(ii) What molecular mechanisms lead to that phenotype in the studied organism?Answering these questions and uncovering gene-phenotype relationships mostly relies on experimental research that has already generated very large amounts of high-throughput data stored in public databases [6][7][8][9][10] .New knowledge about genes and their functions is acquired all the time based on a constant gathering of genomic data.To date there are more than 1500 databases hosting various types of genomic and molecular biology data 11 acompanied by increasing number of research publications analyzing newly-generated data 12 .For this reason integrative algorithms to analyze high-throughput data by mining genomic databases and literature are in the focus of intensive research resulting in many publicly available bioinformatics tools for biologists and clinical researchers 6,[13][14][15][16][17][18][19] .
Biologists analyze lists of genes to dissect individual or collective gene involvement in the biological function that is being investigated, for example: • functions of differentially expressed genes identified in a microarray or RNA-Seq experiment; • relationships between a biological process of interest and target genes regulated by a transcription factor identified by ChIP-Seq experiment; • causative relationships between functions of genes found in a chromosomal deletion or duplication identified in a patient and a clinical phenotype of the patient; • identifying candidate genes from gene lists in literature and databases.
Finding meaningful relationships between genes in a large list and a phenotype by manually reviewing the literature and genomic databases is very laborious and time-consuming.Efforts to automate this process mostly have been directed towards the prioritization of human disease genes 20,21 and less for model organisms and general phenotypes 10 .Gene prioritization tools, that can be used to infer relationships between genes and phenotypes, differ from each other with respect to computational algorithms and data sources used in prioritization [21][22][23] .In computations, a definition of a phenotype will determine the rules by which the algorithm will mine available data resources to retrieve gene-phenotype links.

Phenotype definitions
The definition of a phenotype widely accepted in biology is "the observable trait or the collection of traits of an organism resulting from the interaction of the genetic makeup of the organism and the environment" meaning different things in different contexts 24,25 .In medicine the phenotype often refers to disease or abnormality 26 .In cellular contexts measurable cellular phenotypes are represented by features of cells such as the morphology (shape, size), the behavior (motility, growth), the developmental stage, the expression of specific genes and the rate of bio-chemical reactions 8,27 .
Specific vocabularies of phenotype terminology are implemented as ontologies containing concepts, the relationships between the concepts and the definitions of both 28,29 .Specialized phenotype vocabularies are available for model organisms 30,31 , life sciences [32][33][34][35][36] and human diseases 37,38 .Phenotype can also be defined as a subset of genes known to be functionally related to the phenotype of interest, usually used in gene prioritization algorithms 23,39 .However, if the phenotype of interest hasn't been well studied and does not have genes linked to it, then it is difficult or even impossible to use this approach.
Terms of Medical Subject Headings (MeSH) vocabulary can serve as appropriate phenotypic descriptions 38 .MeSH terms are curated and are assigned to the articles in PubMed to adequately reflect the content of each article since they are meaningfully associated with the biological processes that they denote.Phenotypes in Mammalian Phenotype Ontology (MPO) used in Mouse Genome Informatics (MGI) database 40 are also mapped to MeSH terms.

Approaches to infer gene phenotype links
Gene prioritization tools have to establish links between genes and phenotypes by use of some algorithm.Several overlapping categories of tools can be distinguished based on how a phenotype is defined: by training genes known to be related to the phenotype; by keywords and tools designed to work with disease phenotypes.Table 1 lists maintained prioritization tools from these categories that are frequently cited in GoogleScholar.Algorithms defining phenotypes by training genes in prioritization evaluate similarity between training genes and candidate genes.Supervised machine learning (most often kernel methods) are used in this category of tools 39,41 .Algorithms describing phenotypes by keywords usually use frequencies of gene-associated documents that have keyword matches.Majority of algorithms are designed to prioritize genes with respect to disease phenotypes defined by either the keywords or the training genes or by both.

Short summary of most representative tools
Endeavour.In Endeavour the phenotype is defined by the training genes.It builds a phenotype model using different sources of genomic information derived from the training genes.Endeavour data sources consist of gene annotations, gene sequences, expressed sequence tags over multiple conditions, protein-protein interaction data and known transcription factor binding sites.The program works with the genes of human, mouse, rat, fly and worm organisms.
It builds the model of the phenotype using information of the training genes in each of the genomic sources.It ranks the candidate genes according to how well they compare with the built model.Individual rankings in the Endeavour are combined by the order statistics 42 .In the table of the ranked genes the explanations are provided about the genes.
ToppGene.The candidate gene prioritization is one of the functions provided by the ToppGene tool.The user submits a set of training genes and a set of the test genes.The ToppGene first finds the significantly enriched annotations for the training genes in multiple data sources: GO annotations, literature, Interaction, Pathway, human and mouse phenotype data, TF binding sites, Cytobands, Co-expression Atlas, Drugs, microRNA and more.The candidate genes are ranked by the similarity of their functional annotations to the enriched annotations in the training genes.The similarity is computed as fuzzy-based measure 54 or Pearson correlation coefficient.The user can examine the genes and the enriched terms of the the training set.

GeneWanderer.
In GeneWanderer the candidate genes are retrieved from the genomic region given the genomic coordinates.
The phenotype is defined either by the disease keyword or by the list of the training genes known to be related to the phenotype.If the phenotype is defined by the keyword then the known genes associated with it are retrieved.The tool measures the distance between the candidate genes and the training genes in the proteinprotein interaction network.The tool is specific to the human diseases.
PolySearch.PolySearch allows queries in the form of: Given X find all Y. X and Y can be diseases, tissues, cell compartments, gene/protein names, SNPs, mutations, drugs and metabolites.If the phenotype is defined by keywords, then PolySearch retrieves the documents matching all keywords in the Pubmed, OMIM, DrugBank, Swiss-Prot, Human Mutation Database, Genetic Association Database and Human Protein Database.The ranked list of requested biomedical entities that are associated with the text of the query is returned.The score of the entity is proportional to the number of document matches in the databases.User can browse the results and examine the matching publications and sentences.
PosMed.Positional PubMed is the semantic engine that ranks biomedical entities by the statistical significance of the associations with the provided keywords.The strength of the associations between biomedical entities and keywords is based on the number of the documents they share.The document categories comprise PubMed (PubMed titles, abstracts and MeSH terms), REACTOME (Pathway information from REACTOME), Protein-protein interaction (Protein-Protein Interactions in Human and Mouse from IntAct and Arabidopsis from AtPID), Gene ontology, Human disease ontology, Mammalian phenotype ontology, Microarray based co-expression data for Arabidopsis.Given the keyword defining the phenotype and the type of biomedical entity to score (either gene or metabolite or drug) the PosMed returns list of the scored entities linked to the phenotype, sorted according to the strength of the connection between them.The PosMed supports human, mouse, rat, arabidopsis and rice organisms.The user can browse through all documents of the established links.

G2D.
In G2D the disease phenotype is defined by the OMIM identifier which is mapped to the associated MeSH terms of the diseases.The candidate genes are selected from the provided genomic region possibly containing a marker associated with the disease phenotype.G2D establishes a chain of evidence connecting the disease phenotype to the genes by forming the links between the terms in MeSH and GO annotations.This inference is illustrated in Figure 1 through the example of exploring candidate genes associated with cleft lip phenotype.Association between the cleft lip and the rs987525 variant from region 8q24.21has been replicated independently in several different populations 55 but no associated gene was found.G2D suggests the MYC gene as candidate.The link between this gene and the cleft lip disease phenotype is inferred through the relationship between the terms "Craniofacial abnormalities" and "Homeodomain proteins" and the relationship between the later term and the GO annotation a "Sequence specific DNA binding transcription factor activity" of the MYC gene.The MYC gene is regulated by the CTCF transcription factor 56 which has a binding site at the genomic location of rs987525 leading to a possible hypothesis that this SNP marker might be linked to the cleft lip through a regulatory interaction with the MYC gene 57 .Another connection between the BMP4 gene and cleft lip OMIM phenotype is inferred through the relationship between the "Cleft lip" term and the term "Bone Morphogenetic Protein 4" which is related to the GO annotation "BMP signalling pathway".The BMP4 gene harbors the rs1957860 marker variant which is known to be associated with cleft lip 58 .
Computing relationships between genes and phenotypes Gene prioritization algorithms produce lists of the best candidates which are most strongly related to the phenotype of interest according to the criteria set by the algorithm.Rankings are based on evidence scores of relationships computed by the prioritization algorithm for each candidate gene.For generation of meaningful hypothesis it is important to know what factors led to the obtained rankings and links established between genes and phenotypes.In methods relying on phenotype definitions by training genes a detailed examination of such evidence is difficult.In multipurpose systems such as PolySearch and PosMed provided evidence lacks specificity.Most comprehensive in this respect is G2D in which OMIM phenotype is translated into adequate MeSH term of the disease.In this study an attempt is made to develop means to support computations linking genes and phenotypes defined by the MeSH terms extending beyond the diseases and building upon the algorithmic principles of G2D 43,59 .

Methods
It was shown in applications of Arrowsmith algorithm that biomedical knowledge can be discovered through finding hidden links between concepts in scientific literature.The concepts, co-occurring at high frequency in two disparate sets of literature articles, indicated meaningful links 60,61 .The link suggested that fish oil can reduce Raynaud's syndrome symptoms, later confirmed experimentally 62 .An inference leading to this result was "fish oil reduces blood viscosity, platelet aggregations and vascular re-activity which are increased in Raynaud's syndrome" 63 .In similar way algorithms, based on linking the concepts or entities in the collections of data, relate genes to phenotypes by using concept co-occurrences in literature and controlled vocabularies 43,64 .
Links between phenotypes and gene GO annotations can be computed through intermediate links with chemicals as shown in G2D 59 .It is hypothesized that phenotype defined by the MeSH term can be meaningfully related to a subset of MeSH D terms denoting molecular entities of drugs and chemicals.Similarly, gene functions encoded by GO annotations can be meaningfully related to molecular entities denoted by MeSH D terms through related chemical processes affecting gene functions.Strengths of relationships can be derived from information in annotations of PubMed articles and NCBI datasets gene2go and gene2pubmed 65 .) and m(d j , go i ) relationships represent strengths of the connections between terms.The strengths of connections between the phenotype g and GO annotation go i passing through the chemicals These computed weights express the strength of association between the functional annotation go i and the phenotype of interest.Table in a bottom panel of Figure 2 illustrates one possible way to order annotated genes by the magnitude of weights of their association to the phenotype of interest.Principles underlying the algorithm to compute strengths of relationships m(phenotype, chemical) and m(chemical, GO annotation) between phenotypes and functional gene annotations can be founded on fuzzy set theory (FST) 43,66 .
Using mathematical framework of FST the relationships are defined as fuzzy binary relationships (FBRs) and can take a variety of forms 67 .A thorough explanation can be found in 68 on pages 69-84.

Definitions of relationships between MeSH terms and GO annotations
Let us denote phenotype MeSH terms as g j , j ∈ (1 … NG) in which j refers to a particular MeSH term.Similarly, let us denote MeSH D terms by d k , k ∈ (1 … ND).A subset of PubMed articles annotated by a specific g j term is denoted by G j .Similarly, a subset articles annotated by a particular term d k is denoted by with membership function ( , ) .
The on all pairs of selected annotations in the universe of all articles annotated by those MeSH terms.The membership function in Equation 2 models a degree of inclusion of a narrower concept d k (chemical) into a broader concept g j (phenotype) d k ⊆ g j .The FBR of inclusion in a quantitative way defines a semantic relationship between meanings of the broader and narrower concepts 69,70 .
Inclusion relationship between GO annotations and MeSH D terms is defined using a universe of genes instead of articles.Let us denote GO annotations by go i , i ∈ (1 … NGO) in which i refers to a particular annotation.NGO is a total number of GO annotations of genes in gene2go.Let us denote by GO i a subset of genes in gene2go annotated by a particular go i .Let us denote by GD k a subset of genes in gene2pubmed associated with articles, annotated by the MeSH D term d k ∈ (1 … NGD), where NGD is total number of MeSH D terms associated with genes through articles.Fuzzy binary relation R DGO between these terms is defined as: with membership function ( , ) .
The degree of connection between GO annotation and MeSH D chemical in Equation 4 is determined by a number of genes sharing these two annotations over a number of genes annotated by that GO.
A relationship between GO annotation and phenotype defining MeSH term is computed by applying maximum composition operation R GD ○ R DGO on fuzzy binary relations defined by Equation 1 and Equation 3 resulting in a following FBR: MySQL database and SQL procedures were created in order to experiment with and to support outlined inference 70 .The created datasets are limited to the annotated genes of human, mouse and fly organisms.The MeSH terms (mtree 2012) defining phenotypes are provided for the categories of Anatomy (A), Diseases (C), Drugs and chemicals (D) and Biological processes and phenomena (G).Information in the created datasets is as of September 2013.

Datasets and procedure to compute links between genes and phenotypes
In this section three contributed data sets are described.These datasets and presented MySQL procedure support computations of links between phenotype and GO annotations outlined in a previous section.Datasets were created by using NCBI E-utilities 71 and custom scripts.The data sources (as of September 2013) used to create these datasets are described in Table 2. MeSH terms of category B are present but are not used to define phenotype in computation.The datasets are submitted in a format of tab delimited tables that can be imported into MySQL database or used as plain data.In this work a presented data management framework is based on MySQL.• mid is unique identifier of a row; • mterm is MeSH term in which spaces are replaced by underscores (for example Cell_Fusion); • dterm is MeSH D term in which spaces are replaced by underscores (for example BMP4_Protein); • dscore is a float number representing a strength of connection between mterm and dterm computed as in Equation 2; • nm number of PubMed articles annotated by mterm; • nd number of PubMed articles annotated by dterm corresponding to |D dterm | in Equation 2; • inters number of PubMed articles annotated by both mterm and dterm corresponding to |G mterm ∩ D dterm | in Equation 2; • unio number of PubMed articles annotated by either mterm or dterm or both; • pmids comma separated list of PMID identifiers of PubMed articles that are the inters articles; • dtid numerical key identifying dterm in another table dterm_go.

Dataset dterm_go
The • dterm is MeSH D term in which spaces are replaced by underscores (for example BMP4_Protein); • goterm is identifier of GO annotation in the Gene Ontology (for example GO:0000001 is identifier for annotation "mitochondrion inheritance"); • gscore is a float number representing a strength of connection between goterm and dterm computed according to Equation 3 and Equation 4; • gogenes number of genes (of mouse, human and fly) annotated by goterm as was recorded in gene2go dataset in NCBI ftp repository corresponding to |GO goterm | in Equation 4; • genenum number of genes (of mouse, human and fly) sharing the goterm and dterm annotations corresponding to| GD dterm ∩ GO goterm | in Equation 4; • genetot number of genes (of mouse, human and fly) associated with articles annotated by dterm recorder in gene2pubmed dataset in NCBI ftp repository; • genes comma separated list of Entrez Gene identifiers of genes that form genenum genes; • id unique identifier of the row; • dtid numerical key identifying dterm in table mesh_terms.• gokey is a unique row identifier;

Dataset go_terms
• goterm is an identifier of GO annotation in the Gene Ontology (for example GO:0000001 is identifier for annotation "mitochondrion inheritance"); • description is a description of GO annotation in the Gene Ontology in which spaces are replaced by underscores (for example mitochondrion_inheritance is description of GO:0000001 identifier); • category is an indicator of a category of GO annotation and can take value of "Process", "Function" or "Component".
Procedure to compute links between phenotype and go annotations Figure 2 and Equation 1, Equation 3, Equation 5 outline a possible way of establishing links between phenotypes defined by MeSH terms and GO annotations that pass through chemicals.These links can be computed by MySQL statements given that MySQL database tables were created as shown in Table 3, Table 4, Table 5.The suggested procedure is presented in Table 6  For example, a command line query for phenotype "Intellectual Disability" (user X and database Xdb) can be executed in a following way: $mysql -u X -p Xdb -e "call \\ mesh_to_go( 'Intellectual_Disability',\\ 0.01,0.1);"> id_out; The first output file id_out section will have rows: These GO annotations are weighted by strength of their relationship to the "Intellectual Disability" phenotype.These weighted GO annotations can be used to rank genes as in Figure 2. Second section of output details relationships between the phenotype and chemicals and between the chemicals and GO annotations, for example considering information on GO:0045362: ms 0.071428596 goterm GO:0045362 description positive_regulation_of_ \\ interleukin-1_biosynthetic_process gscore 1.0000 dterm Interleukin-1_Receptor_ \\ Accessory_Protein dscore 0.0714286 A connection between "Intellectual Disability" phenotype and MeSH D term "Interleukin-1 Receptor" is quantified by dscore which equals to 0.0714286.Strength of connection between this chemical and "positive regulation of interleukin-1 biosynthetic process" GO term equals to 1.These two values determine the weight ms of this GO term in connection to "Intellectual Disability" (ID) phenotype.This computational procedure with respect to ID phenotype was previously explored 73 .

Use case Exploratory analysis of caner related genes in sequencing studies
In cancer genes accumulate a large number of mutations 74 and next generation sequencing screening may produce a vast number of genetic variants and genes.If a gene harboring a variant was not previously reported, then outlined computation can be used to explore connections of that gene to specific cancer based on a current available knowledge.
Among highly mutated genes identified by the whole genome and exome sequencing of breast tumors are PIK3CA, TP53, GATA3, CDH1, RB1, MLL3, MAP3K1 and CDKN1, which were previously observed in clinical breast cancer tumors 75 .Genes not previously observed in those tumors were TBX3, RUNX1, LDLRAP1, STNM2, MYH9, AGTR2, STMN2, SF3B1.Both sets of genes were explored in relation to "Breast Neoplasms" by ranking genes in whole human genome.The genes were ranked by magnitudes of weights of their annotations with respect to relationship to "Breast Neoplasms" as depicted in Figure 2 employing the outlined computational principle.Top genes from those sets appearing within the Top 5% of the ranked human genome and closer to this interval are listed in Table 7.
Cancer genes are well characterized and widely studied in literature.The genes, not previously reported as carrying clinically important mutations in studies of breast cancer in 75 had stronger links with cancer phenotype in question through their GO annotations.The genes from that study: RB1, GATA3, TP53 and CDH1 appeared in high ranking positions.Current exploration identifies "Tamoxifen" being strongly related to the breast cancer phenotype.Such link is logical because this chemical is used to treat hormone-sensitive tumors.Applied computational procedure through relationships between phenotypes and chemicals helps to explore contexts in which biological processes of interest take place.Unexpected links may be discovered that may help to formulate novel biological hypothesis.

Discussion
As of now, biologists still find it challenging to interpret large lists of poorly characterized genes with algorithms, which are somewhat limited in terms of how they define phenotypes.These lists may originate from a variety of sources, including microarray experiments, ChIP-Seq experiments identifying transcription factors' target genes, and scientific literature.The algorithm described here is useful in formulating biological hypotheses in situations in which little is known about the phenotype and the genes in question.The algorithm begins by linking lists of gene GO annotations to phenotypes (non-disease and disease) described by meaningful keywords through the MeSH and PubMed databases.The algorithm then deduces which of the links between the genes and phenotypes are strongest and presents the results in an organized manner.This is different from most of existing algorithms in terms of the methods used to define phenotypes of interest and infer their relationships with genes.To better understand how the outlined algorithm is unique, the existing algorithms are parsed into three categories overlapping at some extent and examined.

How algorithms define phenotypes and infer genephenotype relationships
The first category of existing algorithms only uses known phenotyperelated genes, or training genes, while the second focuses solely on human disease phenotypes.The third category uses general keywords from literature to define phenotypes.All must deduce how genes and phenotypes are related by mining selected information sources, retrieving and integrating data from them, but there are some differences between them.Each will be discussed in turn.
Algorithms based on the use of training genes known to be related to the phenotype of interest, as in the Endeavour 39 and ToppGene 45 tools, make prioritizations on the basis of pattern classification 76 .
Training genes extract phenotype-defining information from various data sources based on how similar the phenotype is to the training genes, and then build a model of a phenotype based on this extraction 42 .In other words, the model represents gene features that are most characteristic of that specific phenotype.The candidate genes are ranked by how similar their features are to the features of the model.For example, Endeavour relies on genomic data fusion from multiple information sources 41 .This tool may be very useful if the properties of the training genes clearly define the phenotype properties of interest in the organisms being investigated.Knowing these properties, one can characterize candidate genes by comparing them to the training genes.Genes that have similar characteristics to the training genes may also play an important role in previously unknown phenotype expressions.This principle of discovery is known as "guilt by association" 77 .Although very useful in detecting similarities between candidates and training genes, the integration of data from multiple sources has limitations.The main limitation is in existing schemes of combining the information from different sources to rank the candidates 41 .First, the prioritization algorithms using training genes generally differ with respect to the data sources they use 23 .Different information sources of training genes lead to different models and similarity metrics.Second, some data sources do not have complete information on some genes, so if the phenotype in question has not been sufficiently studied and there are no genes known to be associated with it, then the training genes approach is not effective.Third, the training genes might represent a heterogeneous group biasing phenotype definition in some way.For example, the data fusion scheme relies on the independence of information sources about gene properties, but in practice they are not entirely independent.Protein-protein interaction databases, the gene interaction databases and gene ontology refer to scientific publications as supporting evidence for the information they store, and might even be derived from the literature.While it should do so, the scheme does not always account for these possible interactions and overlaps between sources.
Many tools specific to human diseases use phenotype definitions from databases 47,59 .Because human disease phenotypes have been extensively studied and are well represented by OMIM 78 , and because they contain structured information suited to uncovering meaningful links between human diseases and genes, it is relatively easy to associate genes with said phenotypes.However, phenotypes other than diseases and phenotypes in Mammalian Phenotype Ontology 37 are not yet represented by well-structured and information-rich resources 33,34 .
The third category of algorithms, which use general keywords from literature to define phenotypes, are exemplified by tools such as PosMed, PolySearch, GeneProspector and CANDID 48,[51][52][53] .These tools rely on finding matching documents in MEDLINE or locallycreated databases, and then associating genes with the matching documents.General purpose discovery-oriented systems such as iHOP 79 , Anni2.0 80 , Arrowsmith 60,61 and PosMed 52 , use conceptual networks.Users can browse through the network and create textual profiles describing genes, proteins, or other biomedical concepts.However, once again, there will be genes and processes that are not well represented in literature and there is little information about them that can be retrieved.
Thus, while the obvious advantage offered by specialized gene information databases is that specific information can be extracted very quickly, complementing the literature with more information sources for gene prioritization is advantageous in allowing potential use of algorithms that offer novel interpretations of existing information.G2D 43,59 is the only existing method which provides underlying ideas for the algorithmic approach outlined here, which prioritizes genes with respect to human disease phenotypes.However, the scope of the application of the developed algorithm to link genes with phenotypes 70 outlined here is distinct from G2D by contributing the following: • The outlined algorithm establishes meaningful links between genes and phenotypes, and enables prioritization, beyond human disease phenotypes by using concepts of MeSH vocabulary from the categories A, D, and G.
• The proposed algorithm can be applied beyond human organisms as the annotated genes of the entire genome for human (Homo sapiens), mouse (Mus musculus) and fly (Drosphila melanogaster) are used.In contrast, G2D focuses on human genes.
• The data in the gene2go and gene2pubmed NCBI databases 65 are used to link GO annotations to MeSH terms of Drugs and Chemicals describing the molecular entities.In contrast, G2D works with the RefSeq database for this purpose 43 .
• The outlined algorithm similarly to G2D utilizes fuzzy binary relationships between concepts, based on mathematical operations of fuzzy set theory 67,68 , to infer genephenotype links.G2D uses a similarity relationship 43 while this algorithm uses an inclusion relationship 70 .
The outlined algorithm is an attempt to remedy some of the challenges presented by information shortages and the way existing algorithms described above are configured to define phenotypes and determine relationships.It has important advantages compared to the other gene prioritization algorithms, in addition to G2D, reviewed extensively in Introduction section.

Conclusions
The approach to link genes and phenotypes outlined in this work represents one out of existing possible approaches.Contributed datasets opens possibility to experimentation and development of other applications.These datasets, although in need of updating comprise co-occurrences of selected categories of MeSH terms in PubMed and co-occurrences of MeSH D terms with GO annotations created from NCBI Gene datasets.Availability of such offline data saves time of a researcher who may want to explore and apply text and data mining algorithms to analyze relationships between concepts.
Existing tools provide limited explanations for reasons for phenotype gene association.Using the outlined approach, evidence supporting the obtained strongest links can be easily examined.As a result of inference, the MeSH D terms which are most strongly related to both the candidate genes through their GO annotations and phenotype are identified.This is useful as it reveals the physical background domains related to the candidate genes gleaned from associated articles without reading their full text.The availability of this background information opens up the possibility of identifying and examining unique aspects of the functions of the studied genes.
However, a single information source cannot account for all aspects of gene relations to phenotypes even if MeSH vocabulary contains information about the processes, phenomena and phenotypes studied in literature.And while functional gene annotations are also associated with scientific publications, there will be genes and processes that are not well represented in literature, as stated earlier.In this situation inferring links between genes and phenotypes might be more effective using other information sources, as genes can also be characterized by their interactions with other molecular entities, by their sequences and by the information about the protein domains of the products.These gene properties can be retrieved computationally from other specialized databases.2. 1.

Open Peer Review
Current Referee Status: The article addresses one important issue in the field but misses the following important aspects that should be discussed: Most of the available annotation database such as GO, MESH are biased toward specific terms.Thus, the accuracy of the method could biased toward specific concepts and terms.The author should analyze the distribution of the m scores and associate the statistical significance calculated with respect to a background distribution.
For the definition of phenotypes the Human Phenotype Ontology (HPO) database is the reference.The author should cite this database in their paper and use it as benchmark set for the predictions.
3) A comparison with other methods should be provided.

Minor:
The description of similar methods previously developed is too long.
I have read this submission.I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed.Competing Interests: The process of ranking individual genes by their relationships as depicted in Figure 2 and in the Use Case (page 9) was very confusing and it was not apparent how the process of associating functions with phenotype could be mapped to genes of interest to the researcher.Greater care needs to be taken to describe this process (which is a somewhat complicated one) since this F1000Research 1.
needs to be taken to describe this process (which is a somewhat complicated one) since this seems to be the main application of the method described.
The last five sentences in the Abstract provide detail that is not necessary here but should be included in the main text instead.
It would be very helpful to have a comparison of the results given by the method described and another method (likely one of those described in the paper) in terms of functions identified for the use case and ranking of genes given.This would not be a performance evaluation (since it is difficult to tell what the 'right' answer would be in this case) but would provide a nice comparison with previous methods.Minor concerns: The word "cancer" is misspelled in the title for the Use Case section.
In the use case it is unclear where the list of "Genes not previously observed in these tumors" comes from.I would assume that it was genes that were associated with cancer in the referenced work, but hadn't been previously associated?This needs to be made clear.
I have read this submission.I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed.Competing Interests:

Figure 2
outlines the idea of the algorithm in which a phenotype and GO annotations are linked through chemicals.Let us denote MeSH D terms pertaining to chemicals by d j , j = 1, …, N. A relationship m(phenotype, chemical) between the phenotype defined by MeSH term g and the chemical defined by MeSH D term d j is denoted by m(g, d j ).Let us denote a relationship m(chemical, GO annotation) between the MeSH D term d j of chemical and GO annotation go i , i = 1, …, M by m(d j , go i ).Values of the m(g, d j

Figure 1 .
Figure 1.Connections computed by G2D in prioritization of genes with respect to the cleft lip phenotype.

Figure 2 .
Figure 2. Computation of relationships and weighting of GO annotations.
Pranckeviciene E: Dataset 1 in "Procedure and datasets to compute links between genes and phenotypes defined by MeSH keywods".F1000Research.2014.Data Source 83.Pranckeviciene E: Dataset 2 in "Procedure and datasets to compute links between genes and phenotypes defned by MeSH keywords".F1000Research.2014.Data Source 84.Pranckeviciene E: Dataset 3 in "Procedure and datasets to compute links between genes and phenotypes defned by MeSH keywords".F1000Research.
of Informatics, Department of Pathology, University of Bologna, Birmingham, AL, Italy Department of Computational Biology, Pacific Northwest National Laboratory, Richland, WA, 99352, USA Major concerns:

Table dterm_go http://dx.doi.org/10.5256/f1000research.6140.d43168
Table mesh_terms http://dx.doi.org/10.5256/f1000research.6140.d43167Tablemesh_termsstoresrelationshipinformation for pairs of MeSH terms.Each row corresponds to one pair of MeSH terms: a term of category A,C,D,G defining a phenotype (column mterm) and a term of category D defining a chemical (column dterm).Attributes of this relationship consist of number of articles in PubMed annotated by each of these MeSH terms separately (columns nm and nd contains number of articles annotated by mterm and dterm respectively), number of articles annotated by both terms (column inters), number of articles annotated by either term (column unio), computed strength of the relationship between the terms in the pair (column dscore) and comma separated list of PMID identifiers of the articles annotated by both terms.Column mid is an identifier of the row and dtid column is a key linking into dterm_go table.This table has 9,725,157 rows and 10 columns that are separated by tabs.Size of a plain table is 1.97GB.Compressed table takes 801MB.Information in this table is as of September 2013.Dataset 2. annotations dterm and goterm (column genenum), list of comma separated Entrez identifiers of genes that make genenum (genes sharing both dterm and goterm annotations).Column id is a unique row identifier.Column dtid is a key linking to the table mesh_terms.This table has 14,225,540 rows and 9 columns that are separated by tabs.Size of a plain table is 1.31GB.Compressed table takes 379MB.Information in this table is as of September 2013.Dataset 3. Table go_terms http://dx.doi.org/10.5256/f1000research.6140.d43176Tablego_termsstoresdescription of gene ontology annotations that are in table dterm_go.Each row contains a gene ontology annotation as 10 character identifier in GO (column goterm), textual description of this term (column description) and its category (column category) which can have one of the possible values: 'Process','Function' or 'Component'.Column gokey is a unique row identifier.This table has 20,266 rows and 4 columns that are separated by tabs.Size of a plain table is 1.23MB.Compressed table takes 270KB.Dataset mesh_termsThe mesh_terms table stores associations between MeSH terms defining phenotype and MeSH D terms defining chemicals.Statements to create this table in MySQL database are presented in Table3.Each row stores a pair of MeSH term (category A,C,D and G used to define phenotypes) and a MeSH D term defining

Table 2 . Data sources (content as of September 2013) and tables for computing gene phenotype relationships.
To have the full list of the MeSH terms with the corresponding category identifier.MeSH term categories A,B,C,D and G were used to retrieve the corresponding PubMed identifiers of the articles having cooccurring MeSH terms.Article annotations were used to create a table of PMID counts for pairs of co-occurring MeSH annotations in articles.gene2goTocollectannotatedgenes of the human, mouse and fly together with their GO annotations.gene2pubmedToretrieveMeSH D terms in the articles associated with the genes and to link GO annotations assigned to these genes with the retrieved MeSH D terms.homolo.geneTo create table of homologous genes of the three organisms and their GO annotations.,157 rows store data pertaining to pairs of MeSH term of category A,B,C,D,G and MeSH D terms.dterm_go 14,225,540 rows store data pertaining to pairs of MeSH D terms and GO annotations.go_terms 20,266 rows store GO annotations of genes of human, mouse and fly organism.

Table 3 . MySQL statements to create mesh_terms table.
chemical and their relationship as defined by Equation 1 and Equation 2 with supporting information.Data in this table are based on annotated PubMed content as of September 2013.Meaning of columns in mesh_terms table is as follows: LOAD DATA LOCAL INFILE 'mesh_terms' INTO TABLE mesh_terms COLUMNS TERMINATED BY '\t' IGNORE 1 LINES; a

Table 5 . MySQL statements to create go_terms table.
72Three input parameters queryterm, dfrac, gofrac can be provided to the procedure mesh_ to_go.sql.This MySQL procedure has a computation and an output part.The parameter gueryterm provides a MeSH term that defines phenotype.16914MeSHtermsfrom2012MeSHedition72canbequeried in current implementation.These terms form pairs with 5908 MeSH D terms of chemicals.Textual fields of all MeSH terms and GO annotations have underscores instead of spaces between words that should be used in formulating queries.Parameters dfrac and gofrac set thresholds on the corresponding dscore and gscore values.They can be used to filter terms in computation based on strengths of relationships between the phenotype and chemical and between the chemical and GO annotation (disallowing weaker relationships).Value of dfrac can vary in range of [0.0000041, 1] which is a range of dscore values.Value of gofrac can vary in interval of [0.0001, 1].In computation presented in Table6, a creation of t2 and t3 tables corresponds to performing a maximum composition operation defined by Equation5.The table t2 contains all relationships between the phenotype in queryterm and GO annotations passing through all chemicals that have a connection to the queryterm phenotype.Fewer GO annotations with only maximum weight in relationship to the phenotype are selected into table t3.Statements in the output part create plain sectioned text file.Weighted GO annotations are listed in the first section.Second section identified by "list_of_all_links_go_dterms", lists all connections in the table t2.

Table 6 . MySQL procedure to compute links between phenotype and GO annotations.
SELECT a.ms, a.goterm, b.description, a.gscore, a.dterm, a.dscore FROM t2 AS a, go_terms AS b WHERE a.goterm=b.gotermORDER BY ms DESC; end // delimiter ;