Consensus and conflict cards for metabolic pathway databases

Background The metabolic network of H. sapiens and many other organisms is described in multiple pathway databases. The level of agreement between these descriptions, however, has proven to be low. We can use these different descriptions to our advantage by identifying conflicting information and combining their knowledge into a single, more accurate, and more complete description. This task is, however, far from trivial. Results We introduce the concept of Consensus and Conflict Cards (C2Cards) to provide concise overviews of what the databases do or do not agree on. Each card is centered at a single gene, EC number or reaction. These three complementary perspectives make it possible to distinguish disagreements on the underlying biology of a metabolic process from differences that can be explained by different decisions on how and in what detail to represent knowledge. As a proof-of-concept, we implemented C2CardsHuman, as a web application http://www.molgenis.org/c2cards, covering five human pathway databases. Conclusions C2Cards can contribute to ongoing reconciliation efforts by simplifying the identification of consensus and conflicts between pathway databases and lowering the threshold for experts to contribute. Several case studies illustrate the potential of the C2Cards in identifying disagreements on the underlying biology of a metabolic process. The overviews may also point out controversial biological knowledge that should be subject of further research. Finally, the examples provided emphasize the importance of manual curation and the need for a broad community involvement.


Introduction
Metabolic pathway databases have proven very valuable for a wide range of applications, varying from the analysis of high-throughput data to in silico phenotype prediction. The past decade the number of pathway databases has grown markedly, providing extensive descriptions of the metabolic network for an increasing number of organisms (Karp and Caspi, 2011;Oberhardt et al, 2009). The metabolic networks of several key organisms, for example, S. cerevisiae and H. sapiens, are even described in multiple databases. A comparison of two yeast networks showed, however, that the two agreed on only 36% of their reactions (Herrgård et al, 2008). Similarly, five pathway databases describing the human metabolic network agreed on only 3% of the 6968 reactions they jointly contain (Stobbe et al, 2011). Given that these databases aim to represent the metabolic capabilities of the same organism, the level of agreement is much lower than one might expect and hope for. There are several explanations for the observed lack of consensus. These include the different ways in which the networks have been built, their manner of curation, and a different interpretation of literature (Mo and Palsson, 2009). The comparison of Stobbe et al (2011) also revealed large differences in the breadth and depth of the coverage the five human metabolic networks have.
The advantage of having several descriptions of the metabolic network for the same organism is that they offer different views on the same biological system and thus can reveal controversial biological knowledge. In addition, the databases each have a particular focus and its curators have specific fields of expertise. Therefore, each database may provide complementary pieces of the puzzle of the complete metabolic network. These observations have motivated, still ongoing, efforts to consolidate the different networks for the same organism and to build consensus metabolic networks using a largely manual approach (Herrgård et al, 2008;Thiele and Palsson, 2010a;Thiele et al, 2011).
Combining all the knowledge on the metabolic network contained in the various pathway databases and identifying conflicting information is, however, far from trivial. Retrieving all required information from multiple databases is in itself already a cumbersome task. One reason that makes it challenging to identify instances where pathway databases do not agree on the underlying biology of a metabolic process are the different decisions made by each of the databases on how to represent knowledge (Stobbe et al, 2011;Wittig and De Beuckelaer, 2001). For example, a particular difference may be simply explained by the different levels of granularity with which metabolic processes are described by each database, instead of a fundamentally different biological insight. Secondly, it remains a challenge to determine whether databases refer to the same gene or the same metabolite. Thirdly, the definition of a pathway also differs per database, which makes it nearly impossible to compare the networks on a smaller scale, i.e., per pathway. Fourthly, the larger the number of pathway databases considered, the more difficult it is to identify the consensus and the conflicts. Recently, algorithms have been proposed to semi-automatically merge two descriptions of the metabolic network of the same organism (Chindelevitch et al, 2012;Radrich et al, 2010). These approaches mainly address the challenge of matching metabolites, partly via interactions with the user. The core of their resulting merged description consists of reactions that can be found in both networks. Integrating more than two descriptions will, however, significantly reduce the size of the core and limit its utility (Stobbe et al, 2011). The merged description also contains reactions that could not be (exactly) matched and are therefore unique to one of the descriptions. Such an approach will, however, neither resolve the conflicting information between databases nor filter out erroneous information. Furthermore, the semi-automatic approaches do not explicitly address all issues mentioned above. For example, conflicts due to differences in granularity are not taken into account. While semi-automatic approaches generate a useful scaffold for a consensus network, the resulting description still requires extensive manual curation. Altogether, the issues described above make the construction of a single, more accurate, and more complete network based on the pathway databases available a laborious and largely manual process (Thiele and Palsson, 2010a). Moreover, it is an ongoing process, as new knowledge continues to become available both in the scientific literature and in pathway databases.
To more easily visualize the opinion of multiple pathway databases, we introduce the concept of Consensus and Conflict Cards (C 2 Cards). C 2 Cards combine the knowledge from multiple pathway databases for a specific target organism. A C 2 Card can be centered at a single gene, Enzyme Commission (EC) number or reaction of interest and gives a concise overview of what the databases do or do not agree on with respect to the entity the C 2 Card is centered at. These three perspectives offer complementary views on the knowledge contained in the pathway databases. Importantly, the perspectives provide ways to identify differences that may be explained by a different decision on how and in how much detail to represent knowledge. C 2 Cards can be used to assist reconciliation efforts and make users of pathway databases more aware of the exact differences that currently exist between databases.
As a proof-of-concept, we implemented C 2 Cards Human (www.c2cards.nl), which combines the knowledge of the following five frequently used human pathway databases: the Biochemically, Genetically and Genomically structured (BiGG) knowledgebase (Schellenberger et al, 2010) (H. sapiens Recon 1 (Duarte et al, 2007), the Edinburgh Human Metabolic Network (EHMN) (Hao et al, 2010), HumanCyc (Romero et al, 2004, and the metabolic subsets of the Kyoto Encyclopedia of Genes and Genomes database (KEGG) (Kanehisa et al, 2012) and Reactome (Croft et al, 2011). Below, we first give an overview of the various features of the C 2 Cards, the combined strength of the three perspectives, and how C 2 Cards can aid in the curation of gene and metabolite identifiers. Next, we describe several case studies illustrating the potential of the C 2 Cards in identifying conflicts between pathway databases. Finally, we discuss the next steps to be taken in curating metabolic networks.

Results
Each C 2 Card provides an overview of the knowledge of multiple pathway databases from the perspective of a specific gene, EC number or reaction of interest. A C 2 Card answers the basic question of which databases contain the entity of interest. Importantly, each card provides a concise overview of what the databases do and do not agree on with respect to the entity of interest. The core component of a C 2 Card is a table in which each row contains the following basic elements: a reaction and the EC number(s), gene(s) and pathway linked to it in one of the pathway databases ( Figure 1). Any of these elements may be missing, except for the entity on which the C 2 Card is centered. By focusing on these basic elements, the overviews remain compact. For additional information provided by the pathway databases, e.g., pathway visualization and literature references, a direct link is provided to the original entry of the reaction in the pathway database. The second core component of a C 2 Card is that each card explicitly indicates the similarity of the reactions displayed on it. Similarity is indicated either between all pairs of reactions (gene and EC number perspective; Figure 1) or with respect to the reaction of interest (reaction perspective; Figure 1). Here, reaction similarity is defined as the percentage of metabolites found in both reactions (see Materials and Methods). The strengths of each of the three perspectives are discussed in more detail below.

Three complementary perspectives
C 2 Cards offer three complementary perspectives (gene, EC number, reaction) on the knowledge contained in the pathway databases. Each perspective can answer various types of questions, accommodating the different interests one may have. Figure 1 -Examples of two C 2 Cards. C 2 Card centered at the CTPS gene (top) and the C 2 Card retrieved by clicking on the reaction of Reactome in the C 2 Card centered at the CTPS gene (bottom). Each C 2 Card consists of a table in which each row contains the following basic elements: a reaction and the EC number(s), gene(s) and pathway linked to it in one of the pathway databases. One can switch perspective by clicking on any of the elements in the table. For additional information provided by the pathway databases, e.g., pathway visualizations and literature references, a direct link is provided to the original entry of the reaction in the pathway database. The second core ingredient of a C 2 Card is that each card explicitly shows the similarity of the reactions displayed on it. The percentage of overlap between reactions is indicated and relevant cells are colored according to the degree of overlap. Information on the IDs assigned to the metabolites and genes by a pathway database is shown by clicking on the i icon. For EC numbers the reaction and name linked to it by NC-IUBMB are shown. Importantly, the three perspectives can be used to identify and complement information missing in one (or more) of the pathway databases using the knowledge from the other pathway databases.

Gene perspective
The 'gene perspective' shows for each of the pathway databases, which metabolic functions the product of a gene has, as indicated by the reaction(s) and EC number(s) linked to it. This perspective may also answer the question whether other genes, either encoding isozymes or components of the same complex, are linked to the same reaction.

EC number perspective
The 'EC number perspective' shows on which elements linked to the EC number the pathway databases (dis)agree for a specific type of conversion. It may also reveal possible alternative substrates, which is one of the sources of conflict between metabolic pathway databases (Stobbe et al, 2011). The C 2 Card centered at the EC number 1.1.1.35 (3-hydroxyacyl-CoA dehydrogenase) provides an example of this scenario (Supplementary File S1). The EC number perspective can also be used to answer the question which genes encode for an enzyme with the specified enzymatic function, according to each database.

Reaction perspective
The 'reaction perspective' provides a compact overview of which gene(s) and EC number(s) are linked to a reaction of interest in each pathway database. This perspective can assist in resolving a commonly occurring gap in reconstructions of the metabolic network, namely cases in which the gene product catalyzing a known metabolic reaction is missing (Orth and Palsson, 2010). The reaction perspective (and also the EC number perspective) can be used to find possible candidates for a missing gene in a particular database or reveal that the gene is missing in all pathway databases.
By clicking on any of the entities shown in a C 2 Card one can easily switch perspective. Furthermore, each C 2 Card is opened in a new window to enable a simultaneous view of the C 2 Cards of a linked triple of a reaction, EC number, and gene from different viewpoints. Using all three perspectives is essential to get a complete picture of what the databases do or do not agree on. The EC number perspective can, for example, neither fully replace the gene perspective nor the reaction perspective, as illustrated by the example in Table 1. An EC number does not uniquely identify a reaction or an enzyme. As the example shows, the pathway databases linked different EC numbers to the same reaction. Furthermore, in this case  . The information of NC-IUBMB is available in a C 2 Card for each EC number that is part of the overview (see Figure 1). the databases either do not agree on the substrate specificity of the gene product, or curators assigned the EC number based on the reaction instead of the functionality of the gene product (Table 2). Finally, in the C 2 Cards application one can also cast a wider net when querying for an EC number by allowing a mismatch on the fourth number of an EC number. In contrast to the first three numbers, the last number does not indicate a specific subclass of enzymes and only serves to distinguish enzymes with different substrate specificities.

Dealing with conceptual differences
Combining different perspectives also offers a way to side-step differences that do not reflect a true disagreement on the underlying biology. For example, the detail with which a metabolite or a conversion is described varies between and within databases. One database may describe the specific form of a metabolite, e.g., α-Dglucose or β-D-glucose, while in another database the more general form is used, Dglucose in this case. A possible motivation for database curators to choose the general version is that in an experiment the distinction between two isomers may be difficult to make. This type of difference is unlikely to affect the gene or EC number that is assigned to the reaction and can, therefore, be revealed using the gene or EC number perspective.
Another example of a difference in level of detail is a biochemical conversion that is described in a single reaction using generic metabolites, like 'a long chain alcohol', versus multiple reactions with more specific examples of metabolites, i.e., 'hexadecanol' and 'octadecanol' instead of 'a long chain alcohol'. The gene or EC number perspective can be used to uncover such a difference. The number of steps used to describe a biochemical process may also differ and will prevent a perfect match on reaction level as well. The latter type of difference is not necessarily explained by different decisions made on how to represent a biochemical process, but could also be due to a disagreement on the underlying biology. This commonly occurring difference in level of granularity can be revealed via the gene or EC number perspective as well (Table 3).

Gene and metabolite identity
Next to exploring the genes, EC numbers, and reactions contained in the pathway databases, as described above, C 2 Cards can also be of direct use in curating the identifiers (IDs) assigned to the genes and metabolites by the pathway databases. Identifiers are essential for the unambiguous identification of genes and metabolites across multiple resources and enable linking experimental data to the metabolic network. For each gene and metabolite a C 2 Card provides the identifiers assigned to them by the pathway databases (see Figure 1, and Materials and Methods). Obsolete or transferred identifiers are explicitly indicated. For genes the HUGO Gene Nomenclature Committee (HGNC) symbol is provided and for metabolites their name and synonyms. If available in a pathway database, two structural IDs (InChI and SMILES) and the chemical formula are also shown for a metabolite. The information on the identifiers helps to reveal cases where the assignment of identifiers to a metabolite or gene can be improved. Firstly, it can uncover metabolites that completely lack an ID in one or more pathway databases. Secondly, ID information can also help to identify cases where pathway databases assigned IDs from different gene and metabolite databases to the same entity. This can be used to propose additional identifiers for that particular gene or metabolite, which may also  facilitate matching between databases. Thirdly, it can reveal genes and metabolites to which a pathway database assigned multiple identifiers from the same genome or metabolite database, respectively. In summary, C 2 Cards can assist the considerable amount of manual curation required to correctly link each component of the metabolic network to external databases.
The ability to correctly match metabolites when comparing reactions is influenced by the different decisions the curators of the pathway databases have taken. For example, in Recon 1 and HumanCyc the protonation state of a metabolite is determined at a pH level of 7.2 and 7.3, respectively. The other three databases always use the neutral form of a metabolite. As illustrated in the C 2 Card centered at the CTPS gene (Figure 1), this leads to a reaction mismatch between EHMN and KEGG that have chosen for ammonia (NH 3 ) and Recon 1 that has chosen ammonium. The gene and EC number perspectives offer a possible way to uncover such differences. The C 2 Cards application provides an additional means to uncover reactions that are similar, but not an exact match, by allowing the user to specify that one or more mismatches are allowed when querying for a reaction. An example of the results of a query in which one mismatch was allowed is given in Table 4. Note that the genes and EC number do match, which suggests that the two reactions can be considered equivalent. Moreover, in this example the reactions only seem to differ in the level of detail with which the metabolite ornithine was described. Allowing mismatches also makes it possible to retrieve reactions for which the identity of one or more metabolites could not be established, because of missing identifiers or for which matching on name was hindered by the use of different synonyms.

C 2 Cards interfaces
C 2 Cards can be accessed using common JavaScript-enabled browsers on all major platforms including Windows, Linux, and Apple. A C 2 Card centered at a gene or EC number of interest can be retrieved in a single step. For the reaction perspective two routes are offered, either of which requires three steps. A reaction can be found by entering one or more metabolites or by selecting the pathway it is part of in one of the pathway databases. More detail on how to retrieve a C 2 Card is described on the C 2 Cards website (www.c2cards.nl). Once retrieved, a C 2 Card can also be downloaded for off-line use. In addition, for each database the C 2 Cards for all its genes, EC numbers, and reactions, respectively, can be downloaded in tab-delimited format in a single ZIP file.
Next to the web interface, programming interfaces to R, SOAP (Simple Object Access Protocol), and REST (Representational State Transfer) are provided to enable programmatic querying of the collection of C 2 Cards. One possible application would be to perform computational analyses on each of the pathway databases. A typical example is an enrichment test to prioritize pathways most likely to be affected in a given high-throughput experiment. The differences between pathway databases can be quite large both with respect to content and conceptual differences (Stobbe et al, 2011). For example, the number of pathways, in the five selected human pathway databases ranges from 69 in EHMN to 257 in HumanCyc (see Materials and Methods). Consequently, it is to be expected that the choice of a particular pathway database affects the outcome of pathway enrichment analyses (Elbers et al, 2009). It would, therefore, be advisable to apply analyses to multiple pathway databases to verify the robustness of the results. Specifically, to accommodate pathway enrichment analyses, we provide two additional tables, accessible via the programmatic interfaces only. In these tables the metabolites and genes of each pathway database are linked to the corresponding pathways. The results of our reaction comparison could be used to zoom into the outcomes of an enrichment analysis to see if the differences found can perhaps be attributed to the different pathway definitions used by the databases.
Another additional feature offered is the possibility to look up the fate of a metabolite, contained in any of the five databases, by retrieving the list of reactions in which the metabolite of interest participates. Furthermore, databases in which the metabolite is a 'dead-end', i.e., it is either only produced or consumed, are explicitly indicated. The list of reactions provided allows the user to find candidate reactions to resolve these dead-ends in the network of a particular database using information from other databases. All reactions in this list are linked to their corresponding C 2 Card.

C 2 Cards case studies
For each of the three perspectives we provide a concrete example derived from C 2 Cards Human of consensus and conflicts between the five human pathway databases below. The examples have all been chosen from primary metabolic processes, highlighting that conflicts still occur even in well-studied parts of the metabolic network. The case studies also illustrate why manual curation remains crucial to resolve contradicting information and to determine in which cases further biochemical experiments are even required to verify what is correct and what is not.

Case study I: Gene perspective
The C 2 Card focused on the CTPS gene ( Figure 1) shows that the gene is found in all five databases and is linked to the same EC number by each database. However, Reactome and Recon 1 link the gene to two different reactions, i.e., the glutamine dependent reaction 'l-glutamine + ATP + UTP + H 2 O → l-glutamate + ADP + CTP + orthophosphate' and the ammonium dependent reaction 'ammonium + ATP + UTP → ADP + CTP + phosphate + H + ', respectively. The C 2 Card focused on the reaction of Reactome (Figure 1) shows that Recon 1 does contain this reaction, but links it only to the CTPS2 gene and not to CTPS. The same observation can be made when starting from the EC number perspective, as both genes are linked to the same EC number (not shown).
The products of both the CTPS and CTPS2 gene contain a glutamine amidotransferase domain and have high sequence similarity. This, and the fact that all databases assigned the same EC numbers to both genes suggests that they have similar catalytic activity. Both gene products can indeed catalyze the glutamine dependent reaction, as demonstrated by overexpression of both human genes in yeast (Han et al, 2005). For L. lactis it is known that both ammonium derived from the hydrolysis of glutamine by the CTP synthase enzymes themselves and ammonium from other external sources of amine donors can be utilized for CTP synthesis (Willemoës, 2004). The human counterparts of these enzymes may follow the same reaction mechanism as found for L. lactis. This is supported by the fact that under room temperature glutamine is unstable and will dissociate into an ammonium ion and oxo-proline. We, therefore, conclude that CTPS and CTSP2 should probably be linked to both reactions. This means that Recon 1 could be improved by adding CTPS to each reaction. In Reactome and HumanCyc the ammonium dependent reaction then needs to be added.  )). The reaction in grey is found in all databases, the reaction in red only in EHMN and KEGG. '|==|' indicates no direction provided by the database. Genes are represented by HGNC symbols, retrieved via Entrez Gene IDs. Genes, the products of which form a complex, are placed between parentheses and connected by the Boolean operator 'and' (see Materials and Methods). If the gene products are isozymes 'or' is used.

Case study II: EC number perspective
The EC number 6.2.1.4 (succinate-CoA ligase (GDP-forming)) is found in all five databases. They all agree on one reaction and two genes linked to it (Table 5, reaction indicated in grey). The reaction is considered to be part of the tricarboxylic acid (TCA) cycle, a mitochondrial pathway, by all databases except HumanCyc. Both EHMN and KEGG also include a very similar reaction (Table 5, reaction indicated in red), which only differs with respect to its co-substrates, i.e., IDP/ITP instead of GDP/GTP. Although IDP is a substrate for this enzyme in vitro, it is extremely unlikely to play a role in vivo. The concentrations of IDP and ITP are very low as compared to other nucleotides, and they are considered byproducts of purine nucleotide metabolism (Bierau et al, 2007). The reaction should therefore not be included in the description of the human metabolic network.

Case study III: Reaction perspective
All five databases contain the reaction 'deoxyuridine + phosphate <==> 2-deoxy-d-ribose 1-phosphate + uracil' and assigned it to similarly named pathways (Table 6). However, there is no consensus regarding the genes linked to this reaction. For UPP2 there is clear experimental evidence that its gene product can catalyze the reaction (Johansson, 2003). To the best of our knowledge the activity of the enzyme encoded by UPP1 was only evaluated for two substrates, uridine and thymidine (Watanabe and Uchida, 1995). For TYMP evidence exists that its product can indeed catalyze this reaction in placenta (Kubilus et al, 1978;Yoshimura et al, 1990) and in platelets (Desgranges et al, 1981), but in liver, for example, such activity was not observed (Yoshimura et al, 1990). For PNP there is not enough evidence clearly confirming or refuting that its product can catalyze this specific reaction. In conclusion, additional experiments are required to determine whether the products of UPP1 and PNP can catalyze this reaction. This also illustrates that even though the majority of the databases links PNP to the reaction, this is not necessarily corroborated by conclusive evidence. For the TYMP gene there is only evidence for two highly specific tissues, which leaves it open for discussion whether its product should be included as a catalyst of this particular reaction. We can conclude that EHMN, HumanCyc and KEGG should at least link the UPP2 gene to this reaction. This would resolve the 'missing gene' issue in HumanCyc. Note also that the majority of the databases does not link UPP2 to this reaction, although clear evidence for its role is available.

Discussion
We proposed the concept of Consensus and Conflict Cards to provide concise overviews of the knowledge contained in metabolic pathway databases for an organism of interest. In a single step one can find, for example, a gene of interest and see if the databases agree on the role of its product in the metabolic network. The C 2 Cards will increase the awareness of the differences that exist between the various pathway databases. Other initiatives also provide a web-based interface to browse and search multiple pathway databases (Cerami et al, 2010;Kamburov et al, 2011). However, they are focused on the union of various (pathway) databases instead of explicitly pointing out the differences between pathway databases. Furthermore, they do not provide a clear and compact overview of the content of each of the five selected databases as a C 2 Card does. Also, the C 2 Cards application enables users to find reactions that are similar to the reaction of interest, but that are not exactly the same. The three perspectives offered by the C 2 Cards application provide complementary views on the knowledge contained in the pathway databases. This makes it possible to distinguish differences that reflect a disagreement on the underlying biology (case studies I-III) from differences that may be explained by, for example, different decisions taken on how to represent knowledge (Table 4).

Reaction of interest: deoxyuridine + phosphate <==> 2-deoxy-d-ribose 1-phosphate + uracil
Ultimately, to reconcile differences and to integrate the networks manual curation is required. While a C 2 Card can highlight differences between databases, it cannot distinguish between errors in one (or more) of the databases and cases where databases do not agree due to lack of consensus in the scientific literature. Moreover, for any given organism metabolic pathway databases are still being refined, expanded, and corrected. This makes it challenging to distinguish complementary information from cases in which the database curators purposely excluded, for example, a reaction or gene. Even the parts the pathway databases agree on may need to be reviewed as the databases share information sources and may copy data from each other, thereby possibly propagating incorrect information. Manual curation is also needed to unambiguously assign identifiers to genes and metabolites.
In summary, C 2 Cards offer an elegant solution to bring cases that deserve further inspection to the attention of pathway database curators. The overviews may also point out controversial biological knowledge that should be subject of further research.

Conclusions
A biologically accurate and complete description of the metabolic network for human and other organisms is of utmost importance to, e.g., increase our knowledge about pathways perturbed by a disease, find new drug targets, and interpret the deluge of high-throughput data. A crucial step towards a more complete description is to combine the knowledge captured by each of the available pathway databases for a specific organism. Much time and effort has already been put into pathway databases and we should profit from this to the fullest extent. However, it requires the commitment and the support of a broad community to construct an initial consensus network and to extend it with new knowledge from domain experts, the scientific literature, and as captured by the various pathway databases. C 2 Cards can contribute to such an endeavor in several ways. As illustrated by the three case studies the C 2 Cards are a perfect starting point for manual curation of the human metabolic network in future reconstruction jamborees (Thiele and Palsson, 2010a). The set of five pathway databases currently contained in C 2 Cards Human can also be further expanded with additional pathway databases. Importantly, C 2 Cards can be set up for other organisms as well (see www.c2cards.nl for a description).
As a guide for integrating pathway databases, we provide overviews of which genes, EC numbers, and reactions can be found in which database. The entries in these overviews are linked to the corresponding C 2 Card. One could start by curating the reactions contained in all or the majority of the databases. In fact, for more than half of the reactions found in all five human metabolic pathway databases, there is no agreement on the EC numbers and genes linked to a reaction (Stobbe et al, 2011) and additional curation is needed. C 2 Cards can also be of use if a consensus network for a given organism has already been established. We envision that the C 2 Cards application could serve as a central platform in which the consensus network can be further refined and extended with knowledge available in pathway databases not used for its construction. We are planning to include the, recently completed, consensus human metabolic network Recon 2 (Thiele et al, submitted) in C 2 Cards Human . Recon 2 combines the content of three reconstructions, H. sapiens Recon 1, EHMN, and the liver-specific network HepatoNet1 (Gille et al, 2010). By including Recon 2 as a point of reference, we can compare this state-of-the-art consensus network with other pathway databases. The overview of all reactions in C 2 Cards Human , for example, could be a source of candidates for expanding Recon 2. Bringing the differences between the consensus network and other descriptions to the attention of experts would enable further refinement of Recon 2. As a first step towards such a platform, users can already add comments to a C 2 Card, preferably substantiated by references to the literature. They can subscribe to C 2 Cards of their interest and receive an e-mail when new comments are added. Based on these contributions a team of curators could then decide to incorporate the necessary changes in the consensus network, if enough evidence supports this claim. Notably, as illustrated by case study III, it may lead to the conclusion that further biochemical characterization experiments are required. Since pathway databases are continuously being refined and new information is being added, we could also include the possibility to automatically alert the curators by mailing them updated or additional C 2 Cards.
It is important to actively involve domain experts in this continuous curation process, even though they may only indirectly benefit from contributing to such an effort. To make the barrier to contribute as low as possible, the web interface of the C2Cards was designed to be easy to use and suitable for users with different backgrounds. The application can be accessed via smartphones and tablets as well, allowing C2Cards to be viewed and discussed nearly anywhere. Furthermore, a C2Card can be downloaded for off-line use. The curation of a C2Card is done at the level of a single reaction or the metabolic functions of a single gene product. This may lower the threshold for experts to contribute as well and also allows (very) detailed knowledge of just a single step in the metabolic network to be added. One way to stimulate expert contributions would be to make the contribution traceable and citable in the form of 'nanopublications' (Groth et al, 2010). A nanopublication consists of three parts: a statement, e.g., protein X (subject) catalyzes (predicate) reaction Y (object), conditions under which the statement holds, e.g., a specific compartment, and provenance of the statement, e.g., author and literature. Besides that this provides an incentive for experts to share their knowledge, it is also a way to ensure that contributions of curators are substantiated by references to the literature.
We also plan to include in C2Cards Human the human metabolic pathways of WikiPathways (Pico et al, 2008), an open platform in which anyone can contribute a pathway. By incorporating the knowledge from this database we indirectly have a second way in which experts can contribute their knowledge. Ultimately, to reconstruct a biochemical network that closely resembles the metabolism of a target organism, extensive literature research and additional biochemical experiments will be needed to resolve all conflicts revealed and to fill in the gaps that remain. The continuous support, time and effort of a large and diverse community are therefore essential. C2Cards can contribute to this endeavor by simplifying the identification of consensus and conflicts between pathway databases and lowering the threshold for experts to contribute.

Materials
C2Cards Human was built upon the same dataset we used previously (Stobbe et al, 2011) for a comparison of five pathway databases, i.e., EHMN, H. sapiens Recon 1, HumanCyc, and the human metabolic subsets of KEGG and Reactome (Table 7). For each reaction we retrieved: the EC number(s) and gene(s) linked to it, and the pathway(s) the reaction is part of ( Table 8). To compare the reactions, we retrieved for each metabolite, besides its primary name and available synonyms, the chemical Although not used for comparing metabolites, we also retrieved the InChI and SMILES of metabolites, when provided by the pathway database, as additional information. For the genes we retrieved the Entrez Gene and Ensembl Gene ID, if available. For display and comparison purposes we mapped the Entrez Gene and Ensembl Gene IDs to their corresponding HGNC symbol as provided by the Entrez Gene and Ensembl database, respectively. Both the Entrez Gene ID and the Ensembl Gene ID were not available for 396 genes in HumanCyc. For 106 of these genes the UniProt ID was used to retrieve the Entrez Gene ID and/or Ensembl Gene ID. All out-of-date identifiers and EC numbers were transferred to the current ID/EC number (Supplementary Table S1). If that was not possible the ID or EC number was flagged as being obsolete. All data is made available under the original license terms of the primary databases.

Data retrieval and storage
We used dedicated in-house scripts to retrieve the data needed for C 2 Cards Human from the five pathway databases and stored these data in a local MySQL database. The database was designed for easy comparison of the genes, EC numbers, and reactions. The results of all comparisons were stored in the database as well. A second database, optimized for the queries needed for generating the C 2 Cards Human (Supplementary Figure S1), was derived from this database, including precomputed results of all the comparisons to avoid heavy computations in the web application.

Matching
In C 2 cards Human genes, EC numbers, metabolites and reactions were matched as follows: Genes Two genes were considered to match if they agreed based on the Entrez Gene ID and/or Ensembl Gene ID. In addition, both types of gene identifiers were mapped to the corresponding HGNC symbols. This provides a basis for matching genes that are not linked to the same genome database, i.e., Entrez Gene or Ensembl, via their HGNC symbol.

EC numbers
Matching of EC numbers is straightforward except for 71 incomplete EC numbers the five databases have in total. Up to three numbers of the four that make up a complete EC number may be missing. This is indicated by '-', for example, EC 1.-.-.-. Incomplete EC numbers have an ambiguous meaning (Green and Karp, 2005). They may indicate that further specification of the enzyme activity is not possible, but also that a complete EC number for the specific enzyme activity is not yet included by NC-IUBMB. To reduce the number of spurious matches, incomplete EC numbers were matched literally, i.e., the '-' was not treated as a wildcard.
Metabolites Metabolites were matched based on the KEGG Compound ID, when available. If the KEGG Compound ID was not provided, the metabolites had to match on any of four other identifiers (KEGG Glycan, ChEBI, PubChem Compound or CAS ID) or on name. In the latter case we also required the chemical formula to match. A difference in the number of H atoms when comparing chemical formulae was ignored.
Reactions For reactions we determined the percentage of metabolites they agreed upon, respecting the two sides of a reaction, but ignoring the direction of a reaction. We did not require e -, H + , H 2 O to match as reactions are not always balanced for these metabolites. Furthermore, we did not take into account the compartmentalization of reactions. The similarity of two reactions was measured by the percentage of overlap: It depends on the organism and the specific pathway databases included in the C 2 Cards database which IDs can best be used for comparing genes and metabolites. Only a few changes to the code and the original C 2 Cards database scheme are required to use other IDs for matching. A more detailed description of the changes to make is available on our website (www.c2cards.nl).
Construction web application C 2 Cards Human was built using the Molecular Genetics Information Systems (MOLGENIS) toolkit (Swertz et al, 2010). This software enables bioinformaticians to model a complete web application having rich data structure and user interfaces using a simple and short XML file. From this model, the toolkit automatically generates software in the Java language that provides a basic web user interface (using Freemarker templates, http://www.freemarker.org), and programming interfaces in Java, R, SOAP and REST to the underlying MySQL database. Building on these generated software we used MOLGENIS 'plug-in' framework to program in Java and JavaScript extra features that are specific for C 2 Cards Human , such as the various search options. The result is installed on a standard Tomcat web server, but can also run 'standalone' using the MOLGENIS embedded web server. A local installation of C 2 Cards Human is also available upon request. All code and the database scheme is open source and can be used as a basis for building a C 2 Cards application for other organisms. A manual on how to do this is available on our website (www.c2cards.nl). The code for the C 2 Cards application is available at http://www.molgenis.org/svn/c2cards/trunk/. A copy of the core MOLGENIS project is also required, which is available at http://www.molgenis.org/svn/molgenis/branches/molgenis_c2cards.

Representation
Each row in a C 2 Card contains a reaction, the EC number(s), gene(s), and the pathway linked to the reaction, and the name of the source database. If a reaction was assigned to multiple pathways, a separate row is used for each pathway. The metabolites of a reaction are represented by their primary name as indicated by the pathway database. Although not taken into account when matching reactions, the direction of a reaction and the compartment(s) as indicated by the source database are shown in a C 2 Card. If the direction was not provided this is indicated with '|==|'. Multiple EC numbers are connected by a comma. Following the convention used in Recon 1, genes of which the products are isozymes are connected by the Boolean operator 'or'. If the gene products form a complex 'and' is used. EHMN and KEGG, however, do not have a syntactic mechanism for describing isozymes nor complexes. Therefore, if multiple genes were linked to a reaction by EHMN and KEGG, they are connected by a comma. Genes are represented by the HGNC symbol retrieved from Entrez Gene. The Entrez Gene ID was, however, not always available for every gene, and the HGNC symbol could not always be retrieved when the Entrez Gene ID was available. In these cases we used, when available, the Ensembl Gene ID to retrieve the HGNC symbol. For 358 genes the HGNC symbol was not available via either gene identifier type. In this case the gene is represented by its Entrez Gene or Ensembl Gene ID, depending on which of these two was available.
For 274 genes in HumanCyc these two gene identifiers were also not available and for these cases the internal gene identifier of HumanCyc is used for representation. If multiple HGNC symbols were linked to a gene they are separated by two underscores. Note also that HumanCyc and Reactome may link multiple Entrez Gene IDs to a single gene, which in most cases will also result in multiple HGNC symbols. Similarly, KEGG and Reactome contain genes linked to multiple Ensembl Gene IDs. a One could not be corrected and was therefore removed b As the CID-SID.gz file from PubChem was used to convert the PubChem Substance IDs to PubChem Compound IDs these are naturally up-to-date.
An 'x' indicates that the particular identifier is not available for this database.

Supplementary File S1 -Example of a C 2 Card
A C 2 Card centered at an EC number may reveal possible alternative substrates, which is one of the sources of conflict between metabolic pathway databases (Stobbe et al, 2011