Comparative Analysis of Biochemical Network Reconstructions

Reconstruction of genome-scale metabolic network is a result of assembling various information sources about all biochemical reactions expected in the metabolic network of interest. Despite the efforts of leading bio-models databases to make comparison of biochemical networks by elements names obsolete, interest from researchers in using string similarity metrics in comparison of metabolites names has been growing. Multiple challenges in comparison of reconstructions are discussed in this article and an insight into current approach of metabolic model comparison has been given. The discussion of challenges and attempts to solve them are followed by the author’s proposed algorithm for models comparison that can be particularly useful in case of reconstructions. Special attention is given to the use of metabolites names and chemical formulas. The author’s proposed algorithm has been implemented in a software tool ModeRator. The article is concluded with use cases of the comparison algorithm and the software tool.


Introduction
The molecular processes in cells form a huge network, which makes detailed mathematical modeling and simulation extremely difficult (Schulz et al., 2006).Genome-scale reconstructions of metabolic networks and stoichiometric models may contain thousands of metabolites and reactions (Thiele et al., 2013) and therefore the functions of such networks are hard for the human mind to comprehend (Palsson, 2006).
During the last decade, over 50 genome-scale reconstructions have been built for various organisms.Despite the growing number of reconstructions and models in databases such as JWS (Snoep and Olivier, 2003;Van Gend et al., 2007) or Biomodels (Le Novère et al., 2006), the computational analysis has been rarely applied to comparisons between multiple organisms.The main reason for this is existence of differences between reconstructions that are inherited from the respective reconstruction processes of the organisms to be compared (Oberhardt et al., 2011).
The increasing knowledge base of living organisms leads to even more complex biochemical models and scientists often decide to model only a part of genome, not the whole metabolism.The process of iterative model building promises to accelerate the biological discovery, product development, and process design (Palsson, 2006;Ideker et al., 2001).Consequently, the need for analysis, comparison, and merge of biomodels is growing.The demand for a method to relate different models (Gay et al., 2010) and compare or to couple them as parts of larger models has been noted by Radulescu et al., (2008).
The disuse of strict standardization in identification of metabolites and reactions leads to problematic reuse of models.Single metabolite can have multiple ways of notation.The use of synonyms worsens the problem.The differences in reconstructions annotations lead to the current situation where a number of biochemical network models of the same organism exist, but there is no way to inspect (in a reasonable time) how much they overlap, what parts do they have in common or is one model a subset of the other.
The currently available software solutions for automated comparison of reconstructions and models rely on elements identifiers and can recognize the identity of identically annotated elements.Therefore all could-be-equal elements with noncomparable or different type identifiers have to be pairwise inspected manually by a competent biologist.In case of genome-scale reconstructions checkable pairs of metabolites and reactions can reach several millions.Therefore biologists need computational help to reduce the manual work.However, there is a lack of automated solution that could handle the comparison of genome-scale reconstructions with poor or differently styled annotation.

The challenge of reconstruction comparison
The reconstruction of genome-scale metabolic network is a result of assembling various information sources about all the biochemical reactions expected in the metabolic network of interest (Palsson, 2006).
Many efforts in biology are inspired by the observation that different species have many common properties and molecular mechanisms (Bruggeman and Westerhoff, 2007).For instance, glycolysis process takes place in all the known organisms.The similarity of organisms and modules of biochemical networks justifies necessity of reconstruction comparison between different organisms, and not just different reconstructions or models of one organism.
Since 1997, over close to hundred genome-scale reconstructions for various organisms, including human, have been built.Human reconstruction Recon2 (Thiele et al., 2013) containing 7440 reactions and 5063 metabolites was able to predict with 77% accuracy compared to experimental data changes of metabolite biomarkers 49 inborn errors of metabolism.
The growing number of available reconstructions can be used as integrated knowledge building new models or reconstructions for the process or organism of interest.To utilize the knowledge stored in model databases the models have to be compared to find their level of agreement and make use of highly reliable parts of existing models making use of existing knowledge.
The main reason for not applying computational analysis on comparisons between multiple organisms are the differences between reconstructions that are inherited from the respective reconstruction processes of the organisms to be compared (Oberhardt et al., 2011).
The overall purpose of reconstruction comparison is to find what reactions both reconstructions have in common.The information about common reactions can later be used by a biologist to make conclusions about common pathways.
The comparison of biochemical network reconstructions would be simple if all the reconstructions would be created according to a standard.That is not the case because different scientific groups in different countries with different traditions develop reconstructions over last 20 years.Several groups of challenges therefore are arising: different amount and quality of annotations; differences in metabolite description; differences in reaction notation; compartmentalization.

Current approach of metabolic model comparison
The possible problems with reconstruction comparison origin from the very beginning of the creation of reconstruction as process of reconstruction is based on analysis and combination of available information about biochemical reactions forming the network.Automatically generated draft reconstructions may have comprehensive annotation, however, addition of information from various sources sooner or later spoils the initial consistency.
From the software tools surveyed, currently only Tools-4-Metatool (Xavier et al., 2011), Compare Subsystems (Oberhardt et al., 2011), SemanticSBML (Krause et al., 2010), COBRA (Becker et al., 2007), The FAME (Boele et al., 2012), MetRxn (Kumar et al., 2012), BudHat (Waltemath et al., 2013), PINT (Wang et al., 2010) and MEMOSys (Pabinger et al., 2011) provide functionality that is related to the comparison of models.Software tools mostly rely on internal or external identifiers, like KEGG ID and ChEBI ID and do not tolerate even small differences in metabolite names like brackets, quotes, apostrophes, spaces, upper/lower case letters and some more symbols which may be caused by the modelers style of defining metabolites.Therefore many pairs of identical metabolites may not be recognized leading to wrong conclusions about the similarity of models.
If the identifiers of reconstruction elements (compartments, metabolites and reactions) can be directly used to correctly identify elements across different reconstructions, then the whole comparison problem can be reduced to the comparison of two metabolism graphs.However, in the real-world applications the internal identifiers cannot be used to identify elements across reconstructions.
No software tool that could handle flexible comparison of genome-scale reconstructions with poor or differently styled annotation have been found during the survey.

The proposed algorithm
The overall purpose of reconstruction comparison is to find what reactions both reconstructions have in common.The information about common reactions can later be used by a biologist to make conclusions about common pathways which are formed by a series of reactions.
Usually the elements of metabolic network are metabolites and enzymesmetabolites react with each other with help of an enzyme producing other metabolites.Elements of reconstruction are data lists describing metabolites, reactions and compartments.
Logical order of steps needed to compare two biochemical reconstructions is: 1. compare and map compartments, 2. compare and map metabolites within compartments; 3. compare reactions.
Reactions can be compared only after the involved metabolites have been compared and mapped.Since metabolites may reside in different compartments, it is important to map compartments as well.Recognition and comparison of metabolites has attracted attention also from other researchers including Qi and Ozsoyoglu, 2013;Qi et al., 2014;Thavappiragasam et al., 2014.In this paper, the author focuses on cases where entities external identifiers, like KEGG ID and ChEBI ID can not be used in reconstruction comparison.
Depending on the source of the reconstruction and the file format, different set of additional information bits is available.
The following entities of reconstructions are compared: -Metabolite comparison is based on their names that are provided in the reconstruction file.Information about compartments and chemical formulas is used to strengthen or weaken automatic decision about equality.-Reactions are compared on their equations (reversibility, metabolites and their stoichiometry).Information about E.C. and GPR numbers is used to strengthen or weaken automatic decision about equality.

Comparison of metabolites
The pairwise comparison of metabolites means that each metabolite from one reconstruction is compared with each metabolite from the other reconstruction.The number of comparison operations needed equals m × n, where n and m are numbers of metabolite count in reconstructions that are compared -so reconstructions each containing thousand metabolites will require one million comparison operations.The algorithm to compare two individual metabolites is summarized in Figure 1 and is applied to each pair of metabolites.When algorithm ends with "Discard the Fig. 1: Algorithm of processing a pair of metabolites pair" action, the particular metabolite pair will not be passed further to the mapping algorithm.
To calculate the similarity of metabolites names, author has chosen to use Gestalt Pattern Matching algorithm by Ratcliff and Metzener, 1988.This algorithm is available in Python standard module difflib.To calculate Edit distance Levenshtein, 1966 algorithm from pyLevenshtein library has been used.
The similarity ratio and edit distance independently characterize the similarity of any two metabolite names.The two metrics have different scopes: the similarity ratio is a percentage in floating point format, but the edit distance is an integer starting from 0.
In metabolite comparison algorithm these two metrics are combined into one.If the similarity ratio (R) clearly characterizes the similarity of given names, then the expression (1 − R) calculates the dissimilarity.The edit distance already characterizes the dissimilarity of given names, therefore, the division of edit distance and the length of the shortest name, is still a number that characterizes the dissimilarity, but within the scope that is similar to that of the ratio.
Not division with the length of the shortest nor the longest name can guarantee a result between 0 and 1.The division with length of the shortest name will always be a greater number then the division with length of the longest, and it is essential for short metabolite names.
The combined Difference score of dissimilarity is presented in the Equation (1).
where: D : difference score R : similarity ratio E : edit distance L : the length of the shortest name A : coefficient to affect the impact of similarity ratio B : coefficient to affect the impact of edit distance The difference score is a sum of two dissimilarity metrics (1).This new summed metric is calculated for each metabolite pair.The essence of the equation is that for long metabolite names, the summed dissimilarity consists mainly of the ratio component, but for short metabolite names, the main contributor is the edit distance component.Examples of various names and corresponding difference score are given in the Table 1.For two identical names the calculated difference score is 0 (zero).Levenshtein, (1966), similarity ratio algorithm by Ratcliff and Metzener, (1988) and Difference score algorithm by the Author.Phonetic-like preprocessing.Certain symbols in metabolite names can have different impact to biological meaning (see Table 2).For instance, special characters do not play significant role in the meaning of particular metabolite name, while numbers can change the biological meaning completely.A procedure is proposed to obfuscate characters with small impact on the meaning and to increase impact of numbers in the metabolite names.As seen in the Table 2, the phonetic-like preprocessing decreases the difference score for the first pairs (Aspartate), but increases it for the second pairs (Trihydroxypropane).In both cases the result improves chances to match equal metabolites and avoid matching of unequal.Comparison of metabolites formulas Chemical formulas can be used to verify that a particular pair of metabolites truly contains the same metabolites or not.If both formulas are available, the basic solution would be to compare formulas "as they are".If one or both formulas are not available, the decision can not be made.The formula similarity metric is given in the Equation ( 2).The formula comparison algorithm calculates how many atoms are different between two formulas.

String
where: F : formula similarity index minH : smallest number of hydrogen atoms maxH : greatest number of hydrogen atoms O : number of other differing atoms The formula similarity index is used as a multiplier for Difference score.The essence of Equation ( 2) is the following: for two equal formulas the equation will produce value 0.5 and therefore it will reduce the previously calculated Difference score by half; for formulas where only count of hydrogen atoms are different the produced value will be between 0.5 and 1.0 and therefore the Difference score will be slightly decreased (enhanced); if other atoms are different among the formulas, their count is added to the formula similarity and therefore the Difference score will be increased (degraded).
Examples of different formulas comparison is given in Table 3.The conjunction of the Difference score and the formula similarity is given in Equation (3) The free constant (C) is important and should not be set to zero because if both metabolites names are identical and therefore Difference score already is zero, then different chemical formulas would make no impact to decrease the similarity of metabolites.For example, if, the C is 1 then for equal names and equal formulas the final score will be 0.5.The non-zero value of the final score leaves open space for additional multipliers that can be added later after further research.

Mapping of metabolites
The mapping of metabolites is a procedure that explicitly defines which metabolite from one reconstruction corresponds to which element in the other reconstruction.
Metabolite mapping between two networks can only take place after the individual comparison of metabolites.The mapping is a procedure that explicitly defines which element from one network corresponds to which element in the other network.The problem of metabolite mapping can be classified as bipartite graph matching.
The problem of metabolite mapping can be classified as bipartite graph matching.A matching in a graph is a subset of its edges, no two of which share an endpoint.Polynomial time algorithms are known for many algorithmic problems on matchings, including maximum matching (finding a matching that uses as many edges as possible), maximum weight matching, and stable marriage The Difference score for each metabolite pair is used as a criterion in mappingonly pairs with lowest difference gets mapped.
The metabolite mapping algorithm solves the stable marriage problem (Gale and Shapley, 1962).The difference from Gale algorithm is that it is not always required or possible to produce a stable marriage between all pairs of metabolites between two reconstructions.The task is to pair only equal metabolites, not to make sure that no one is left unpaired.In case of uncertainty is also necessary to keep a number of multi-engaged metabolite pairs, because it is the biologist that makes the final approval of which metabolites from one reconstruction suit to which metabolites on other reconstruction.The matching algorithm provides suggestions in cases where it is not possible to create a match automatically.Such cases appear quite often in real-world reconstructions.Also, reconstructions not necessarily have equal number of metabolites, and it is not always necessary to create a stable marriage for all metabolites even in equally sized reconstructions, because both reconstructions may cover different parts of genome, which overlap for a certain degree.

Comparison of reactions.
Reaction comparison algorithm not only tells whether two reactions are equal or not.It calculates the difference -how many reactants differ in both reaction sides.
The filtering (ignoring) of common metabolites like water and hydrogen can give overall improvement on comparison of reactions.However, in cases when a researcher does not know what are the metabolites that should be ignored, the comparison that tolerates small differences is desirable.
It should be stressed that two reactions can be equal despite some missing metabolites if the reactions in reconstruction are not balanced.The tolerant approach with missing metabolites should be taken only in cases when reaction balance can not be verified.

Impact of metabolite similarity thresholds on the comparison of reactions.
Figure 2 shows how the number of mapped metabolites affects the number of found reactions.In this example two reconstructions of C. acetobutylicum by Salimi and Mandal, (2010) and McAnulty et al., (2012) were compared.The curves in the plot are: -Mapped metabolites -the number of approved and mapped metabolites; -Equal reactions -the number of found equal reactions; -Tolerated (OR) -the number of found equal reactions where one missing reactant from substrates or products is tolerated.The similarity threshold is 51% -the percentage of matching reactants; -Tolerated (AND) -the number of found equal reactions where one missing reactant from substrates and products is tolerated.The similarity threshold is 51% -the percentage of matching reactants; -Reactions with mapped metabolites -the number of reactions containing at least one mapped metabolite; -MPNVP reactions -maximal possible number of common reactions (the number of reactions in the smallest reconstruction); -MPNVP metabolites -maximal possible number of common metabolites (the number of metabolites in the smallest reconstruction, taking compartment coverage into account); -Manually appr.metabolites -the number of manually (by a biologist) approved metabolites after the automatic comparison and matching.What is interesting, the number of reactions where at least one metabolite is reconciled is close to the maximal possible number of common reactions from the very beginning.However, the number of equal reaction where it is possible pinpoint equal reactions barely reaches 15% of theoretically possible common reactions.Figure 2 clearly shows number of things that have to be taken into account: automatic mapping of metabolite pairs with tolerated formulas can lead to false positive mapping of some metabolite; even knowing formulas and compartments for all metabolites does not guaranty correct metabolite matching; automatic mapping of metabolites (without manual approval) can lead to false positive results in comparison of reactions.
4 ModeRator -a software tool for comparison The software tool ModeRator has been made according to the object-oriented paradigm.
Reconstructions that are loaded into ModeRator become objects that have methods that enable their comparison with other reconstructions.The code of ModeRator is organized in many classes, but the core of the inner data model consists of just four classes: st model, metabolite, reaction and reactant.
Handling of different file formats.To compare reconstructions in different file formats, ModeRator converts them to inner data model.Constructor classes for two importers have been implemented: for COBRA reconstructions in MS Excel spreadsheets and for SBML models.The constructors deal with specifics of particular file format.The use of libSBML (Bornstein et al., 2008) enables convenient way of SBML model conversion to ModeRator inner data format.SBML files prior to Level 3 does not support storing of chemical formulas.ModeRator can process three different nonstandard patterns of storing chemical formulas in SBML files: directly in the notes field, in paragraph in the notes field, in the metabolite's name field after the actual name.A special algorithm in ModeRator scans name and notes fields, splits them by various delimiters, and tries to parse splitted parts as chemical formulas.If the algorithm succeeds, it assumes that the formula is found.
In COBRA models, reactions are stored as strings in spreadsheet cells.Metabolites in one sheet, reactions in another sheet, and a set of columns with additional data.In order to read COBRA compatible MS Excel files, ModeRator makes use of Python xlrd library.Therefore, the rest of reading COBRA models involves only string processing.The importer of COBRA models deals with inconsistency of reaction string formatting.
A peculiarity of COBRA models is that there is no list of compartments.The compartment identifier (usually name) is indicated in a column beside other information about metabolites.Therefore the list of compartments is created dynamically while importing the list of metabolites.There can be situations where compartment is not indicated in a dedicated column but in brackets as a part of the metabolite abbreviation, for instance, ADP[c] or H2O [m].A workaround for such cases has been implemented in ModeRator -if there are less than 2 compartments, ModeRator will try to guess them from metabolite names.
The Graphical User interface.The functions of ModeRator are arranged in consecutive tabs.For instance, in the first tab user can import two biochemical reconstructions.Other tabs are dedicated for comparison of metabolites or reactions.
Filtering of metabolites.The presence of chemically unbalanced reactions makes the identification of equal reactions across multiple reconstructions harder.An option to equalize balanced and unbalanced reactions is to filter (ignore) specific metabolites, like water and hydrogen from all reactions.
Metabolite filtering feature is implemented in ModeRator.To filter a metabolite the user has to find it the list and enable filtering of particular metabolite by placing a tick.A quick search function is also available.
Since the ModeRator can also be used to generate graph drawings of the metabolism, in some cases it may be useful to filter other metabolites, like CO2 to produce more transparent picture with less arrows.
Comparison of metabolites.The tab for metabolite comparison and mapping is shown in Figure 3.The metabolite names similarity and edit distance thresholds are set with graphical sliders.Phonetic-like name preprocessing can be enabled or disabled.
The GUI allows user to set various thresholds, like metabolites names similarity, allowed edit distance, the tolerance of formulas and filtering by compartments.The Author's proposed Difference score is used to weight matched metabolite pairs.It is also Depending on the size of the reconstructions, comparison settings and user's computer the comparison process can take from few seconds to several minutes.
The user with biological knowledge makes the final decision ticking the first column whether automatically matched metabolite pairs truly are equal metabolites.Usually a manual curration and help from colleagues is needed.For that reason user can export the results of automatic metabolite matching to CSV file.After manual curration of automatically matched metabolites user has to apply metabolite mapping by pressing the Apply button.
Comparison of reactions.There are two methods for comparison of reactions.
Comparison by metabolite IDs compares reactions based on metabolite mapping or internal identifiers.Comparison by metabolite formulas is applicable in cases when it is not possible to match metabolites by their names, but metabolite formulas are available in both reconstructions.
can not be equal

Replaced reactants
Fig. 4: Acceptable and not acceptable differences between reactions.
The ModeRator can match reactions where some metabolites are not mentioned in equations (see Figure 4).Missing metabolites tolerance settings allow user to set maximum number of allowed missing metabolites for each side and the overall tolerance limit.The Limit limits tolerance to certain length of reaction.The length is the number of involved reactants.By default the overall Limit set to "2" thus excluding transport reactions and other short reactions from tolerance settings influence.
Software dependencies and availability.The software has been tested on a number of free operating systems including Ubuntu 12.10, Fedora 21 and Debian 7. The ModeRator can be downloaded from Biosystems Group homepage http: //biosystems.lv/moderator2/.The website also provides sample files and documentation.The ModeRator is written in Python.
The recommended method for new users willing to avoid manual installation of all dependencies is to use ModeRator in a virtual environment.The download page provides OVA 1 package containing xUbuntu with the latest ModeRator pre-installed.OVA files can be opened with virtualization software, like Virtualbox and VMware on all major operating systems.

Use cases of ModeRator
The settings of ModeRator have to be adapted for particular cases depending on the type and quality of available information about reconstruction elements.Therefore the ModeRator settings for particular use cases are as different as the reconstructions are.Four pairs of biochemical network reconstructions were compared in three use cases.Different sets of information were available in reconstructions, hence different comparison settings were used.The meaning of the settings are as follows: -Names -metabolites were compared by names; -Identifiers -metabolites were compared by internal identifiers; -Formulas -metabolite formulas were available and were used for filtering after comparison by names; -Compartments -information about compartments was available and was used for filtering after comparison by names; -Variable charge -different number of hydrogen atoms was tolerated comparing formulas; Phonetic -phonetic-like preprocessing of metabolite names was used; -By mapped mets.-the comparison of reactions was based on mapping of metabolites; -By formulas -the comparison of reactions was based on chemical formulas of reactants; -Tolerated react.-reactions with certain number of missing reactants were considered similar; -Ignored mets.-specific metabolites, like water and hydrogen were ignored during the comparison of reactions.

Curated comparison of metabolites
Metabolites in two reconstructions of E.coli2 containing 1314 and 1704 metabolites were compared by their identifiers.Manual curation was necessary for 31 pair with non equal names.In total, 30 metabolite pairs were approved during manual curation.One pair was left without decision because the available information was not enough for biologist to confirm or deny the identity of metabolites.Metabolites in two reconstructions of S.cerevisiae3 containing 1063 and 681 metabolites were compared by metabolite names.The threshold for similarity ratio was 68% and the threshold for edit distance was 15 edits.As the phonetic-like preprocessing and difference score was not used, the comparison in this use case was essentially based on similarity ratio.723903 metabolite pairs were processed by computer.400 metabolites were matched automatically.376 (out of 447) metabolites were mapped after manual curation (Mednis and Aurich, 2012).
In this use case ModeRator version 2.5.5 was used.

Comparison of reactions after mapping of metabolites
In this use case metabolite pairs were weighted using the difference score, the edit distance threshold was set to 100 allowed edits.Phonetic-like preprocessing of metabolite names was enabled in some experiments.In this use case, the biologist approved 46 C.acetobutylicum metabolite pairs with names similarity under 50% including 24 metabolite pairs with similarity under 30% including 3 pairs with similarity under 15%.This allows to conclude that it is very difficult to set a reasonable threshold for name similarity, because the same metabolite may have quite different names.However, such cases are small part (¡5%) of all metabolite pairs that were automatically approved.
The similarity settings should be balanced with the costs of false-positive cases.In case of high importance of comparison results the threshold settings should be set at a level where all pairs even with low similarity would be analyzed by biologist spending more time, but gaining better confidence about the results.Even at low threshold settings of ModeRator the number of comparable pairs is heavily reduced and the data is well prepared for analysis by biologist.

Comparison of reactions skipping metabolite mapping
Two genome-scale reconstructions of Zymomonas mobilis by Lee et al., (2010) having 615 metabolites and 600 reactions, and Widiastuti et al., (2011) having 773 metabolites and 747 reactions were compared.
The peculiarity of this pair of reconstructions is the lack of some metabolite names as well as use of several synonyms describing the same metabolite.That makes the usage of metabolite names problematic.On the other hand both reconstructions have formulas.This use case demonstrates the opportunity to compare reactions skipping metabolite comparison.That can be done to get fast similarity overview.448200 reaction pairs were processed by computer.Depending on the comparison settings 93 to 277 reactions were matched automatically.
The use case demonstrates the flexibility of ModeRator software enabling direct reaction comparison skipping the metabolite comparison step.This kind of approach is reasonable only in case of corresponding peculiarities of data when there is limited information about metabolites while reactions are described in good quality.
Different confidence levels can be reached taking into account enzyme numbers, which, in combination with other data may give strong confidence about identity of reactions.Still, even there the same enzyme can catalyze several similar reactions.Variations of comparison parameters like reversibility of reactions and ignoring of water and hydrogen can change the comparison results significantly.
ModeRator version 2.2 with command-line interface (Mednis et al., 2012) was used in this use case.

Application of model comparison for determination of consensus level of models
Automated generation of an intersection of the two models combined with its structural analysis (Rubina and Stalidzans, 2010;Rubina and Stalidzans, 2012) can give fast indication about the agreement level between metabolic models of a particular organism in their overlapping part.The creation of intersection is one of the functionalities of ModeRator.Some of the structural parameters can be used to measure the agreement level between the models by analysis of their intersection.Intersection analysis of model pairs compared in subsections 5.1 and 5.2.illustrate two different cases: high agreement intersection model in case of E.coli and the low agreement intersection model in case of S.cerevisiae.The reason of high agreement of E.coli models is the fact that they are built by the same group of researchers and both models reflect the development of the E.coli models of a particular group of researchers (Rubina et al., 2013).Applicability of some structural parameters for the determination of agreement level has been analyzed using software BINESA (Rubina and Stalidzans, 2013).A low agreement level of a model pair resulting in a fragmented, poor quality intersection model can be indicated by low values of average degree, average incoming degree, average outgoing degree and average number of the neighbors.A low agreement of the model pair can be recognized also by the distribution of the incoming and outgoing degrees of the metabolites: high percentage of the low inter-connectivity metabolites and low percentage of the hubs (more than ten links).

Conclusions
-Additional identifying information about reconstruction elements can be used to strengthen or weaken automatic decision about equality of two elements.However the sets of additional information rarely overlap.-The approach of fuzzy string comparison works well with long metabolite names.
Lowering the threshold involves higher risk of false positives to be found.-Certain symbols in metabolite names can have different impact to biological meaning.For instance, special characters do not play significant role in the meaning of particular metabolite name, while numbers can change the biological meaning completely.
-In some cases it is still possible that proposed algorithm returns multiple mapping links for the same metabolite due to the lack of lowest difference score for a single pair of metabolites.Such cases require manual curation and approval.
-Tolerance for variable formula charge improves chances to find truly equal metabolites.-Automatic mapping of metabolite pairs with tolerated chemical formulas can lead to false positive mapping of some metabolites.-Even knowing formulas and compartments for all metabolites does not guaranty correct metabolite matching.-Automatic mapping of metabolites (without manual approval) can lead to false positive results in comparison of reactions.
The following future developments can be proposed: Despite the efforts of leading databases, like MetaCyc and KEGG to make comparison of biochemical networks (Altman et al., 2013) by elements names obsolete, the interest from other researchers in using string similarity metrics in comparison of metabolites names (Qi and Ozsoyoglu, 2013;Qi et al., 2014;Thavappiragasam et al., 2014) has been growing.
One of the directions of further research in comparison of biochemical reconstructions is better recognition of common reactions.Computer aided matching of metabolites is a good start, but apparently not enough to reliably find common reactions.This is indicated by a low number of identified common reactions.A reason for this could be that a truly common reactions may contain identified common metabolites along with the unidentified.The current version of ModeRator can report such possibly common reactions, however, lowering similarity threshold increases the number of false positives.
Another direction of further development is to solve a problem of compartmentalization in different scales.Most anatomical compartments are separated from each other by phospholipid membranes.In a simpler reconstruction, for example, fluids of the body are divided into two compartments: fluids in cells and fluids outside cells.In a more detailed reconstruction cells themselves may have internal compartments, like, nucleus, mitochondria, Golgi apparatus or cytosol.This problem when compartments of reconstructions differ in granularity is aimed to be addressed in future releases of software tool ModeRator.

Fig. 2 :
Fig. 2: Impact of mapped metabolites on the comparison of reactions

Table 1 :
Example of similarity ratio and distance variations for different strings.Edit distance algorithm by

Table 2 :
Two examples showing raw and phonetically processed metabolite names.

Table 3 :
Examples of similarity value for different chemical formulas.
Table 4 lists all used reconstructions and comparison settings.

Table 4 :
Summary of comparison settings depending on the use case Two reconstructions of S. cerevisiae.In this use case the same reconstructions of S. cerevisiae (see Use case 5.1) were compared.Unlike the previous use case, in this use case six comparison experiments with different settings were performed.723903 metabolite pairs were processed by computer.Depending on the comparison settings 248 to 473 metabolites were matched automatically.1146272 reaction pairs were processed by computer.Depending on the comparison settings 68 to 218 reactions were matched automatically.Two reconstructions of Clostridium acetobutylicum.Two reconstructions of C. acetobutylicum by Salimi and Mandal, (2010) and McAnulty et al., (2012) containing 1134 and 707 metabolites and 1105 and 794 reactions were compared.801738 metabolite pairs were processed by computer.Depending on the comparison settings 85 to 487 metabolites were matched automatically.450 (out of 564, including metabolites with identical names) metabolites were mapped after manual curation by Dr.biol.Armands Vgants.877370 reaction pairs were processed by computer.Depending on the comparison settings 109 to 449 reactions were matched automatically.