Computational Identification of Metabolic Pathways of Plasmodium falciparum using the k-Shortest Path Algorithm

Plasmodium falciparum, a malaria pathogen, has shown substantial resistance to treatment coupled with poor response to some vaccines thereby requiring urgent, holistic, and broad approach to prevent this endemic disease. Understanding the biology of the malaria parasite has been identified as a vital approach to overcome the threat of malaria. This study is aimed at identifying essential proteins unique to malaria parasites using a reconstructed iPfa genome-scale metabolic model (GEM) of the 3D7 strain of Plasmodium falciparum by filling gaps in the model with nineteen (19) metabolites and twenty-three (23) reactions obtained from the MetaCyc database. Twenty (20) currency metabolites were removed from the network because they have been identified to produce shortcuts that are biologically infeasible. The resulting modified iPfa GEM was a model using the k-shortest path algorithm to identify possible alternative metabolic pathways in glycolysis and pentose phosphate pathways of Plasmodium falciparum. Heuristic function was introduced for the optimal performance of the algorithm. To validate the prediction, the essentiality of the reactions in the reconstructed network was evaluated using betweenness centrality measure, which was applied to every reaction within the pathways considered in this study. Thirty-two (32) essential reactions were predicted among which our method validated fourteen (14) enzymes already predicted in the literature. The enzymatic proteins that catalyze these essential reactions were checked for homology with the host genome, and two (2) showed insignificant similarity, making them possible drug targets. In conclusion, the application of the intelligent search technique to the metabolic network of P. falciparum predicts potential biologically relevant alternative pathways using graph theory-based approach.


Introduction
Malaria remains one of the leading global health challenges with about 216 million cases and more than 445,000 deaths recorded in 2016. According to the World Health Organization (WHO), 88% of these deaths occurred in Africa [1] and Plasmodium falciparum accounted for the majority of the cases. P. falciparum has developed resistance to all antimalarial medications including the most potent one-artemisinin [2][3][4]. Some genetic changes and metabolic alterations confer survival advantage that enables the parasite to evade drug effects thereby ensuring its survival. However, there is still a poor understanding of the processes utilized by P. falciparum to evade drug effects, which also hamper vaccine development [5]. The incomplete knowledge of the metabolic pathways of P. falciparum has also been identified as a major impediment towards the development of an effective treatment [6]; hence, the need for intensified research that is aimed at understanding better the parasite biology. Computational approach, which predicts metabolic networks and possible alternative pathways pertinent to the parasite's survival, can identify candidate drug targets to eliminate the disease globally. Some previous studies have been done in this regard. Figure 1 explains the various paths (a), pathways (b), and complete network constructions (c) in metabolic pathways.
The identification of these undiscovered pathways within the metabolism of the malaria parasite entails enumerating not only the shortest path within the metabolic pathway but also the list of other feasible paths within a source compound and a target compound. This could be regarded as an optimization problem. The metabolic system or metabolism of a specific cell or an organism is the entire system of metabolic reactions of the cell or organism. A metabolic pathway is an associated subsystem of the metabolic system either as a representation of particular processes or characterized by functional boundaries, e.g., the system between a glucose (initial substance) and a pyruvate (final substance). The k-shortest path technique enables the optimal and suboptimal metabolic pathways to be identified within the metabolic network [7].
Croes et al. [8] presented a reaction to the representation of the metabolic network in order to identify relevant pathways in a biological network. However, the path finding algorithm applied was unable to handle a large dataset. Faust et al. [9] combined a random walk-based reduction of the graph with the shortest path-based algorithm and applied on a yeast metabolic network. This approach is computation-ally intensive due to several runs by the shortest path algorithm used. Oyelade et al. [10] applied a colour coding algorithm to search for minimum pathways in the P. falciparum interaction network. They discovered "identified" and "unknown" genes and signal transduction pathway involved in metabolic activities of P. falciparum. However, this approach is only capable of obtaining the shortest signal pathway. Also, in another work of Oyelade et al. [11], the qualitative Petri net model for the glycolysis pathway in P. falciparum was built and analysed for its structural and quantitative properties using the Petri net theory which only give insights into the complex net behavior of the pathway. Essential reactions in the metabolic network of P. falciparum have been predicted using several computational techniques such as in silico knockout screening [12], load and choke point [13], centrality measures [14], flux balance analysis [15], and machine learning approach [16] etc.
The goal of this study is to computationally identify branching metabolic pathways of P. falciparum using the enhanced k-shortest path technique with improved prediction precision. This technique was used in our previous work [17] although limited to finding only alternative paths in selected metabolic pathways. In this work, we obtained the genome-scale metabolic model of the 3D7 strain of P. falciparum from a previous study [18]. The dataset contains 325 genes and 670 metabolic reactions. The underlying  architecture of the metabolic network was modelled using a reaction graph where reactions are designated as the nodes and two reactions are neighbors, if the product of a reaction is the substrate of the other. Compound graph representation is not suitable for path finding algorithms because a reaction can have more than one compound as both input and output, e.g., A + B ≥ C + D, thereby leading to a bunch of redundant edges. The existence of two different types of nodes in a bipartite graph makes it complex to traverse by the path finding algorithm. The choice of a reaction graph was informed by the aim of this study, which is to identify more alternative metabolic pathways, and the pathways are known to be chains of reactions and also due to the applicability of path finding algorithms which requires a single type of node to compute the shortest paths within the network. The T * algorithm, the k -shortest path technique developed by Kadivar [19], was adapted to extract the shortest paths from a defined source node to a target node. Only two metabolic pathways were considered for the alternative path analysis in this work due to the overlapping nature of the pathways that made it difficult to identify source and target reactions relating to a specific pathway.
We set the value of k of the path finding algorithm to equal five (5) because as k increases, the biological relevance of the path reduces. The identification of essential enzymes in a particular network allows a possible drug target to be identified [16,[20][21][22]. Therefore, in drug development, essential enzymes are generally recognized as perfect drug target candidates for potential new drugs and vaccines to treat and prevent diseases since their deletion from a network can compromise its integrity [21,22]. Hence, the essentiality of reactions was carried out using the betweenness centrality measure to determine essential reactions within a reconstructed network and subsequently predicted 32 essential enzymes.

Materials and Methods
2.1. Reconstructing the Metabolic Network. Metabolic reaction data were obtained from Chiappino-Pepe et al.'s study [18]; also 19 metabolites and 23 reactions were obtained from the MetaCyc database [23] to fill gaps in the iPfa GEM (see appendix A for the list of reactions obtained from the Meta-Cyc database). In a metabolic network, there are several metabolites that are commonly involved in reactions that cause shortcuts without biological meaning when computing paths in a simple graph [24]. These metabolites are referred to as pool metabolites or currency metabolites such as proton and water. Most of them often produce biological misinterpretations due to the artifactual links between nodes.
Kim et al. [25] identified twenty-five (25) currency metabolites out of which we eliminated twenty (20) from the network before reconstructing the graph. The remaining five (5) currency metabolites identified by [16] were left in the network since their presence does not have a significant effect that could lead to shortcuts without biological meaning, and a uniform weight is assigned to all edges in the network.
2.2. The Algorithm. The T * algorithm was selected among the other k-shortest paths because of its superior computational performance (as shown in Table 1). This algorithm requires topological sorting of the graph nodes as its input, which is only possible for a directed acyclic graph. Figure 2 shows the algorithm flow chart of the metabolic network construction.
Given that P i = P 1 i , P 2 i , ⋯, P k i as the set of the k i -shortest paths from s to node i and L P 1 i ≤ L P 2 i ≤ L P k i , where L P denotes the length of path P. Let i, j ∈ A, because of topological ordering and since order (j) is larger than the other nodes in all si paths; P l i i, j is a loopless path for each l ∈ 1, 2, ⋯, k i . Therefore if P s = ∅, then P i = j∈A− i P ϵ P|j| P j, i (stated in Algorithm 1).
The T * algorithm was modified in two significant ways: Firstly, we eliminated the first stage (topological ordering) due to the nature of the dataset where several reactions within the dataset could be on the same level in the topological order thereby varying the results from each run of the algorithm. For instance, if reactions A, B, and C have the same topological order, then it implies that at different runs of the algorithm, reaction A could come first or reaction B or reaction C, which varies the result of the algorithm at different runs. The metabolic network of P. falciparum contains over six hundred compounds and over a thousand reactions. The inconsistency inherent in the ordering of the reactions will have severe biological implication for the eventual shortest paths generated by the algorithm. For instance, if the order of a vital reaction within the pathway is less than that of the source compound for a particular pathway, then the reaction will be omitted from the shortest paths because the algorithm starts traversing from the source node to the target node in the topological order. Secondly, the topological ordering is not applicable to networks with a cycle since metabolic networks are known to contain loops; hence, we introduced heuristic function to enhance the prediction precision of the alternative paths. Figure 3 shows the modified T * algorithm for metabolic network construction. The reactions in the annotated pathway were retrieved and used to guide the search of the algorithm to list biological feasible paths. Therefore, the modified T * algorithm 2 is given. Table 1: Time complexity of T * , K * , Yen, Feng, EA, and LVEA algorithms. As shown in the Table 1, the T * algorithm has better computational performance in terms of running time when compared to others [19].

Algorithm
Time complexity The concept of reward shaping was also introduced to the k-shortest path algorithm to improve the quality of biologically relevant alternative path prediction. Reward shaping is the addition of an extra reward signal that encodes some heuristic knowledge of the system designer or domain expert, thus encouraging the learning agent to explore parts of the state space believed to contain good solutions [26][27][28]. In practice, the choice of reward function is intuitively selected [29] and this has been successfully applied to speed up reinforcement learning techniques in complex domains [30,31]. The reward R k of each feasible path between a source reaction and target reaction is computed using the number of reactions within the path that can be found in the set of annotated reactions R A . An annotated pathway was obtained from the KEGG database. The justification for this model is based on the assumption that the annotated pathway for a metabolic process represents the shortest path; hence, the next shortest alternative path is most likely to have branched from the annotated path thereby containing most of the reactions in annotation. The reactions in the feasible path were compared to the annotated path to determine the number of intersecting reactions. However, the total cost of a path is a function of its length and the number of annotated reaction contained in the path.
where M k represents the initial reward for path k, c r i represents the coefficient for reaction (r i ) which has a constant value of 1. x i will assume the value of 1 if r i is present in the set of annotated reaction for that pathway and 0 if otherwise.
In addition to the objective, obtaining biologically relevant paths is also to obtain the shortest path; hence, a penalty function is introduced which is meant to reduce the reward of paths based on their length; this is represented in where t is a "balancing factor" which is an arbitrary value that is suitable to provide a penalty value that balances the   , j ∈ A─ (i)) 7: Make min-priority queue MQ(i) by P 1 j ∈ P j for j ∈ A─(i) according to L P 1 j + c ji values. 8: while MQ i ≠ ∅ and |P i | < k do 9: Move P at root of MQ(i) to P[i] and replace the moved item by its neighbour in the source list from which it came and update the nodes values traversing up in MQ(i) along the path of the moved item. 10: end while Algorithm 1 positiveness of annotated reactions present in the path and the disadvantages of the path length.
The reward for a path k is given as the difference between its initial reward and penalty score as shown in In this study, we chose t to be 0.5 for a fair penalty score and the adjusted cost is the difference between the number of reaction in the path and the reward. We then applied the modified T * algorithm to the glycolysis and pentose phos-phate pathway of P. falciparum to obtain the k-shortest paths for k = 5.

Results
We then applied the modified T * algorithm to the glycolysis and pentose phosphate pathway of P. falciparum to obtain the k-shortest paths for k = 5. Table 2 shows the number of reactions present in a predicted path and its computed cost. The five alternative paths identified by our method for each of the target metabolic pathways   were further verified manually to identify artifacts and biologically plausible paths. Two alternative paths were identified to be a feasible path for metabolic activities in the pentose phosphate pathway. One alternative path was identified in the glycolysis pathway, which could serve as an alternative path for metabolic activities in the pathway. The reactions represented in gold in Figures 4 and 5 indicate reactions identified by this study to have created alternative paths in the respective metabolic pathways. The reaction represented in blue in Figure 5 represents a nonannotated reaction identified by this study to create an alternative path glucose 6-P to glycolysis. For the glycolysis pathway, we predicted a pathway, which could provide an alternative path for the generation of pyruvate. This metabolic reaction involves alpha-D-glucose-1-phosphate uridylyltransferase. The predicted alternative paths for the glycolysis pathway are presented in Figure 4.
The predicted alternative paths for the pentose phosphate pathway are presented in Figure 5. Two (2) additional meta-bolic reactions were predicted to have created an alternative path in the pathway; they are D-glyceraldehyde-3-phosphate glycolaldehyde transferase and sedoheptulose 7-phosphate 1phosphotransferase reactions, respectively.

Essential Reaction Prediction.
To verify the predicted pathways, a test of the essentiality of these genes was done on the predicted pathways. Betweenness centrality measure was applied to the reconstructed metabolic network to each reaction in the network to determine the essentiality of a particular metabolic pathway that is critical for the production of a target compound identified. The betweenness centrality method is stated as follows: where b ik p j is the proportion of the shortest path linking p i to p k that contain p j , g ik p j is the number of the shortest path that contain point p j as an intermediary in the shortest path from p i to p k , and g ik is the number of the shortest path from p i to p k .
where d ij is the pair dependency which represents the degree to which a point, p i , must depend upon another, p j .
where C B p j is the betweenness point centrality.
The genes that coded the enzymes responsible for the essential reactions discovered were obtained and blasted against the human genome to evaluate the existence of homology in the host. We used a centrality score of 0.25 as the level of significance because the centrality score of some of our predicted reactions that validated the gold standard set of essential enzymes and other predicted ones in literature has a centrality score of approximately 0.3. Table 3 shows the reactions with a centrality score from 0.25 and above as the predicted essential reactions. A total of 32 reactions were predicted as essential, out of which 13 have been validated in the literature.    [18] particular pathway while Figure 7(b) represents the pathways that the betweenness centrality measure has been applied. The biggest nodes in Figure 7(b) denote the predicted essential reactions of the pathway.

Discussion
Several P. falciparum metabolic network reconstructions have been done previously, but this study is aimed at predicting alternative metabolic pathways for energy generation in the malaria parasite via the pentose phosphate and glycolytic reactions using the k-shortest path algorithm. The essentiality of these reactions to the survival of these parasites was also evaluated in the predicted models (Table 3) and genes with no human homologs identified (Table 4). Sequence alignment was performed for all the 32 enzymes to identify the existence of homology; two of the enzymes show insignificant similarity with the host. Energy production in apicomplexans involves both the aerobic and anaerobic pathways, and this varies from species to species [32]. P. falciparum is a fast growing organism in the human erythrocyte. Therefore, in order to meet its energy need for growth and cell division, it relies on anaerobic oxidation of glucose to generate ATP. This is achieved through a series of enzyme-catalyzed breakdown of glucose to pyruvate, a metabolic process otherwise called the glycolysis or Embden-Meyerhof-Parnas pathway [33]. This process does not require oxygen to generate ATP, but results in the accumulation of pyruvate. While the main purpose of glycolysis is to generate ATP, one of its intermediates, glucose-6-phosphate, is the precursor for the pentose phosphate pathway [34]. This pathway serves an important purpose of generating NADPH and pentose  Similarly, different stages of malaria parasites have been reported to have different energy requirements while P. falciparum-resistant strains have been shown to alter their genetic and metabolic pathways, in order to utilize nutrient requirements for fitness and survival in a drug environment. A recent study identified metabolic changes in mutant parasites, which shows differential transport and utilization of nutrients, as shown in computational prediction that incorporates metabolomics data from sensitive and resistant isolates [35]. Malaria parasite also lacks capability to store carbohydrates [36,37]; it constantly requires the production of this through glucose and can utilize other carbohydrate precursors scavenged from a host such as UDP-glucose; hence, the identified alternative metabolic pathways that P. Falciparum utilizes for survival are crucial for curbing drug resistance and can be targeted as vaccine candidates to reduce malaria transmission. Variable regulation of host UTP-glucose-1-phosphate uridylyltransferase (UDP-glucose phosphorylase), which catalyzes energygenerating reactions such as galactose metabolism and glycogen synthesis, has been reported [34] thereby making it a good target. This was identified as an essential pathway in glycolysis in the present study. Other predictions from this study corroborate the findings from other studies; these are highlighted in Table 3. Amino acid deprivation can alter the growth of P. falciparum in vitro; however, it may not completely kill the parasite [38]. This regulation of nutrient requirement to cope with deficit is one of the many survival mechanisms already identified in the parasite. However, other essential amino acid pathways identified can be targeted together to achieve a detrimental effect on the parasite's survival. Our study identified gamma-L-glutamyl-L-cysteine:glycine ligase as an essential reaction for glycine, serine, and threonine metabolism. This finding has been previously reported in other studies [18,39].
Similarly, our study predicted a metabolite-sedoheptulose-1,7-diphosphate-from the breakdown of xylulose, in the pentose phosphate pathway, which may be utilized by the parasite to scavenge for energy. Only sedoheptulose-7-phosphate:D-glyceraldehyde-3-phosphate glycolaldehyde transferase was predicted as an essential reaction within the pentose phosphate pathway. P. falciparum essentially synthesize nucleotides de novo. Hence, it cannot salvage pyrimidines from an extracellular environment. This biosynthetic pathway has been identified as a good target for malaria control [45]. For the purine and pyrimidine pathways, five and four reactions were predicted as essential and have been reported in other studies (see details in Table 3). The implications of these hypothesis are that in addition to other predicted pathways, if biologically validated through disruption of genes, encoding the enzymes for these pathways in P. falciparum will further elucidate survival and alternate energy generation pathways in the parasite. The hypothesized potential pharmacological enzyme targets essential for the parasite can be targeted to control malaria infection globally.

Conclusions
In this work, we have been able to predict alternative metabolic paths in the glycolysis and pentose phosphate pathways of P. falciparum. We predicted two (2) essential proteins in the glycerophospholipid, purine, pyrimidine, and glycolysis metabolic pathways (Table 4) without homology with a host. With the use of heuristic function to enhance the k-shortest path algorithm, we have been able to identify potential biologically relevant paths using a computational graph-based technique which hitherto is less utilized due to its very high false positive result. Biologically targeting these candidate proteins in Plasmodium is recommended to improve understanding of the predicted alternative pathways and the precision of the graph theory-based method.

Data Availability
The datasets used in this study are available in http://lcsbdatabases.epfl.ch/pathways/Gems.

Conflicts of Interest
The authors disclose no potential conflicts of interest.