Inference on chains of disease progression based on disease networks

Motivation Disease progression originates from the concept that an individual disease may go through different changes as it evolves, and such changes can cause new diseases. It is important to find a progression between diseases since knowing the prior-posterior relationship beforehand can prevent further complications or evolutions to other diseases. Furthermore, the series of progressions can be represented in the form of a chain, which enables us to readily infer successive influences from one disease to another after many passages through other diseases. Methods In this paper, we propose a systematic approach for finding a disease progression chain from a source disease to a target one via exploring a disease network. The network is constructed based on various sets of biomedical data. To find the most influential progression chains, the k-shortest path search algorithm is employed. The most representative algorithms such as A*, Dijkstra, and Yen’s are incorporated into the proposed method. Results A disease network consisting of 3,302 diseases was constructed based on four sources of biomedical data: disease-protein relations, biological pathways, clinical history, and biomedical literature information. The last three sets of data contain prior-posterior information, and they endow directionality on the edges of the network. The results were interesting and informative: for example, when colitis and respiratory insufficiency were set as a source disease and a target one, respectively, five progression chains were found within several seconds (when k = 5). Each chain was provided with a progression score, which indicates the strength of plausibility relative to others. Similarly, the proposed method can be expanded to any pair of source-target diseases in the network. This can be utilized as a preliminary tool for inferring complications or progressions between diseases.


Introduction
An individual disease may go through different changes as it evolves, and such changes can cause new diseases [1]. The concept of progression between diseases came across as a form of complications or sequelae from a disease. For example, insulin resistance can cause diabetes mellitus type 2 [2], and it can further lead to chronic kidney disease [3,4]. It is important to find disease progressions since knowing the prior-posterior relationship beforehand can prevent further complications or progressions to other diseases. Progression between diseases has long been studied through cohort verification between two diseases [2,[5][6][7]. Although the results provided valuable information concerning disease progressions, the time and cost to obtain the results were rather expensive. In recent years, researches have been conducted to find causality between diseases by utilizing diversified biomedical data. In [8], the authors proposed a model for causality using gene/protein, clinical, and metabolic pathway information to construct a disease causality network. In [9], a text mining approach was employed on biomedical literature to construct a causal disease network. Note that causality of disease in these researches stands for potential progression/evolution of various diseases, not the underlying causes of diseases.
Extending from the concept of disease progression, there may be a continuous path or series of diseases that reflect a prior-posterior relation between two diseases. For instance, it is difficult to find a prior-posterior relation between colitis and respiratory insufficiency if we simply look at them directly. However, colitis can progress acute kidney injury, which can lead to polyuria, then to hyponatremia, and finally to respiratory insufficiency. In this paper, we define such a chain of relations among diseases as a disease progression chain. As in the case of colitis and respiratory insufficiency, a disease progression chain can extract prior-posterior relations between two diseases that are seemingly unrelated on the surface.
There were some previous studies concerning diseases in perspectives of chain [1,[10][11][12][13][14]. They defined a causal chain that focused on pathological viewpoints in which diseases are caused by the series of risk factors such as abnormal states, symptoms, or lifestyles. In this research, instead of taking a pathological viewpoint, we attempt to find a disease progression chain in the view of pan-disease, which relates to various diseases with prior-posterior relations. From the clinical viewpoint, disease progression chains can be beneficial for developing continuity of treatment by retrospectively tracing back through the series of diseases. Furthermore, disease progression chains yield a wider angle of disease-related genes (proteins) to consider with the inclusion of diseases composing the chain. The chains may uncover other disease-related genes associated with a specific disease through series of relationships that can be helpful for drug discovery or repositioning. In [15], the authors attempted to detect new target disease of drug considering similar side-effect between diseases, and Chiang and Butte [16] suggested the use of same drug for diseases with similar therapies in terms of disease relationships.
One effective and efficient approach to finding a disease progression chain is to utilize network modeling. Using a network analysis of data has the advantage of scrutinizing relations between data from a more comprehensive and systematic point of view. In constructing a disease network, nodes represent disease, and edges represent genetic, biological, pathological, epidemiological, or other relations between diseases [17][18][19][20][21]. Many researches that utilizing disease network analysis have been carried out in biomedical field to such as establishing genotypes and phenotypes of diseases [22,23], identifying disease-related genes [24] and drug target genes [25], and repurposing drugs [26].
In addition, if we extend associative relations to prior-posterior relations between diseases, the network is a directed network that can be regarded as a disease progression network. From the constructed network, a simple approach to find a disease progression chain is to manually track the diseases by considering all the possible cases. This approach, however, is very unrealistic owing to a very high number of cases to consider. One possible solution to this issue is to use a shortest path search algorithm, a well-established approach in network (graph) theory. The shortest path search algorithm is a method of searching the path with the lowest cost (or the highest profit) from the source node to the target node. Given a directed network (graph) with nodes and weighted edges, it finds the best single path from various possible paths between two nodes, the shortest path if the edge-weights are defined as distances or the strongest path if edge-weights represent similarities. Applied to disease progression networks, it finds the most probable paths that the largest number of genes shared by diseases, the most frequently co-occurring diseases from clinical information, and the most confirmative relationship of diseases from related researches. Each of networks in the order will be presented in the sub-sections of construction of disease progression networks. The most representative algorithms [27] are the A � algorithm [28], Dijkstra algorithm [29], and Floyd-Warshall algorithm [30,31]. These algorithms, however, concentrate on finding the shortest path, whereas there could be other "short" paths that contain meaningful information. For this research, we employ the idea of the k-shortest path search algorithm, which was first suggested by Yen [32]. The original Yen's algorithm first finds the shortest path and searches for k-1 consecutive shortest paths by eliminating the edges of the shortest path. Many researches have been carried out to improve the complexity of Yen's algorithm [33][34][35]. The k-shortest path search algorithm has been used in various researches regardless of the field. In [36], the authors applied the k-shortest path search algorithm for public transport travel optimization. From the kshortest paths, they selected the optimal path based on the preferences of users. In [37], safe paths in vehicle navigation were recommended based on a risk model that considered crime incidents in an urban road network. To calculate numerous safe paths, the k-shortest path search algorithm was applied. Other applications of the k-shortest path search algorithm include adjusting traffic flows from overloaded links to underutilized links in a telecommunication network [38] and detecting objects in individual frames of a video [39,40]. Likewise, there have been numerous researches employing the shortest path search algorithm in the bioinformatics area. In [41][42][43], the shortest paths in a protein-protein interaction (PPI) network were calculated to find genes that were related to diseases. In [44], the authors defined the mechanism of Parkinson's disease by identifying genes, miRNA, and potential drug targets using the shortest path search algorithm to determine a microarray gene expression dataset. In [45], regulatory pathways were inferred from a gene network with the shortest path search algorithm, and the same objects exhibiting slight variations in bioimages were analyzed with shortest path search algorithm [46].
In this paper, we propose a systematic approach to find the disease progression chain between two diseases by using a disease progression network constructed from various biomedical data. To find the chains between two diseases, we devise a k-shortest path search algorithm that combines the A � algorithm, Dijkstra algorithm, and Yen's algorithm. Instead of identifying the single shortest path, it is desirable to find the k-shortest paths; there may be many different paths in the disease progression network between two diseases. Such consecutive paths may also contain meaningful information on disease progression chain. The rest of the paper is organized as follows. In the proposed method section, we explain the step-by-step process of constructing the disease progression network and finding disease progression chains. In the experiments section, we present experiments and results of applying the proposed method to various biomedical data. In the conclusions section, we conclude the paper with insights and future works of study.

Proposed method
The proposed method consists of two steps. First, disease progression networks are constructed based on various biomedical data that are related to diseases. The information includes disease-protein relations, biological pathways, clinical history, and biomedical literature. Four networks, each constructed from different information, are integrated into a single network. From the integrated network, we employ the k-shortest path search algorithm to find disease progression chains that have the most influence on prior-posterior relations between two diseases. Fig 1 shows a schematic description of the proposed method.

Construction of disease progression networks
Association disease network. An association disease network (ADN) is the most fundamental disease network that is constructed based on disease-protein relations. The relation is represented by a bit vector where each bit indicates the existence of relations of a protein to the disease. To quantify the degree of relation between diseases, the cosine similarity between two vectors is used. For diseases d i and d j in an ADN, the weight w A ij is calculated by where 0 � w A ij � 1 and has a higher value for higher numbers of shared proteins between d i and d j . Fig 2 shows an example of the computing similarity for an ADN.

Pathway-based disease progression network
To construct a pathway-based disease progression network (pDPN) based on biological pathway information, we employ the method in [8] where prior-posterior relation between two diseases is derived by analyzing pathways associated with the diseases. First, common genes that are included in the pathways of two diseases are extracted and defined as a sharing block. Additionally, a flow function that quantifies the degree of which one disease affects the other disease is defined. The flow function considers the direction of molecular genes that are not included in the sharing block. That is, Flow(d i |d j ) is equal to the number of molecular reactions directed toward disease d j from disease d i , and Flow(d j |d i ) is equal to the opposite. To determine the prior-posterior relation, we compare the two flow values and set the weight matrix with . The resulting weight matrix is an asymmetric matrix that defines prior-posterior relation in the direction of high to low flow values. In the same manner, we calculate the weights for all pairs of diseases with known pathway information and construct a pDPN. Fig 3 shows an example of constructing a pDPN.

Clinical history-based disease progression network
A clinical history-based disease progression network (cDPN) is constructed based on the concept of relative risk. Relative risk is an index that describes the association between risk factors and incidents. The relative risk (RR) is given as where RR(A,B)>1 implies that A influences B with a prior-posterior relation. In the same way, RR(B,A) can also be calculated. For a cDPN, to calculate the associated probabilities for two diseases d i and d j , we use the number of patients who carry only one or both of d i and d j . Thus, the progression network is constructed with RR obtained from clinical information.
To determine the prior-posterior relation between d i and d j , the ratio of relative risk (RRR) that compares the two RR values is calculated. With RR and RRR, the weight w C ij between d i and d j in the cDPN is calculated with the following:

Text-based disease progression network
For a text-based disease progression network (tDPN), we extract the prior-posterior relation between two diseases from text data, and quantify the information [9]. To construct a disease progression network, we consider the following two aspects. First, terms that represent a prior-posterior relation between diseases are defined, and the degree of strength is assigned based on interpretations. Second, clauses that appear in multiple documents should have more influence on prior-posterior relation than clauses that appear multiple times in a single document.
The degree of strength in which terms that represent a prior-posterior relation is defined by α t . The value α t has higher weight if the term t has a stronger implication on prior-posterior relation, and the value has a lower weight if the term simply implies association. For a frequency-based approach, the strength is calculated by considering the number of documents that expresses prior-posterior relation using the term t ðdf ij t Þ and the number of clauses appearing across the documents (cf ij t ). The relation strength between diseases d i and d j in the text data with a document-clause frequency (DCF) is given by To define the prior-posterior relation between d i and d j , the strength of term α t and DCF ij t are combined, and the cases d i !d j and d j !d i are compared as in (6) to determine the weight value and its direction.

Chain of disease progression
Search algorithm for progression path. To find a disease progression chain, the k-shortest path search algorithm is utilized. Simple shortest path search algorithms only search for a single path. In the problem of finding a disease progression chain, one disease may lead to the other through many different paths in the disease progression network. Thus, it is desirable to find the k-shortest paths instead of the single shortest path. In this paper, the k-shortest paths for two diseases in a DPN is found by using a modified version of Yen's algorithm [32]. Given a graph, Yen's algorithm first finds the shortest path between two nodes and searches for consecutive shortest paths by enlarging the values of edges that are part of the shortest path. The method of searching the paths is based on the Dijkstra algorithm [29], a greedy search algorithm for finding the shortest path. In this study, we modify Yen's algorithm by using a combination of the Dijkstra algorithm and A � algorithm [28] in search of the shortest paths. This combination can reduce the computational time compared to the Dijkstra algorithm, and guarantees optimality [42].
To briefly review, suppose there is a graph G(D,E) where D = {d 1 ,d 2 ,. . .,d n } is the set of nodes and E denotes the set of weighted edges, possibly with directions. The Dijkstra algorithm starts by setting the initial starting node, say d S , and stores the vertex with the distance at each iteration. If a stored vertex is reached from another path, the distance is updated with the smaller value. Through the iterations, the shortest path from d S to every element in {d j :d j 2D, d j 6 ¼d S } is obtained. The principle of the A � algorithm is similar to that of the Dijkstra algorithm except that it uses an evaluation function: where g(d v ) is the minimum distance possible from d S to d v , and h(d v ) is the estimate of the cost of an optimal path from d v to the target node d T . From the starting node d S , the A � algorithm searches for the smallest f at each iteration until d T is reached. When h = 0 for all d v 2V, then the A � algorithm is equivalent to the Dijkstra algorithm. The combination of the Dijkstra algorithm and A � algorithm is given as follows. First, the Dijkstra algorithm is applied to the DPN in the reverse manner. That is, the search starts from the target d T , traces the edges back to their original vertices, and stores each optimal distance δ (d v ,d T ). Then, we set h(d v ) to be the optimal distance from d v to d T obtained from the previous step. Finally, the A � algorithm is applied to search paths from d S to d T to find the optimal path. To find the k-shortest paths P 1 ,P 1 ,. . .,P k , the following procedure is applied: 1. Apply the A � search algorithm to obtain P 1 , the shortest path from d S to d T . Set i = 2. 2. For each edge in P i−1 , we cut it and apply the A � search algorithm to obtain the set of potential paths Z i ¼ fz i 1 ; z i 2 ; . . . ; z i q g, where j = 1,. . .q, and q is the number of edges in P i−1 . We discard z i j if z i j 2 fP x g iÀ 1 x¼1 .
3. The shortest path from the set of potential paths becomes P i . Set i = i+1. (2) and (3) until i = k+1 or no potential path can be found.

Repeat
The pseudocode for finding the k-shortest paths is given in Fig 6. Disease chains and progression scores. The edges in a DPN have weight values between 0 and 1. If the weight between two diseases is high, then it implies a strong casual relation between the two. The shortest path search algorithm has priority in searching edges with low weight values. To reflect this property, the weight values are tranformed into distance values before applying the shortest path search algorithm. The distance between diseases d i and d j is defined as where w INT ij is the edge weight between diseases d i and d j in the integrated network. In addition, in contrast to the other three DPNs, edges in an ADN represent association instead of prior- posterior relations between diseases. In the process of finding chains, the bidirected edges in the integrated network are penalized.
By applying the disease progression chain finding algorithm to the integrated network, it is possible to find chains between two diseases of interest. Then, the influence of the chains is measured by the progression score (PS), which quantifies the relative importance of each chain. For the list of diseases in the progression chain between disease A and B, DPC d = {d A , d 1 ,. . .,d n ,d B }, the importance of each connection in the chain can be obtained by the weighted values in the original integrated network. For the set of weight values in a disease progression chain, DPC w = {w A1 ,. . .,w nB }, the PS between disease A and B is calculated by where l is the length of the disease progression chain, and δ is a scale parameter. Since the weight values lie between 0 and 1, the PS decreases with a greater number of connections between two diseases of interest. In other words, PS depends on the length of the progression chain, where a shorter length implies a larger influence with a more direct relation.

Data
To implement the proposed method, data on diseases, disease-protein relations, biological pathways, clinical history, and biomedical literature were used. The list of diseases was collected from Medical Subject Headings (MeSH) [47], where MeSH 2018 contains the names of 4,798 diseases in the diseases category. For disease-protein relations, data was collected from PharmDB [48], which provides information on diseases, drugs, and proteins. The data in PharmDB are extracted from various databases including the Comparative Toxicogenomics Database (CTD) [49], Genetic Association Database (GAD) [50], Online Mendelian Inheritance in Man (OMIM) [51], and Pharmacogenomics Knowledge Base (PharmGKB) [52]. To construct pDPN, biological pathway information was used from KEGG [53]. By matching diseases in KEGG and MeSH, 153 diseases were extracted with 129 having pathway information. For cDPN, we used the HuDiNe database [54], which contains 13 million clinical history records of prevalence and comorbidity information for diseases. Last, abstracts of biomedical literature listed in PubMed [55] were utilized for tDPN construction. Table 1 summarizes the data used for the experiments.

Results for construction of disease progression networks
Four different disease progression networks (ADN, pDPN, cDPN, and tDPN) were constructed by employing the proposed method on the collected data. In the network construction process, disease terms used in KEGG and HuDiNe are different from those in MeSH, where KEGG has its own terms and HuDiNe has ICD9. Therefore, each disease term was mapped to MeSH based on disease ontology in order to standardize and merge into a single term.
Four different DPNs are integrated into a single network in which the algorithm for finding the disease progression chain is applied. The integrated network has 3,302 diseases with 613,270 prior-posterior relations. Table 2 lists the results of network construction with the properties for four different DPNs and the integrated network. To outline some characteristics, ADN represents the association between diseases and has the form of an undirected network. In this study, this form is considered a bidirected network that has directions in both ways between two diseases. In addition, from the perspective of a disease progression network, ADN has less significance of disease progression compared to other networks that represent prior-posterior relations instead of association. The abundance of disease-protein relations, however, leads to a relatively dense network. Thus, to construct an ADN with relevant information, the k-Nearest Neighbors (k-NN) method is applied. In the experiment, when k = 40, the density of ADN was reduced from 17.95% (1,334,312 relations) to 1.98% (147,290 relations). In addition, as we can see the Table 2, ADN and cDPN have more diseases and relations compared with pDPN and tDPN. This is due to the difference of inherent characteristics in data sources for constructing each of the networks. For ADN, disease-protein relations have already been established by numerous researches, therefore resulting in high number of diseases and relations. Likewise, the size of data for cDPN is huge with large number of clinical history records. On the other hand, the small size of pDPN is originated from low number of diseases associated with pathways in KEGG. Furthermore, in Fig 7, the overlap of diseases and relations among four networks is scarce. This observation comes from complicated process of linking the biological mechanism, the phenotypes, and the literature knowledge base of diseases.
For the dataset of ADN, pDPN, cDPN, tDPN, and integrated network, refer to S1 Dataset.  The red and blue nodes represent sources and targets in the disease progression chain, respectively. The backgrounds colored with red, blue, green, yellow, and orange represent cardiovascular, digestive system, metabolic, urogenital, and musculoskeletal diseases, respectively. Fig 9 shows the disease progression chain spectrum, in which it is possible to determine the process of a particular disease affected by various other diseases. The diseases with blue nodes indicate the destinations of the chains. The number inside the node is the step of the chain, and the value under each disease is the PS.

Results for disease progression chain
For the case of colitis, we see various progression chains ranging from the direct connection of malnutrition to paralysis with four bypassing diseases. Inference on chains of disease progression based on disease networks

Exploration of disease progression chains in DPN
To examine the overall result of disease progression chains in a disease network, the search algorithm for disease progression chain was applied with k = 1 for simplicity. Of 10,899,902 possible pairs, 10,123,970 (92.88%) disease progression chains were extracted. Fig 10 shows the distribution of the number of diseases within the disease progression chain. We see that the chain length varies from one (direct connection) to a maximum of 12 diseases. Furthermore, the disease progression chain has a shorter length for a higher network density, and longer length for a lower network density. This implies that more information in constructing the integrated network leads to closer (direct) relations between diseases in the progression chain.
One example of a disease progression chain with a long length is hypopharyngeal neoplasms and thyroid hormone resistance syndrome. The disease progression chain is given as follows:  In the disease progression chain, connections from hypopharyngeal neoplasms to trachoma and macular degeneration to thyroiditis are based on clinical history, from trachoma to cataract is based on the biomedical literature, from cataract to macular degeneration is based on biological pathways, and from thyroiditis to thyroid hormone resistance syndrome is based on disease-protein relations.

Implication of k-chains
The results of k>1 are explained with an example of the relations between colitis and respiratory insufficiency. In general, it is difficult to find a direct association between colitis and respiratory insufficiency. By applying the proposed method to corresponding diseases with k = 5, we found the following results for the disease progression chain: From the resulting chains, there is a common path from polyuria to respiratory insufficiency, which passes through hyponatremia. The difference between the chains comes from various paths from colitis to polyuria. For the common path, a close relationship between polyuria and hyponatremia was shown in numerous clinical reports (PMID: 23837469, 10468901), where both diseases were affected by vasopressin. From hyponatremia to respiratory insufficiency, the former reduces cerebral blood flow and arterial oxygen content, which leads to hypoxia and respiratory insufficiency (PMID: 14605269). In addition, it was also reported that hyponatremia can cause a sudden respiratory insufficiency (PMID: 3713746).
In the case of the first chain, it was reported several times (PMID: 23445618, 25056300) that acute kidney injury can be caused by colitis. From acute kidney injury to polyuria, there have been research studies (PMID: 877851, 20525977) that examined the mechanism of a former disease leading to the latter. In a similar approach, it is possible to verify the prior-posterior relations for all five chains. The overall result is shown in Table 3.

Validation of disease progression chains
To evaluate the confidence of the proposed method, resulting disease progression chains were validated by comparing them with clinical histories. The prior-posterior relations of diseases with the ratio of relative risk (RRR) from clinical history data contain directions for a significant number of diseases and are based on the information of patients. Although it is the best compare with trajectories referred to patients, it takes tremendous time and effort. In terms of practicality, information on clinical history can serve as the standard for validation. The validation process was carried out as follows: (a) In the integration step, use ADN, pDPN, and tDPN, excluding clinical history information. (b) Apply a search algorithm for progression path to each pair of diseases. (c) Calculate RRR for the selected prior-posterior relations of diseases. As explained in clinical history-based disease progression network section, if RRR>1, then it is plausible to evaluate the corresponding prior-posterior relation as a correct relation.
Of 3,302 diseases, 100 were randomly selected, and three chains were found for each disease pair possible. As a result, 24,030 chains were found with 84,122 prior-posterior relations within the chains. The confidence of the proposed method was evaluated based on the ratio of priorposterior relations with RRR>1. Fig 11(A) shows the distribution of RRR of the prior-posterior relations found. The number of prior-posterior relations with RRR>1 is 70,575, which Inference on chains of disease progression based on disease networks corresponds to 83.90% of the total. In addition, the confidence of a disease progression chain can be evaluated based on the average RRR for all prior-posterior relations within the chain. Fig 11(B) shows the distribution of the average RRR for the disease progression chains. The number of chains with average RRR>1 is 20,662, which corresponds to 86.80% of the total. The validation results show that the proposed method guarantees high confidence in the results.  With the disease progression chains, we can track numerous cases that describe the process of a disease developing to another from colitis to respiratory insufficiency. For more diverse results and details of the k-chains, refer to Table A in the S1 Appendix. https://doi.org/10.1371/journal.pone.0218871.t003

Conclusions
In this paper, we proposed a method of finding a series of disease progressions, the disease progression chain, from an integrated disease progression network constructed with various biomedical data. The disease progression network was constructed by integrating four different sources of disease-protein relation, biological pathway, clinical history, and biomedical literature. To find disease progression chains, a k-shortest path search algorithm that combines the A � algorithm, Dijkstra algorithm, and Yen's algorithm was proposed. Through the proposed method, various disease progression chains between two diseases were found and were verified qualitatively from biomedical literature and quantitatively with comparisons between clinical histories and other sources of information.
The novelty of this research is that the concept of the disease progression chain, proposed in this paper, can be beneficial for tracking the prognosis of various diseases that can follow from an occurrence of a disease. In addition, the prior-posterior relations between two diseases from different categories can also be found despite their seemingly low association. On the other hand, there are some limitations in the present study. For the disease progression networks, each of sources has its domain trait which may be deserved to be preserved. However, the problem is the respective networks are sparse and disconnected. This means we can hardly find a relevant path from a single network, thus the network integration was employed. In addition, it would be better to give the networks different weights according to the relative importance. However, we currently have only a little knowledge on which network is better than the other. Therefore, we treated the networks (except the association network) with same weights in the network integration process. But the performance of the proposed method will be improved if we reweight the networks according to significance of each of source domains. In principle, it will be worth finding the path on disease progression from an individual network not from integrated one when abundant knowledge become more available than now. Moreover, the validation of disease progression chains in the current study was compared with clinical history data, but it would be more thorough if the results are compared to trajectories of diseases referred to patients.
In these respects, this research can further be developed by enriching the disease progression network with more abundant information, integrating networks with different weights according to significance, improving the proposed k-shortest path search algorithm, and refining verification methods through comparisons with cohort studies. These will be important Inference on chains of disease progression based on disease networks aspects of our future work. With improved verification methods, it can benefit the role of therapy and its temporal assessment in the progression of diseases. We hope that the proposed method is perceived as a preliminary information that may help practitioners have some hints that may be far better than beginning from nothing at all.