STRUCTURAL KNOWLEDGE ANALYSIS AND MODELING OF MULTIMORBIDITY USING GRAPH THEORY BASED TECHNIQUES

. Multimorbidity is one of the major problems in the modern medical system. The more conditions the patient has, the greater the psychological pressure will be. We propose a formal deﬁnition of the general case of Multimorbidity Disease Network detection. Based on pairwise association method, we constructed an undirected weighted graph of co-occurrence for comorbidity based on the socio-psychological proﬁle existing in a real data set. Based on the obtained network, we used the centrality analysis of the network nodes to conduct a mesoscopic-analysis, and used the community detection algorithm to determine potential components of the network. The main results show ﬁrst, that algorithms used can be helpful for extracting models of multimorbidity. Second, that aging process not only affects the number of diseases, but can also inﬂuence Multimorbidity Burden and its complexity pattern.


INTRODUCTION
Comorbidity is defined as "any distinct additional clinical entity that has existed or that may occur during the clinical course of a patient who has the index disease under study" [1]. It is a major health problem in modern medicine: the more conditions are (i.e. Multimorbidity), the more burdens are put on patient and healthcare system. Moreover, healthcare systems are still designed in a single disease paradigm rather than Multimorbidity. However, the transition from a disease-centered care, to a patient-centered care is ongoing, but generally appears complicated [2].
Recently, initiatives are made to make use of the increasing amount of electronic health care data to get insight about this phenomena based on big data tools [3]. As general methodology, researchers in data science collect suitable data and tries to clean and transform them to suitable format, then apply models and algorithms to extract knowledge from primitive information gained from raw unstructured data. (See figure1). We try to contribute by this work in exploring the use of analytical tools in data science for multimorbidity modeling. The increasing prevalence of multiple diseases poses problems for healthcare providers, care systems, and policies. It has impact as well for patients in well-being and quality of life [4,5].
It seems to be obvious that understanding multimorbidity in light of psycho-social profiles provides more accurate understanding of psycho-medical needs of patients, thus to better manage cases with multimorbidity. Under this perspective, we investigate patterns of multimorbidity across psycho-social profiles of patients across life span time, according to Erickson theory [6] of psycho-social development. The mentioned theory classifies human life in 8 stages: Infancy (0-1.5 years), toddler (1.5-3 years) , early childhood (3-5 years), middle childhood (5)(6)(7)(8)(9)(10)(11)(12) years), adolescence (13-18 years), young adulthood (18-40 years), middle adulthood (40-65 years), older adulthood (> 65 years). Each stage reflects relatively stable psychological growth, in which individuals have general psychological properties and patterns. Edges between these stages reflect a transition period, a shift characterized often by a crisis considered as challenges in maturation process.
There are several ways to express association between diseases in multimorbidity research.
The relative risk (RR), odds ratios (OR) [7,8], and Multimorbidity coefficient (MC) are famous ones. Risk ratios is used for quantifying association among two disorders, which calculates the ratio of the risk of occurrence of a disease among exposed group of people to that among the unexposed [9]. Odds ratio is defined as the odds of being exposed to disorder D 2 if one has disease D 1 , divided by the odds of being diagnosed with D 2 if one does not have D 1 . The popularity of the odds ratio is due to the ease of calculation and consisting of a good estimate of the relative risk [10].
Odds and risk ratios estimate the overall strength of association between disorders but fail to separate clusters from coincidental comorbidity. Multimorbidity coefficients can express association among any number of diseases by dividing the actual rate of multimorbidity by expected numbers of cases. The MC favors pairs of low prevalence. To reduce this tendency a pseudo-count can be added to the numerator and denominator of the MC [11].
Our objectives are then twofold: first, to present the multimorbidity patterns detection problem in formal terminology. Second, to understand how multimorbidity develops across one's life stages. This work is conducted in two steps: first, constructing network model for cooccurencing diseases from the studied dataset. Second, analyzing network patterns using centrality analysis and detecting communities in the network.
In the remainder of this paper, Section 2 presents some related works. Section 3 is devoted to putting multimorbidity problem in formal framework. In Section 4 and Section 5, we present the used algorithms in community detection and the used centralities. We present results in Section 6. A general conclusion is in Section 7.

RELATED WORKS
costs and health care utility [12]. More technically, the methods and models differ either on the data, be it cross sectional or have a temporal dimension, or whether the goal of the model is to explain, explore or to predict.
Earlier medical research relied on regression models that are applied on single diseases, which ignore the big picture of multimorbidity complexity. Recently, combinations of traditional data analysis and machine learning were proposed as multimorbidity research methods.
In [13] the authors used Classification/ regression trees and random forest applied to data of elderly adults to model and identify how specific combinations of chronic conditions, functional limitations, and geriatric syndromes affects costs and inpatient utilization. In [14] applied nonhierarchical cluster analysis based on k-means on Cross-sectional study using electronic health records of patients aged between 45 and 64 years to identify and separate certain population groups from others. In [15] added fuzziness upon k-means algorithm to estimate clusters of patients as well as membership matrix indicating the membership degree of a patient to a given cluster. In [16] a multilevel analysis of the influence of individual-and area-level factors on patterns of physical-mental multimorbidity and health-care use in the general population. Applying this method allows detecting the isolated and combined influence of variables of each level on the outcome variable.
In other approach, network science was a fertile domain to use in order to draw insights about comorbidity disease network. Hernández et al. [17] proposed an analysis of comorbidity patterns using network analysis and the use of association rules was performed to study disease associations in 6,101 Irish adults aged more than 50 years. They perform Louvain algorithm to detect clusters of diseases from the disease network. The standardized lift and confidence scores of the association rules was considered as probabilistic measures of how conditionally the diseases are related. In [18] logistic regression models, adjusted by age and sex, and odd ratio as strength association, were used to estimate the comorbidity network. They used some metrics to analyze the network such as clustering coefficient, Page Rank and degree centrality. In [19] used Salton Cosine Index as a comorbidity strength to build comorbidity network and weighted degree, closeness and betweenness centrality for a microscopic analysis of the comorbidity network.
Other approaches in literature focused in probabilistic formulation and longitudinal data. In the important work of [20], Lappenschaar et al. summarized and classified some terminologies used in definitions of concepts of multimorbidity, and proposed probabilistic framework to model these concepts using causal Bayesian network [21]. In [22], the authors proposed Bayesian network structure learning methods for modeling the interactions between risk factors explaining co-occurrences of malignant tumours in oncological area. This model was extended with a temporal dimension in [23]. Authors in [24] proposed a latent-based approach to model multimorbidity related event in temporal electronic health records and introduced the notion of clusters of hidden states allowing for an exploration of the multiple dynamics that underlie events in data.
In the aforementioned studies in medical field, the authors focused on the representability of concepts related to the multimorbidity problem, and were concerned more about the fidelity of the proposed model to the real world characteristics, and on the interpretability of the outcomes of proposed methods. They choose mathematical techniques as tools to draw conclusions when applying them to data, then interpreting the outcomes from medical point of view. The chosen model itself is a case study of a given analytical tool. In this work, our methodology is quite the opposite, we study the tools in a computational point of view and assess them on a case study of a real medical dataset.
In theoretical computer science, there are many ways to learn the network's structure, and during the last 20 years, many new advances in structure learning algorithms based on different principles have been proposed. The main question here is whether the performance of such learning algorithms still of significance and feasibility to the real world multimorbidity data.
For example, where in some domains structure learning algorithms that favour sparsity of the network are more suitable, one cannot be sure whether this holds for Multimorbidity domain.
Compared to other domains, medical data has a large range in frequency of occurrence of particular events. From a medical perspective, any significant association helping to understand multimorbid diseases network would be of great value. Besides, medical data often is characterized by having extreme prevalences. For example, 96 % of nearly 9800 diseases, in ICD-10 coding, have a prevalence of less than 1%. Further, co-occurrences of diseases of less frequent diseases are likely to be even rarer. Moreover, how these approaches perform in large size of real world datasets with very large number of variables is a great point to assess its feasibility. Therefore, it is necessary to know which of the learning algorithms are most suitable for the ultimate purpose of building computerized decision support systems. The feasibility of these models performance on very large and heterogenic real world data is questionable.
Our contributions, besides the different methodological concerns, are: we provide a general machine learning framework that describes the learning of structural knowledge of multimorbid data. This framework can express many of the methods proposed in medical literature in a general way. Then we give an implementation of a pairwise approach to our data and discuss some of its computational properties. Then we move to treat special case of the comorbidity network, and discuss some techniques of microscopic and mesoscopic analysis of the obtained comorbidity network. We choose centrality and community detection as techniques of analysis. For community detection, we compare four well known algorithms, representing different approaches in community detection literature. Finally, we applied some centralities and we try briefly and carefully to interpret them in light of our data of interest.

Problem setting.
To understand co-occurrence between multimorbid diseases (let us consider k multimorbid diseases), we computed associations strength of all combinations of k diseases in the data, per life stages for both genders across a given time period (2016).
In the following, we note |S| the number of elements of a given set S. Let D = {d 1 , d 2 , . . . , d |D| } a finite set contining |D| number of diseases present in medical dataset. Let with x j,i ∈ D a tuple of observed diagnosis for the patient i. Let X v ⊂ X the subset of patients with a certain profile v (a profile can be any patient's characteristics).
|X v | } is a finite subset of X indexed by v=(a, b) such that a represents in this work, gender: a ∈ { f emale, male} and b represents life stage: b ∈ {Infancy, toddler, early childhood, middle childhood, adolescence, young adulthood, middle adulthood, older adulthood}.
We assume that data X are independent and identically distributed (i.i.d), i.e samples from the datasets are generated by the same generative law, which has no memory of previously generated samples. To put it in another way, we base our analysis in this paper on the premise that every patient sample consists of distinct cases, each of which is caused by the same underlying Multimorbidity Mechanism.
Let R consists of a binary relation over Cartesian product sets D × D. Two diseases d 1 and d 2 are related with the relation R if and only if they satisfy a predefined condition. This condition depends on the context of the study. It can be for example the fact of being correlated, or causally related, or conditionally related. This relation R is usually assessed by a metric to measure its strength. Statistically, R is estimated in function of observations X. We consider the hy- explaining the observed co-occurencing of diseases }. Parameters θ encode the parameters related to the chosen model: it could be logistic coefficients or Bayesian networks probabilistic table, degree of polynomial regression, Hidden Markov models [24], latent class [25], principal component analysis [16], threshold in significant associations [18,26]. Technically, statistical models R are grouped as family of equations and H is framed based on assumptions underlying the problem of interest. We use this binary relation over the disease set to define a Comorbidity Disease Network (CDN). Generally, we use a k-ary relation over D × D × ... × D = D k to define a Multimorbidity Disease Network (k-morbidity disease network).
We define in this work the binary relation R(d 1 , d 2 ) as follows: two diseases d 1 and This definition coincides with Van Den Akker et al. definition [27] of cluster comorbidity: if d 1 has occured, then d 2 will be more likely to occur than what would be expected just by chance.
we consider that the two conditions are randomely co-occurencing.
can be interpreted as d 1 and d 2 are in protective comorbidity (for example myopia may be protective against diabetic retinopathy [28]).
To measure the strength of this relation/association, multimorbidity coefficients (MC) is calculated. MC measures pairwise associations [29], i.e. how strongly disorders are linked. It is defined as the division of observed rate of co-morbidity (multimorbidity) by the rate which is expected, under the null hypothesis of no association between the separate disorders. See The algorithm search for every n k = n! (n−k)!k! combinations of distinct n > k > 1 diseases among n total diseases, and compute their MC score by mining their co-occurrence. If the MC is significantly higher than 1, then we consider that these diseases are in Multimorbidity. If the MC score is less than 1 then we consider that they are in protective Multimorbidity. The bigger this number is, the stronger the association is considered. We are more interested by positive comorbidity. For each single k-morbid disease (k is fix in this case), the algorithm has to verify  if H 0 : "W expected ≥ W observed " is rejected at risk α then 6: E ← ∪ k as.one.hyperedge(E k M k ) // each clique E k M k from graph G k will be transformed into a one hyperedge in hypergraph G. The count occ (S, I) procedure count the number of occurences of a disease d ∈ S in diagnosis records indexed by i ∈ I. One way to implement this procedure is to implement it as a search algorithm, in which the sequential search will count the number of occurrences by iterating over the diagnosis records f(I) resulting in totally ordered set such that ≤ represent totally order relation (such as lexicographic order of Diseases code, or some ranking score for diseases.), a binary search algorithm will take advantage of order relation to reduce search space logarithmically in sorted data; resulting in In the remainder of this paper, we will focus the analysis in the special case of Comorbidity Disease Network. Each algorithm in this approach works on the multimorbidity network to detect subsets forming typical groups. So, this algorithm will assume that diseases present in dataset can be split into homogenous groups.

COMMUNITY
Community detection is a network analysis method often confused with graph partitioning and graph clustering. Graph partitioning and graph clustering is the task of dividing the vertices of the graph in a given number of mutually exclusive subsets of a given size such that the number of linking edges is minimal. Whereas Community detection is attributing to every node a given class (community), and overlapping is allowed, and no class number or sizes are predefined.
Many algorithms and approaches have been developed to detect community in networks, each has its pros and cons, and performance depends as well on the topological properties of each network. Thus, a comparative analysis is often required for selecting the most suited method.
Community detection is often formulated as a combinatorial optimization problem, in which an optimization method tries to optimize a criterion (cut, conductance, modularity, etc) and the exact solution is NP hard, thus the focus on greedy, approximation algorithms and heuristics in literature.

Label Propagation
Algorithm. This algorithm was introduced by Raghavan et al [32].
This algorithm starts with initializing a distinct community labels for each node in the network.
Then, listing the nodes in the network in a random order. Afterwards, through the random sequence, each node takes the label of the majority of its neighbors. The above step will stop once each node has the same label as the majority of its neighbors. The computational complexity is O(E).
The advantage of this algorithm is that it is quite fast because it doesn't collect prior information about the network. Different community structures are reachable from the same initial condition, this is a limitation of this algorithm. The algorithm uses the network structure to guide its progress and does not optimize any chosen measure of community quality. The problem however is that subgraphs in the network that is bi-partite or nearly bi-partite in structure lead to oscillations of labels.

Louvain Algorithm.
One of the popular algorithms with multilevel hierarchical strategy. It was introduced by Blondel et al [33]. It is a greedy algorithm attempting to optimize the modularity score of a partition of a network. The optimization is executed in two steps.
First, the method looks for small communities by optimizing local modularity. Secondly, it aggregates nodes belonging to the same community and builds a new network whose nodes are the aggregated communities. These steps are repeated until a maximum of modularity is attained resulting in a hierarchy of communities is produced. The computational complexity of the multilevel algorithm is O(NlogN).

Walktrap Algorithm.
Proposed by Pon Latapy [34], a hierarchical clustering algorithm based on predefined random walk based similarity measure, which is a distance between nodes and sets of nodes (communities). The basic idea is that short distance random walks tend to stay in the same community. Starting from a totally non-clustered partition, the distances between all adjacent nodes are computed. Then, two adjacent communities are chosen, they are merged and the distances between communities are updated. If this step is repeated N +1 times, then the computational complexity of this algorithm is O(EN 2 ). And O(N 2 log(N)) For sparse networks.
4.6. Comparing community quality. One of the most popular measure of community strength is modularity. The basic idea is to compare the fraction of edges within the cluster to the expected fraction in a random graph with identical degree distribution. Its value can be either positive or negative. Positive value of modularity implies the presence of a community structure shape. The modularity Q is used to compare quality of communities detected but also as an objective function to optimize. It is defined as follows: where m is the number of edges, A i j is the adjacency matrix, k i is the degree of node i, C i is community/group i, δ is the Kronecker delta such that δ (x, y) = 1 if x = y and 0 otherwise; e uu is the observed fraction of edges within the group/community u, a 2 u is the expected fraction for the same group, e i j is the fraction of edges that associates groups i and j. Q = 0 implies that this group is what would be expected by a random attribution of edges. Q 1 implies that the group is very well shaped as a community inside the graph.

Centrality measures. Centrality indices characterize an important vertex in a graph by
defining a real-valued function C : V → R on the vertices of a graph, where the score C(v) provide a ranking that identifies a scale of node importance. In general, the importance is defined in relation to the information flow across the network, or to the cohesiveness of the network. In this work, we calculated the following centralities: edge betweennes, closeness, degree, eigen vector.

Betweenness.
Betweenness centrality quantifies how many times a node play a role of bridge in the shortest path between two other nodes. It was introduced originally as a measure to quantify the control of a human on the communication flow between other humans in a social network by Linton Freeman [35]. Formally, if σ st is the total number of shortest paths that run from node s to node t, and σ st (v) is the number of those paths that pass through v, the betweenness can be defined as C B (v) = ∑ s =v =t∈V

Closeness.
Closeness is defined as the average length of the shortest path between a node and all other nodes in the graph. i.e. the closer a node is to all other nodes ,the more central is the distance between vertices v 1 andv 2 then closeness of a vertex v can be defined formally [36] as: C(x) = 1 ∑ y d(y,x) for all vertices y in V . Computing betweenness and closeness of vertices involve computing the shortest paths between all pairs of vertices, which requires O(V 3 ) using the Floyd-Warshall algorithm [37], or O(V E) on unweighted graphs with Brande's algorithm [38].

Eigen vector.
Its basic idea is that connections to influential nodes will lend a node more influence than connections to less influential nodes. If we denote the centrality of vertex i by x i , then we can model influential effect by making xi proportional to the average of the centralities of i's network neighbors: If A = (a i, j ) is the adjacency matrix, then x v = 1 λ ∑ j a v, j x j , or equivalently in matrix notation Ax = λ x, with x = (x 1 , x 2 , ..) the vector of centralities. Since the adjacency matrix is non-negative, by the Perron-Frobenius theorem there is a unique largest real positive eigenvalue λ . In this way, the eigenvector centrality accords for each vertex a centrality that depends on both the quality and the number of its connections [39].

Degree.
It is the simplest centrality measure. It is the count of how many edges a node has. The degree can be interpreted as the ability of a node for catching flowing through the network (such as a virus, or some information propagation). Calculating degree centrality for all the nodes takes Θ(V 2 ) in a dense adjacency.

RESULTS AND DISCUSSION
We applied algorithm 1 in the special case of Comorbidity Disease Network. (i.e M k = M 2 ) to our data. Figure 4 presents example of 9 diseases and their co-occurrence strength (MC) for older males (> 65 years). The absence of an edge between two diseases means, according to our available data, that either the association is not significant or this association is what would be expected just by chance. The presence of edge is considered as significant non random association comorbidity between the two diseases linked by this edge. For example, Rheumatic disorders of both mitral and tricuspid (valves) co-occur 15.07 times more than what would be expected just by chance with other secondary pulmonary diseases. If we zoom out the whole detected comorbid diseases we can see visually the increasing quantity and complexity of diseases across psycho-social profiles. Figure 5 show visually the obtained zoom out Comorbidity Disease Network for males across life cycles. Each edge in the graph represents a significant association between two diseases (p − value < 0.01).  profiles can explain these three phases. Besides, degree distribution of diseases nodes increases from average of 1.5 in childhood to 6 in older adulthood, which reveal increasing potential pathways and complicated possible disease scenarios as age increases. The diameter increases from 2 to 9 (i.e the maximum number of edges between two nodes/diseases in CDN is of 9 significant associations).This confirms, unfortunately the increasing possible pathways, but fortunately reveals (since the diameters is 9 which is very low to the possible highest diameter) that diseases act like "islands" of diseases, which can be manageable if these islands are detected and understood from data.
In table 2 we present some descriptive statistics and diseases with high centrality measures of our data according life stages. The hardest and maybe the misleading part of in handling centrality measure is its interpretation. Since the importance of a node in a graph (which is reflected partially by centrality and other statistics) is based upon assumptions about what is important to consider in a specific domain of interest. Since our work is essentially technical, we will be careful in interpretations and we let physicians to consider centralities in light of their specialty. However, in this work we adopted the following basic interpretations to justify the chosen centralities: diseases with high degree centrality are diseases that have high probability of appearing in multiple comorbidities. Closeness centrality reflects the degree of contagion of diseases over its comorbid diseases. Diseases with high edge betweenness are diseases which should be targeted in therapeutic interventions, since they act like bridges connecting other diseases which increase the multimorbidity burden of patients. Finally, diseases and conditions with high Eigen vector centrality are conditions related to influential diseases, which may indicates a diseases related by direct causations. Notice that nodes in the graph may have heterogenic centrality profiles, and nodes with the highest score in most centralities tend to play a role of a "trouble makers" in the CDN. Examples of diseases with highest scores in multiple centralities in the same time are primary hypertension (I10), heart failure (I50.9), acute kidney failure (N17.9). It is known in medical literature that these diseases generate other diseases because of damaging impact of the instability of vital organs at the core of the circulatory and genitourinary systems.  Further, we applied community detection approach to detect potential components of MDN.
Since algorithms in literature have different assumptions, pros and cons, we conducted comparative analysis of four well known algorithms. Figure 9 present number of communities detected for each algorithm.
We observe from figure 9 that females have in general more average of communities than males, and modularity values had a good scores around 0.78-0.90, which can be explained by a modular topology aspect of CDN, i.e groups of diseases are easily distinguishable (and acts like islands) which causes the algorithm performances are comparable. We think that increasing numbers of clusters reveals increasing types of multimorbidity diseases and additional layers of multimorbidity burden as age increases. On the other hand, best performances for label propagation are when the label is initialized in the community's gravity center: initializing this algorithm in nodes with high closeness centrality can yields the best results, because label propagation is proportional to the node's capacity to propagate information flow across the network. Edge betweeness target edges that connect communities and could be of medical importance. Determining the communities boundaries can be helpful in the treatment of edges diseases and, thus, the prevention of potential complications.
For example, hypertension can act as edge diseases between cardiovascular and cerebral diseases, then targeting hypertension for a patient with cardiovascular diseases can help him avoid complications in cerebral related problems and putting another burden on him. In term of interpretability, we found some communities we think easily interpretable (see  [40]. Hypertensive kidney and heart diseases [41] have hypertensive perturbation as common, and long term use of insulin can be one reason that relates kidney and diabetes [42]. See figure 11. Hypertensive chronic kidney disease, I10 = Essential (primary) hypertension.
As a main conclusion, aging process do not increases additional condition average burden, but also can have effect on complexity alongside with quantity. This is important theoretically, since aging process will be considered as an example of a hidden Multimorbidity Mechanism that generates Multimorbid datasets, which is very important to suppose before any analysis of this type of data.

CONCLUSION
We based our work by the conviction that psycho-social characteristics have to be taking into account when managing patients with multimorbidity, in order to valorize further human capital. We tried to investigate how diseases co-occur based on life stages corresponding to a psycho-social point of view. We classified these profiles based on Erickson theory of life stages.
First, we gave a formal definition of detecting Multimorbidity Disease Network. We focused on pairwise associations, which output a Comorbidity Disease Network (CDN). Second, we presented how some centralities measures evolve over life stages the CDN, and we compared four algorithms that detect components/ communities in CDN. The main results indicate that processes of aging do not increase Multimorbidity in term of diseases count, but also in complexity of Multimorbidity and we showed qualitative and quantitative visualizations of such phenomenon. This work needs to be verified for generalizability for other empirical data, besides further investigation of the suited centralities for multimorbidity context.

CONFLICT OF INTERESTS
The author(s) declare that there is no conflict of interests.