The impact of cellular networks on disease comorbidity

The impact of disease-causing defects is often not limited to the products of a mutated gene but, thanks to interactions between the molecular components, may also affect other cellular functions, resulting in potential comorbidity effects. By combining information on cellular interactions, disease--gene associations, and population-level disease patterns extracted from Medicare data, we find statistically significant correlations between the underlying structure of cellular networks and disease comorbidity patterns in the human population. Our results indicate that such a combination of population-level data and cellular network information could help build novel hypotheses about disease mechanisms.

A problem that results from this artifact is that in later stages of mapping the OMIM diseases into the ICD-9-CM coding scheme used in the Medicare database for calculating comorbidity (described in detail in Section S2), the gene NRAMP1, along with several other genes, may not be considered to be associated with Tuberculosis, while it clearly should. We corrected this problem by including all genes that correspond to the six-digit OMIM ID that was originally mapped to each disease; in the example above, now both diseases 1043 and 1533 correctly contain all genes mapped to six-digit OMIM code 607948 in the morbidmap. . We have found out that 4 OMIM codes were mapped to three ICD9 codes, while 43 OMIM codes were mapped to two ICD9 codes.
b. Protein-protein interaction and gene expression.
The protein-protein interaction data were taken from Rual et al. 3 and Stelzl et al. 4 . In order to calculate the average human gene coexpression, we used an Affymetrix (www.affymetrix.com) microarray data that lists expression levels of select genes on 36 human tissues 5 (also see Section 3).

c. The Medicare database
We obtained raw Medicare claims files directly from Centers for Medicare & Medicaid Services (CMS, www.cms.hss.gov) in the form Medicare Provider Analysis and Review (MEDPAR) files. These files are made available subject to a Data Use Agreement. At present, access to such files is via the CMS-designated Research Data Assistance Center (ResDAC, www.resdac.umn.edu) program. The data we used contain the complete hospitalization records of 13,039,018 Medicare patients, for a total of 32,341,348 visits over 4 years, from 1990 to 1993. In the Medicare database, each line corresponds to a hospitalization even of a patient, and has a record of up to ten diagnoses. The diagnoses are presented using the ICD-9-CM scheme (www.icd9data.com), where a disease is assigned a numeric code such as 174 for breast cancer (Fig. S1).
The Medicare records are comprehensive, and they are frequently used for epidemiological and demographic research 6,7 . The present sample was abstracted from a comprehensive set of hospital visits of all elderly patients (aged 65 and up) in the Medicare program, which comprises 96% of the entire elderly American population. Our sample of 13,039,018 hospitalized patients shows a mean age of 76.3 ± 7.4; 41.8% were female; and 89.9% where Caucasians. Most patients were diagnosed with multiple diseases during the observation period, a co-occurrence that is in some cases accidental, but is also often causal, i.e., one disease increases the likelihood of the development of other diseases 8,9 . S2. Mapping between the Medicare database and the OMIM disease-gene association.
The mapping from OMIM diseases to ICD-9-CM codes we used in this study is given in the Supporting show a schematic diagram of the procedure of counting incidences and co-ocurrences of disease pairs using the OMIM and the Medicare database in conjuction with this mapping.
We accept a single ICD-9-CM 170.9 code as our unit of disease, whose associated genes are the union set of those of their corresponding OMIM diseases, as shown in Fig S1. Now, to find the incidence of ICD-9-CM 170.9 and its comorbidity with other diseases, we parsed the Medicare database and counted the number of patients whose records show the ICD9-CM code 170.9. (See Fig. S1 1 ). We can show that, after performing this procedure for all codes appearing in the mapping, 90.0% of the patients were diagnosed with at least one of the diseases we considered ( Fig 1A). Finally, in order to mitigate the effect of biases due to extremely rare diseases and disease pairs, we considered only the 1 We also consider the hierarchical nature of ICD-9-CM codes. In Fig S1 for instance, when we meet a subcode of breast cancer (174) such as 174.1 -meaning more detailed, specific diagnoses --we count it also as an incidence of 174.
diseases whose incidences were 10 or larger, and the pairs of those diseases whose randomly expected co-occurrence * was 1.0 or larger, resulting in a total of 83,924 disease pairs for the study. In Section S4, we compare the various thresholds for * . The genes for which no entry exists in the data set was set to have no correlation (i.e.

S3. Definitions
0.0) with other genes. When multiple expression values exist for a gene a in tissue t, was set to be the average of the expression levels for the given tissue.

S4. The -correlation and the robustness of comorbidity measures
In addition to relative risk , we used the -correlation as a comorbidity measure, which we discuss in more detail. First, it is defined as which can be shown to be equivalent to the Pearson correlation between variables that take dichotomous values (for instance, 0 or 1), so that −1 ≤ ≤ 1 10 . Note that and are not independent from each other: we can rewrite as = − 1 − − , which clearly shows that the conditions > 0 and > 1 are equivalent --both indicate that two diseases occur together more often than expected by chance alone. Nevertheless, each variable carries with it unique advantages and disadvantages over the other, which prompted us to use both to show the robustness of our findings. The primary advantage of using is that its meaning is very clear and intuitive. However, can be biased in the sense that it could assume an abnormally large value when the random expectation * is smalli.e., when i and j are very rare.
Our method to overcome this issue is to introduce a threshold (TH) for * and consider disease pairs whose expected co-occurrence equal or exceed it. In contrast, the advantage of using as the comorbidity measure comes from the fact that it is bounded in the range [-1,1]. However, even when two diseases are highly comorbid ( ≫ 1), can have an apparently small numerical value, because ≈ − 1 2 , and in our database usually , ≪ . Furthermore, two diseases being "maximally" comorbid given the incidences (meaning = , one disease always occurring with the other, assuming ≤ ) doesn't imply = 1 : rather, we have max = − − which is always smaller than 1 unless = . In order to compensate for this, we could use the normalized variable max so that when is at its maximum, max = 1.
In Fig. S2 we study the effect of the aforementioned threshold TH imposed on * on the average values of , , and max for disease pairs that satisfy various criteria presented in Fig. 1C. We compare the cases of TH equal to 0 (no threshold, i.e. all disease pairs are included), 1, 3, 4, 5, and 10. Fig. S2 demonstrates that, as pointed out earlier, introducing thresholds significantly changes the magnitude of by removing pairs of exceptionally large s that arise from from * that are far too small. Most importantly, becomes stable and robust for any threshold larger than 0 (i.e., TH≥ 1).
Still, the qualitative trend we observe in the zero-threshold (TH=0) curve remains unchanged even in the thresholded curves except for the case of against . We also observe a similar trend in and max , which again demonstrates the robustness of our conclusions. Due to such strong stabilizing effect and the robustness resulting from the thresholds on * , we chose to show the curves of TH=1 in the paper. However, had we chosen any other threshold, the conclusions would have been qualitatively unchanged.

S5. Determining the P-values and Errors
The P-values for relative risk and can be obtained using standard numerical analysis tools such as Mathematica by approximating the binomial distribution generated from and = 2 as a Poisson distribution (since = 1.3 × 10 7 ≫ 1 ). Since * = × = , the P-value of comorbidities between diseases i and j is given by The P-values for Pearson correlations (appearing in Fig. 2A) between comorbidity and genetic variables was calculated via a Monte Carlo method, where we randomized the ordering of genetic variables many times (For Fig. 2A, we performed one million iterations for each case) and directly counted how often we observed a correlation larger than the actual values.

S7. List of genetically linked pairs
In Supporting

S8. Discussions of three disease pairs
Here we discuss in more detail the three example disease pairs introduced in the main text. Full genetic associations of each disease can be found in the Supplementary Tables   1 and 2.

a. Breast cancer and cancer of bone and cartilage
Worldwide, breast cancer is the most common type of cancer and the fourth most

b. Alzheimer's disease and myocardial infarction
Among the various conditions leading to dementia, Alzheimer's disease is the leading cause, accounting for more than 60% of all dementia cases 14 . Myocardial infarction (heart attack) is the leading cause of death for men and women in the U.S.
(http://www.americanheart.org/downloadable/heart/113535864858055-1026_HS_Stats06book.pdf) Population-based studies have been carried out to evaluate the comorbidity of myocardial infarction and dementia with inconsistent results [15][16][17] . In one study, the comorbidity was shown to be gender-dependent: women with a history of myocardial infarction were 5 times more prone to dementia than those without a history, an effect absent in men 15 . In others, unrecognized myocardial infarction was associated with an increased risk of dementia in men 17 . Also, low cardiac output, cerebral hypoperfusion, and microembolization have been believed to be responsible for developing cognitive impairment after a myocardial infarction 16 . Genetic effects are also known to some degree: the sequence variation of ACE and the following variation in the level of the angiotensin I converting enzyme in plasma are known to be involved in both diseases. Blood pressure is partly regulated by angiotensin II, formed from angiotensin I, and the Alzheimer disease risk may be related to blood pressure regulation [18][19][20][21] .
Apolipoprotein E (APOE) is a ligand for low density lipoprotein (LDL) receptor, very low density lipoprotein (VLDL) receptor, etc., and its polymorphism has an impact on plasma cholesterol levels. Therefore, the APOE polymorphism is suspected to modulate the coronary heart diseases risk, and its association with the myocardial infarction and also with the Alzheimer's disease has been identified by population-based studies 22,23 .

c. Carpal Tunnel Syndrome and Autonomic Nervous System.
Carpal Tunnel Syndrome (CTS) occurs when the median nerve, which runs from the forearm into the hand, becomes pressed or squeezed at the wrist.   , and for disease pairs that satisfy genetic constraints and whose expected co-occurrence * equal or exceed a given threshold TH. The cases of and for TH=1 are also shown in Fig. 1C. Imposing a threshold greatly reduces the magnitude of the comorbidity by removing pairs that exhibit unusually high values due to very small values of expected * . Note that the curves with TH≥ are stable over various thresholds.