Investigation of multiple sclerosis-related pathways through the integration of genomic and proteomic data

1Department of Molecular Biology and Genetics, Faculty of Science and Letters, Istanbul Technical University, Istanbul, Turkey 2Department of Biostatistics and Medical Informatics, Faculty of Medicine, Acıbadem University, Istanbul, Turkey 3Department of Neurology, Cerrahpaşa School of Medicine, Istanbul University-Cerrahpaşa, Istanbul, Turkey 4Department of Molecular Biology and Genetics, Faculty of Science and Letters, Acıbadem University, Istanbul, Turkey


INTRODUCTION
Multiple sclerosis (MS) is known to be an immune-mediated, neurodegenerative central nervous system (NS) disorder with complex inheritance and pathophysiological mechanisms. Although approximately common 250 genetic variants with low to modest risk effects have been associated with MS mainly by genome-wide association studies (GWAS) using rather large sample groups (International Multiple Sclerosis Genetics Consortium, 2007;Sawcer et al., 2011;Patsopoulos et al., 2017;International Multiple Sclerosis Genetics Consortium et al., 2019), known variants not only fail to explain predicted MS heritability but also cannot be efficiently translated into disease mechanisms. To date, numerous studies have also revealed potential diagnostic and prognostic biomarkers and disease-related cellular pathways that emphasize the different pathological components of the disease; however, the exact underlying mechanisms in disease development and progression are largely unknown. In order to better translate the growing number of findings into disease pathophysiology, algorithms for pathway analyses of multiple high-throughput omics data seem essential. In this context, the integration of multiple omics data is essential to better describe the complex nature of MS.
In our previous study (Avsar et al., 2015), we have conducted a cerebrospinal fluid (CSF) proteome analysis using 2D-gel electrophoresis and mass spectrophotometry and identified 151 differentially expressed proteins between an MS cohort of 179 patients with different clinical MS phenotypes and 42 non-MS controls. Later, affected Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were identified using the functional enrichment tool Pathway and Network-Oriented GWAS Analysis (PANOGA) (Bakir-Gungor, Egemen & Sezerman, 2014), revealing MS-related pathways including aldosterone-regulated sodium reabsorption pathway, renin-angiotensin system, notch signaling pathway, and vitamin digestion and absorption pathway. Here, we further explored disease-related pathways, applying single nucleotide polymorphism (SNP) genotyping on 11 MS patients, who were included in the previous proteomic study, and 60 independent, healthy individuals. Pathway enrichment analyses of the genomic and proteomic data were conducted. The two datasets were merged, highlighting the most prominent pathways that may be affected in the studied patient group.

Study groups
All patients were diagnosed at Istanbul University-Cerrahpaşa, Cerrahpaşa School of Medicine, Department of Neurology, according to the McDonald criteria (Polman et al., 2011). Eleven unrelated MS patients were randomly selected among the study group of our previous work comprising of 179 MS patients. The patient group had a heterogeneous disease presentation at the time of their CSF sample collection. Sixty independent age-and gender-matched intrinsic healthy controls were included in the analyses. All individuals in the study group are of Turkish origin. All procedures of the study were in accordance with the Helsinki Declaration of 1964 and its later amendments. The Ethics Committee of Istanbul Technical University approved the study (İTÜ-SM.İNAREK-MBG-1), and each individual of the study group gave a written informed consent form prior to the sample collection.

Genome-wide associations
DNA isolation from blood samples was performed (Roche DNA Isolation Kit for Mammalian Blood), and genotyping for 300.000 SNP markers was applied for each individual on the Illumina HumanCytoSNP-12 array. Illumina GenomeStudio software was used for quality control. All individuals showed sample call rates of more than 98% (98.65-99.63%) and were therefore included in the study. Golden Helix SNP & Variation Suite software was used for identity-by-descent detection, suggesting no relatedness between the individuals. SNP filtering was performed: SNPs on the Y chromosome, with call rates lower than 95%, minor allele frequency lower than 0.01, and in strong linkage disequilibrium (r 2 > 0.5) were excluded from the study. A total of 129.547 SNPs was included in the analysis. Frequency differences of SNPs between cases and controls were assessed, and genotypic association P-values were obtained.

Pathway enrichment analysis
Genotypic associations and proteomic data were subjected to pathway enrichment analyses using the functional enrichment tool PANOGA, which reveals functionally relevant pathways by identifying genes within the pathways, incorporating protein-protein interaction (PPI) information, and extracting significant pathways. SNPs with genotypic association P-values lower than 0.05 and differentially-expressed proteins with P-values lower than 0.05 from our previous study were used for PANOGA procedure as follows: For SNP associations, because a SNP might affect more than one gene, each SNP was initially assigned to the gene on which the SNP has the most important functional effect, using the SPOT webserver (Saccone et al., 2010). Functional information of SNPs was obtained utilizing SPOT, F-SNP (Lee & Shatkay, 2007), SNPnexus (Dayem Ullah, Lemoine & Chelala, 2012), and SNPinfo (Xu & Taylor, 2009). Genes and proteins were then mapped onto a PPI network, for which Goh et al.'s human PPI network was used (Goh et al., 2007). Next, the jActive Modules algorithm (Ideker et al., 2002 was applied to identify active subnetworks containing a large number of the disease-affected genes and proteins in the PPI network. Each genotypic association and differential expression P-value was taken into account, and active subnetworks that overlap at most 50% with each other were extracted. In order to evaluate the biological importance of the subnetworks, the number of genes and proteins found in a specific KEGG pathway was compared to the total gene and protein number involved in the corresponding pathway. For this functional enrichment procedure, a two-sided hypergeometric test was used, and the Bonferroni method was applied for multiple testing corrections of the P-values. Significantly associated KEGG pathways to our patient group consisted of significantly enriched (P < 0.05) pathways for at least one of the active subnetworks. For the pathways that were enriched in multiple subnetworks, the one with the lowest P-value was reported. PANOGA was run 10 times for both genotypic associations and differentially-expressed proteins, and the lowest P-values over the 10 iterations were reported. The resulting enriched pathways for both datasets were then merged. If a given pathway was identified in both analyses, the corresponding P-values were merged using Fisher's combined probability test. If the pathway was identified in only one of the analyses, the corresponding P-value was used as the final P-value.

Clustering of enriched pathways
Using a method previously described by Chen et al. (2014) we clustered the enriched pathways to identify similar groups and establish representative pathways. The clustering approach can be summarized as follows: initially overlap index matrix (OI) that consists of the overlap indices between all of the pairs of enriched pathways was calculated. The overlap index (OI i,j ) of a pair of pathways P i and P j (denoting the ith and jth pathways, respectively) was defined as: where G i is the set of all genes in the ith pathway. In the overlap matrix, each row (o i ) represents the gene overlap profile of a pathway (i.e., o i is the gene overlap profile of the ith pathway). To identify similarity between each pair of pathways Pearson correlation coefficients (R) between each pair of overlap profiles (e.g., o i and o j ) were calculated. These correlation coefficients were then converted to pairwise distances (PD) as: Using PD as the distance metric, hierarchical clustering (average-linkage) of the enriched pathways was performed. Examining the hierarchical clustering dendrogram, the dendrogram was partitioned into clusters at the manually selected PD cut-off value of 0.55. The pathway with the lowest P-value in each cluster was selected as the representative pathway for that cluster.

SNP associations
All 71 individuals and a total of 129.547 SNPs passed the quality check and therefore were included in the final analyses. The Manhattan plot summarizing the genome-wide associations is given in Fig. 1. The most significantly associated SNP was rs7873, located on 3 UTR of the IGF2 gene, whose significance level was the only value close to a classical stringent P-value cut-off for a GWAS (P = 4.39 × 10 −07 ). When considering SNPs with a significance level of lower than a looser P-value cut off (P < 10 −05 ) given the small sample size, 4 more SNPs have high associations with MS in the studied patient group, one of which is located on a gene: rs17187282 (P = 1.17 × 10 −06 ), rs11688088 (P = 2.8 × 10 −06 ), rs654188 (P = 5.24 × 10 −06 ), and rs7092208 (P = 9.66 × 10 −06 , intronic variant on the MGMT gene).
For enrichment analysis of the SNP associations, the P-value threshold was set lower than 0.05 ( Fig. 1, blue line) in order to prevent the elimination of potentially MS-associated SNPs with falsely high P-values due to the low sample size and limitations of the genome-wide association methodology. Later, elimination of the false-positive SNP associations resulted from preferring not to set a conventional stringent P-value was aimed to be achieved during the identification of MS-related pathways using SNP subnetworks, therefore ruling functionally irrelevant SNPs out. A total of 6.594 SNPs had P-values under the threshold of 0.05 (Table S1), which were included in the subsequent analyses.

MS-related pathways
The combination of enriched pathways for the resulting SNPs and differentially expressed proteins resulted in 151 enriched pathways in total (Table S2). In order to identify pathways with similar content and function, this combined list of enriched pathways was clustered (Fig. 2). The dendrogram was then manually partitioned into biologically relevant clusters, and representative pathways were established. In total, 33 clusters, therefore 33 representative pathways were obtained (Table S3). Nine of the representative pathways emerged from both genomic and proteomic analyses, among which the complement and coagulation cascade (hsa04610) was the most significantly associated pathway in the studied group (P = 6.96 × 10 −30 ). Figure 3 shows the complement and coagulation cascades with genes identified through the genomic dataset and differentially expressed proteins identified through the proteomic dataset as a representative pathway for the relationships between the genomic and proteomic findings. Regulation of actin cytoskeleton (hsa04810) and focal adhesions (hsa04510) were the two systems with the second and third lowest P-values even though there were no significant protein level changes related to these pathways (P = 9.64 × 10 −27 and P = 3.29 × 10 −26 , respectively). Other 8 shared pathways with high significance levels are as follows: Adherens junctions (hsa04520, P = 5.38 × 10 −17 ), colorectal cancer (hsa05210, P = 1.70 × 10 −15 ), pathogenic Escherichia coli infection (hsa05130, P = 7.02 × 10 −9 ), prion diseases (hsa05020, P = 2.48 × 10 −5 ), endometrial cancer (hsa05213, P = 6.00 × 10 −9 ), non-small cell lung cancer (hsa05223, P = 1.40 × 10 −8 ), bladder cancer (hsa05219, P = 4.83 × 10 −6 ), and non-homologous end-joining (hsa03450, P = 4.27 × 10 −5 ). Table 1 shows the detected SNP associations and differentially expressed proteins in the above-mentioned pathways.  The most significantly MS-associated pathways obtained from the SNP associations and differentially expressed proteins did not overlap when the analyses were performed   Adherens junction, hsa04520 6.64 × 10 −25 ErbB signaling pathway, hsa04012 1.28 × 10 −23 separately (Table S2). The merged data analysis resulted in a combined list of pathways from both datasets, revealing a different set of pathways with the lowest P-values and emphasizing the most relevant pathways that could not have been prioritized by a straightforward, non-combinatorial omics approach. The top five pathways that emerged from the proteomic-only, genomic-only, and merged dataset analyses are given in Table 2.

DISCUSSION
The complement and coagulation cascade (CCC) pathway emerged as the most significantly associated pathway to MS in our patient group, with many SNP associations on different genes and differentially expressed CSF proteins as shown in Fig. 3. The complement and the coagulation systems are two closely linked cascades, both having roles in immunity. In a study by Magliozzi et al. (2019), CSF proteomic profiles of MS patients with low or high cortical lesion load were compared, revealing that the identified differentially expressed proteins were mainly involved in the complement and coagulation cascade. Examination of white matter lesions of MS patients has revealed dysregulation of coagulation-associated proteins in chronic active plaques involving Serpin A5 and tissue factor (Han et al., 2008). Also, significantly upregulated Serpin E1, another CCC component, was reported in the post-mortem cortex of progressive MS patients (Yates et al., 2017). In our study, even though the genes and differentially expressed proteins involved in CCC do not overlap, the pathway was found 10 times for each dataset in PANOGA, resulting in a high association with MS. Our results confirm the previous findings suggesting the involvement of CCC alterations in MS pathophysiology and highlight other pathway components that may also be responsible for these alterations.
Disruption of blood-brain barrier (BBB) integrity is one of the hallmarks in MS pathophysiology, during which massive leukocyte infiltration occurs across the damaged BBB into the CNS (Alvarez, Cayrol & Prat, 2011) Tight and adherens junctions between the endothelial cells of BBB have significant importance for normal immune surveillance in the CNS. Tight junctions consist of claudins, occludin, junctional adhesion molecules, and Zonula Occludens (Hawkins & Davis, 2005), whose dysfunctions leading to BBB abnormalities in MS have been previously reported (Kirk et al., 2003;Padden et al., 2007;Plumb et al., 2002). Cadherins form adherens junctions via homophilic interactions between endothelial cells and bind cytoskeletal components through cytoplasmic catenin proteins (Dejana, Orsenigo & Lampugnani, 2008). Although adherens junctions are required for the overall junctional organization to maintain the BBB integrity, their roles in normal physiology and diseased states have not been well-established. In a previous study, the level of an adherens junction protein, β-catenin, was similar in progressive MS and non-MS brain sections (Padden et al., 2007). In our patient group, adherens junctions, which were not detected in our previous proteomic study, emerged as the second most significantly altered cellular components with SNP associations on different genes and differentially expressed proteins indirectly related to the pathway. Our findings indicate that adherens junctions may have more influence on BBB function than it has been thought and are needed to be explored in both normal physiology and MS pathophysiology in more detail.
Focal adhesions and actin cytoskeleton regulation are two interrelated pathways that emerged with low final P-values from the analyses, even though no significant protein expression changes directly or indirectly related to these systems were detected. Organized clusters of focal adhesion complexes connect extracellular matrix through transmembrane integrin proteins to the intracellular actin cytoskeleton through other focal adhesion proteins in the complexes and mainly have roles in cell motility (Wozniak et al., 2004). During leukocyte infiltration across the BBB, α4-integrin proteins on leukocytes form firm connections with the endothelial surface, initiating the process (Berlin et al., 1995;Vajkoczy, Laschinger & Engelhardt, 2001). In this context, Natalizumab (Antegren, Elan Pharmaceuticals and Biogen), a monoclonal antibody against α4-integrin, has been used in MS treatment, which suppresses inflammatory activity by inhibiting leukocyte migration to the inflammation areas (Polman et al., 2006). We detected many SNP associations on genes encoding for many focal adhesion proteins, including the α4-integrin-coding ITGA4 gene. Focal adhesion molecules and their combinations to create different complexes are diverse, and regulation differences of these interactions for specific cell behaviors are yet to be elucidated.
In the studied MS patients, increased CSF nucleolin expression and SNP associations on WASL and ROCK2 genes were detected, all of which are components of the Pathogenic Escherichia coli infection pathway. Neural Wiskott-Aldrich syndrome protein encoded by WASL and Rho-associated protein kinase 2 encoded by ROCK2 are both regulators of the actin cytoskeleton (Miki, Miura & Takenawa, 1996;Gallo et al., 2012), and the involvement of the pathway may indicate induction of inflammatory cell mobility. Prion diseases are fatal conditions presenting neuronal degeneration, and the emergence of the pathway from our analyses emphasizes some shared molecular alterations with MS pathophysiology, as both neurodegeneration and neuroinflammation also occur in prion diseases (Perry, Cunningham & Boche, 2002;Eikelenboom et al., 2002). We have also detected a number of cancer pathways involving both SNP associations and differential CSF expressions. Cancer risk among MS patients has been investigated in a number of studies with diverse results, revealing unchanged (Nielsen et al., 2006), reduced (Bahmanyar et al., 2009), and increased (Grytten et al., 2020) cancer rates. Altered cancer pathways in our study group may be attributed to changes in immunological mechanisms involving the same components as in mechanisms against anti-tumor surveillance. Other pathways that emerged only from the SNP associations included a number of immunological and neurological mechanisms. T cell receptor signaling pathway (hsa04660, P = 1.86 × 10 −19 ), chemokine signaling pathway (hsa04062, P = 4.38 × 10 −19 ), and Fc-gamma R-mediated phagocytosis (hsa04666, P = 8.78 × 10 −16 ) were the immune system-related pathways. Nervous system-related pathways included axon guidance (hsa04360, P = 6.36 × 10 −20 ), retrograde endocannabinoid signaling (hsa04723, P = 6.74 × 10 −18 ), neurotrophin signaling pathway (hsa04722, P = 8.03× 10 −18 ), glutamatergic synapse (hsa04724, P = 1.74 × 10 −15 ), cholinergic synapse (hsa04725, P = 7.79 × 10 −15 ), and dopaminergic synapse (hsa04728, P = 5.83 × 10 −12 ). Also, the Rap1 signaling pathway (hsa04015, P = 2.35 × 10 −20 ) was detected, further supporting the involvement of adherens junction, focal adhesion, and regulation of actin cytoskeleton pathways.
Working on small sample sizes for genome-wide associations leads to enormous numbers of false-positive SNP associations. Using large sample sizes, on the other hand, may result in false-negative results due to the low effect sizes of disease-associated SNPs. One limitation of our study is the considerably low sample size in the assessment of SNP associations. In this regard, we aimed to take advantage of having both DNA and CSF samples of the same affected individuals. Therefore, we aimed to include both false-positive SNP associations and prevent missing out SNPs with falsely high P-values in the first place by setting a non-stringent P-value cut-off. We then aimed to exclude false-positive associations during the pathway enrichment analyses in which interacting SNP subnetworks were used to identify the pathways, eliminating functionally irrelevant SNPs. Integrative analyses of the genomic and proteomic datasets determined the shared pathways between the two datasets, which seem to represent the pathophysiological mechanisms. Another limitation of the study is that we have used a genotyping chip with a relatively low SNP number (300K) since it was one of the most powerful chips when the genotyping had been carried out. Lastly, we had not obtained mRNA samples of the study participants at the time of their sample collection, which would lead to a more comprehensive analysis of the altered pathways in the studied patient group.

CONCLUSIONS
Here, we provide an integrated pathway analysis of whole-genome SNP and CSF proteome datasets obtained from the same MS patients, highlighting the most prominent molecular pathways that may be altered in the studied patient group. We confirmed the involvement of some of the previously known players in MS pathophysiology and suggested possible roles of some others. We conclude that this approach provides a better placing of separate genomic and proteomic datasets into a biological context and is useful for elucidating disease mechanisms.