Supervised machine learning models and protein-protein interaction network analysis of gene expression pro�les induced by omega-3 polyunsaturated fatty acids

Background Omega-3 polyunsaturated fatty acids (PUFAs), such as eicosapentaenoic (EPA) and docosahexaenoic (DHA) acids have bene�cial effects on human health but their effect on gene expression in elderly individuals (age ≥ 65) is largely unknown. To examine this, the gene expression pro�les were analyzed in the healthy subjects (n = 96) at baseline and after 26 weeks of supplementation with EPA+DHA to determine up-regulated and down-regulated differentially expressed genes (DEGs) triggered by PUFAs. The protein-protein interaction networks were constructed by mapping these DEGs to a human interactome and linking them to the speci�c pathways


Background
The bene cial effects of long-chain n-3 polyunsaturated fatty acids (PUFAs), such as eicosapentaenoic acid (EPA) and docosahexaenoic (DHA) are well known to promote human health [1].The usage of PUFAs as food supplements demonstrated the ability to target in ammation and other underlying pathogenic factors found in depression and cancer [2][3][4].On the other hand, the PUFAs impact on gene expression pro les in health and disease is not completely understood.Some researchers implemented the whole genome transcriptomics analysis to examine the genetic outcomes of a high EPA+DHA intake, resulting in a decreased gene expression involved in in ammatory and atherogenic-related pathways [5].Moreover, PUFAs could modulate the cellular functions through speci c changes in gene expression via DNA methylation mechanisms or binding and subsequent activation of peroxisome proliferator-activated receptors [6].However, results from such studies are inconsistent, providing the basis for the development of alternative techniques, such as machine learning (ML) algorithms.These approaches can be used for the analysis of high-throughput deep sequencing data due to their computational e ciency and robustness in nding common patterns obtained from a small sample size [7].In particular, a multivariate ML analysis, known as Gaussian process classi cation, con rmed that baseline fatty acids predicted response to treatment in the ω-3 PUFA group with high degrees of sensitivity, speci city, and accuracy [8].
In another study, researchers found that the ML algorithm based on the support-vector machines classi er yielded good agreement with the certi ed/reference values for the prediction of EPA and DHA in the reference standard using vibrational spectroscopic data [9].Moreover, some authors have also explored the interactions between serum free fatty acids, including PUFAs, and fecal microbiota in obese patients through the regression tree modeling, showing a non-obese-linked pro le with serum EPA > 0.235 μg/mL [10].Therefore, in the present study we aimed to identify differentially expressed genes (DEGs) associated with the particular metabolic or disease pathways in the healthy subjects and to assess the ability of different ML algorithms to predict and cluster the EPA/DHA-or high-oleic acid sun ower oil (HOSF)-speci c gene expression pro les.

Results And Discussion
To obtain DEGs expressed in healthy individuals at baseline and after 26 weeks of treatment with 1.8 g of EPA+DHA mixture and HOSF as a negative control, publicly available microarray data set (GSE12375) was retrieved from the GEO repository.The unnormalized gene expression data (Supplementary Figure 1  [A, B]) was tested for quality assessment purposes calculating normalized unscaled standard error (NUSE).This quality control analysis, designed only for comparing arrays within one dataset, has revealed no signs of low-quality arrays with signi cantly elevated NUSE values (Supplementary Figure 2 [A, B]).Subsequently, data preprocessing was performed to eliminate the effect of background noise and to normalize and summarize the gene expression values per each probe of the database (Supplementary material 3 [A, B]) according to the standard protocol published elsewhere [15].
To evaluate the overall structure of the RNA-seq dataset, we performed the unsupervised dimensionality reduction of gene expression data (23,941 genes) at baseline and after 26 weeks of treatment with EPA+DHA mixture using PCA.As a result, no clustering was observed between EPA+DHA and HOSF samples (Figure 2 [A, B]) at these particular time points, indicating the possible limitation of the PCA methodology, which might be due to the effect size of the biological signal as well as on the fraction of samples containing this signal [16].
Next, the total number of DEGs was evaluated (Table 1 and Figure 3 [A]), showing a signi cant increase in the number of these genes after 26 weeks of treatment (1,805) with EPA+DHA mixture compared to the baseline (779).Similar patterns were observed in the study of Bouwens and coauthors where the EPA/DHA and HOSF consumptions using the same conditions and time points resulted in gene expression changes of 1,040 genes induced by PUFAs and 298 genes induced by the sun ower oil.Of these genes, 140 were overlapping between the groups, and 900 were belonged to the unique genes in the EPA/DHA group [5].We identi ed 847 and 312 up-regulated unique genes and 753 and 262 unique downregulated DEGs at baseline and after 26 weeks of PUFA treatment (Figure 3 [B, C]), where the number of up-regulated genes is slightly dominated over the down-regulated ones except for the overlapping (common) genes (97.5 ± 0.71 vs. 107.5 ± 0.71), showing the inverse dynamics (Figure 3 [C]).
To further investigate the functional impact of up-regulated and down-regulated DEGs upon the treatment with EPA+DHA mixture vs. control at different time points, these genes were mapped onto the DAVID database to identify the speci c metabolic or disease pathways.This database is known to be one of the most popular tools in the eld of high-throughput functional annotation bioinformatics and microarray analysis [17].Subsequently, the most signi cantly enriched KEGG pathways were linked to the neurotrophin and mitogen-activated protein kinase (MAPK) signaling, complement/coagulation cascades, and axone guidance for the up-regulated DEGs and cancer or acute myeloid leukemia pathways, long-term depression, fructose/mannose and arachidonic acid metabolism for the downregulated genes (Table 2).Further, all the genes related to these pathways were subjected to the PPI network analysis to observe their involvement in the interconnections of different pathways (Figure 4 [A-D]).
Previously, some PUFAs, such as arachidonic acid, were discovered to activate MAPK in rat liver epithelial WB cells by a protein kinase C-dependent mechanism [18].Moreover, it has been already demonstrated in many in vitro, in vivo and in clinical studies that PUFAs can downregulate cancer-related cellular proteins and modify signaling to inhibit tumor growth and metastatic rate and to extend patients' survival duration [19,20].Furthermore, PUFAs can promote the axonal outgrowth by translational regulation of Tau and collapsin response mediator protein 2 expression and probably for axonal guidance as well by the modulation of lipid rafts, which are essential for this particular pathway [21][22][23].EPA and DHA are also well-known in alleviating symptoms of different mental illnesses starting from anxiety and depression to bipolar disorder and schizophrenia [24].Recently, the PUFA effect on preoperative bleeding has been estimated in the randomized placebo-controlled clinical trial to be associated with a lower risk of bleeding at higher sh oil levels [25].This information might support our ndings that PUFAs could potentially trigger the upregulation of complement/coagulation cascade genes, reconsidering current recommendations not to use sh oil before cardiac surgery [26].The negative impact of some PUFA dietary components on sugar metabolism has been also examined in the study, where the authors evaluated them for the prevention and treatment of type 2 diabetes [26].The evidence might suggest that consumption of sh oil supplements at high doses could lead to a further worsening of glucose metabolism [27].Another metabolic change caused by EPA and DHA is most likely related to the observed effects of marine n-3 PUFAs on eicosanoid pro les, resulting in a decreased arachidonic acid metabolism via the inhibition of phospholipase A2 and cyclooxygenase [28].
To construct the PPI network and maximize the genome coverage, the human interactome was implemented as a nonredundant and undirected binary data set comprising 16,018 unique HGNC-curated protein IDs and 299, 018 interactions compiled from the literature [15].As an outcome, the PPI networks belonging to cancer/leukemia pathways and MAPK/neurotrophin signaling were identi ed as the subnetworks with a high number of nodes and edges, where the less expressed proteins were located on the periphery of the network.Interestingly, the gene expression of neurotrophins and their target receptors were determined to be differentially up-regulated by n-3 PUFAs via an age-dependent mechanism in the C57BL/6 mice cerebral cortex [29].On the contrary, DHA has previously been shown to inhibit the induction and progression of acute myeloid leukemia cell lines (KG1a and HL-60) via the DHA-induced apoptosis, increasing the expression of the pro-apoptotic Bax protein [30,31].
Before applying supervised ML algorithms, the inferential statistics, such as one-way ANOVA, was performed on DEGs comprising the analyzed subnetworks, to show that most of the gene expression data were statistically signi cantly different (p-value < 0.05) except for the plakoglobin (JUP), lymphoid enhancer-binding factor 1 (LEF1) genes, and MYC proto-oncogene from cancer/leukemia pathways (Table 3).The logistic regression, naïve Bayes, and DNN models were implemented to solve the binary (EPA/DHA vs. HOSF) classi cation problem together with the receiver operating characteristics (ROC) analysis.The ROC analysis is considered to be valuable in evaluating predictive models, including gene expression pipelines and molecular docking experiments, since it captures the trade-off between sensitivity and speci city over a continuous range [32][33][34].The ROC curves were plotted for the ML models (Supplementary Figure 4 and 5 [A-C]) to assess the true and false positive rates as the total number of correct positive results predicted among all the positive samples and the total number of incorrect positive predictions among all negative samples in the dataset.Besides, the area under the ROC curve (AUC) was calculated as a numerical value that can be used to compare the logistic regression and naïve Bayes models getting maximal AUC values for the former model (Tables 4 and 5).Ideally, the best prediction model produces a curve on the top-left corner (0, 1) indicating perfect classi cation (100% sensitivity and speci city), whereas our ROC curves occurred in the top half of the graph, giving only a better-than-average model prediction.In our case, the ROC/AUC analysis cannot be used for screening and diagnostic purposes as is it lacks a minimum sample size (≈ 300) required to estimate the ML performance more precisely [35].
The DNN classi er with three hidden layers provided the worst performance on all the datasets with the lowest accuracy and harmonic mean of the precision and recall (F-score) values (Table 6) due to the small sample size (n = 96).Indeed, logistic regression models for binary classi cation can provide better performance than DNN because they are less prone to over tting and not so di cult to train [36,37].
Overall, the models exhibited their best performance values for the MAPK signaling pathway except for the Naïve Bayes model classi er, where the highest performance values were detected for cancer/leukemia pathway.Moreover, the "re ned" model performances for this pathway were signi cantly improved after the exclusion of the JUP, LEF1, and MYC genes (p-value ≥ 0.05), which has contributed by decreasing the accuracy and F-score values, especially in the performance of DNN model.
Finally, the decision surface analysis was utilized using the top two feature importance determinants for DEGs that contributed towards classifying the EPA/DHA and HOSF samples (Supplementary Figures 6  and 7 [A-C]).From Figures 5 and 6, it is clear that the logistic regression and naïve Bayes methods were able to separate the majority of the EPA/DHA samples from the HOSF ones, learning the underlying patterns quite well based on the most important DEG features.Even though the naïve Bayes classi er is considered to be one of the most popular classi ers for class prediction or pattern recognition from microarray gene expression data, it is highly sensitive to any outliers with the classical estimates of the location and scale parameters [38].In fact, a discrepancy was observed between the Naïve Bayes model performance and data clustering for the cancer/leukemia pathway before the model "re nement", where the good model performance produced no data clustering (Figure 6 [C]).This phenomenon might be explained by the interference of high DEG number (n = 12), small relative importance scores for the top feature contributors, and the presence of statistically insigni cant genes.In addition, the outlier analysis for the logistic regression and naïve Bayes models con rmed by the previous results on the model performances, where the former model outperformed the latter one by a decreasing number of outliers observed in cancer/leukemia pathway clustering (Table 7).Indeed, the outlier detection and removal procedure previously published elsewhere was able to improve the ML classi cation accuracy from 63% to 76% by reducing the variance of the training data and matching the accuracy of clinical judgment of medical experts [39].Given that there is a limited number of medical facilities providing highly specialized and complex health care, this approach may be useful to improve the diagnostic process for patients without access to these facilities.

Conclusions
The present study analyzed the gene expression pro les and the PPI interaction pathways that may be triggered by the PUFA supplements, such as EPA and DHA in elderly healthy individuals using comprehensive machine learning methodology.To achieve this goal, the GSE12375 Affymetrix data were extracted from the GEO database to identify the up-regulated and down-regulated DEGs and to construct metabolism or disease-speci c PPI networks, including cancer, acute myeloid leukemia, and long-term depression.In the next step, the ML techniques were able to cluster the EPA/DHA and HOSF groups, providing the best performance by using the logistic regression modeling.However, more research is needed to con rm these observations by improving the ML accuracy or performing clinical trials to pinpoint the PUFA effects on gene expression in healthy and sick individuals.

Methods
The transcriptional pro le of GSE12375 was obtained from the Gene Expression Omnibus database [5], which is based on the Affymetrix NuGO array (human) using the NuGO-Hs1a520180 annotation data.In particular, fasting venous blood samples were collected at baseline (n = 48) and after 26 weeks (n = 48) of supplementation with either 1.8 g EPA/DHA mixture or HOSF. 4 ml blood was collected for the isolation of peripheral blood mononuclear cells (PBMC), using BD Vacutainer cell preparation tubes with sodium citrate (BD, Breda, The Netherlands).Immediately after blood collection, PBMCs were isolated according to the manufacturer's manual.Total RNA was isolated from all PBMC samples using the Qiagen RNeasy Micro kit (Qiagen, Venlo, The Netherlands), labeled using a one-cycle cDNA labeling kit (MessageAmpTM II-Biotin Enhanced Kit, Ambion, Inc.), and hybridized for custom-designed NuGO GeneChip arrays.Prior to DEG analysis, the probe cell intensity data (CEL les) were converted into the gene expression values, and the background correction was performed by the robust multi-array average (RMA) algorithm within the Bioconductor environment.The principal component analysis (PCA) was utilized for a clustering of the RMA-normalized gene expression data together with the variance proportions for the most contributing principal components [11].The LIMMA (Linear Models for Microarray Data) algorithm was implemented to identify relevant DEGs at baseline and after 26 weeks of supplementation with either 1.8 g EPA/DHA mixture or HOSF with p-value < 0.05.These DEGs were subsequently submitted to the DAVID web server to identify and construct the enriched KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways [12].
To demonstrate the potential protein-protein interactions (PPI), the DEGs were mapped on the compiled dataset of a human interactome for the PPI network construction using the Cytoscape v2.8 software platform.The human interactome was obtained from the laboratory of Cell Tra cking and Signal Transduction (University of Verona, Verona, Italy).The one-way analysis of variance (ANOVA) was used to compare two means from the EPA/DHA and HOSF groups using the F-distribution and to cluster the samples based on the gene expression data.Finally, the construction of ML models, including logistic regression, naïve Bayes, and deep neural networks (DNN) were implemented for the analyzed DEGs associated with the speci c pathways according to the work ow (Figure 1) using the TensorFlow v1.12 and Keras v2.24 algorithms within the Python environment [13].The DNN model was trained for 100 iterative cycles (epochs) over all the samples according to the protocol published elsewhere [14].All the gures were prepared using the Meta-Chart online graphing tool, GraphPad Prism v7.0 software for Windows, R Graphics package, and matplotlib Python library.The R and bash scripts together with the Bioconductor and Cytoscape output les are provided in the Supplementary material.
Tables Table 1: Gene expression characteristics determined for healthy elderly individuals (n = 96) at baseline and after 26 weeks of treatment with 1.8 g of EPA+DHA mixture and HOSF as a negative control.

Figures Figure 1 Supervised
Figures

Figure 2 The
Figure 2

Figure 5 Decision
Figure 5

Table 2 :
Enriched KEGG pathways (p-value < 0.05) of the total, up-regulated and downregulated DEGs in samples of healthy elderly individuals (n = 96) at baseline and after 26 weeks of treatment with 1.8 g of EPA+DHA mixture and HOSF as a negative control.

Table 3 :
One-way ANOVA statistics for enriched KEGG pathways, using DEGs from samples of healthy elderly individuals (n = 96) at baseline and after 26 weeks of treatment with 1.8 g of EPA+DHA mixture and HOSF as a negative control.

Table 4 :
Logistic regression model performance for enriched KEGG pathways, using DEGs from samples of healthy elderly individuals (n = 96) at baseline and after 26 weeks of treatment with 1.8 g of EPA+DHA mixture and HOSF as a negative control.

Table 5 :
Naïve Bayes model performance for enriched KEGG pathways, using DEGs from samples of healthy elderly individuals (n = 96) at baseline and after 26 weeks of treatment with 1.8 g of EPA+DHA mixture and HOSF as a negative control.

Table 6 :
DNN model performance for enriched KEGG pathways, using DEGs from samples of healthy elderly individuals (n = 96) at baseline and after 26 weeks of treatment with 1.8 g of EPA+DHA mixture and HOSF as a negative control.

Table 7 :
Outliers for the logistic regression and naïve Bayes models of enriched KEGG pathways, using DEGs from samples of healthy elderly individuals (n = 96) at baseline and after 26 weeks of treatment with 1.8 g of EPA+DHA mixture and HOSF as a negative control.