Development and validation of a skin fibroblast biomarker profile for schizophrenic patients

Gene expression profiles of non-neural tissues through microarray technology could be used in schizophrenia studies, adding more information to the results from similar studies on postmortem brain tissue. The ultimate goal of such studies is to develop accessible biomarkers. Supervised machine learning methodologies were used, in order to examine if the gene expression from skin fibroblast cells could be exploited for the classification of schizophrenic subjects. A dataset of skin fibroblasts gene expression of schizophrenia patients was obtained from Gene Expression Omnibus database. After applying statistical criteria, we concluded to genes that present a differential expression between the schizophrenic patients and the healthy controls. Based on those genes, functional profiling was performed with the BioInfoMiner web tool. After the statistical analysis, 63 genes were identified as differentially expressed. The functional profiling revealed interesting terms and pathways, such as mitogen activated protein kinase and cyclic adenosine monophosphate signaling pathways, as well as immune-related mechanisms. A subset of 16 differentially expressed genes from fibroblast gene expression profiling that occurred after Support Vector Machines Recursive Feature Elimination could efficiently separate schizophrenic from healthy controls subjects. These findings suggest that through the analysis of fibroblast based gene


Introduction
Schizophrenia (SZ) is a chronic psychiatric disorder with mean lifetime prevalence of almost 1% [1].The definition of the disease has changed over the last century, but its causal pathophysiology remains obscure [2].SZ is diagnosed on the basis of specific criteria presented in the fifth edition of Diagnostic and Statistical Manual of Mental Disorders (DSM-V) [3] or the tenth revision of International Statistical Classification of Diseases and Related Health Problem [4].Among others, these criteria include the appearance of the positive, negative and cognitive symptoms of SZ.Positive symptoms include psychotic symptoms such as delusions and hallucinations with the characteristic lack of contact with the reality.The negative symptoms include alogia, social withdrawal, avolition and anhedonia and the cognitive symptoms include cognitive abnormalities such as working memory impairments [1].
The etiology of the disease seems to be heterogeneous affected both by environmental and genetic factors.Generally in all psychiatric disorders there is a lack of strict boundaries between the normal and the diseased condition and there is an overlap of symptoms among different disorders [5].It is also a fact that SZ begins long before the typical diagnosis of the disease takes place.For this reason it is important to explore approaches for the identification of subjects being at risk and subjects being in the early stages of the disease [1].These facts result to the need for identification of reliable and objective diagnostic means such as biomarkers for the discrimination among the psychiatric disorders and among the healthy and the diseased state [5].

Machine learning methodologies towards possible biomarkers
It is very important to discover biomarkers in the field of psychiatry in order to early diagnose the psychiatric disorders.It will certainly prove beneficial to detect preliminary psychotic symptoms even in prodromal phases on the maximum possible objective basis.High dimensional microarray data can be utilized for the development of biomarkers, since they pinpoint different molecular contributions to the disease pathophysiology [6].Gene expression datasets usually include small number of samples and thousands of genes.In such high throughput studies, it remains a big challenge to select those biomarker genes that are crucial for distinguishing the sample classes [7].Supervised learning algorithms utilize samples with known class for the training, in order to develop a model for classification of new examples that were not taken into consideration during the classification task [8].

Human skin fibroblasts as a cell model
Human skin fibroblast cells have been used as an experimental model for studying the pathophysiology of psychiatric disorders such as SZ [9].Since there are indications that SZ has a strong genetic background, the genetic contribution of the disease is probably preserved in cell cultures, such as fibroblast cell cultures.Fibroblasts share some common molecular characteristics with human brain microvascular endothelial cells [10].Additionally, they have the advantage of being easily obtained from the subjects under study.Moreover, the medication status of the patients does not greatly affect the skin fibroblast cells [11].
No validated biological test for the discrimination of schizophrenic patients from healthy controls has been adopted in clinical practice so far.Gene expression differences between schizophrenic patients and healthy controls can be used towards the identification of biomarkers that can distinguish the patient and the healthy control classes.The model of human skin fibroblast cells could greatly contribute to the identification of brain disease biomarkers.Blood-based studies have been already used for the development of biological classifiers in bipolar disorder and SZ [6,12].In this study, we examine the possibility of developing a classifier based on skin fibroblast gene expression of schizophrenic patients.The fibroblast cell model has some disadvantages like the fact that the age of the donors may affect the cell viability and the fact that they do not include physiological, in vivo signals [9].Further confirmatory examination was performed in this study so that the diseased tissue will be also examined.More specifically, in this study it was examined if the genes identified to result in separating schizophrenic and healthy subjects based on the fibroblast gene expression profiling, could be also used for the discrimination of SZ and the healthy control postmortem brain samples.

Dataset
The dataset that was used in this study contains data collected by Cattane et al. [11], accessible at National Center For Biotechnology Information (NCBI) [13] Gene Expression Omnibus (GEO) database [14], with the accession number GSE62333.The dataset includes the gene expression of skin fibroblasts from 20 patients affected by SZ (mean age ± SD, 44.60 ± 12.67; 50% males) and 20 healthy control subjects (mean age ± SD, 48.40 ± 12.20; 45% males).The study involved patients who satisfied the DSM-IV criteria for SZ.The platform that was used for the microarray experiment was Affymetrix Human Gene 1.1 ST Array, which interrogates more than 28,000 well-annotated genes through UniGene or via Reference Sequence.This dataset was utilized for developing a classifier to distinguish the two classes.

Data preprocessing and analysis
In this study the raw data files were imported into R studio and the normalized values were obtained after applying the robust multi-array average (RMA) algorithm, which includes the background correction on the Affymetrix perfect match data, quantile normalization and summarization using the median polish method [15].In order to deal with the problem of multiple probes for the same gene, for each gene the average value of its probes was used.
A gene was considered present, if its mean expression value was above 5 log 2 intensity in 60% of the replicates in at least one condition.In order to identify the differentially expressed (DE) genes, limma Bioconductor package was used and linear model fitting and the empirical Bayes method were applied [16].DE genes (SZ samples versus healthy control samples) were defined based on the following criteria: (i) ±0.3 or greater fold change (FC) in the mean expression level (log 2 scale), (ii) p-value ≤ 0.05 after correction of multiple testing with the False Discovery Rate (FDR) adjustment method of Benjamini and Hochberg [17].

Functional analysis
The genes that were identified to be DE in the study were imported into BioInfoMiner tool [18].After applying statistical enrichment tests, the imported genes were associated to the Reactome Pathways Database [19], as well as to three different ontology databases: Gene Ontology (GO) [20], the Human Phenotype Ontology [21] and the MGI Mammalian Ontology [22].

Classification algorithms and parameter optimization
In this study the tested classification algorithms were Support Vector Machines (SVM), Extremely Randomized Trees (Extra Trees), Random Forest (RF), AdaBoost and k-nearest neighbors (kNN).The SVM algorithm tries to find a hyperplane that separates the two classes with the maximum margin between the nearest training data points of the different classes [23].The kNN algorithm is based on the idea of giving a class label to a new sample based on the class label of the predefined amount of closest samples [24].RF and Extra Trees are two averaging algorithms on the basis of randomized decision trees [25,26].In these methods an ensemble of classifiers is developed, and randomness is introduced in the sampling of the features and the samples used to build the classifiers.The prediction of the ensemble occurs from the average prediction of the included classifiers.In RF each tree results to a classification and the forest finally makes a selection of the classification model with the most votes [27].Extra Trees and RF use a random subset of features, but in Extra Trees, thresholds are drawn at random (in comparison to RF, where the most discriminative threshold is chosen) for each feature and the best of these thresholds is chosen as the classification rule [26,28].AdaBoost is another ensemble classification algorithm [29].The principle behind this algorithm is to construct a classifier based on a combination of many weak classifiers [30].The best parameter set of each algorithm was selected after Cross Validation (CV) grid search, which selects the parameters that result to the highest score of the CV [31].All of the machine learning methods were implemented in scikit-learn of Python [32].

Feature selection
Each of the tested classifiers included genes that resulted from a Support Vector Machines Recursive Feature Elimination (SVM-RFE) with cross-validated selection of the best number of features [33].This feature selection method is based on an iterative process of removing features with lowest calculated weights for each estimator until the subset of features with the best performance is achieved.The genes that occurred after the feature selection were used for training classifiers based on their gene expression values in the skin fibroblast dataset and based on their gene expression values in an independent cohort, in order to evaluate the discrimination efficiency of those genes and to further evaluate them as possible biomarkers.

Model Evaluation
In order to evaluate the performance of each classification model, we utilized the method of 4fold CV and the evaluation measure of receiver operating characteristic (ROC) area under the curve (AUC) score.In 4-fold CV the dataset is split in 4 random, mutually exclusive subgroups Di (D1, D2,…, D4) called folds.4-fold CV is performed since the study includes few samples, and it is better to split them into 4 subgroups.The classification model is then trained using three out of the four subgroups and the remaining subgroup is used for testing.
ROC curves are typically used in binary classification for evaluating the output of a classification algorithm after using CV.The Y axis of the curve depicts the true positive rate, and the X axis shows the false positive rate.A larger AUC, which means a larger area between the X axis and the ROC curve corresponds to a better classification performance.For each fold in CV a ROC curve is constructed and finally a mean AUC from all the curves is calculated.The feature selection and the model evaluation were also performed using scikit-learn machine learning in Python [32].CV was performed in the skin fibroblast dataset in the three following ways: in the first 4-fold CV each fold included the 63 DE genes; in the second 4-fold CV each fold included feature selection with the method of SVM-RFE.The third 4-fold CV included in each fold the 16 genes from the feature selection.

Data collection and analysis of the independent dataset
The second dataset of this study was used as an independent group of samples.The dataset includes postmortem brain samples derived from superior temporal cortex (Broadmann Area 22) of 23 schizophrenic (mean age ± SD, 72.2 ± 16.9) and 19 healthy control subjects (mean age ± SD, 67.7 ± 22.2) [34].Patients were diagnosed with SZ according to DSM-III criteria.The SZ samples included 13 males and 10 female samples and the heathy control group included 11 male and 8 female samples.The accession number in GEO for the second dataset is GSE21935 and its platform for the microarray hybridization is Affymetrix Human Genome U133 Plus 2.0 Array.The normalized expression values were generated using RMA as described above.The subset of 16 genes that occurred from the SVM-RFE feature selection of the skin fibroblast dataset, were used in order to test if their corresponding expression values in the postmortem brain dataset can discriminate SZ samples from healthy control samples.
Additionally, the DE genes of this dataset were also identified using linear model fitting and the Empirical Bayes method from the BioConductor limma package were used to test the differences in mRNA levels.Linear model fitting and the empirical Bayes method were applied, and DE genes were derived with the following criteria: ± 0.3 or greater FC in the mean expression level (in log 2 scale) and (ii) unadjusted p-value ≤ 0.05.No significant genes were obtained after correction of multiple testing with the FDR adjustment method of Benjamini and Hochberg.The DE genes from this dataset were also imported into BioInfoMiner for functional analysis.The DE genes and their enriched terms were used for comparison with those resulted from the skin fibroblast dataset.

DE genes and functional analysis
The list of the 63 DE genes of the skin fibroblast cells according to the criteria set (

Classifier optimization and performance, AUC scores and feature selection
The best parameter set of each algorithm after CV grid search is presented in Supplementary Table 3.The mean AUC of the ROC curve shows that the schizophrenic samples can be distinguished from the healthy control samples based on the gene expression of the 16 genes that resulted from SVM-RFE (Supplementary Table 4).SVM and AdaBoost outperform Extra Trees Classifiers, Random Forest and kNN, with an AUC of 0.99 (Supplementary Table 4), although these performances may be likely inflated, due to the fact that the feature selection was not applied in each training subgroup of the CV.This feature selection was applied in order to derive the smallest subset of features, which was further utilized on an independent dataset, so an estimation of the performance of the classification models based on these 16 genes would be included.Another CV was also performed, in order to estimate the performance of the classification algorithms.For this scope, the feature selection method of SVM-RFE was applied for each fold and consequently for each training subgroup of the 4-fold CV.The AUC scores of each classification model resulting from the 4-fold CV with and without feature selection (the trained model included all the 63 DE genes) are presented in Table 2.

Independent postmortem brain dataset
For the postmortem brain dataset that was used for validation of the 16 gene model, maximal classification performance was achieved by the RF classifier, with mean AUC score of 0.82 after applying 4-fold CV.The mean AUC scores of all tested classification methods are presented in Supplementary Table 4.
In the postmortem brain dataset 125 genes were DE according to the statistical criteria presented in Materials and Methods section.Additionally, after comparison of the DE genes in the two datasets, only two genes were identified to be common: CNTN3 and TPBG.Several term clusters from the functional analysis of the DE genes analysis of the skin fibroblast dataset were also identified from the respective analysis of the postmortem brain dataset (Table 3).

Discussion
The top ranked biological processes resulting from the BioInfoMiner tool include response to cyclic adenosine monophosphate (cAMP) (FOSB, FOS, JUN and JUND) (Supplementary Table 2).The pathway analysis detected among the statistical significant REACTOME pathways, mitogen activated protein kinase (MAPK) related pathways (Supplementary Figure 1).Both MAPK and cAMP signaling pathways involve responses to extracellular stimuli, modification of the functionality of receptors and finally they affect the cell survival and neuroplasticity.Abnormalities in those pathways have been observed in frontal cortical areas of schizophrenic patients [35].Mitogen activated protein kinases are expressed in the central nervous system and the extracellular signal-regulated kinases are related to long-lasting neuronal plasticity [36].There are also previous studies on second messenger signal transduction system of cAMP that exhibit abnormal function in postmortem brain, in cerebrospinal fluid and blood platelets of SZ samples.Additionally neuroleptic treatment in animal models resulted in the opposite pattern of cAMP metabolism in schizophrenic patients, indicating that cAMP signal transduction may be a target of this medication [37].Finally, the REACTOME pathway analysis resulted to terms classified to innate immune system (CALM3, JUN, FOS, PSMB4, SPRED2).A study including RNA sequencing, resulted to SZ genes that are enriched in immune related pathways [38].Other epidemiological, genomic and transcriptomic studies on postmortem brain and peripheral samples also indicate abnormalities of the innate and the adaptive immune system of the disease.Still, the exact mechanism by which the immune system aberrations confer to the disease phenotype remains elusive.Several missing links exist concerning the chain of events that link the observed immune disturbances with the disease manifestation [39].
According to the human phenotype ontology, the resulted phenotypic abnormalities related to the nervous system are of great interest in SZ (Supplementary Figure 2).More specifically, the resulted term of abnormality of the autonomic nervous system is in accordance to abnormal autonomic nervous system activity reported in patients with SZ [40].
Concerning the statistical significant terms of the mammalian phenotype it is worth mentioning that there are genes related to abnormal brain morphology and more specifically to abnormal hippocampus neuron morphology and decreased brain size (Supplementary Figure 3).Even after many years of brain scanning, the brain structural abnormalities in SZ are not completely understood.There are many indications, that here is a decrease of the brain and intracranial size of schizophrenic patients [41].Reductions of grey and white matter as well as of whole-brain in MRIscans of schizophrenic patients in comparison to healthy controls have been reported [1].ENIGMA SZ Working Group is a consortium that performs an effort to collect and analyze neuroimaging data of severe mental illnesses [42].In the context of ENIGMA a meta-analysis on brain MRI scans from 2028 patients affected by SZ and 2540 healthy controls showed that SZ patients have smaller hippocampus, amygdala, thalamus and intracranial volumes in comparison to healthy controls [43].Hippocampus is involved in SZ through neuropsychological defects.In summary, the hippocampal pathophysiology in SZ is involved at a morphological, molecular and functional level, resulting to alterations in the structures and functions of hippocampal neuron circuits and subsequently leading to disturbances in glutamatergic neurons [44].
Decreased fibroblast proliferation resulted from the MGI mammalian phenotype (Supplementary Figure 3) is also in accordance with previous studies concerning skin biopsies that resulted to decreased fibroblast growth and abnormal morphology of fibroblast cultures of SZ samples as compared to fibroblast cell cultures of healthy controls [45,46].
Many other findings resulting from the functional analysis have not been related to SZ, so it would be interesting to further investigate them.Still, a more detailed examination of these findings is beyond the scope of this study, which actually focuses on the use of skin fibroblast cells as a possible model for developing classification models in SZ.For this reason the molecular abnormalities resulting from the functional analysis and were related to previous findings in the SZ research, can be considered a validation of the skin fibroblast cell model in the study of SZ.
So far, biomarker development in the psychiatric field is in its infancy as most studies lack validation through independent cohorts.The need for a test involving peripheral tissues (with blood being the most studied tissue type) that could help in the prediction and diagnosis of SZ is an important issue in the studies of the disease [47].Similar studies that exploit the gene expression profiling of non-neural samples through machine learning methods for the development of diagnostic classification methods have been already performed in psychiatric research.For example, an artificial neuron networks classification method performed on gene expression signature of whole peripheral blood could afford to develop classification models in SZ with good performance [6].Peripheral blood gene expression through the use of microarray technology has been used for the identification of biomarker genes that can differentiate bipolar disorder upon the mood state of the patients [48].
The biomarker panel from human skin fibroblasts presented in this study is validated through an independent postmortem brain schizophrenic dataset.Ultimately, this 16-gene model of possible biomarkers could be further developed with the goal of finally developing a diagnostic tool for SZ with clinical use.Among the genes resulting from the feature selection, genes that have been previously implicated in SZ are RPLP2 and SLC44A1.In a study that examined protein expression alterations in postmortem brain samples of SZ subjects, altered expression of ribosomal proteins including the gene RPLP2 has been reported [49].In a translational convergent functional genomics study for the identification and prioritization of genes involved in SZ, the gene SLC44A1 is included in the resulted genes [50].
Finally, only two genes were found to be commonly DE both in skin fibroblast cell samples and in postmortem brain samples of schizophrenic subjects.Some of the resulted terms of the functional analysis in the two examined datasets (GSE62333, GSE21935) are clustered into the common superclasses of Table 3, such as immune response related terms.These common clusters of terms resulting from DE genes in the two datasets may imply that the skin fibroblast cell model could capture molecular abnormalities that take place in the brain tissue of SZ samples, thus reflecting underlying similarities in the exercised molecular mechanisms.
The mean AUC scores after the 4-fold CV shows that SZ samples can be distinguished from the healthy control samples based on the gene expression of the 16 genes that resulted from SVM-RFE (Supplementary Table 4).SVM and AdaBoost outperform Extra Trees, Random Forest and kNN, with an AUC of 0.99 (Supplementary Table 4), although these performances are likely inflated, due to the fact that the feature selection was applied on the whole training dataset and the CV does not correctly mimic the application of the classifier to a completely independent test set, since these predictors 'have already seen' the left out samples.Nevertheless, SVM-RFE feature selection was applied in order to derive the smallest subset of features, which will have the ability to classify schizophrenic subjects with an unknown label.Through the application of SVM-RFE, the minimum subset of genes that achieve the highest AUC according to an embedded CV in the feature selection process is selected.This study aimed to find informative subsets of genes that will be further utilized in other independent datasets.The 16 genes from the feature selection were also used as a basis for training and testing classification models on an independent dataset of another tissue (which in this case is the diseased postmortem brain tissue).In other words, in this study it was studied if a gene signature obtained from skin fibroblast cells can be used for separating healthy control samples from SZ samples derived from postmortem brain tissue.For this reason, an estimation of the performance of the classification models both in the skin fibroblast dataset and in the independent postmortem brain dataset of these 16 genes was performed.The CV in the postmortem brain dataset resulted to a satisfactory mean AUC score of 0.82 with the use of a RF classifier.The feature selection is also important, since through the dimensionality reduction, covariance is reduced and irrelevant attributes are excluded.Another CV was also performed, in order to estimate the performance of the classification algorithms, without taking into consideration the 16 genes from the SVM-RFE.For this scope, the feature selection method of SVM-RFE was applied for each fold.The AUC scores of each classification model resulting from the 4-fold CV with and without feature selection (the trained model included all the 63 DE genes) are presented in Table 2.This CV was applied for the estimation of the performance of the classification methods used in this study, including the SVM-RFE feature selection method in each fold, resulting to worse, though satisfactory mean AUC scores of the classification models (Table 2).
There are also some other classification approaches in SZ research, such as machine learning algorithms for pattern classification with the use of neuroimaging as a clinical diagnostic or prognostic tool [51], as well as neurocognitive test batteries [52].These classification approaches utilize physiological and cognitive data.It would be interesting to combine data from all these different levels and to develop composite classification models.Still, the development of classification models based on gene expression data can be proved very important in classification tasks, if we consider that there are already clinically applied assays, deriving from machine learning methods based on gene expression data.In 2002, a 70-gene classifier or the Mammaprint® assay has been developed based on the expression of 70 genes in tumors biopsied from women with breast cancer and has been applied in clinical practice [53].
There are some limitations concerning this study.Concerning the medication effects in the expression of skin fibroblast cells, it is assumed that because of the fact that the skin fibroblast cells have been subcultured for at least five passages, any effects from in vivo exposure to factors such as hormones, and drugs are minimized [54].Additionally, limited sample size can be considered another limitation of this study, but concerning the dataset of SZ skin fibroblast cells, we were limited by the content of the available datasets.The approach of this study to use independent datasets and different tissues bears several pitfalls and this might be reflected by the fact that only two genes were DE in both samples.However the postmortem brain dataset is useful as a validation dataset in a study based on skin fibroblast cells, due to the fact that postmortem brain is a model that includes the diseased tissue in SZ.Furthermore, lacking a validation of the classifier itself can be considered a limitation of this study.However, through the methodological approach used in this study, the aim was to identify a subset of genes with a biological interest in SZ.The validation performed in the independent dataset from postmortem brain cells, mostly concerns the validation of the selection of the best subset of genes as a potential diagnostic signature in SZ.Additionally, a classifier trained with expression data of skin fibroblast cells, would be better validated as a classifier, with the use of an independent dataset containing gene expression data from skin fibroblasts.Continuing this work, with additional skin fibroblast cells from schizophrenic patients and healthy controls for model development as well as for the validation on completely independent test cohorts, may help to develop and define the true utility of the skin fibroblast based classification models.

Conclusions
In this study it was examined if gene expression of skin fibroblast cells could be exploited through supervised machine learning for the development of classification models that can discriminate SZ from healthy control samples.A subset of genes that could discriminate the schizophrenic and the healthy control samples based on the gene expression of skin fibroblast cells could also sufficiently discriminate SZ and healthy control subjects based on the expression values of those genes in postmortem brain samples.These are indications that skin fibroblast cells could be used for the identification of potential biomarkers in SZ.Concerning the classification models that have been developed based on the skin fibroblasts gene expression, achieving AUC scores over 0.9, other independent skin fibroblast samples of schizophrenic patients should be used, in order to further blindfold validate the generalization ability of the developed classification models.
see Materials and Methods) is presented in Supplementary Table 1.The DE genes were imported into the BioInfoMiner tool.The top GO terms are presented in Supplementary Table 2 and the top Reactome Pathways, Human Phenotype Ontology terms, and MGI Mammalian Ontology terms are presented in Supplementary Figures 1, 2 and 3 respectively.

Table 1 .
Genes that occurred after the SVM-RFE feature selection on the skin fibroblast samples and could discriminate the schizophrenic and healthy control subjects.

Table 2 .
Mean performance estimation values of different classification algorithms after applying 4-fold CV, either with no feature selection or with feature selection of SVM-RFE included in each fold.

Table 3 .
Common clusters of enriched terms between the skin fibroblast dataset and the postmortem brain dataset derived from the functional analysis.