Differential diagnosis of systemic lupus erythematosus and Sj ¨ ogren ’ s syndrome using machine learning and multi-omics data

Systemic lupus erythematosus and primary Sjogren ’ s syndrome are complex systemic autoimmune diseases that are often misdiagnosed. In this article, we demonstrate the potential of machine learning to perform differential diagnosis of these similar pathologies using gene expression and methylation data from 651 individuals. Furthermore, we analyzed the impact of the heterogeneity of these diseases on the performance of the predictive models, discovering that patients assigned to a specific molecular cluster are misclassified more often and affect to the overall performance of the predictive models. In addition, we found that the samples characterized by a high interferon activity are the ones predicted with more accuracy, followed by the samples with high inflammatory activity. Finally, we identified a group of biomarkers that improve the predictions compared to using the whole data and we validated them with external studies from other tissues and technological platforms.


Introduction
Systemic autoimmune diseases (SADs) are heterogeneous and complex pathologies whose main hallmark is an immune response to selfantigens, causing tissue injury and failure in different organs.Although these diseases cause different symptoms, they share risk factors [1], clinical manifestations [2], and molecular alterations (e.g., gene expression alterations [3]).
In order to establish a molecular classification of SADs patients, whole blood transcriptome and methylome data have been previously used from patients from seven SADs, including systemic lupus erythematosus (SLE) and primary Sjogren's syndrome (pSjS) [4].In a previous work, we used these data to demonstrate the existence of four subgroups of SADs patients based on clustering analysis of molecular profiles [4].The defined clusters were named 'inflammatory', 'lymphoid', 'interferon' and 'undefined' and each one was characterized by different serological, cellular, genetic and clinical features.Overexpression and hypomethylation of genes and CpGs from neutrophil and monocyte-driven modules were characteristic of the inflammatory cluster.T and NK cell functions defined the lymphoid cluster, whereas interferon, viral, and dendritic cell functions defined the interferon cluster.The undefined cluster did not present distinct functional modules compared to healthy controls.B lymphocyte functions were seen in both the lymphoid and interferon clusters, and cell cycle and transcriptional upregulation were connected to the interferon cluster.The undifferentiated patterns of the undefined cluster were explained by the low disease activity found in these samples, concluding that these patients have healthy-like molecular patterns.
SLE is a complex disorder with an autoimmune background, a multifactorial etiology, impairment of several organic systems, a wide spectrum of clinical manifestations, variable prognosis, and an evolving clinical course marked by the occurrence of episodes of active disease and remission [5].It is interesting to note that the prevalence of SLE is rising across the globe, with rates varying from 40 to >160 per 100,000 people [6].On the other hand, pSjS is another autoimmune chronic inflammatory clinical condition that primarily affects the lacrimal and salivary glands and results in a decrease in the salivary and lacrimal flows and, as a result, to symptoms of dry mouth and dry eyes [7].
The heterogeneity of SADs patients makes diagnosis and treatment difficult.In this context, it is sometimes challenging to differentiate between pathologies with overlapping clinical features, like SLE and pSjS.Some symptoms shared by these two diseases are hemolytic anemia, leukopenia, lymphopenia, thrombocytopenia, photosensitivity or fatigue, among others [8][9][10][11][12].Nevertheless, both pathologies are characterized by the upregulation of the interferon (IFN) signature [13,14], a set of genes regulated by the IFN cytokine.
For these reasons, it is common for pSjS patients to be misdiagnosed, underdiagnosed, or diagnosed at late stages of the disease [15], even after 6-10 years from presenting the first symptoms [16].Consequently, proper therapeutic strategies are delayed, contributing to evitable tissue damage.
Previous efforts have been made to distinguish SLE and pSjS patients from the clinical point of view [2,15], using methylation data [17], metabolomics data [18], or salivary protein biomarkers [19].However, to the best of our knowledge, gene expression data has not been used to perform a differential diagnosis between SLE, pSjS and healthy controls.Given that measuring gene expression has become an affordable approach, we studied its potential as a diagnostic tool for these patients.Furthermore, although methylation data has been used previously to perform pairwise predictions (SLE vs. healthy, pSjS vs. healthy, and SLE vs. pSjS) [17], a multiclass classifier to distinguish the samples from these 3 groups simultaneously was not proposed.Nevertheless, we consider that a multiclass predictor may be helpful in the clinical context.Importantly, some specific dysregulation in these diseases may be observed at the gene expression and DNA methylation level, but not with other data types such as metabolomics.For instance, the IFN signature, which plays a crucial role in the development and progression of SLE and pSjS [13,14], may be directly assessed with gene expression data and indirectly with methylation data, since promoters of IFN-regulated genes (e.g., IFI44L) are differentially methylated [20].Therefore, these are very valuable data modalities to study and predict the clinical diagnosis in this context.
In this study, we applied machine learning (ML) methodologies to classify SLE and pSjS patients, as well as healthy controls.Operatively, we designed an analysis pipeline to train a eXtreme Gradient Boosting (XGBoost) multiclass predictor for each type of data.Given the heterogeneity of these diseases, we also studied whether the performance of the XGBoost models varies according to the molecular clusters previously defined for these patients [4].Furthermore, we obtained subsets of genes and cytosine-phosphate-guanine (CpGs) that improve the performance of the predictors and we characterized the molecular implications of these features into the studied diseases.In addition, we evaluated the predictivity of these features as biomarkers for the labels in independent datasets.Finally, we constructed specific models and feature selections for the group of patients with healthy-like molecular patterns.

Data
The data used in this study was produced for a previous work [4], which generated multi-omics data from whole blood samples obtained from patients with seven different SADs and healthy controls.In detail, we used matched expression and methylation data from 213 SLE patients, 181 pSjS patients and 257 healthy controls.Table 1 shows the distribution of the SLE and pSjS samples according to their assigned molecular cluster.
Illumina HiSeq2500 and Illumina Beadchip 450k technologies were used to measure the transcriptome and methylome, respectively, in each sample.For the transcriptome experiment, messenger ribonucleic acid (mRNA) was extracted and sequenced.These sequenced reads were aligned to the reference human genome and the number of reads aligned to each gene (raw counts) were extracted.These raw counts are a measure of genes expression since they are proportional to the amount of mRNA in the cells.For the methylome analysis, the intensity of the methylated and unmethylated probes in the Beadchip arrays was measured and, after appropriate data processing, M-values for each analyzed CpG was obtained.Further information about the cohorts, experimental procedures and data preprocessing is available from the previous article [4].

Data processing
In order to discard uninformative features, we filtered out those genes with less than 5 counts in at least half of the samples of each group.Furthermore, we discarded those genes and CpG sites with a coefficient of variation lower than 0.4 and 0.1 respectively, retaining information for 8413 genes and 253,903 CpGs.We adjusted the expression and methylation data for the technical variables pool and sample plate, respectively, using the ComBat method [21] implemented in the sva R package.We normalized the adjusted expression values with the trimmed mean of M-values (TMM) normalization [22] implemented in the NOISeq R package [23].We used the Uniform Manifold Approximation and Projection (UMAP) method [24] for dimension reduction to explore the gene expression data.Finally, given the number of features, we performed differential expression and methylation analyses to reduce the computational costs of the ML models.For that aim, we used the limma package [25], to compare SLE vs. Healthy, pSjS vs. Healthy and SLE vs. pSjS groups.After ranking each feature by the obtained P-value, we selected the top 2000 genes and 3000 CpGs from each comparison, obtaining a final list of 3.681 genes and 7.596 CpGs after removing duplicated features.

Model fitting and evaluation
In the first place, we used the Python library lazypredict to test the performance of 29 ML models on both the expression and the methylation data.XGBoost was the model with the highest Matthews Correlation Coefficient (MCC) for both types of data (Supplementary Table 1) and was selected for further analyses.
XGBoost is a ML algorithm that use gradient boosting and ensemble learning to combine multiple decision trees for classification and regression tasks [26].This approach minimizes a loss function to add new trees to the model.XGBoost has become a broadly used ML method due to its good performance and scalability.
For each data type, we split the data randomly into a training set and a test set following a 80/20 partition.We standardized the training sets with the StandardScaler function of the scikit-learn Python library [27], which calculates Z-scores subtracting the mean and dividing by the standard deviation.We used the parameters learnt from the training sets (means and standard deviations) to scale the test sets.We used the training set to optimize the hyperparameters for the XGBoost classifier through a 10-fold cross-validation (CV) approach.Specifically, we optimized the learning rate, maximum tree depth, gamma, alpha, lambda, minimum child weight, subsample ratio of the training instance and subsample ratio of columns hyperparameters.We used the softmax loss function for the multiclass classification task.Given the high number of possible combinations of hyperparameter values, we performed a grid search with 100 random combinations using scikit-learn's Ran-domizedSearchCV function and we selected the hyperparameter combinations with the best mean MCC on the internal validation sets.
Once the best fitting hyperparameters were selected, we trained a XGBoost model with those hyperparameters and the training set.We used that model to predict the labels of the test set and we evaluated the performance for this prediction.This workflow is shown in Fig. 1.The whole process was repeated 10 times to avoid biases due to the training and test set splitting of the first step, following the FDA-SEQC guidelines for reproducibility.We used the Python library scikit-learn [27] to apply all the described ML methodologies.
To evaluate the performance of the models, we calculated the accuracy (defined as the fraction of correct predictions), the mean precision for the three classes (1) weighted by the number of samples of each class, and the weighted F1 score (2).
Being TP the number of true positives and FP the number of false positives.
Where recall is calculated following (3).
Being TN the true negatives.
We also calculated the MCC [28,29], which is a balanced measure of accuracy and precision that can be used to evaluate multiclass models.MCC maximum value is 1, which indicates a perfect prediction.The MCC for K classes is calculated following (4) [30].
Where t k is the amount of k labelled samples in the data; p k is the number of times that class k was predicted; c is the number of samples correctly predicted and s is the total number of samples.

Variable selection and functional analysis
For feature selection, we excluded the data from the patients belonging to the undefined cluster, which have a healthy-like molecular pattern.For each of the 10 rounds of the model fitting and evaluation (Fig. 1), we ranked the features of the model (i.e., genes and CpGs) based on their Gini importances [31] in the XGBoost models fitted with the training sets.Then, we obtained the mean position of each feature for the 10 folds and we sorted the features accordingly.We repeated the model fitting and evaluation with increasing subsets of top features (from 10 to 1000).We calculated the mean MCC for each subset along the 10 rounds and we selected the subset with the maximum mean MCC (i.e., the top 90 genes for expression and the top 900 CpGs for methylation).
We annotated the genes and CpGs with biomaRt [32,33] and Illu-minaHumanMethylation450kanno.ilmn12.hg19R packages respectively.We used the GeneCodis tool [34,35] to perform the functional analysis of the selected features, using as input the ENSEMBL and CpG probe identifiers and selecting Reactome as annotation.We considered as significant those pathways with a False Discovery Rate (FDR) < 0.05.

External data processing and analysis
With the aim of testing the selected biomarkers utility for the classification of healthy and diseased samples, we used public gene expression and DNA methylation datasets available in the Gene Expression Omnibus (GEO) [36].For the gene expression dataset (GEO ID: GSE108497), we downloaded the processed data from the ADEx database [37].Data for 62 of the 90 genes were available for this dataset.For the DNA methylation dataset (GEO ID: GSE166373), we used the GEOquery R package [38] to download the raw idat files.We processed these raw data with the minfi package [39] applying the normal-exponential out-of-band (Noob) normalization [40].Data for the 900 selected CpGs were available for this dataset.
We followed the same analytical pipeline described previously, with the exception that we did not perform a batch effect correction for these data.We assessed the performance of the models with the accuracy and MCC metrics.Being this a binary classification, the MCC formula (4) reduces to the simpler form ( 5 The code to process and analyze the validation data is available at https://github.com/GENyO-BioInformatics/SLE_pSjS_classifier.

Diagnosis capacity depends on the patients' molecular profile
In the first place, we represented the UMAP plots to explore how samples are grouped according to their gene expression profiles (Fig. 2).As can be observed, SLE and pSjS samples overlap completely, and samples from both diseases overlap partially with the healthy controls (Fig. 1A).These results illustrate the challenge of classifying these diseases.On the other hand, if samples are colored by their molecular cluster assigned in our previous work [4], it can be observed that the inflammatory and interferon clusters are almost perfectly separated from the healthy samples, while the lymphoid and, especially, the undefined clusters overlap with the healthy samples (Fig. 2B).
Next, we used lazypredict library to calculate the mean MCCs obtained by different ML models after 10 rounds of training/test splitting (Supplementary Table 1).The model with the highest mean MCC in expression and methylation data was XGBoost, which was selected to perform the classifications.
We trained and evaluated XGBoost models to classify the three   classes of our cohort (SLE, pSjS and healthy) considering all the samples ('overall' group).Accuracy, weighted precision and weighted F1 scores for our predictors are reported in Table 2.However, it has been reported that these metrics may be inflated [41].For that reason, we also calculated the MCCs and we interpreted the results based on these scores.
Our first multiclass predictor model, which used the complete datasets, achieved a test set MCC of 0.5791 ± 0.0409 and 0.5546 ± 0.0484 for expression and methylation data, respectively (Table 2).
In previous work, we demonstrated that SADs patients may be classified in four molecular clusters with different characteristics [4].For this reason, we wondered whether the performance of our models varies between the different groups of samples.To answer this question, we repeated the analyses for each individual cluster of samples (i.e., lymphoid, inflammatory, interferon and undefined clusters).
Using the expression data, these models showed a differential performance among clusters (Table 2, Fig. 3), with remarkable low performance for the undefined cluster (MCC = 0.3840 ± 0.1075).This cluster of patients was previously associated with low disease activity and showed a healthy-like molecular pattern [4], so they were expected to be poorly predicted.At the other extreme, the interferon cluster showed the best prediction results (MCC = 0.7717 ± 0.0523), followed by the inflammatory cluster results were close (MCC = 0.6802 ± 0.0900).These results are also in accordance with the characterization of these two molecular clusters, since they were associated with the most extreme manifestations (e.g., nephritis and thrombosis for the interferon cluster, and fibrosis complications for the inflammatory cluster) [4].On the other hand, the moderate performance for the lymphoid cluster (MCC = 0.4705 ± 0.1466) is also coherent with the mild symptoms associated with this cluster (e.g., gastrointestinal manifestations) [4].
We followed the same methodology to construct predictor models based on methylation data, observing the same trend regarding the differential prediction capacity in the different molecular clusters (Table 2, Fig. 3).Furthermore, we decided to repeat the overall analysis excluding the samples belonging to the undefined cluster, given that they represent healthy-like molecular profiles and they were probably affecting the overall model accuracy.We defined this new group of samples as the 'pathological' group, which includes the samples from the inflammatory, lymphoid and interferon clusters.The pathological group comprises the molecular heterogeneity found for active SLE and pSjS patients and our predictor showed a middle performance between the individual pathological clusters (MCC = 0.6538 ± 0.0551 and 0.6522 ± 0.0401 for expression and methylation data respectively).

Variable selection improves the predictors performance
Following a feature selection strategy (see Methods) we identified the most informative subsets of genes and CpGs, which could be used as biomarkers to differentiate SLE, pSjS and healthy controls.In this way, we proposed a set of 90 genes for gene expression (Supplementary Table 2) and 900 CpGs for DNA methylation (Supplementary Table 3) that may be used to distinguish SLE and pSjS, both from healthy controls and between them.For gene expression, the mean MCC for the pathological cluster increased from 0.6538 to 0.7310 ± 0.0312 using the subset of 90 genes.For DNA methylation, the MCC increased from 0.6522 to 0.6717 ± 0.0572.
We performed an enrichment analysis to get insight into the biological pathways in which the selected biomarkers are involved.We discovered that the most enriched pathways are related to relevant processes in SADs, such as interferon signaling and interleukins signaling (Supplementary Table 4).We obtained similar results with the selected CpGs methylation sites (Supplementary Table 5, Fig. 4).
In addition, we constructed an integrated model concatenating the selected genes and CpGs in a new dataset containing gene expression and DNA methylation data.Using these data, we obtained a mean MCC of 0.7159 ± 0.0387, what is lower than using only the gene expression, but higher than using only the DNA methylation.
Furthermore, we tested the utility of selecting these features for predictive tasks on external public datasets available in GEO [27].For gene expression, we used a dataset with 325 SLE patients and 187 healthy controls (GEO ID: GSE108497), which used the Illumina HumanHT-12 V4.0 expression beadchip technology to measure gene expression in whole blood samples.Following the same workflow that we used to prepare diagnostic models for our data, we obtained predictive models with a mean accuracy of 0.8496 ± 0.0365 and a mean MCC of 0.6712 ± 0.0836.For DNA methylation, we used a dataset with 64 pSjS patients and 67 healthy controls (GEO ID: GSE166373), which used the Illumina HumanMethylation450 BeadChip and the Illumina Infinium MethylationEPIC platforms to measure the DNA methylation levels in labial salivary gland samples.For this dataset, the predictive models achieved a mean accuracy of 0.7926 ± 0.0894 and a mean MCC of 0.5958 ± 0.1764.These results suggest that the selected biomarkers may be useful for the diagnosis of SLE and pSjS patients, even for data from different technologies and tissues than the ones used in our cohort.

The undefined cluster has a specific gene expression signature
As previously commented, the undefined cluster contains a group of patients with molecular patterns very similar to healthy controls.For that reason, it is very challenging to classify these samples with expression and methylation data.Nevertheless, we tried to improve the results with specific models and feature selection for this subset of samples.For that aim, we performed the feature selection process described in Section 3.2 for this group, selecting 50 genes for gene expression (Supplementary Table 6) and 640 CpGs for DNA methylation (Supplementary Table 7).The mean MCCs were 0.5746 ± 0.0811 and 0.4914 ± 0.1079 for expression and methylation data respectively.Although the performance without feature selection was similar for expression and methylation (Table 2, Fig. 3), the results are better with expression after feature selection.Interestingly, only 14 of the 50 genes (28%) overlap with the 90 selected genes for the pathological group, while the majority of CpGs (624 of 640, 97.5%) overlap with the selected CpGs for the pathological group.These results may indicate that the undefined cluster has specific gene expression alterations that may help to diagnose these difficult samples, while the alterations at DNA methylation level are similar to the other groups and are not as useful to perform this classification.

Conclusions
SLE and pSjS are two SADs with some overlapping symptoms.High throughput technologies such as RNA-Seq and microarrays may be valuable tools for the differential diagnosis of these pathologies.In this work, we demonstrated that ML methodologies can predict the disease status of each patient from expression and methylation data, although the prediction capacity depends on the molecular background of the patients.Furthermore, we selected subsets of features that improve the predictions and have important roles in the pathological mechanisms of SADs.We validated these features with external datasets, demonstrating their capability to diagnose each disease from both expression and methylation data.Finally, we obtained specific sets of biomarkers that may be useful to classify the samples from the undefined cluster, which have a healthy-like molecular pattern.Although some previous works used different types of data to perform differential diagnosis of SLE and pSjS, as far as we know the present study appears to be the first work describing multiclass classifiers to perform this task from expression and methylation data.In our opinion, the results of this work demonstrate the potential use of transcriptome and methylome data to perform differential diagnoses of SADs using ML approaches.

Declaration of competing intrest
None Declared.

Fig. 1 .
Fig. 1.Data analysis workflow.The initial data was split into training and test sets.Training sets were used to tune the XGBoost model with 100 rounds of 10-fold CV.A XGBoost model with these optimized hyperparameters was trained with the training set.This model was used to predict the labels of the test set and to evaluate the performance with the results of the prediction.This outer data splitting was repeated 10 times.

Fig. 2 .
Fig. 2. UMAP plots of the gene expression data with samples colored by their clinical diagnosis (A) and by their previously assigned molecular cluster (B).

Fig. 3 .
Fig. 3. MCC scores for each molecular cluster with XGBoost models based on expression and methylation data.For each cluster and data type there are 10 MCCs corresponding to the 10 rounds of training-test splitting.

Fig. 4 .
Fig.4.Network with the top 10 most significant pathways for the enrichment analysis of the methylation biomarkers.Reactome pathways (blue circles) are connected to the corresponding genes (orange circles).The circles size is proportional to the statistical significance and the color intensity to the number of biomarkers.

Table 1
Distribution of the patients samples in the four molecular clusters.

Table 2
Accuracy, precision, F1 and MCC scores for the test sets of each group of samples.