AUTOMATED INTERPRETABLE COMPUTATIONAL BIOLOGY IN THE CLINIC : A FRAMEWORK TO PREDICT DISEASE SEVERITY AND STRATIFY PATIENTS FROM CLINICAL DATA

We outline an automated computational and machine learning framework that predicts disease severity and stratifies patients. We apply our framework to available clinical data. Our algorithm automatically generates insights and predicts disease severity with minimal operator intervention. The computational framework presented here can be used to stratify patients, predict disease severity and propose novel biomarkers for disease. Insights from machine learning algorithms coupled with clinical data may help guide therapy, personalize treatment and help clinicians understand the change in disease over time. Computational techniques like these can be used in translational medicine in close collaboration with clinicians and healthcare providers. Our models are also interpretable, allowing clinicians with minimal machine learning experience to engage in model building. This work is a step towards automated machine learning in the clinic.


INTRODUCTION
The advent of big data and clinical records databases opens possibilities for clinical data science.Machine learning techniques coupled with clinical data is thought to be critical in delivering the next generation of healthcare [1].
Here we present an automated computational framework to derive insights from clinical data.The computational framework presented here can be used to stratify patients, predict disease severity and propose novel biomarkers for disease.Our approach automatically performs model inference, cross-validation, model selection and generates insights with minimal operator intervention.Our models are also interpretable, allowing domain experts like clinicians (with minimal machine learning experience) to engage in model building.Insights from machine learning algorithms coupled with clinical data may help guide therapy, personalize treatment and help clinicians understand the change in disease over time.Our approach is a step towards automated machine learning and computational biology in the clinic.

METHODS
We have developed an automated machine learning framework that performs predictions with minimal operator intervention.First, we perform feature scaling to ensure that all input features are on the same scale.We then look at a suite of different machine learning techniques like neural networks, random forests, regularized generalized linear model (logistic regression) with LASSO (least absolute shrinkage and selection operator), support vector machines, linear regression and principal components analysis.Crucially, we perform inference, cross-validation, model selection and insight generation with minimal operator intervention.

DATA
We used data from the UCI machine learning repository (Wisconsin breast cancer dataset, which are available for download from [2]), [3,4].The dataset consists of 699 patients divided into healthy and patients with breast cancer.The disease status is reported as benign or malignant.The different attributes measured were clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses.All predictors are numeric (there are no categorical predictors) and were scaled to be within a range of 0 to 10.We replaced missing values with a 0. Future work will look at schemes to impute these values.Finally, we split the data into training, cross-validation and test sets.

STRATIFYING PATIENTS
We used principal component analysis (PCA) to gain insights into the clinical data.The PCA analysis suggests that there are a few clusters that the data can be separated into (Figure 1 and Figure 2).Single epithelial cell size and uniformity of cell shape seem to separate the data into distinct clusters (Figure 2).The attribute mitoses seem to account for many outliers (Figure 2).We note that the first principal component explains about 65 % variance in the data (Figure 3).
Finally, the PCA analysis suggests the most extreme points in the data (outliers).Five patients with codes 1123061, 1198128, 1147748, 1165926 and 760001 are predicted to be outliers.For example, patients coded as 1123061 and 76001 have a very low value (< 3) for the uniformity of cell shape.Patient coded as 760001 has a very low value of mitoses (value of 1 on a scale of 1 to 10).All patients predicted to be outliers also have low values of the attribute bare nuclei.This kind of analysis can be used to stratify patients.

PREDICTING DISEASE SEVERITY
We predict disease severity or probability of getting the disease using a suite of different machine learning algorithms.We looked at artificial neural networks (Figure 4) which are composed of an input layer of features, hidden layers and an output layer that predicts disease severity (on a scale of 0 to 1).We varied the number of hidden layers from 1 to 100.A neural network with 10 hidden layers was found to give the best performance (mean squared error eqal to 0,01) as shown in Figure 5 and Figure 6.
We also used random forests which are collections of trees.Each tree can be interpreted as a set of rules that suggest how to combine the attributes to predict a disease severity.A forest is a collection or ensemble of such trees.We varied the leaf size from 5 to 100 and the number of trees that are grown from 1 to 50 (Figure 7).The best random forest model achieved a cross-validation mean squared error of 0,04.Insights from interpretable machine learning algorithms like random forests can inform decisions in the clinic.The top predictors in random forests are shown in Figure 10.Uniformity of cell size (2 nd feature) and bare nuclei (6 th feature) are important predictors.Mitoses (9 th feature) is the least important predictor.We note however that mitoses separates two different clusters in the PCA plot (Figure 2) and may be useful as a biomarker.
We note that even though artificial neural networks have the best performance (cross-validation mean squared error = 0,01 for neural networks; cross-validation mean squared error = 0,04 for random forests), the most interpretable models are random forests.
We also used a logistic regression model with LASSO (L1 regularization).We performed 10-fold cross validation to determine the regularization parameter (Figure 11).We found that  all predictors are non-zero after cross-validation.Hence the logistic regression model suggests that all the predictors are important.
Finally, we also looked at linear regression models for correlations of attributes with each other (within patients).We did not observe any meaningful relationships.

BIOMARKERS
The predictors uniformity of cell shape and single epithelial cell size separate the data into a few different clusters in the PCA plot (Figure 2).Mitoses separates the data into a third cluster in the PCA plot (Figure 2).Bare nuclei is an attribute that accounts for some outliers in the PCA analysis (see Section 4.1 Stratifying Patients).
Our random forest algorithm suggests that the top predictors are uniformity of cell size and bare nuclei (Figure 10).Taken together, we suggest that uniformity of cell size and bare nuclei maybe important biomarkers for disease.

DISCUSSION AND CONCLUSION
Big data technologies coupled with massive clinical records databases opens possibilities for data science in the clinic.Machine learning techniques coupled with clinical big data are thought to be critical in delivering the next generation of healthcare [2].
Here we present an automated machine learning framework that generates insights from clinical data with minimal operator intervention.The computational framework presented here can be used to stratify patients, predict disease severity and propose novel biomajkers for disease.This can be used to guide therapy and intervention in the clinic.
We use a suite of machine learning algorithms to predict disease severity and stratify patients.We found that a PCA analysis combined with random forests can suggest biomarkers and ways to stratify patients.Our analysis suggests that uniformity of cell size and bare nuclei maybe important biomarkers for disease.
Even though artificial neural networks have better performance predicting disease severity than random forests, the most interpretable models are random forests.This is critical in communicating these insights to clinicians and healthcare professionals who may not be machine learning experts.We show a representative rule from a tree in a random forest (Figure 9) which takes the form (1).
Insights from interpretable machine learning algorithms like random forests can be very informative to clinicians.Our framework automatically performs model inference, crossvalidation, model selection and generates insights into data with minimal operator intervention.Our models are also interpretable, allowing domain experts like clinicians (with minimal machine learning experience) to engage in model building.Coupling automated and interpretable machine learning techniques with clinical data may help guide therapy, personalize treatment and help clinicians understand the change in disease over time.
Our approach can be combined with multi-scale models [5][6][7][8][9].Hybrid modelling approaches can be combined with machine learning techniques presented in the current work to gain mechanistic insights into disease, as has done previously for infectious diseases [10][11][12].
In summary, we present an automated and interpretable machine learning framework for generating insights.We demonstrate how this computational framework can be applied to clinical data.Computational techniques like these can be used in translational medicine in close collaboration with clinicians and healthcare providers.Our approach is a step towards automated machine learning and computational biology in the clinic.

Figure 1 .
Figure 1.Principal components analysis of data.Analysis shows a few clusters for the first two principal components.

Figure 2 .
Figure 2. Principal components analysis of the data showing clusters for the first two principal components.

Figure 3 .
Figure 3. Percentage of variation explained by each principal component in a PCA.

Figure 4 .
Figure 4. Architecture of neural network used to predict disease severity.The network shown has an input layer, 30 hidden layers and an output layer.

Figure 5 .
Figure 5. Neural network performance on training, validation and test dataset with 30 hidden layers.

Figure 6 .
Figure 6.Neural network performance on training, validation and test dataset with 10 hidden layers.

Figure 7 .
Figure 7. Performance of a random forest algorithm (out of bag prediction error).The leaf sizes are varied from 5 to 100 and up to 50 trees are grown.Representative trees used for predicting disease severity are shown in Figure 8 (regression) and Figure 9 (classification).Random forests are very interpretable.For example, the tree shown in Figure 9 is very interpretable since it represents a rule of the form: IF [(single epithelial cell size  2,5) AND (uniformity of cell shape < 1,5)] THEN healthy (1)

Figure 8 .
Figure 8.A representative tree from the random forest used in predicting disease severity (regression).

Figure 9 .
Figure 9.A representative tree from the random forest used in predicting disease severity (classification).

Figure 10 .
Figure 10.Top predictors in a random forest algorithm.

Figure 11 .
Figure 11.A plot of the effect of changing the regularization parameter (lambda) in a logistic regression model with LASSO (L1 regularization).The cross-validation error is used to find the optimal value of lambda.