A machine learning research template for binary classification problems and shapley values integration

This paper documents published code which can help facilitate researchers with binary classification problems and interpret the results from a number of Machine Learning models. The original paper was published in Expert Systems with Applications and this paper documents the code and work-flow with a special interest being paid to Shapley values as a means to interpret Machine Learning predictions. The Machine Learning models used are, Naive Bayes, Logistic Regression, Random Forest, adaBoost, Classification Tree, Light GBM and XGBoost.


Introduction
The COVID19 pandemic has affected every corner of the world and research efforts have increased exponentially since the start of it. We provide a basic framework for researchers who wish to apply Machine Learning models to predict COVID19 mortality at a patient level. This will allow researchers to save time in writing their own models, it is by no means an exhaustive resource but provides a starting point for other investigations. Predicting the probability of mortality from patient characteristics also may help hospitals better treat patients in cooperative game theory, see [2] and has recently been applied to Machine Learning models, see [3] and [4].

Contribution to the scientific community
The code was written in RMarkdown [5] and can be used as a template for Machine Learning applications to other problems. RMarkdown is a file format for making dynamic documents with R, a document can be created using plain text and LaTeX formatting along with chunks of R code embedded throughout the document. This allows researchers to save time and facilitates reproducible results by directly replicating the whole research paper. In the Code Ocean capsule we provide the full RMarkdown script along with a standalone R script such that others can reproduce the exact same results as in the original paper, see [6].
We apply a series of Machine Learning models, notably, Naive Bayes, Logistic Regression, Random Forest, adaBoost, Classification Tree, Light GBM and XGBoost. We then compute a number of relevant statistics based on a confusion matrix for each model. The results are visualized using the Receiver Operating Characteristics (ROC) and Precision Recall (PR) Curves. Individual predictions and case studies can also be analyzed from the XGBoost model which may help practitioners make more informed decisions based on the Machine Learning models predictions. We make use of Shapley values in order to analyze the model agnostic methods for interpreting Machine Learning models. This allows us to analyze the models predictions at a local and global level along with seeing how different variable values change with Shapley contributions. Finally, we apply what-if analysis to show how (holding constant all other variables) changes in one variable impacts a prediction. This is important since we can see how the probability of mortality changes with increases or decreases in variable values, i.e. how does keeping a patient in hospital a day longer impact that patients probability of survival. In this paper we highlight the main results focusing on the aforementioned analysis, this time applied to a new dataset, to show its reproducibility.

Shapley values
Shapley values [2] from coalition game theory allows us to interpret the predictions of Machine Learning models by treating each variable as a player in a game with the prediction being the payout [7] and [8]. A cooperative game can be considered as the following. A characteristic function game is given by a pair ( , ) where is the number of players and ∶ → R is a characteristic function which maps every coalition of players to a payoff.
Consider the following example which has three variables and the output is a prediction from a Machine Learning model. Each variable contributes differently to the prediction (some variables contribute more, others less). The characteristics function ( ) tells us that if the variable 1 was the only variable in the model then it would make a contribution of 80, the variables 2 and 3 on their own make a contribution of 56 and 70 respectively. The value of the whole coalition is 90 with all three variables 1 , 2 and 3 contributing to the prediction, there are also different contribution scores for each possible coalition of variables. The Shapley value is defined as ( ) = 1 ! ∑ ∈ ∏ ( ). The table on the right considers every permutation of players, that is, the first line considers the permutation of 1 , 2 and of 1 and 3 is 85, therefore 3 contributes 5. Still, the coalition of 1 , 2 and 3 is 90, therefore 2 must contribute 5 also. Finally, if the order is as in line three, then 2 contributes 56 on its own with the coalition of 1 and 2 being 80, 1 must contribute 24 and the coalition of all variables is 90, therefore 3 must contribute 10. This is done for all permutations of the variables and then the average of these values are taken, given as which gives us the average marginal contribution for each variable over all of the possible subsets.  [9][10][11][12] and [13] used clinical features to predict the survival probabilities of patients with COVID19. [10] released their patient characteristic dataset and thus in this paper we replicate the results from our first study [1] using this new data. The code for the models and figures can be found on CodeOcean [6].

Reproducible results: Application to new data
The data contains COVID19 patient characteristics along with the patients mortality outcome for people admitted to hospital with COVID19. The replication of the code onto a new dataset are left to Appendix with the purpose of illustrating the outcome of the code. The data contains only 9 features with some of these features likely to be related to each other, additionally two of the features are binary features and can only be used a single time in each tree, thus reducing    the complexity and non-linearity of the tree. 3 Therefore, some of the more complex tree based algorithms are unable to distinguish between the noise and the correct functional form in the data.
The code, plots and interpretation can be applied to any binary classification problem. [14] applied Machine Learning to the classification of bankrupt versus non-bankrupt firms with interpretation on case study levels. [15] applied Shapley values to credit risk management using a number of financial ratios to show the contribution each variable made to the models prediction. [16] applied Shapley values to the 3 For example a decision tree can firstly make a split on the variable Age by splitting at ≤ 40 and > 40, it may again use the variable Age at the node below with an additional split at > 30 and ≤ 40 -using this variable twice. Binary variables can only be used once in a tree model since it can either be a 1 or a 0 and thus follows < 0.5 or ≥ 0.5 and cannot be used again further down the tree. detection of acute myocardial infarction (AMI). [17] applied Shapley values to the detection of relevant Alzheimer's disease characteristics. [18] used Shapley values as a feature selection model using data on mental health. [19] measured Shapley values to the input of images of a Convolutional Neural Network to find pixels which contribute most to the classification of input images. [20] studied the early prediction of a teams performance in software engineering. They use a Deep Neural Network to make a classification of the teams and Shapley values for feature importance. [21] discusses some of the problems with using Shapley values as feature importance measures.

Limitations and improvements
Whereas the XGBoost and LightGBM models both contain crossvalidation code (commented out in the scripts), other models were  optimized away from the main script. Therefore, users should be mindful that cross-validation should also be integrated for the adaBoost and Random Forest models also. Additionally, whereas the main focus of the paper was on the use of Shapley values for model-agnostic interpretability, other model-agnostic methods can also be implemented, such as partial dependence plots, individual conditional expectations, accumulated local effects, permutation feature importance, local interpretable model agnostic explanations (LIME) and others.

Conclusion
We believe that researchers may benefit from the models and code used to generate the results and figures in our paper. The code can be applied to binary classification problems, we have shown two examples of use cases applied to COVID19 mortality prediction, one in [1] and the other here. The reproducibility of our paper may help researchers apply the same models and code to their own research problems. Adding Machine Learning models, improving and building on top of the current code should be a next step.