Evaluation of Machine-Learning Algorithms for Predicting Opioid Overdose Risk Among Medicare Beneficiaries With Opioid Prescriptions

Key Points Question Can machine-learning approaches predict opioid overdose risk among fee-for-service Medicare beneficiaries? Findings In this prognostic study of the administrative claims data of 560 057 Medicare beneficiaries, the deep neural network and gradient boosting machine models outperformed other methods for identifying risk, although positive predictive values were low given the low prevalence of overdose episodes. Meaning Machine-learning algorithms using administrative data appear to be a valuable and feasible tool for more accurate identification of opioid overdose risk.


Introduction
Traditional statistical model specifications require specific structural assumptions (e.g., linearity) and pre-specified factors entered in the model, which may limit opportunities for discovering new knowledge from the data, especially when model assumptions are not satisfied by real data. Machine learning is a branch of artificial intelligence in which computer algorithms learn adaptively from the data without making strict assumptions in the statistical models. The major advantages of using machine learning over traditional statistical approaches include the ability to capture non-linear and nonmonotone association and to discover complex interactions that might have gone unnoticed in traditional regression models. 1 Moreover, tree-based machine learning, such as classification trees, use surrogate splits to handle missing values, whereas traditional statistical approaches require imputation for missing values, which often rely on additional assumptions on the underlying missing mechanism.
In this study, our primary goal was prediction, and the secondary goal was risk stratification (i.e., to identify subgroups of patients at similar risk of the outcome). First, we randomly and equally divided beneficiaries into training, testing, and validation samples based on the characteristic and overdose distribution. For each study design scenario (eFigures 2A and 2B), we developed and tested prediction algorithms for opioid overdose using 5 machine learning approaches: multivariate logistic regression, least absolute shrinkage and selection operator (LASSO), random forests (RF), gradient boosting machine (GBM), and deep neural network (DNN). For each approach, we fit the trained algorithms based on the training sample, refined the algorithm using the testing sample, and then applied the final algorithm in the validation sample to evaluate prediction performance.
Our model reporting compiles with the Standards for Reporting Diagnostic Accuracy (STARD) and the Transparent Reporting of Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guidelines. 2, 3 We calculated the C-statistic (or area under the receiver operating curve [ROC]) from the validation sample to assess discrimination (i.e., the extent to which patients predicted as high-risk exhibit higher overdose rates compared to those predicted as low risk). We examined any difference in C-statistics across different approaches using the DeLong Test. 4 For each probability cutoff point, overdose was predicted for the visits with calculated probabilities above the cutoff point, whereas non-overdose was predicted for the visits with probabilities below the cutoff points. Based on their true and predicted overdose status, the patients' 3-month visits can be assigned to one of the four groups (i.e., true positive [TP], false positive [FP], true negative [TN], false negative [FN]) shown in the classification matrix (eFigure 3). Given that overdose events are rare outcomes and C-statistics do not incorporate information about the prevalence of the outcome, we reported other more appropriate metrics, including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (PLR), negative likelihood ratio (NLR), number needed to evaluate (NNE) to identify one overdose, and estimated rate of alerts to assess pre-implementation evaluation of our prediction algorithms (eFigure 3). 5 The optimal algorithm for a screening test depends on pre-test probability of the outcome, the values of TPs and TNs, and the costs of FP and FN. Since these factors vary from setting to setting (and some of them are subjective choices), no single cutoff point is suitable for every purpose. In order to compare performance across methods, we presented and assessed these prediction metrics (e.g., NNE) at the optimized threshold of the predicted probability that balances sensitivity and specificity as identified by the Youden index, 6 as well as at multiple levels of sensitivity and specificity (e.g., 90%-100%) to allow risk-benefit evaluations of interventions triggered by a positive tests using different thresholds defining high risk.
Second, based on the score (i.e., 100×predicted probability of overdose to simplify interpretation and application) in the validation sample, we classified the beneficiaries into three risk groups: low-risk (the scores below optimized threshold identified by the Youden index), high-risk (the top fifth percentile of the scores), and medium-risk (score between optimized threshold and top fifth percentile) groups. Calibration (the extent to which the predicted overdose risk agree with the observed risks) was evaluated using calibration charts of observed overdose rates by three risk group.

Multivariate logistic regression (MLR) 7
MLR was performed with the log odds for opioid overdose as the dependent variable. The independent variables consisted of all the 268 predictor candidates and 2-way interactions between any of the two predictors. We examined correlations among the correlation coefficients to screen for multicollinearity. Absolute values of r >0.75 were considered significant or strong correlations. We incorporated variables into a model in a stepwise fashion with α value of 0.10 to enter and 0.15 to remove.
Step-wise selections were based on asymptotic covariance estimates. The goodness of fit for the final model was determined by Hosmer-Lemeshow X 2 analysis. The MLR for training visits was used to calculate a probability of overdose for each 3-month visit in the validation sample. For MLR, validation visits were assigned to one of the two predictive categories (i.e., overdose vs. non-overdose) if the probability threshold >0.01 (identified from the ROC using the Youden Index). Furthermore, we used multiple imputations (MI) with an assumption of missing at completely random for missing predictor data. 8 The MI methods simultaneously predicted missing values of variables using complete observations of variables by modeling the joint distribution of all other predictor candidates. For variables with multiple categories (e.g., provider specialty), we used multiple imputation chained equations (MICE) to impute the missing data. MLR was conducted using the software package Salford SPM v8.2.

Regularized logistic regression: Least absolute shrinkage and selection operator (LASSO)-type regularized logistic regression 9-11
In order to simultaneously fit the regression model and select important predictors, we used several regularized or penalized logistic regression including LASSO, ridge regression, and elastic net regression, which can automatically identify factors that most strongly predicted overdose. Regularized regression uses the elastic family of penalty function for model estimation that performs variable selection and produces a parsimonious model. Briefly, after forming a prediction model with logistic regression using all candidate variables, beta coefficients are penalized and lowered to deal with model overfitting. The magnitude of penalization is subsequently changed to create various models with different prediction errors in a cross-validation process, so the final model achieves optimal penalization based on the lowest prediction error. In regularized regression, variable selection is performed automatically by shrinking regression coefficients of some variables to zero. We used multiple imputation (MI) with an assumption of missing at completely at random for missing predictor data. 8 The MI methods simultaneously predicted missing values of variables using complete observations of variables by modeling the joint distribution of all other predictor candidates. For variables with multiple categories (e.g., provider specialty), we used multiple imputation chained equations (MICE) to impute the missing data. Similar to traditional statistical methods (e.g., logistic regression), regularized regression methods cannot handle missing values and they delete rows with missing data. It also cannot discover a nonlinearity relationship or multiple interactions. However, regularized regression is expected to be more effective in the following situations: there are (1) many more columns (predictors) than rows (observations), (2) predictors available may be extremely highly correlated with each other, or (3) the goal to find the most compact model yielding acceptable performance. We used the GPS/LASSO function in the Salford SPM v8.2 to perform regularized regression with different elasticity values between 0 and 2. We used an elasticity value of 1 (which performs the LASSO regression that introduces variables as quickly as possible and produces reasonably sparse solutions), 2 (which performs the ridge regression that introduces variables sparingly and produces the least sparse solutions favoring proportional shrinkage), and 1.1 (the hybrid ridge-LASSO regression). We also controlled the amount of coefficient adjustment using the learning rate of 0.001 (learn rate between 0 and 1 and a small value forces smaller updates at each step). All other parameters were set as default values. The best model was with an elasticity value of 1.1 chosen in a cross-validation process. For regularized regression, validation visits were assigned to one of the two predictive categories (i.e., overdose vs. non-overdose) if the probability threshold >0.01 that was identified from the ROC using the Youden Index.

Random Forests (RF) 12,13
A classification tree identifies mutually exclusive subgroups whose members share similar important predictors of the outcome of interest through a series of binary recursive partitionings. 14 RF, consisting of numerous independent classification trees from random sampling, improves prediction performance over a single tree. 12,15 In order to have diverse trees in the random forest, each random tree partially randomly selects a subset of predictor variables. This allows for variables of low predictive importance to enter the ensemble and potentially reveal interaction effects with other variables, which could improve overall predictive accuracy of the RF. Implementation of the tree-growing procedure in RF may be controlled by multiple tuning parameters and criteria including the number of trees in the forest, meeting a priori p-value thresholds to implement a binary split, the minimum sample size of observations in a node, and the number of candidate predictors to be selected at random. Practical guidance suggests that initial values for each random forest be fine-tuned iteratively by varying the value of each parameter and empirically selecting the combination of parameters to yield the lowest prediction error. 16 The tree growing process continues until a terminal node is reached, which occurs when no allowable splits exist, or stopping criteria are met. Our proposed algorithm steps and rationale were adapted from the implementation of Chirkov et al.'s RF framework. 16 We used the Random Forests in the software package Salford SPM for this study. For RF, validation visits were assigned to one of the two predictive categories (i.e., overdose vs. non-overdose) if the probability threshold >0.93 that was identified from the ROC using the Youden Index.

Gradient Boosting Machine (GBM; Stochastic gradient boosting or TreeNet in Salford SPM) 17,18
Similar to RF, GBM creates a tree ensemble except that a GBM model consists of a series of trees grown in a sequential order, whereas a RF consists of a collection of trees grown in parallel. Briefly, the first tree is fitted to the data and begins with a very small tree as the initial model. The residuals (error values) from the first tree are then fed into the second tree which attempts to further reduce the error. Then it grows a second small tree to predict the residuals from the first tree. Next, it computes residuals from this new 2 nd tree model and grows the 3 rd tree to predict revised residuals. This process is repeated through a series of successive trees. The final predicted value is formed by adding the weighted contribution of each tree. The algorithm typically generates thousands of small decision trees built in a sequential error-correcting process to converge to an accurate model. Usually, the individual trees are fairly small (typically 3 levels deep with 6 terminal nodes), but the full GBM additive series may consist of hundreds of these small trees. Mathematically, a GBM model can be described as:

GBM predicted target (i.e., opioid overdose) = F 0 + β 1 *T 1 (X) + β 2 *T 2 (X) + … + β 100 *T 100 (X)
Where F 0 is the starting value for the tree series, X is a vector of pseudo-residual values remaining at each point in the series, T 1 (X), T 2 (X), T 3 (X),… are trees fitted to the pseudo-residuals, and β 1 , β 2,… are coefficients of the tree node predicted values that are computed by the GBM algorithm.
Specifically for our study, Salford's TreeNet function supplied an initial value specific to the chosen loss function (i.e. logistic binary) for each record in the training sample. TreeNet can handle missing values automatically. We used cross entropy (i.e., negative average log likelihood) as the tuning criterion to determine the number of trees optimal for logistic models. Second, TreeNet sampled 25% of the records in the learn sample randomly and then computed the generalized residual for the records in the sample. The first tree is fitted to the data and begins with a very small tree as the initial model. TreeNet used the sampled records to fit a classification tree with a maximum 8 terminal nodes to the generalized residuals. Third, TreeNet used the classification tree derived from the sampled records to update the TreeNet model based on the loss function and shrink the update tree by the learning rate (or shrinkage rate) at 0.1 for overfitting protection. TreeNet repeated the steps previously described 200 times (i.e., number of trees to build = 200). Finally, we tested and validated the algorithms in the testing and validation samples. For TreeNet, validation visits were assigned to one of the two predictive categories (i.e., overdose vs. non-overdose) if the probability threshold >0.39, that was identified from the ROC using the Youden Index.

Deep feed-forward neural network (DNN) or Artificial neural network (ANN) 19,20
Deep learning refers to artificial neural network (ANN) models with multiple hidden layers (typically ≥3) that can be used to quantify the complex nonlinear relationships between inputs and outputs. Recently, deep learning has shown impressive performance in image and audio processing, natural language processing, as well as biomedical fields. 19,20 Prior to conducting deep learning, we imputed missing data using median imputation for continuous variables and mode for categorical data. We also generated a series of dummy variables for categorical variables, and normalized continuous variables within a range between 0 and 1. We used a deep feed-forward neural network (DNN) to develop algorithms in our training datasets using Python 3.6 (keras package). We examined different numbers of hidden layers and nodes. At the end, the DNN with 2 hidden layers with 120 and 40 nodes performed the best. In this study, we generally referred our approach as DNN although the hidden layers were 2 in the best performing model. In each hidden layer during the algorithm-training process, we chose ReLU as an activation function to better prevent vanishing gradient and yield faster convergence. We applied a sigmoid function to the output layer to generate score representing the probability of overdose. We used the binary cross-entropy loss function with balanced class weight to adjust for the rare outcome. Finally, we fine-tuned the L2 regularization weight and minimized the loss function to optimize the weight and bias of the DNN. For DNN, validation visits were assigned to one of the two predictive categories (i.e., overdose vs. non-overdose) if the probability threshold >0.465 that was identified from the ROC using the Youden Index.

Overall misclassification rate
The proportion of correctly predicted observation (i.e., actual overdose and actual non-overdose) divided by the total observations. F1 score The weighted average of precision (or PPV) and recall (or sensitivity). F1 takes both false positives and false negatives into account, and it usually more useful than overall misclassification rate under an uneven class distribution (e.g., non-overdose individuals comprised the majority of the cohort). 23 F1 closer to 1 is desirable.

C-statistic
The area under the receiver operating characteristics curve (ROC) curve, which is a plot of sensitivity vs. false positive (or 1-specificity) for all potential cut-off probability thresholds for an algorithm. Comparisons of C-statistics based on imbalanced data or rare outcomes can be misleading because C-statistics do not incorporate information about prevalence or pre-test probability of the outcome. 5 Precision-recall curves A precision-recall curve of precision (or PPV; y-axis) vs. recall (sensitivity; x-axis). The curve closer to the upper right corner (corresponding to 100% precision and 100% recall), has better performance. Number needed to evaluate (NNE) The NNE is the number of patients necessary to evaluate or screen to detect one outcome (i.e., overdose), similar to number needed to treat in clinical trials. A PPV of 10% is equivalent to an NNE of 10. Estimated rate of alerts Provides estimated number of alerts per number of patients screened or evaluated over a period of time -for example, per 100 patient per day. Too many alerts may lead to alert fatigue; too few may lead to unfamiliarity with the clinical response.    Abbreviations: ED: emergency department; FFS: fee-for-service; MME: morphine milligram equivalent; No: number of a Rather than p values or coefficients, the RF reports the importance of predictor variables included in a model using both permutation-based and Gini split methods. Importance is a measure of each variable's cumulative contribution toward reducing square error, or heterogeneity within the subset, after the data set is sequentially split based on that variable. Thus, it is a reflection of a variable's impact on prediction. Absolute importance is then scaled to give relative importance, with a maximum importance of 100. For example, the top 10 important predictors identified from RF included cumulative days of early refills for opioids, cumulative 30-day duration of opioid use, No. opioid prescribing pharmacies, No. methadone fills, No. oxymorphone fills, prescriber-level average monthly opioid prescriptions, No. long-acting fentanyl fills, No. short-acting hydromorphone fills, No. gabapentinoid fills, and provider-level average monthly patients prescribed opioids.