Machine Learning Analysis of Handgun Transactions to Predict Firearm Suicide Risk

Key Points Question Can handgun purchasing records, coupled with machine learning techniques, be used to forecast firearm suicide risk? Findings In this prognostic study of nearly 2 million individuals with handgun transaction records, among transactions classified in the riskiest 5%, close to 40% were associated with a purchaser who died by firearm suicide within 1 year. Among the small number of transactions with a random forest score of 0.95 and above, more than two-thirds were affiliated with a purchaser who died by firearm suicide within 1 year (24 of 35). Meaning This study suggests that passively collected administrative data on handgun transactions may be used to inform targeted interventions based on risk stratification.


Predictor Variables
We generated 41 predictor variables from the DROS transaction data. A number of these features related to the handgun itself. These included handgun category (revolver vs. semiautomatic pistol vs. unknown); caliber, categorized into small (e.g. . 22, .25, .32), medium (e.g. .38, 9 mm), and large (e.g. .40, .44, .45) and "other" which included frame only, interchangeable, missing, or any caliber not captured in our small, medium, large schema. We included an indicator for "inexpensive" handgun, proxied by the manufacturer, selecting the bottom quartile of median prices found in the Blue Book of Gun Values. 2 Transaction characteristics included an indicator for whether the handgun was purchased at a gun show (yes, no, or unknown), the transaction type (e.g. sale, voluntary registration, pawn redemption, or law enforcement acquisition) and transaction status (approved, administrative denial, cancelled). We also calculated the number of purchases an individual had made in the prior 1, 2, 5, and 10 years. To calculate prior transactions, we used AFS transaction data dating back to 1985. These transaction records include personal identifiers but do not have most of the details available in DROS 1996 forward, and thus were only utilized for the purpose of calculating prior purchasing.
We included individual purchaser demographic variables available in DROS. These included sex, race/ethnicity, state of birth (California vs. elsewhere), and age at the time of the transaction.
We geocoded the purchaser and dealer addresses recorded in DROS to identify the associated census tracts and obtain community characteristics. All addresses (for the purchaser and dealer) were geocoded using a multistage process to find the highest resolution match available using SAS ® software, Census Bureau Geocoder, and Google Geocoding API. [3][4][5] (These approaches use different geodetic datums, but the magnitude of difference is not relevant for our spatial unit of analysis.) There were three sets of addresses considered: the purchaser's address as listed on the transaction data, the dealer's address, and the address contained in the DROS Person Table (a table in DROS that includes  the most recent address for each person key).
If a transaction was missing the purchaser address (3.7%), we attempted to complete the information using an address from one of the following methods (in order): another transaction by the purchaser on the same day (0.067%); another transaction from a different day, closest in time to the purchase in question (2.1%); or the person table address (1.53%). Occasionally, the geocoding programs returned different coordinates for repeated instances of the same address. In these cases, if the matched latitude or longitude coordinates differed by more than 0.001 degree, the finest resolution match was used, or the first one in chronological order if tied. We manually fixed a very small percentage of addresses (less than 0.001%) that remained unmatched by all three geocoders or were assigned an out of state match. For these transactions, an internet search of the recorded address revealed a compelling alternative, such as a small typographical error in the zip code field.
There were 1,322 purchaser addresses that could not be geocoded. These included 105 known military addresses such as ships or bases that do not correspond to ACS census information, out of state addresses (less than 0.001%), and transactions with incomplete purchaser address information. All were coded as Unknown for location-based variables.
There were 2,590 transactions with no geocoding match for the dealer's address, in all but one case because the transactions had no dealer identification number associated with the transaction. In one instance, the dealer address was missing. In the person table, a unique person identifier may have more than one distinct address listed separately as home, mailing, business, or physical addresses. When this occurred, the home address was used.
Census-tract level features were obtained from Geolytics' Neighborhood Change Database, which produces US Census data normalized to 2010 tract boundaries. 6 We included a number of census tract variables from the U.S. Census and American Community Survey: population density, proportion of the population under 18, the ratio of males to females aged 15-34 years, proportion non-Latino white and non-Latino Black, the proportion of the population with past-year income below the poverty level, proportion with public assistance within the last year, the unemployment rate, and the proportion of households with children that were female-headed. Based on the geocoded address, we also identified addresses that were known military bases and created a "military address" indicator variable.
Finally, we included several county-level community characteristics associated with the purchaser's address and dealer's location. Using the Death Statistical Masterfile, we generated a county-level ratio of firearm suicides divided by total suicides (a common proxy for firearm prevalence). We also included the average suicide rate in the county over the twelve months prior to the month of the transaction. We included an estimate of local firearm purchasing trends with a moving monthly average of the number of handguns purchased (in DROS) in the county over the previous 12 months. We calculated the distance between the purchaser's address and the location of the dealer. Lastly, for both the purchaser and dealer address, we included rural-urban status based on the USDA's Rural-Urban Continuum Codes (RUC). Here too, all county-level continuous variables were converted into categorical variables by octiles, and missing values were coded as "Unknown." We converted all continuous variables into categorical variables based on octiles, with missing values coded as "Unknown."

Random forest
We implement random forest classification on the transaction-level data to predict firearm suicide within 1 year. 7 Random forest has been shown to be among the strongest performing classifiers 8 and to frequently perform well on imbalanced data. 9 It has been successfully applied in a range of contexts, including criminal offending, 10,11 disease forecasting, 12 and predicting suicide ideation. 13 In brief, random forest works by aggregating many hundreds of classification trees, each of which represents a recursive partitioning of the training data. Each classification tree is built from a bootstrapped sample from the training data. The tree creates binary splits based on a sample of predictor variables, drawn randomly at each partition, and selecting the best (i.e. purest or most homogeneous) split. The tree is grown, without pruning, until either purity or node size 1 is reached. Each tree then predicts the outcome value for any observations in the training set that had not been used to create that tree and calculates the out-of-bag (OOB) error rate. Finally, the classification trees are aggregated to create the random forest algorithm, and each observation receives a predicted probability or score (e.g. 0.20, 0.50, 0.95, etc.) based on the proportion of trees that assign it to the positive class (class 1). Depending on the task, these scores are then converted to a class label determined by the "decision threshold." The default threshold is majority rule (observations with scores above 0.50 are assigned to class 1). 14 Though the default threshold is majority rule, this can be adjusted to meet desired objectives and improve classifier performance. 15 For example, the threshold can be moved to maximize performance metrics of interest, e.g. composite measures of sensitivity and specificity such as Youden's Index. 16 Adjusting the threshold also offers a means to account for asymmetric costs. For example, if false negatives are much more costly than false positives, a lower threshold can be selected to achieve the desired cost-ratio.

Dealing with imbalanced data
The major challenge with predicting firearm suicide is the rarity of the event, resulting in extreme class imbalance. "Under-sampling" is a well-established technique for dealing with imbalanced data, shown to improve an algorithm's ability to isolate the event signal. 17,18 This approach balances the data by decreasing the size of the majority class. We implemented undersampling within the random forest algorithm by using a bootstrapped sample of the same size from each class (or strata) to create a balanced data-set to grow each tree. This stratified sampling alters the marginal distribution such that the observations used to grow a tree are comprised of half firearm suicides and half non-suicides. Though the algorithm is built with data approximating balanced classes, the data used to test the algorithm performance remains unbalanced. For each algorithm, we allocated a random sample of 70% of the data as the training set and used the remaining 30% as a test set.

Algorithm tuning
We tuned 2 random forest parameters, the number of predictor variables to randomly select at each binary split (mtry), and the number of trees in the forest (ntree), by maximizing the area under the receiver operator characteristic curve (AUC) within the caret package in R. 19,20

Algorithm evaluation
We evaluated the algorithm with test-set AUC, sensitivity (true positive rate), specificity (true negative rate), and metrics that combine sensitivity and specificity that are commonly used on imbalanced classification problems 21 : F-score 22 (2 * sensitivity * specificity) / (sensitivity + specificity), and Youden's Index (sensitivity + specificity −1). 23 We also present the positive predictive value (PPV) and negative predictive value (NPV) (i.e. the proportions of positive/negative results that are true positive/negatives). We present all of these metrics for a range of random forest vote thresholds including the default (.5), the threshold that maximizes F-score, and the threshold that maximizes Youden's Index. Finally, we also present raw scores and the proportion of suicides among the highest predicted risk transactions, defined as those with a random forest score of 0.90 and above.

Variable importance
To shed some light on the "black box" algorithm, we also present variable importance measures, which provide information on predictors' contributions to classification accuracy. Random forest generates two default measures of variable importance: mean decrease in accuracy (i.e. how much accuracy the algorithm loses by excluding each variable) and mean decrease in Gini impurity (i.e. how each variable contributes to the homogeneity of node splits over all trees in the forest). Given the well-established biases of Gini importance measures when predictors vary in their scale or the number of categories, 24 we focus on mean decrease in accuracy. This approach, also known as permutation importance, permutes each variable and calculates the difference in accuracy before and after permuting (averaged over all trees and normalized by the standard deviation of the difference) to give an importance measure.
In addition to the overall permutation importance, we also present class specific importance measures, which can be useful in differentiating positive and negative discriminating features in the context of imbalanced data. 25 These are calculated based on the mean change in error rates computed only for observations belonging to a specific class. We focus on the minority class (firearm suicide); given the extreme class imbalance, the overall importance and majority class importance measures are essentially the same.
Finally, to further explain the random forest predictions, we present Shapley Additive Explanations (SHAP) 26  . SHAP uses the game theory concept developed by Shapley 28 to calculate the relative contribution of each feature in the prediction model to the different prediction "coalitions" (i.e. 0 or 1). We present the top fifteen SHAP features. We present select violin plots for important minority-class specific features identified using the random forest mean decrease in accuracy to help identify the values of each feature that are relevant in "pushing" the prediction towards a 0 or 1. Positive SHAP values increase the probability of a classification to the positive class (1); negative values increase the probability of a classification to the negative class (0