A practical tool for public health surveillance: Semi-automated coding of short injury narratives from large administrative databases using Naïve Bayes algorithms

Public health surveillance programs in the U.S. are undergoing landmark changes with the availability of electronic health records and advancements in information technology. Injury narratives gathered from hospital records, workers compensation claims or national surveys can be very useful for identifying antecedents to injury or emerging risks. However, classifying narratives manually can become prohibitive for large datasets. The purpose of this study was to develop a human – machine system that could be relatively easily tailored to routinely and accurately classify injury narratives from large administrative databases such as workers compensation. We used a semi-automated approach based on two Naïve Bayesian algorithms to classify 15,000 workers compensation narratives into two-digit Bureau of Labor Statistics (BLS) event (leading to injury) codes. Narratives were ﬁ ltered out for manual review if the algorithms disagreed or made weak predictions. This approach resulted in an overall accuracy of 87%, with consistently high positive predictive values across all two-digit BLS event categories including the very small categories (e.g., exposure to noise, needle sticks). The Naïve Bayes algorithms were able to identify and accurately machine code most narratives leaving only 32% (4853) for manual review. This strategy substantially reduces the need for resources compared with manual review alone.


Introduction
Electronic health records containing real-time medical data create the potential for vast changes and improvements in public health research and surveillance. One important goal of injury surveillance is to determine important antecedents to injury and rank them according to magnitude, risk, severity or burden (Souza et al., 2011). To do this requires accurately classifying a comprehensive and representative sample of cases. Narratives in large administrative databases such as hospital records, workers compensation (WC) claims, or national surveys, can provide useful information about potential causes, prevention and recovery alone or as a supplement to existing coded data (Sorock et al., 1996(Sorock et al., , 1997Stutts et al., 2001;Williamson et al., 2001;Lombardi et al., 2005;Verma et al., 2008;Lombardi et al., 2009;McKenzie et al., 2010;Vallmuur, 2015;Taylor et al., 2014). However, classifying narratives into groups manually can become prohibitive for large datasets.
Advances in computer technology provide one potential solution. Automated methods of text processing developed within the field of medical informatics are currently used to perform highly sophisticated and accurate automated identification of specific events for applications such as medical monitoring or syndromic surveillance and coding of chief complaints (Wagner et al., 2004;Chapman et al., 2005;Brown et al., 2010;Gerbier et al., 2011). However, tailoring software for one particular coding strategy can consume tremendous resources (including the integration of specialized linguistics and semantics).
The purpose of this study was to develop a human-machine system that could be easily tailored to routinely and accurately classify injury narratives from large administrative databases such as workers compensation. Specifically, we sought to identify a representative set of classifications at similar levels of accuracy across all potential categories, those occurring the most often or those fairly infrequent but with potential high or emerging risk, or high severity. Recent research has shown that a simple multinomial Naïve Bayes model can outperform state-of-art methods of text classification for short snippets of text and when there are few training cases (Wang and Manning, 2012). Therefore, given the lack of detail and specificity in short administrative narratives such as those collected on the first report of injury (snippets of text typically two to 15 words long), and our interest in identifying a large number of small categories (each with very few training cases), the Naïve Bayes model seemed especially appropriate for the purposes of this study.

Methods
Thirty thousand records were randomly extracted from claims filed with a large WC insurance provider between January 1 and December 31, 2007. Four coders, trained on the Bureau of Labor Statistics (BLS) Occupational Injury and Illness Classification system (OIICS) 2012 version, classified records into two-digit event codes using the accident (what happened, 120 character maximum) and injury (type, e.g., strain, fracture, 20 character maximum) narratives as they appeared on the first report of injury. These manual codes served as our gold standard.
The dataset was divided into a training set of 15,000 cases, for model development, and a prediction dataset of 15,000 cases for evaluation. Each record included a unique identifier, a narrative describing how the injury occurred, and a two-digit BLS OIICS event code. The distribution of the two-digit OIICS event codes did not differ between datasets (x 2 P = 0.87).

Model development
Following the approach of Lehto et al. (2009), the Naïve Bayes model can be used to assign a probability to each event code based on the words present in a particular narrative. The event code with the largest estimated probability is then chosen as the prediction for the words present. The assigned probability follows from Eq. (1) below: where P(C i |n) = the predicted probability of event code i given the set of j words in the particular narrative, P(n j |C i ) = the probability of observing word n j in a training narrative corresponding to event code i, P(n j ) = the marginal probability of observing word n j in any training narrative, P(C i ) = the marginal probability of event code i in the training dataset.
To estimate P(C i |n), we followed the approach of Lehto et al. (2009) and only considered the words that were actually present in a narrative. Specifically, we employed the multinomial Naïve Bayes model, as opposed to the Bernouli Naïve Bayes model. The multinomial model is not only simpler because it allows the missing words to be ignored, but also in our experience and that of others (McCallum and Nigam, 1998), it gives consistently better results for text classification. Another simplification follows from the fact that each estimate of P(C i |n) in Eq. (1) above contains the constant multiplier P(n j ) which, therefore, can be factored out of (dropped from) the calculations. Consequently, the model can be fit by estimating only two parameters P(C i ) and P(n j |C i ) from the training data. In this study, the first parameter is simply determined: where N i is the number of times category C i occurred in the training set and N is the total number of training cases. The second parameter, P(n j |C i ), was estimated as defined above but also by adding a smoothing parameter to the observed relative frequency of word n j in each category. The way this was done corresponds to assuming a uniform prior probability density function for the frequency of word n j in each category, given a prior sample size of a Â N, where N is the total number of training cases.
From a practical perspective a value of alpha greater than 0 avoids inappropriately setting the P(n j |C i ) to zero when a word does not occur in the category of interest in the training dataset (often due to a noisy or odd misspelling of word). That is, where N i was the number of times category C i occurred in the training set, N j was the number of times word n j occurred in the training set, and N ij was the number of times in the training set that word n j occurred in category C i . The smoothing constant a was set to a value of 0.01, corresponding to strong confidence that differences in the relative frequency of words between categories were in fact predictive of the category.
To improve precision and reduce potential rounding errors, instead of multiplying through P(n j |C i ) for each word in the narrative we calculate P(C i |n) using the log transform of Eq. (1). This gives the result of The normalized estimates of each P(C i |n) can then be obtained as Note: words never found in the training dataset are ignored (e.g., have no effect) when calculating P C i jn ð Þ. Normalizing Eq. (1), using Eq. (5), results in an estimate of the true value of P(C i |n), which will be optimal if the words are conditionally independent. Our earlier studies have found this technique results in a well calibrated estimate of P(C i |n) strongly related to prediction accuracy (Lehto et al., 2009;Choe et al., 2013). This estimate contains no free-weighting parameters that are being adjusted to "fit" the model to the data. As such, this technique differs from methods such as neural networks, weighted Bayesian models, or logistic regression which have a large number of free parameters that are being adjusted to fit the data to minimize some objective function.
In summary, the algorithm is trained by simply calculating weights for each category (Eq. (2)) and the predictive relation of words to each category (Eq. (3)) using the training narratives. To generate the word counts used in Eqs. (2) and (3), first parse each training narrative to obtain a list of all the words and word sequences, and then count how often each word occurs in the training dataset for each category.
These probabilities (Eqs. (2) and (3)) are then used to assign classifications to the 15,000 prediction narratives. The probability of each category, given the set of words in a new prediction narrative P(C i |n), is calculated by adding up the log transformed weights (Eq. (4)) for each single word, or word sequence in the narrative, then normalizing (Eq. (5)). The category with the highest value of P(C i |n) is chosen as the prediction.
Sample calculations of P(n j |C i ), P(C i ) and the category prediction strengths, P(C i |n) for the following two examples are provided in Appendix A,Tables A1 and A2. The first example, "Glove got caught in conveyor" evaluated using the single-word model (naive sw ) Table 1 The accuracy of two independent Naïve Bayes algorithms for classifying the events leading to injury of N = 15,000 workers compensation narratives (results for categories n ! 100).

BLS OIICS 2-digit event code
Gold a Two-digit categories with <100 cases. b Gold standard codes were assigned to each narrative by expert manual coders. c n pred = number predicted into category. d % pred = percent of cases in whole dataset predicted into category. e The distribution of two-digit classifications will be skewed towards categories with high sensitivity, biasing the finally distribution of the coded datasets. f Sen = sensitivity: (true positives) the percentage of narratives that had been coded by the experts into each category that were also assigned correctly by the algorithm. g PPV = positive predicted value: the percentage of narratives correctly coded into a specific category out of all narratives placed into that category by the algorithm.

Table 2a
The accuracy of the human-machine classification system: implementation of a strategic filter a based on agreement between two Naïve Bayes algorithms (results for categories n ! 100).

BLS OIICS 2-digit event code
The subset of narratives where the naive sw and naive seq algorithms independantly assigned the same classification c Naive sw = Naïve Bayes single word algorithm; Naive seq = Naïve Bayes sequence word algorithm. a A filter is a technique to decide which narratives the computer should classify vs. which should be left for a human to read and classify. b Two-digit categories with <100 cases, detailed results for these categories are shown below. c Distribution and accuracy of codes for just those narratives where the two models agreed, (68% of complete dataset). d n pred = number predicted into category. e % pred = percent of cases in whole dataset predicted into category. f The distribution of two-digit classifications will be skewed towards categories with high sensitivity, biasing the finally distribution of the coded datasets. g Sen = Sensitivity: (true positives) the percentage of narratives that had been coded by the experts into each category that were also assigned correctly by the algorithm. h PPV = Positive predicted value: the percentage of narratives correctly coded into a specific category out of all narratives placed into that category by the algorithm. i Human-machine system: the computer assigns codes to narratives that the algorithms agreed on the classification (68% of the dataset), and the remainder are manually coded (32 % of the dataset).

Table 2b
The accuracy of the human-machine classification system: implementation of a strategic filter a based on agreement between the two Naïve Bayes algorithms (results for small categories only, n < 100 cases in each category).

BLS OIICS 2-digit event code
Gold standard b The subset of narratives where the naive sw and naive seq algorithms independantly assigned the same classification c Naive sw = Naïve Bayes single word algorithm; Naive seq = Naïve Bayes sequence word algorithm. a A filter is a technique to decide which narratives the computer should classify vs. which should be left for a human to read and classify. b Gold Standard codes were assigned to each narrative by expert manual coders. c Distribution and accuracy of codes for just those narratives where the two models agreed, (68% of complete dataset). d n pred = number predicted into category. e Sen = sensitivity: (true positives) the percentage of narratives that had been coded by the experts into each category that were also assigned correctly by the algorithm. f PPV = positive predicted value: the percentage of narratives correctly coded into a specific category out of all narratives placed into that category by the algorithm. g Human-machine system consisted of human coding 32% of the dataset, machine coding 68% of the dataset. h nec = not elsewhere classifiable. maximizes the probability of the caught-in-or-compressed-by category where P(C i |n) = 0.999. The sequence-word model (naive seq ) predicts the same category, given the parsed sequences: "glove-got," "got-caught," "caught-in," "in-conveyor," "glove-gotcaught," "got-caught-in" with a probability of P(C i |n) = 0.999. "Caught" and "caught-in" are both very strong predictors in the models (e.g., the word is much more commonly found in the caught-in-or-compressed-by category.) In contrast, for the narrative "Box fell against ankle," the singleword model predicts the fall-on-same-level category incorrectly, with a relatively low prediction strength of P(C i |n) = 0.576. The sequence-word model uses the predictors "box-fell," "fell-against," "against-his," "his-ankle" and "fell-against-his" to correctly predict the struck-by-object category with a much higher prediction strength P(C i |n) = 1.0. The single-word model was incorrect because the word "fell" alone occurred much more often in the fall-onsame-level category than the struck-by-object category and incorrectly skewed the prediction strength for fall-on-same-level category above the struck-by-object category. When the word "fell" is only considered in sequence with other words, "fell-against" and "box-fell," the prediction strength becomes much lower for the fall-on-same-level category and higher for the struck-by-object category.
The Textminer program developed by one of the authors (ML) was used to generate word counts and estimate their predictive weights for all 15,000 training narratives. This was done independently for single words and word sequences, resulting in two different sets of predictions. (Note: others have also implemented these steps in SAS (Bertke et al., 2012). Software for performing such analysis is available from many other sources, for example the WEKA platform includes numerous versions of Naïve Bayes and other classifiers. The sequence-word model uses the exact same procedure described above for the single-word model but considers each two, three or four words in sequence as an individual "keyword" as demonstrated in the examples above. Some (minimal) data cleaning was performed during this process to reduce noise in the calculations. This included removing a short list of frequently occurring "stop words"thought not to be meaningful from all narratives prior to word counting [e.g., the words A, AN, AND, ETC, HE, HER, HIM, HIS, I, LEFT, LT, MY, OF, RT, RIGHT, SHE, THE, R, L]. We chose not to edit misspelled words or assign morphs to words with the same meaning although this may help if you have only a small number of training narratives (e.g., the words "CARS VEHS AUTO VEHICLE VEH VECHICLE AUTOMOBILE VEHICLES VECHILE VECHILES VEHILCE VEHICAL" could be changed to be spelled the same, "automobile"). Assigning morphs can be very time consuming, and our objective was to test how well a model with minimal human editing of the words would work.

Evaluation of the human-machine coded approach
Our primary objective was to classify 15,000 new WC narratives as efficiently and accurately as possible. Our focus was not to find one particular subcategory or type of claim (i.e., data mining) but rather classify all narratives into categories for understanding the magnitude and ranking of root causes of injuries.
We know from prior results that relying on Bayesian algorithms alone will not result in consistently high positive predictive values (PPV) across all categories (Lehto et al., 2009;Marucci-Wellman et al., 2011). Also, Bayesian models can assign predictions for short narratives (text snippets such as those in WC claims narratives) with a degree of confidence (prediction strength) that is strongly related to the actual probability of being correct and which can be utilized as a filter for the human-machine approach. A filter is just a technique to decide which narratives the computer should classify vs. which should be left for a human to read and classify. An assortment of filters are possible and make this method versatile for many different applications; the selection of filter(s) should consider both accuracy and resource constraints for the individual classification task at hand. For the current study, we were interested in achieving high PPV across all categories, including the small ones. Therefore, we used the following filters: (1) if the predictions from the single-word model and sequence-word models agreed (naive sw = naive seq ), this indicated high confidence in the prediction and the remaining disagree cases were manually reviewed and (2) if the predictive strength of the single-word model was very high, this also indicated high confidence in the prediction and the remaining cases below a probability threshold value were manually reviewed. All evaluation statistics were determined using SAS version 9.3 (SAS Institute, Cary, NC).
We evaluated both sensitivity and PPV for all 15,000 prediction narratives by comparing the machine classifications with gold standard codes. Sensitivity (true positives) is the percentage of narratives coded by the experts into each category (gold standard codes) that were also coded by the program; PPV is the percentage of narratives correctly coded into a specific category out of all narratives placed into that category.
Using a chi-square goodness of fit test, we evaluated whether the proportionate distributions by two-digit event category of the simulated resulting datasets were the same as the proportionate distribution in the gold standard dataset. We also evaluated the capability of each approach to identify small categories.
Since we also wanted to evaluate whether we could expect any improvement in accuracy compared to all manual coding, we present inter-reliability results for four manual coders who independently assigned classifications to a separate set of 4000 workers compensation narratives. Reliability metrics include the range in the percent agreement between each set of two coders from the four and the Kappa statistic (Viera and Garrett, 2005) for each two-digit category.

Results
The individual performance of the two independent models, naive sw and naive seq , are shown in Table 1. Similar to prior reports (Lehto et al., 2009), while the overall sensitivity was fairly good (0.67 naive sw , 0.65 naive seq ), both algorithms independently predicted some categories much better than others, skewing the final distribution of the coded data (x 2 P < 0.0001), and most of the cases in the smaller categories were not found. The sequence-word model showed improved performance where word order was important for differentiating causality. For example, Table 1 shows a higher sensitivity for the struck-against category and higher PPV for the struck-by category using the sequence-word model. Still, many categories had low performance.
3.1. Addition of strategic filters for human-machine coded narratives 3.1.1. Filter 1: agreement in predictions between the single-word and sequence-word models Using this filter, 68% of the narratives were coded by machine with an overall sensitivity and PPV of 81% and fairly high PPV across categories with more than 100 cases (Table 2a). However, using only these narratives (n = 10,147) would result in a biased distribution of the dataset with proportionally more overexertionfrom-outside-sources injuries, 34% vs. 28%, and proportionally fewer struck-against injuries, 1% vs. 3% (x 2 , P < 0.0001) than represented in the gold standard codes (n = 15,000). Additionally, most cases in the very small unique categories would never be found (Table 2b).
However, if the remaining 32% of records where the naive sw model had a different classification than the naive seq are identified, pulled out and manually coded, the final human-machine dataset becomes distributed very similar to the original gold standard coded dataset (is representative), with consistently high sensitivity and PPV across all categories (Table 2a). Also, since many of the narratives included in the smaller categories ended up being pulled out for manual review, the performance of the small categories improved markedly (Table 2b). Note-worthy is the high sensitivity for these small categories, indicating that many were found because of the filter.

Filter 2: prediction strength
Filtering simply on prediction strength alone is another option. For example, filtering on the Naive sw algorithm where those narratives with a prediction strength lower than 0.90 are pulled out for manual review (40% of the narratives) would result in approximately 88% sensitivity of the final machine-human coded data. Fig. 1 demonstrates how overall PPV is affected by setting different threshold levels. The results across all categories however was not as good as the filter based on agreement between the two algorithms.

Tailoring the approach
Different "filtering" strategies can be followed depending on human resource and accuracy trade-offs: (1) filtering more narratives for manual review from the dataset where the two models agree will improve results (if we need higher sensitivity or PPV results across categories), or (2) filtering from the disagree dataset strategically adds additional machine classifications and reduces human coding requirements (if we cannot afford to manually classify 32%, Fig. 1). Using a combination of filters 1 and 2 we can: (1) filter more narratives for manual review from the dataset where the two models agree in order to improve results (i.e., need higher sensitivity), or (2) filter from the disagree dataset to strategically add additional computer classifications and further reduce human coding, Fig. 1).
Since our objective was to apply the most efficient, yet highly representative and accurate approach across all categories, we created an additional filter on the agree dataset (where naive sw =naive seq , n=10,147) using a naive sw prediction strength of 0.90. The overall sensitivity for the combined filters improved to Filter 1 overall results: Expect 87% accuracy when allowing the co mpu ter to classify narratives where the algor ithms agree on the class ification (68% of the narrati ves) , manu ally code remaind er. Filter 2 (Appli ed to pred ictive strengt hs generat ed with naiv e sw algorith m only) Filter 1 followed by Filter 2 (Appli ed to th e subset of n arrat ives where th e naive sw and naive seq algorith ms independent ly assigned a different classificat ion) Filter 2 (Appli ed to pred ictive strengt hs generat ed with naiv e seq algo rithm on ly) Fig. 1. Accuracy expected when implementing a human-machine approach for coding 15,000 workers compensation narratives: alternative methods based on various filters. Footnotes to figure: A filter is a technique to decide which narratives the computer should classify vs. which should be left for a human to read and classify. Filter 1: If the classification predicted by the single-word model and sequence-word models agree (naive sw = naive seq ), this indicates high confidence in the classification. Therefore, for filter 1 we accept the computer classifications where the algorithms agreed, then manually review the remaining disagree cases. Filter 2: If the predictive strength of an algorithm is very high, this indicates high confidence in the prediction. Therefore, for filter 2 we accept the computer classification when the prediction strength, (P(C i |n) is above a probability threshold value, then manually review the remaining cases. Note: The numbers on the plots in parenthesis are the predictive strengths (P(C i |n)) calculated by the algorithm and used to determine whether to accept or reject the computer code. naive sw = Naïve Bayes single-word algorithm. naive seq = Naïve Bayes sequence-word algorithm. 0.94 (Fig. 1), and the sensitivity of the struck-against category (which previously had the lowest performance in Table 2) improved to 0.87 (PPV 0.95, Appendix A, Table A3). 20% additional narratives were pulled out for human review (Fig. 1).

Inter-rater reliability
Using a separate set of narratives to test inter-rater reliability (n = 4,000), we found 77-90% agreement overall between two coders, 0.78 Kappa (Table 3). Some categories were assigned consistently among all coders, such as the roadway-vehicle category (93-96% range in inter-rater agreement between each set of two coders, 0.94 Kappa), yet some categories had very low agreement, such as aircraft (0-75% agreement, 0.17 Kappa, Table 3), watervehicle-incidents (0-88% agreement, 0.25 Kappa) and non-roadwayincidents-involving-motorized-land-vehicle (52-84% agreement, 0.62 Kappa). Categories are bolded where n > 100 cases in prediction dataset. a Categories shown where at least one coder assigned a narrative into the category. b Only one narrative was ever classified into this category by any coder. c 5 or fewer narratives ever classified into this category by any coder. d Two-coder agreement, e.g., 6 total comparisons, coder 1 compared to 2, 3, 4, coder 2 compared to 3, 4 coder 3 compared to 4. e Fleiss Kappa between 0 and 1, >0.6 considered good agreement, >0.8 considered very good agreement.

Discussion
In this study, we demonstrate that a strategic human-machine approach based on Naïve Bayesian algorithms was able to comprehensively and accurately classify the events leading to injury. Accurately categorizing the cause, source and location of injuries from hospital, trauma center records, or WC claims is an essential part of the analytic process of epidemiology and surveillance and provides critical information for preventing future events such as amputations (Friedman et al., 2013), motor vehicle crashes (Pollack et al., 2013) or childhood injuries (Chan et al., 2001;McGeehan et al., 2006). For large administrative data sources, however, costs of personnel necessary to manually classify each narrative limits the utility.This study demonstrates the feasibility of a human-machine classification method to reduce such costs.
We demonstrate the importance of utilizing the complete sample for surveillance vs. a subset of cases. Utilizing only the cases where the two models agreed would result in a dataset that contained many accurate codes; however, that dataset would not, in fact, be representative of the population of cases we began with and would include very few of the cases in the small unique categories. In fact, we believe any approach solely using machine codes will have low sensitivity in finding the small categories because there can only be relatively very few training cases for these categories. A semantic model to find cases from the small unique categories may be possible but would take a very large amount of effort to develop. We further note that if only very few of a certain category are used in the final analyses, such estimates will be unreliable with large standard errors and, effectively, be unusable for surveillance.
A particularly striking benefit of the Naïve Bayes algorithms for short narratives is the ability to filter out cases for manual review where there is low confidence in the algorithm assigned code. We believe that reserving human resources to focus on a smaller pool will improve the efficiency of the machine-human method and allow personnel to focus on the rarer narratives and those that the Table A1 Sample calculations: "Glove got caught in conveyor": calculation of P(C i |n) using the naive sw and naive seq algorithms independently. algorithm classifies poorly. Overall sensitivity and PPV improved substantially using the integrated human-machine approaches.
In our example, applying a filter using agreement between the Naïve single-word and sequence-word models, resulted in an overall sensitivity of 87% with consistently high PPV across all two-digit BLS event categories. The combined human-machine approach was able to identify and code many narratives (68%) leaving only 32% (4800) for manual review, and was able to find many of the narratives (68%) in the very small categories. Further manual review of cases using the additional filter allowed for even higher accuracy (PPVs ranged from 0.88 to 1.0 across all categories).
The design and implementation of this human-machine coding system required the following resources: (1) training coders over a four-week period, (2) manually classifying 30,000 narratives for training and evaluation, (3) pre-processing narratives by removing some non-meaningful words using global re-assignment methods, (4) computing word probabilities for training the algorithm, (5) applying the algorithm to prediction narratives, and (6) identifying the optimal filtering potential of the data. This total initial investment included approximately 4 months' effort and provided an algorithm capable of classifying the bulk of future WC narratives. The Naïve algorithms easily classified 48-68% of the WC narratives (freeing up human resources) and resulted in a higher accuracy of the ensuing combined human-machine assigned dataset, likely beyond what either a human or computer alone could accomplish.
Relying on human coders alone to manually classify injury narratives for surveillance has other disadvantages beyond consuming large amounts of resources: human coders require periodic re-training to code accurately and systematically (CDC, 2013), experienced coders cannot be replaced without investing in training of new coders, human coders can become bored with or just inattentive to repetitive and mundane narratives which may lead to coding inconsistencies over time. Conversely, an algorithm can code systematically and consistently for a limitless amount of repetitive, mundane narratives without experiencing fatigue.
Other computer algorithms have been developed resulting in similar or higher accuracy (e.g., natural language processing or optimal classifiers, such as those based on logistic regression), yet these require either advanced modeling or tailored programing to contain all the necessary words, phrases, and ontologies for a specific classification or type of narrative. Naïve Bayes algorithms are less complex and require fewer computations compared with Fuzzy Bayes (prior analyses, see (Lehto et al., 2009;Marucci-Wellman et al., 2011)) or other more sophisticated models (e.g., support vector machine (SVM), logistic regression). The training of a probability-based Naïve Bayes algorithm relies simply on precoded narratives and results in intuitive evidence supporting each prediction.
Optimization using methods such as Regularized Logistic Regression or SVM will almost always moderately improve overall results (prior to filtering) over the simple multinomial Naïve Bayes model, but studies have shown a tendency for the Naïve Bayes model to outperform optimized models when there are few training cases (Ng and Jordan, 2002) such as for small categories or for shorter snippets of text (Wang and Manning, 2012). Interestingly, the performance of multinomial Naïve Bayes can also be improved by simple data transforms that retain the large Table A2 Sample calculations: "Box fell against his ankle": calculation of P(C i |n) using the naive sw and naive seq algorithms independently.
advantage Naïve Bayes has over optimization in terms of computation requirements (Rennie et al., 2003).
Previously, we demonstrated similar potential for high accuracy using agreement between Fuzzy and Naïve Bayesian models (Marucci-Wellman et al., 2011). However, Fuzzy Bayes is computationally cumbersome and better suited for long narratives (Taylor et al., 2014). Use of the two Naïve Bayesian models shown here required much less model development time and had similar results. The present strategy is more feasible for resourceconstrained public health organizations when utilizing short narratives from large administrative databases for many different applications.

Limitations
There are several limitations to this approach. Currently, we have evaluated the accuracy using only one classification scheme and with fairly short WC narratives. More research will be required to understand how the algorithm performs with more-detailed classifications, on other types of classification protocols or with different types of narratives. Sometimes additional variables may be required (e.g., gender, age) for accurate coding-it is very important to provide the algorithm with the same variables used by humans to do classifications. We note, however, that the Naïve algorithm accommodates these other variable as predictors (or additional keywords) if they do not add redundancy with words already in the narrative (e.g., the nature of injury included as a "fall"or"burn" improves the machine classification).

Conclusion
These results suggest that large amounts of resources could be saved by public health organizations which routinely classify short narratives in large administrative databases for surveillance purposes by strategically deploying computer coding for the majority of cases. Noteworthy, is the ability of the Naïve algorithm to pull out narratives from the small and novel categories for manual assignment, improving results substantially.
More sophisticated classifiers than Naïve Bayes exist. However, Naïve Bayes algorithms have good performance for classifying short, noisy snippets of text and are a simple, practical, easily implemented approach. This study shows that filtering with two Naïve Bayes models to selectively guide manual review successfully generated an unbiased estimate of the frequency of injury causation/events. In future studies, we will look at whether SVM or other models can also be used to selectively filter out the small categories for review. Simply put, we have observed that this simple approach is effective, but further improvement in overall performance may be attainable with other classifiers.

Table A3
Accuracy of the human-machine classification system: implementation of two strategic filters a (1) where the two Naïve Models agree on the classification and (2) where the classification strength is above 0.90 for the naive sw agorithm.