Machine learning for syndromic surveillance using veterinary necropsy reports

The use of natural language data for animal population surveillance represents a valuable opportunity to gather information about potential disease outbreaks, emerging zoonotic diseases, or bioterrorism threats. In this study, we evaluate machine learning methods for conducting syndromic surveillance using free-text veterinary necropsy reports. We train a system to detect if a necropsy report from the Wisconsin Veterinary Diagnostic Laboratory contains evidence of gastrointestinal, respiratory, or urinary pathology. We evaluate the performance of several machine learning algorithms including deep learning with a long short-term memory network. Although no single algorithm was superior, random forest using feature vectors of TF-IDF statistics ranked among the top-performing models with F1 scores of 0.923 (gastrointestinal), 0.960 (respiratory), and 0.888 (urinary). This model was applied to over 33,000 necropsy reports and was used to describe temporal and spatial features of diseases within a 14-year period, exposing epidemiological trends and detecting a potential focus of gastrointestinal disease from a single submitting producer in the fall of 2016.


Introduction
More than 60% of emerging infectious diseases can be transmitted from animals, making animal populations an important surveillance tool for detecting emerging disease [1]. Because animals share the same environment as humans and often spend more time outdoors, they are also important for monitoring environmental health hazards, human health hazards, and bioterrorism threats [2].
While there is a growing emphasis on monitoring data captured early in the course of medical evaluation or treatment, such as clinical notes or lab request forms (often called pre-diagnosis data), existing animal disease surveillance systems frequently depend on definitive diagnoses achieved through lab testing [3,4]. Such systems exhibit a time delay in detecting novel or unexpected diseases emerging in a population and may exhibit poor sensitivity to multifactorial diseases that cannot be characterized by a single agent [5]. Surveillance relying on pre-diagnosis data targets broad categories of diseases and is often called "syndromic PLOS ONE | https://doi.org/10.1371/journal.pone.0228105 February 5, 2020 1 / 19 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 surveillance" [6]. By facilitating the rapid detection of potential public and animal health threats, syndromic surveillance can enable the implementation of targeted investigations, diagnostic testing, or prophylactic treatments early in the course of a potential outbreak. Necropsies are post-mortem evaluations performed by veterinarians in the field and at diagnostic facilities to determine the cause of an animal's illness or death, and are often critical in the investigation of disease outbreaks in a herd [7]. Necropsy reports represent a unique opportunity for syndromic surveillance because of their emphasis on an animal's cause of death, and because the text is often dominated by specific morphologic terms describing grossly observable and microscopic tissue changes. The reports also commonly include the animal's signalment, clinical signs, geographic origin, and herd-level factors [8].
As is common for pre-diagnosis data, necropsy reports are often written in a free-text format. Analysis of free text is generally challenging, and natural language processing (NLP) methods have become increasingly important in mining clinical text [9]. Text mining can be used to classify passages into categories, such as disease groups, which may be monitored for changes over time. This framework has been used to conduct syndromic surveillance using chief complaints in human records [10,11]. In animals, text mining has been used to conduct syndromic surveillance from online news reports [12], web searches [13,14], and laboratory test requests [4].
There is a growing amount of literature examining information-extraction tasks involving pathology reports [15][16][17][18][19][20][21]. A rule-based approach is common, in which prediction rules are manually built, commonly using pre-defined named entities recognized using NLP software. Such algorithms often suffer from the knowledge acquisition bottleneck associated with maintaining extensive lists of named entities and the rules governing their interpretation, resulting in a loss of portability and flexibility [4,22,23]. A rule-based text mining system for syndromic surveillance has been recently described in the context of veterinary necropsy reports [8].
Machine learning does not require the manual development of decision rules as it automatically infers a model from an annotated corpus. While supervised machine learning requires human input to produce document labels, this approach is generally less intensive than designing and maintaining a set of rules [23]. Machine learning has been successfully used to extract a multitude of discrete phenotypes from heterogenous health data including free text [24,25]. Current literature represents a variety of learning algorithms useful for medical text analysis [26] and multiple approaches to encoding document features including n-gram ("bag of words") representations [25], graphs-of-words [24,27,28], and sequential encodings with deep learning [29][30][31][32][33]. Current reports indicate that recurrent neural network (RNN) models such as long shortterm memory (LSTM) networks [34] can be highly successful for veterinary text classification tasks when a large amount of training data is available [35]. They are also reported as effective models for syndromic surveillance using free-text chief complaints in human medicine [10].
We aim to demonstrate that supervised machine learning methods can effectively perform syndromic classification of free-text veterinary necropsy reports, forming the basis for an automated approach to syndromic surveillance within an animal population. We focus on evaluating distinct machine learning algorithms and show that some are effective for this task. We also demonstrate that a preliminary predictive signal can be extracted from gross necropsy findings alone, which approximately represents the first available information in a necropsy examination.

Data
Necropsy reports were obtained from the Wisconsin Veterinary Diagnostic Laboratory (WVDL) at the University of Wisconsin-Madison. Necropsy submissions at this facility represent most species of veterinary importance with a strong emphasis on farm animals, particularly bovine. All electronic necropsy reports on record between July 6, 2004 and August 6, 2018 were acquired as raw data for a total of 33,567 reports. Each necropsy report included five sections: (1) gross necropsy findings, (2) histological findings, (3) morphologic findings, (4) final diagnosis, and (5) pathologist comments. The reports also included additional information such as the animal receipt date, location, species, breed, and sex.

Construction of a document dataset for labeling
Using the R Programming Language [36], a subset of 1,000 reports was randomly sampled from the dataset. For each pathology report, a primary document was prepared by combining the morphologic findings and final diagnosis sections or, if both of those were empty, by combining all sections (15% of cases). Since the most concise morphologic terminology is present in these sections, this abstraction submitted only the most structured language to the learning model.

Defining syndromes
Because a necropsy examination is organized according to organ systems in the animal, we selected examples of topographical, organ-system-based syndromic categories: (1) gastrointestinal (GI) disease, (2) respiratory disease, and (3) urinary disease. These categories were intentionally general and inclusive of both overt and non-specific illnesses relating to each respective system. For example, documents describing evidence of diarrheal disease or nonspecific hepatic disease should both be flagged as positive by a GI-disease classifier. To illustrate the language in WVDL pathology reports, Table 1 presents examples of text criteria judged by two veterinary pathologists to represent positive classifications in each syndromic category.

Obtaining expert labels
Two veterinarians board-certified by the American College of Veterinary Pathologists reviewed the 1,000 documents and classified each as having evidence of GI disease and/or respiratory disease and/or urinary disease based on clinical experience. Diagnoses were excluded that did not specify an organ system, such as "salmonellosis", "bacteremia", or "septicemia". A small percentage of randomly selected documents (3%) were blank and classified as negative in all three syndromic categories. The inter-rater reliability between the two experts was measured using percent agreement and Cohen's kappa. One expert was selected to represent ground-truth syndrome labels.

Defining the machine learning task
The machine learning model should evaluate a necropsy report and determine if there is evidence of GI, respiratory, or urinary pathology. Any, all, or none of these syndromes could be present. This was accomplished by developing a separate binary classifier for each syndrome.
A document was fully processed after being independently evaluated by each of the classifiers, an approach generally useful for multi-label classification in medical record prediction tasks [37]. This allows for learned models to be customized to each syndrome and would allow the pipeline to be augmented with additional classifiers later without affecting the pre-existing steps.

Statistical analysis of performance
The performance of a binary classifier can be evaluated by its accuracy: However, accuracy is not ideal for studying classification performance in cases where positive instances of a syndrome are significantly over-or underrepresented in the training data. To make our analysis robust to class skew, we also utilized the following metrics for each binary classifier: Recall ¼ True positives True positives þ False negatives Precision ¼ True positives True positives þ False positives These metrics were combined using a harmonic mean into a single performance metric called the F1 score: In this study, performance statistics were reported using 10-fold cross-validation, and 95% confidence intervals were computed using bootstrapping as described in Gao et al. [29,38] and summarized in Table 2. All references to statistical significance are made relative to a significance level of 5%.

Learning with bag of words representations
Document text was tokenized into words and cast into a document term matrix (DTM) (Figs 1 and 2). In this process, each document was separated into a collection of words, reflecting a bag-of-words approach that does not preserve the original order of document terms. The DTM is a large, sparse matrix in which each row represents a document and each column represents a unique word in the document corpus. Columns corresponding to the common pathology terms "mild", "moderate", "acute", "multifocal", "small", "diffuse", and "necrosis" were removed because they could be used in reference to any body system and are therefore not relevant for syndromic prediction. Stop words were also removed from consideration.
Each entry in the DTM encodes the term frequency-inverse document frequency (TF-IDF) measure for the corresponding document and word. Term frequency (TF) measures how frequently the word appears in the document. I.e. if n ij represents the number of times term t i appears in document d j then the frequency of term t i in document d j is The following expression gives the inverse document frequency (IDF) of term t i : The IDF of a term provides a weight inversely correlated to its frequency across all text. Finally, Table 2. Determining confidence intervals in cross-validation experiments.

Cross-Validation
Pool test set predictions across cross-validation folds.

Step 2: Bootstrapping
Repeat 2000 times: • Sample with replacement from the pooled predictions to create a bootstrapped set of predicted labels equal in size to the set of pooled predictions. • Calculate the F1 score of the classifier using this bootstrapped set.

Fig 1. Tokenization.
A document example was tokenized into words. Numbers and punctuation were removed. Stop words, common words in English (like "from", "and", and "of") were removed. All characters were changed to lower case. After tokenization, the document was represented as a non-ordered collection of words. https://doi.org/10.1371/journal.pone.0228105.g001 the TF-IDF score for term t i in document d j is the product This approach encodes each document as a feature vector of TF-IDF statistics. Using this representation, we evaluated the performance of several machine learning methods on the syndromic classification task defined above. Models were learned using scikit-learn [39] in Python version 3.7. On each cross-validation fold, the hyperparameter space was explored using a grid search, and hyperparameters were selected to maximize mean F1 scores computed by internal 10-fold cross-validation. To assess feature importance weights in tree-based models, the normalized mean decrease in Gini impurity was summarized using scikit-learn.
Classification and regression tree (CART). An optimized CART algorithm was evaluated using the standard decision tree model in scikit-learn. The maximal depth of the tree was controlled by specifying the minimum number of documents min samples required to split an internal node. Values of this hyperparameter in the set {2, 5, 10, 50} were considered.
Bagging trees. Bagging (bootstrap aggregation) represents a statistical ensembling technique in which each tree is trained on documents sampled randomly with replacement [41]. This was done using 1,000 trees learned via the CART method on each cross-validation fold. All trees had min samples globally fixed to the value selected by internal cross-validation when using CART to learn single trees, so that no hyperparameter searching was employed for this algorithm.
Random forest. A random forest is another tree-based ensemble learner in which bootstrapped sampling is applied and a random subset of features is considered to produce the split at each node of every decision tree [42]. Each model used 1,000 trees. The depth of each tree was controlled using min samples as in the CART model, and the maximum number of features considered for each node split was specified as a hyperparameter m. Given a feature space of size p, grid search considered m in the set  Gradient tree boosting. We also considered gradient tree boosting, in which shallow decision trees are iteratively combined into a stronger ensemble learner [43]. Each model used 1,000 boosting stages. A grid search explored maximum tree depths in {2, 3, . . ., 10} and learning rates in {10 −5 , 10 −4 , . . ., 10 −1 , 1}.

Deep learning with sequence representations
Document text was encoded using a 50,000-word vocabulary. Accordingly, each document was represented by a sequence of integers uniquely determined by the sequence of words in the text (Fig 3). These sequences were padded to a maximum length of 50 words. Keras [44] and TensorFlow [45] in Python were used for text pre-processing and model implementation.
A recurrent neural network model was considered for the syndromic classification task (Fig  4). When propagating a document forward through the network, each vocabulary word was first projected into a 200-dimensional GloVe embedding space in which the Euclidean distance is smaller between pairs of more similar words [46]. After the initial embedding layer, there was a 1-dimensional convolutional layer consisting of 64 3x1 filters employing ReLU activations, and subsequently a 1-dimensional max pooling operation with a window size of 4 and valid padding. Next there was a single long short-term memory (LSTM) layer with 128 hidden units, followed by a densely-connected, single output unit with a sigmoid activation function. To prevent overfitting, dropout [47] was used between the embedding and convolutional layers, and L2 regularization was employed at the convolutional and output layers. The model was trained using Adam optimization with its default parameters [48], binary crossentropy loss, and a mini-batch size of 32 over 10 epochs. The matrix of embedding parameters was initialized using GloVe embeddings but subjected to gradient descent updates throughout training.

Error analysis
For selected learning methods, we conducted an error analysis to provide human interpretation of model predictions. At the end of cross-validation, predictions on each test fold were concatenated to yield a set of predictions for the entire labeled dataset. This was provided to a human reviewer as a spreadsheet, who attempted to identify and quantify major classes of errors via manual inspection of the input document text.

Classifying documents beyond the labeled corpus
To demonstrate applications of the syndromic classifiers, we trained the highest-performing model (measured by F1 score) on the entire labeled corpus. A 10-fold cross-validated grid search was employed as in the initial model validation experiments to ensure that hyperparameters were optimal. This model was used to predict GI syndrome classifications on the entire document corpus, which were then used to generate a time-series of GI disease cases in R. Cases involved in a sharp rise in prevalence were examined as a possible disease outbreak. Other examples of analysis specific to the GI syndrome were explored.

Learning from gross necropsy findings
The first section of a WVDL necropsy report (gross necropsy findings) is often a valid approximation of the document's initial draft status. Given that all sections of the report relate to the same patient and a single necropsy exam, we hypothesized that the syndrome label assigned to the primary document represents a valid label for gross findings. We applied all methods reported in the section "Learning with bag-of-words representations" except that we tested models on TF-IDF representations of only the gross necropsy findings section. Analysis was restricted to the subset of documents for which this section was non-empty. Both primary documents and gross necropsy findings were evaluated as training input. For each syndrome, the performance of these learners was compared to a baseline syndromic classifier whose output is indiscriminately positive. F1 scores and 95% confidence intervals were computed using the same bootstrapping procedure.

Results
Two experts achieved percentage agreement of 97.2% and a Cohen's kappa of 0.944 for their labeling of 1,000 documents from the dataset. After defining one expert's labels as ground truth, a proportion of 51.1% (511/1000) represented the GI syndrome, 45.8% (458/1000) represented respiratory disease, and 10.8% (108/1000) represented urinary disease. Table 3 presents accuracy and F1 scores for machine learning approaches applied to each of the syndromic classification tasks, using unigram TF-IDF vectors as input features. The dimension of the feature space was 2,594. Although no single model was best, random forest was consistently among the top-performing models with F1 scores of 0.923 (GI), 0.960 (respiratory), and 0.888 (urinary). Logistic regression and support vector machine models exhibited lower performance. Precision-recall curves for the random forest model are presented in Fig 5. Optimal hyperparameters are described in Table A in S1 Text. The inclusion of bigram tokens did not significantly improve performance and may cause a marginal performance degradation for these models ( Table B in S1 Text).

Learning with bags-of-words representations
The mean decrease in Gini impurity provides a static illustration of feature importance for the random forest model, helping to explain which features have the greatest impact on its classification decisions (Table 4).

Error analysis
Manual error inspection was performed for the random forest model. False negative predictions outnumber false positives for two of the three syndromic prediction tasks (Fig 8). Three syndrome classification tasks (gastrointestinal, respiratory, and urinary) were tested. Accuracy and F1 scores were assessed by 10-fold cross-validation, with 95% confidence intervals in parentheses calculated by bootstrapping. The two best results in each column are bolded. Results outside the confidence intervals of the best results are italicized.
False negatives were most frequently associated with an uncommon term, a species-specific anatomical descriptor, or terms derived from causative organisms (Table C in S1 Text). In total, these accounted for 48% (40/84) of false negative predictions. Uncommon terms included references to specific tissues, cell types, or disease processes appearing so infrequently that it seemed unreasonable for a machine learner to recognize their significance without addition domain knowledge (27% of false negatives). Species-specific anatomical descriptors were tracked separately and mostly included descriptors of avian and ruminant anatomy (18% of false negatives). Terms derived from causative organisms were associated with 14% of false negatives. These percentages do not add to 48% because a small number of documents  contained terms in more than one category. Table D in S1 Text presents examples of features encountered in this error analysis. Among the false positives, 81% (43/53) were associated with text that mentioned a biological entity without suggesting any corresponding pathology. In such cases, the pathology report may state that a specific organ or tissue has no lesions or may describe findings associated with normal postmortem processes. Furthermore, 87% (46/53) of all false positive predictions were associated with documents for which the original report's morphologic findings and final  diagnosis sections were both empty (and therefore the remaining sections were used for training). This is noteworthy because only 15% of original reports fall into this atypical group.

Classifying documents beyond the labeled corpus
A random forest was used to render syndromic predictions on the entire document corpus with hyperparameters m = 0.1p (for feature space of size p) and min samples = 2 selected by 10-fold cross-validated grid search. Distributions of predicted monthly GI-disease counts for necropsy cases at the Wisconsin Veterinary Diagnostic Laboratory (WVDL) are illustrated in Fig 9. The random forest predictions were used to generate a time-series of GI disease counts among species labeled as "small animal exotic" (Fig 10). An apparent increase in GI disease was observed in the fall of 2016. Further examination of cases contributing to this phenomenon revealed that many specimens came from a single producer, and their necropsy reports included evidence of non-specific hepatic pathology.

Learning from gross necropsy findings
We measured the predictive signal represented by the first section of necropsy reports (gross necropsy findings), which are often written earlier than other sections. There were 622 labeled documents with non-empty gross necropsy findings, and the syndrome prevalences within this subset were 0.532 (GI), 0.461 (respiratory), and 0.140 (urinary).
When models were tested on gross findings, F1 scores of 0.738 (GI), 0.698 (respiratory), and 0.423 (urinary) were achieved by a support vector machine, random forest, and classification tree respectively (Table 5). For GI and respiratory disease, the most performant models were trained on gross findings. While several learners achieved F1 scores exceeding the baseline classifier, no models outperformed it with statistical significance on the GI or respiratory disease tasks. For urinary disease, classification tree and bagging trees models trained on primary documents outperformed the baseline classifier with statistical significance.

Algorithm performance
This study demonstrates that it is feasible to use machine learning algorithms to classify veterinary necropsy reports according to their mention of GI, respiratory, or urinary disease. The F1 scores are high and models showed significant levels of recall at high rates of precision. The best-performing algorithms are at least comparable to models performing other informationextraction tasks on free-text pathology reports, where micro F1 scores are often reported in the 0.45-0.92 range [24,29,35]. No machine learning algorithm outperformed all others by a statistically significant margin, although the random forest algorithm had consistently high performance across all three syndromic prediction tasks.
Error analysis of the random forest model suggests that higher performance may be possible if we strengthen its ability to infer correct syndromic labels from important but rare biomedical terms. This issue might be addressed using medical ontologies such as the Unified Medical Language System 1 (UMLS 1 ) from the U.S. National Library of Medicine (NLM) to link conceptually related medical terms. In the future, machine learning systems for syndromic surveillance may use such frameworks to make intelligent predictions from features that appear infrequently or which may be absent from training documents, as has been previously suggested in the context of rule-based syndromic classifiers [49].
Performance across the syndromic categories was variable, with GI and respiratory diseases being easier syndromes to detect as measured by average F1 scores across models. The urinary models had lower F1 scores due to poor recall, as depicted in the precision-recall curves and reflected in the finding that random forest false negatives outweigh false positives. During manual inspection of random forest predictions, it was also found that more prevalent features of urinary disease are associated more strongly with true positive predictions (e.g., terms such as "nephritis", "nephrosis", and "tubules"). These results suggest that errors in urinary syndrome prediction may arise due to the low prevalence of urinary disease in labeled documents. Non-random sampling methods such as the synthetic minority over-sampling technique (SMOTE) could help create a class-balanced dataset more appropriate for training a machine learning system for this task [50].
False positives in the random forest model were associated with documents that included specific biological terms without conferring a pathologic diagnosis, such as in a statement of negation (e.g., "Kidney: No significant lesions found."). False positives were also associated with documents for which the original report did not contain a morphologic or final diagnosis, which most often resulted in a longer description of gross findings being used as input for the learning system. This suggests that statements of negation and longer texts (which are more likely to contain such statements) elevate the risk of false positives in this system. By taking whole documents as input, recurrent neural network models like the LSTM network can learn to distinguish variations in sentence structure including statements of negation that become problematic when text is tokenized into unigram TF-IDF statistics. In this study, LSTM performance did not exceed the best-performing TF-IDF feature-vector models for GI and respiratory disease and was markedly lower for urinary disease. Therefore, despite the theoretical advantages of recurrent neural networks and the recent evidence that they are effective for chief complaint classification in human medicine [10], in this domain we were unable to conclude that they are superior to models using unigram TF-IDF feature representations. Like most deep learning algorithms, LSTM networks often require very large datasets to train effectively. It is possible that deep learning could outperform TF-IDF feature-vector approaches with more training input, but with our relatively small dataset of 1,000 necropsy reports it was not possible to test this hypothesis. In future studies, active learning algorithms may help guide the document-labeling process to ensure that limited training data is optimally informative.

Syndromic surveillance
After the performance of a machine learning classifier has been validated, it can be applied to a larger collection of historical data and its syndromic predictions can be leveraged to draw epidemiological conclusions. Using predictions from the random forest algorithm, we can track syndromes over time and localize cases involved in a suspected outbreak. Our example in Fig  10 illustrated increased GI disease in animals originating from locations close to the diagnostic lab. This approach could also help uncover baseline trends in case numbers at this laboratory. Analysis of historical trends is valuable for resource planning at a diagnostic laboratory and may help test hypotheses about important diseases in the case population.
Studies conducting further analysis of syndromic time series could generate real-time syndromic surveillance applications. For example, future studies may consider a hierarchical surveillance pipeline in which syndromic predictions are processed using statistical eventdetection algorithms to detect emerging anomalies in real time.
In a standard necropsy workflow, it is common for a pathologist to begin by describing gross findings and then to state morphologic and final diagnoses after histological evaluation of tissues and ancillary lab testing. In this study, primary documents were prepared by taking only morphologic findings and final diagnosis sections from necropsy reports, in cases where at least one of these sections was non-empty. This means that training material may have included findings influenced by laboratory tests. However, in real-time use cases, a syndromic surveillance application would ideally make preliminary predictions on initial drafts of reports before test results are available and would update its predictions as reports are completed.
Our findings suggest that gross necropsy text on its own presents a weak predictive signal for syndromic surveillance. There are several reasons for this weakness. First, gross findings can be subtle or non-specific in some disease conditions. Second, 38% of labeled documents did not have information populated in the gross necropsy findings section. Third, we assumed that the label assigned to the primary document logically transfers to the gross findings. This assumption does not hold true in cases where the first section contains a statement of nondiagnostic significance such as "see below". Finally, error analysis showed that a high proportion of false positives were associated with longer documents, and gross findings often represent a longer narrative-style text. Future development of necropsy syndromic surveillance applications should consider the informativeness of initial drafts or gross findings when supported by the medical database system. Institutional policies that may address this need include guidelines for uniform usage of each necropsy report section or version control systems that would enable direct analysis of initial drafts.
Syndromic surveillance of animal populations can provide epidemiological insights that are important to animal and public health. While structured medical data (such as records that include coded diagnoses) would simplify the design of syndromic surveillance systems, there is still an abundance of free-text data in veterinary medicine. The methods presented in this paper provide a framework for extracting syndromic information from free-text necropsy reports. Machine learning approaches may also help to automate veterinary syndromic surveillance using other types of medical text, including physical exam findings and discharge documents. Further work may examine the utility of machine learning for veterinary syndromic surveillance in these domains.