Machine learning based natural language processing of radiology reports in orthopaedic trauma

Objectives: To compare different Machine Learning (ML) Natural Language Processing (NLP) methods to classify radiology reports in orthopaedic trauma for the presence of injuries. Assessing NLP performance is a prerequisite for downstream tasks and therefore of importance from a clinical perspective (avoiding missed injuries, quality check, insight in diagnostic yield) as well as from a research perspective (identi-ﬁcation of patient cohorts, annotation of radiographs). Methods: Datasets of Dutch radiology reports of injured extremities ( n = 2469, 33% fractures) and chest radiographs ( n = 799, 20% pneumothorax) were collected in two different hospitals and labeled by radiologists and trauma surgeons for the presence or absence of injuries. NLP classiﬁcation was applied and optimized by testing different preprocessing steps and different classiﬁers (Rule-based, ML, and Bidirectional Encoder Representations from Transformers (BERT)). Performance was assessed by F1-score, AUC, sensitivity, speciﬁcity and accuracy. Results: The deep learning based BERT model outperforms all other classiﬁcation methods which were assessed. The model achieved an F1-score of (95 ± 2)% and accuracy of (96 ± 1)% on a dataset of simple reports (n = 2469), and an F1 of (83 ± 7)% with accuracy (93 ± 2)% on a dataset of complex reports (n = 799). Conclusion: BERT NLP outperforms traditional ML and rule-base classiﬁers when applied to Dutch radiology reports in orthopaedic trauma. © 2021 The Author(


Introduction
Injury is responsible for more than 5 million deaths per year globally [1] . A total of 80,0 0 0 patients with minor and major injuries will be admitted annually in the Netherlands, on a population of 17 million people [2] . Imaging has a crucial role in triaging patients and clinical decision making. The spectrum of imaging in orthopaedic trauma is broad, ranging from simple radiographs of extremities after minor trauma to comprehensive radiological eval-uations for multi-trauma patients. At the simple side of the spectrum, the complexity is not in the imaging procedure itself, but in the organization of effective triaging of patients for advanced care. The positivity rate of this triaging can be assessed by classification of radiology reports and contributes to insights into effective resource utilization [ 3 , 4 ].
Multi-trauma patients are evaluated, among other injuries, for a pneumothorax with a chest radiograph at the emergency room. Quick analysis is required to identify life threatening pneumothorax, where prompt drainage with chest tube placement is crucial. Large annotated datasets are needed to explore future automated analysis of these chest radiographs with artificial intelligence.
Manual classification of radiology reports by experts is the gold standard for positivity rate assessment and automated label ex- traction. However, it is very labor intensive. An automated method based on machine learning (ML) and natural language processing (NLP) is needed to make these tasks feasible.
The majority of radiology reports are in free text. NLP is the processing of free text into structured information and encompasses several steps [5] . NLP techniques have been applied to radiology reports for several purposes [6] , such as for the identification of actionable findings or follow-up recommendations [7][8][9] , identification of patient cohorts [10][11][12][13] , the prediction of outcomes [14][15][16] , and the annotation of radiologic examination in fully automated workflows for deep learning (DL) [17] . The NLP methods applied in radiology can be divided in three groups: rulebased methods, ML and DL. Some authors focus on one method [ 18 , 19 ], others compare several methods within [ 20 , 21 ] or between [ 22 , 23 ] groups. A systematic review on natural language processing in healthcare with data until December 2019 included 77 papers of which only 5 in the field of radiology, and of which 60 (78%) dealing with English data and one to five papers dealing with data in eight other languages including one in Dutch [24] . A systematic review dedicated to deep learning NLP in radiology included only 10 publications from 2018 and 2019 of which only two in the orthopaedic trauma domain [6] . This indicates the scarcity of publications in the field of radiology or on non-English NLP in general. English NLP methods often need adaptation before application to other languages [25] . For the Dutch language some papers address language specific challenges [26][27][28][29] , but to the best of our knowledge no applications for NLP of Dutch radiology reports in orthopaedic trauma have been described.
Before NLP can be applied in downstream tasks, a domain specific and language specific evaluation of available methods is a prerequisite. Therefore, the purpose of this study is to compare and optimize different NLP methods and assess these methods' performance for the classification of simple and complex Dutch radiology reports in orthopaedic trauma for the purpose of positivity rate assessment and automated label generation.

Materials and methods
In a multidisciplinary AI research collaboration of data scientists, researchers, trauma surgeons, and radiologists of two institutions, we performed a retrospective study for machine learning based natural language processing of radiology reports in orthopaedic trauma. Two distinct datasets of radiology reports for radiographs of trauma patients were collected and annotated for binary classification. According to the Dutch law on Medical Research in Humans, no informed consent was needed because of the retrospective chart review. Three NLP methods with increasing complexity were used to analyze the data, and the results were compared. Comparing these methods across very different datasets allows us to explore their respective strengths and limitations.

Fractures dataset
All radiographs in a general teaching hospital (Treant, Emmen, the Netherlands) requested between January 2018 and September 2019 by general practitioners during evening, night and weekend shifts for patients with minor injuries to extremities were included. The data of this fractures dataset ( n = 2469) were downloaded as comma-separated values (CSV) file from the picture archiving and communication system (PACS). The distribution of body regions is shoulder / upper arm 11%, lower arm / wrist 22%, hand 23%, pelvis / hip / upper leg 3%, lower leg / ankle 23%, foot 18%. Fig. 1 illustrates the data collection process and demonstrates the positivity rate, defined by the fraction of radiographs with a fracture or other type of pathology needing referral such as a luxation. All 2469 radiology reports were annotated manually by a single radiologist (AO) in Excel. All annotations were checked for errors by one out of two other radiologists (CK, HS) who both checked half of the data, resulting in 96.8% (Fracture-2018) and 97.6% (Fracture-2019) agreement. Errors were resolved by a consensus discussion. For the purposes of training and testing the ML models, the Fracture-2018 data ( n = 1377) was used as a training and (cross) validation set, whereas the Fracture 2019 data ( n = 1092) was kept as an independent test set. A 5-fold cross validation was used to find the optimal model performance on the 2018 data which was then tested on the full 2019 data for temporal validation [ 30 , 31 ].

Pneumothorax dataset
Patients who underwent a trauma screening in a Level 1 academical trauma center (UMCG, Groningen, the Netherlands) between January 2009 and December 2017 were eligible for inclusion. Comprehensive trauma radiology reports (including a section on the chest radiograph) were retrieved in a two-step approach. The first batch was extracted manually from the EHR. The second batch was automatically downloaded from the PACS in CSV file format to improve the extraction process's efficiency. The process is clarified in Fig. 2 a. A total of 799 radiology reports were labelled. Labeling (presence or absence of pneumothorax and laterality based on the information in the radiology report) was performed in Excel by a panel of three trauma surgeons (EF, FIJ, VS). Each of them labeled a third of the dataset and difficult cases were discussed until consensus was reached. Fig. 2 b and c illustrate the data distribution over the categories. With ML models, a 5-fold cross validation was performed on this dataset. Table 1 provides an overview of the different NLP strategies and the corresponding classifiers. All strategies are discussed in detail below.

Strategy 1: Rule-based classification
The following steps were taken to incorporate domain knowledge into machine understandable rules: Two radiologists independently made respectively 11 and 12 written rules that, in their opinion, were suitable to classify the reports into two categories. A third radiologist combined these rules, removed duplicates, added logical statements, and ensured consistent language usage, resulting in 9 rules. (Appendix Table A1). The anatomical terms that were incorporated in the rules were provided by the radiologists and based on their domain knowledge. Rule-based classification was implemented using regular expressions in Python and applied on the raw report text of the fractures dataset without any preprocessing. There was no iterative process to improve the rules or extend the set of anatomical terms.

Strategy 2: Traditional ML
Feature extraction for ML. Data was preprocessed per report in order to create features for ML. Each report was first sentence tokenized, i.e. separated into sentences, using the NLTK tokenizer in Python. Sentences were then cleaned by removing punctuations, special characters and numbers and converted to lowercase. All stop words, such as de(the), en(and), etc. were excluded. Cleaned words for each report were then lemmatized using the Frog package in LaMachine [32] . Lemmatization allows for a consistent dataset and its role is highlighted while creating the n-gram features. Finally, all one-, two-and three-word sequences (or ngrams) of lemmas were generated and stored. The input feature space for the ML classifiers was the number of occurrences of each feature in each report.   . Fig 2 b shows the distribution of data over the"pneumothorax" and "no pneumothorax" groups. Fig 2 c shows the distribution of report length per class. Maximum and median report lengths were 529 and 145, respectively, with an interquartile range of108 words.  3. Illustration of the feature extraction process. The raw report text is sentence tokenized, followed by word tokenization and lemmatization. From the Lemma's, the uni-, bi-and trigram features are generated. Fig. 3 gives an example of the process. The lemmatized words are shown separately for each n-gram. sequence type: one, two and three. In this entire feature extraction process, only lemmatization is considered language-dependent.
We also performed experiments using learned word embeddings. One method for generating embeddings is Word2Vec [33] , implemented in the gensim [34] module in Python. Using this method, a model was created comprising the word embedding for each lemma in the data set. Subsequently, report vectors were generated as an average of the embeddings of all words in that report.
ML classifiers. We implemented three different traditional ML classifiers in building our system: naïve bayes classifier, random forest and artificial neural network (ANN). Classifiers were implemented using scikit-learn [35] in Python. Naïve Bayes: Naïve Bayes is a simple classifier that uses Bayes theorem to calculate the probability of each class, based on the features of the given sample. It assumes that all features are independent. This classifier was implemented from scratch.
ANN: An ANN consists of layers of neurons, with each neuron representing a simple weighted sum of its inputs followed by a non-linear activation function to calculate its final output. Random forest: This classifier creates subsets of the given input set and subsequently a decision tree for each subset. The classifier makes the final decision based on an average of the previously created decision trees. The optimal number of estimators in the random forest was selected using trial and error.
Details of the network structure and parameter values are provided in the Supplementary Materials ( Section 2 ).

2.2.2.3.
Feature selection for ML. The performance of a classifier can be optimized by narrowing down the feature space to an optimal number of features. This was implemented in two ways: by selecting the most frequent 500 n-gram and feature selection via a tree-based classifier. The latter was done using feature importances, based on Gini importance [36] .

Strategy 3: DL based NLP: BERT
For our DL based NLP method, we employed a language representation model called Bidirectional Encoder Representations from Transformers (BERT) [37] . BERT has a very deep model architecture, based on self-attention layers [38] , allowing the model to learn relationships between individual words as well as sentences as a whole. BERT is the current state-of-the-art in NLP for a variety of tasks and datasets, ranging from text classification to question answering and translation. The model has been pre-trained on a large corpus of text: the version which we used, BERT Multilingual, was pre-trained on the Wikipedia dumps of the 104 languages with the largest Wikipedia bases. The model has 110 million trainable parameters and uses a multi-lingual learned Word-Piece vocabulary of 190,0 0 0-word parts.
We largely followed the fine-tuning recommendations by Kenton et al. [37] . For both datasets, full details of the fine-tuning process and the parameters used are provided in Section 3 of the Supplementary Materials.
BERT fine-tuning was performed on a NVIDIA Titan V GPU, using CUDA 10.0 and Tensorflow version 1.14. Due to the high complexity of the BERT model compared to traditional ML methods, using a GPU is necessary to keep training time reasonable.

Evaluation method and metrics
All classifiers were evaluated using the accuracy, F1-score, sensitivity, specificity and area under the receiver operator characteristic curve (AUC).
For the fracture dataset, five-fold cross-validation was used to measure performance on the training set. In addition, temporal validation was performed by evaluating model accuracy (without retraining the model) on the Fracture-2019 set.
Because of the smaller size of the pneumothorax dataset, no temporal validation was done. Performance was measured using five-fold cross-validation. Table 2 shows the results of the different experiments per classifier for the rule-based, ML and DL methods on the Fracture-2018 dataset.

Rule-based results
The top two rows of Table 2 illustrate that the classification rules used in the rule-based classifier can classify reports with high efficacy. However, the scope of the rules proved to be too limited, and a fairly large number of reports (45% in the Fracture-2018 dataset, 40% for Fracture 2019) were left 'undecided', meaning that none of the classification rules was triggered explicitly. Upon taking the undecided reports into account in the performance metrics, the rule-based performance drops considerably, making it the worst performing method of the ones we investigated.

ML classifier results
All the classifiers were trained and tested for the feature space of 500 most frequent n-grams. There were only minor differences between the performances of the classifiers, with random forest performing slightly better than the others. Hence, we chose to perform further experiments on the more complex pneumothorax dataset using random forest alone.

Feature selection results.
The selection of the 500 most frequent unigrams led to the best performance for all three classifiers. Feature selection using random forest and Gini importance produced a feature space consisting of 357 (out of 20,899) features, but surprisingly did not improve performance over using the top 500 unigram features alone. The top 50 n-gram features and their respective feature importance are shown in Fig. 4 a. Table 2 shows that the best performing classification method is the BERT model, fine-tuned on our specific dataset. The model achieves an accuracy of (96 ± 1)% and an F1-score of (95 ± 2)%. It only lags somewhat in sensitivity compared to the ML based classifiers, but is superior in all other metrics.

Temporal validation results
The classifiers trained on the Fracture-2018 data were tested on the 2019 data for validation of our system. Here, the setting with all uni-, bi-and trigrams was chosen. Table 3 shows the results on this dataset, which are similar to those obtained on the 2018 set, validating the performance of our system. Furthermore, the validation results also confirm the superior performance of the DL based method compared to the traditional ML approaches.

Pneumothorax dataset
The top 50 features selected by feature selection are shown in Fig. 4 b. Table 4 summarizes the results obtained on the pneumothorax dataset. Clearly, performance on this more complex dataset with much longer reports (compare Figs. 1 c and 2 b) is lower when compared to the fracture dataset. Best performance was again achieved using the BERT based model, with an overall accuracy of (93 ± 2)% and F1 score of (83 ± 7)% . The main reasons for the lower performance are likely: 1) larger report length, 2) smaller number of reports to train on and 3) increased report complexity. As these reports were made as part of a general trauma screening procedure, they often contain references to multiple images of either different or the same anatomic regions.

Discussion
This study presents a comparison and optimization of ML based NLP for classifying radiology reports in orthopaedic trauma.
Our results demonstrate that NLP-is feasible for different types of non-English radiology reports from different anatomic regions. Our experiments demonstrate the importance of a clean and wellconstructed dataset and show that feature engineering is the primary method of improvement for traditional ML methods. In comparison with rule-based classification and traditional ML classification algorithms, our experiments demonstrate the high performance of the transfer learning technique BERT for both simple and  Fig. 4(a) shows the top features for the fractures dataset, some of the most important include: geen fractuur (no fracture), patiënt (patient), distaal (distal), seh (the Dutch acronym for emergency room). Fig. 4(b) shows the most important features for the pneumothorax dataset. Here, direct features such as pneumothorax rechts (pneumothorax right) can be seen with high importance, but also indirect ones such as subcutaan emfyseem (subcutaneous emphysema), thoraxdrain (chest tube drain) and diep sulcus (deep sulcus) are ranked highly.  complex Dutch radiology reports. We therefore propose BERT as the result of our objective, the optimization of the NLP methodology. The methodology can be applied to classify radiology reports for downstream tasks as positivity rate assessment or weak labels for training of computer vision algorithms which, in turn, has great potential to enhance the performance of clinicians in patientcare [39] .
For the fracture dataset the BERT algorithm performed at human level, because the accuracy of 0.96 is nearly equal to the error rate (0.97) we observed in our data preparation step. The accuracy remained high in an independent dataset from another time period. Temporal and external validation is required before clinical application [ 40 , 41 ]. The algorithm is reliable and can be applied to assess positivity rate in unseen comparable datasets. This helps to monitor effective patient care by automated tracking of referral patterns. The results for the chest radiograph reports appear to be good enough to apply for the future purpose of automated labeling of chest radiographs to train an imaged based neural network using weak labels, because the F1 score of the NLP labels is in the same range as in the study of Annarumma [42] . However, the 93% accuracy will impact the training-results of a pneumothoraxrecognition-algorithm to an unknown extent, because the imperfect labeling will result in an imperfect image algorithm. A combination of NLP labeling and manual labeling can be used, referenced as gold-standard labels (manual) and silver-standard labels (NLP). The latter is then used for training and the former for validating and testing an algorithm [43] .

Comparison with literature
To the best of our knowledge, there are no studies applying BERT for analyzing radiology reports in orthopaedic trauma. In a systematic review, Sorin et al. reports ten NLP studies using DL, all achieving F1 scores higher than 0.9, including two studies concerning fractures [6] . Wang reports an F1 score of 0.97 using a convolutional neural network (CNN) for detection of proximal hip fractures [44] . This performance is comparable to our study, however, our results apply to all fracture types. Lee also reports a high F1 score of 0.97 using a recurrent neural network (RNN) for the identification of fractures [45] . RNN is also a DL method, and a predecessor of BERT [ 38 ]. Kolanu et al. reports a sensitivity of 69,6% and a specificity of 95% in an external validation of a rule-based NLP model to detect fractures [47] , which is a lower performance compared to our best performing algorithm (BERT). However, because of methodological differences the studies the comparability is limited.
The study of Bressem et al. [48] report an AUC of 0.99 for the detection of pneumothorax by a BERT algoritm applied to chest radiographs reports ( n = 7200) in German language from an intensive care unit, compared to an AUC of 0.96 in our study with a smaller dataset ( n = 799). Also, a difference is the pre-training of the BERT algorithm they performed on 3.8 million German radiology reports, compared with the general Dutch algorithm we applied.
With 7 PubMed records in 2019 and 52 in 2020 the scientific literature on BERT in healthcare is limited but expanding. The fields where BERT is applied is diverse [49][50][51][52] , including some publications in the radiology domain. Zhang et al. demonstrated superior performance of BERT compared with other methods applied to Chinese radiology reports in a study to extract information about breast cancer [53] . This study supports our results, even though in a different language and with a different purpose. BERT can also be applied to sentence or word level classification tasks in radiology reports [20] , compared to document level classification in our study.
Grundmeier et al. reports a high F1 score (0.95) for three traditional ML NLP algorithms for the detection of fractures in radiology reports [54] . Although our results were comparable in terms of performance, the strength of our study is the detection of different fracture types in a diverse population, compared to the detection of only long bone fractures in a selective pediatric population. Foufi et al. reports a higher match between manual labeled and automated processed fracture radiology reports using rule-based classification (96,8%) [55] . Comparison with our study is difficult because the composition and origin of their data set is not entirely clear. In a study using rule-base NLP, Wang reports a higher F1 score (0.96) compared to our rule-based results [56] . This difference can be explained by the extensive site specific rules used, compared to our comprehensive rules that were used for all body regions. This difference indicates a drawback of rule-based meth-ods: the requirement of extensive feature engineering to improve results.

Biases and limitations
Our datasets were relatively small, which is a limitation given the relation between performance of ML and DL and the size of the training set. This relation was tested in the study of Carrodeguas et al. in the range of 100-750 samples, with a plateau of performance for DL around the 500 (F1 score around 0.7). However, the author also nuances this by discussing that with a different model architecture a larger dataset might be beneficial [7] , as we demonstrated in our study.
Another limitation is that for both datasets were not independently annotated my multiple raters resulting in the omission of inter-rater reliability calculation. Alternatively, for the Fracturedataset the annotations of the first radiologists were reviewed for consistency and errors by one of two other radiologists. A recent systematic review identified that limited annotator availability is a common bottleneck for NLP studies [57] .
Both our datasets comprised a narrow problem statement. Application of this methodology for other types of radiology reports should be done for further validation of the method. Our chest radiograph reports were embedded in comprehensive trauma radiology reports. This might have led to bias, because it is likely that patients with a pneumothorax also have a higher chance of having other injuries, that are described in the same extensive radiology report.

Conclusion
BERT NLP outperforms traditional ML and rule-base classifiers when applied to Dutch radiology reports. Traditional ML classification performance is mostly determined by features instead of classifier-type. Automated positivity rate assessment and label generation from radiology reports are feasible, even for small datasets.

Funding
The Titan V GPU used for this research was donated by the NVIDIA Corporation. Part of this work has been realised within the DAME-project, which is funded by the INTERREG V A-Deutschland-Nederland program with resources from the European Regional Development Fund and co-funded by the Ministerie van Economische Zaken en Klimaat (EZK), the Provincie Groningen and the Niedersächsisches Ministerium für Bundes-und Europaangelegenheiten und Regionale Entwicklung.

Declaration of Competing Interest
The authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article

Acknowledgment
The authors acknowledge H.P. Stallmann, radiologist, for participating in the data preparation.

Supplementary materials
Supplementary material associated with this article can be found, in the online version, at doi: 10.1016/j.cmpb.2021.106304 .