Identification of Malignancies from Free-Text Histopathology Reports Using a Multi-Model Supervised Machine Learning Approach

We explored various Machine Learning (ML) models to evaluate how each model performs in the task of classifying histopathology reports. We trained, optimized, and performed classification with Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), Adaptive Boosting (AB), Decision Trees (DT), Gaussian Naïve Bayes (GNB), Logistic Regression (LR), and Dummy classifier. We started with 60,083 histopathology reports, which reduced to 60,069 after pre-processing. The F1-scores for SVM, SGD KNN, RF, DT, LR, AB, and GNB were 97%, 96%, 96%, 96%, 92%, 96%, 84%, and 88%, respectively, while the misclassification rates were 3.31%, 5.25%, 4.39%, 1.75%, 3.5%, 4.26%, 23.9%, and 19.94%, respectively. The approximate run times were 2 h, 20 min, 40 min, 8 h, 40 min, 10 min, 50 min, and 4 min, respectively. RF had the longest run time but the lowest misclassification rate on the labeled data. Our study demonstrated the possibility of applying ML techniques in the processing of free-text pathology reports for cancer registries for cancer incidence reporting in a Sub-Saharan Africa setting. This is an important consideration for the resource-constrained environments to leverage ML techniques to reduce workloads and improve the timeliness of reporting of cancer statistics.


Introduction
The South African National Cancer Registry (NCR) is responsible for the registration of all malignancies, including histopathologically diagnosed malignancies, and annual reporting of cancer statistics for South Africa (SA) [1,2]. The NCR receives over 100,000 cancer pathology reports annually from pathology laboratories in SA [1,2]. All cancer pathology reports are coded according to the International Classification of Diseases for Oncology 3rd edition (ICD-O-3), reports are de-duplicated to identify index cancer cases, and the cancer statistics are calculated and reported annually [1,2]. The NCR database, since its inception in 1986, has over 1.2 million index cancer cases recorded [2].
The NCR receives pathology reports from both private and public laboratories throughout SA [1]. These reports are electronic and in free-text format. Trained data coders perform medical data abstraction and code the malignant reports using the ICD-O-3 topography and morphology classification for downstream analysis [3]. The medical data abstraction process is labor-intensive,

Data Source
All histopathology reports collated in the National Health Laboratory Service's (NHLS) Corporate Data Warehouse (CDW) [32] for the year 2016 in the Western Cape province were made available for this study. The NHLS is the central pathology laboratory service for the public healthcare sector in SA.

Pre-Processing
We assigned a unique row identifier (ID) for each record and then subset the dataset by retaining the three columns that contained the result text in free text format, the SNOMED-CT morphology codes, and the row ID. We plotted a word cloud on the result text to determine the word representation before data cleaning of the result text. We also generated a character count, word count, and unique word count before data cleaning. For each row in the result text, new lines, tabs, and extra spaces were replaced with a single space. Then, we picked from the start word patterns that are "a" to "z" in either lower or capital case, numbers 0 to 9, hyphens, apostrophes, and spaces. Then, we converted all the resulting text into lower case. We expanded the contracted words-for example "don't" to "do not"-and removed the words "no" and "not" from the stopwords list [15] because these are negation words in a sentence. We added the words "tel", "telephone", and "fax" to the stopwords list. Then, we removed all the words in the stopwords list from the resulting text.
A second word cloud was constructed after data cleaning to visualize the effect of the pre-processing. Character counts, word counts, and unique word counts were generated again after data cleaning. Comparison of the metadata data frame was done before and after pre-processing, and we excluded rows where the content was completely lost due to pre-processing. These rows, which initially had tabs and newlines values, were left with no result text after pre-processing (had null values).

Feature Engineering
This is the process of transforming input data to features that machine learning models can easily interpret to improve model performance [33,34]. This can be done by either reviewing the input features or letting the machine learning model select the most appropriate features [33,34]. We used Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer with an ngram range of 1-3 to convert the raw text to the matrix of TF-IDF features. TF-IDF is a mathematical representation of a weight to a term or terms in a document [35]. It looks at how important a term or terms is/are with respect to the whole corpus or document [35]. The TF-IDF is given by the equation below: where d is document, t is term, df is document frequency, TF is term frequency, and N is the total number.
Through feature engineering, we were able to drop words that the TF-IDF vectorizer gave more weight to when paired with other words that do not contribute to the actual classification goal. These included words such as comment, diagnosis, final diagnosis, immunohistochemistry, microscopic examination, etc.; these words were subheadings in the reports. We sampled some of the most important words, bigrams, and trigrams for this classification and attached a list in the Supplementary Materials.
Then, we fitted the features to the encoded value labels. Since TF-IDF generates many features, it is impossible to use all the features to perform classification. Therefore, we used a dimensionality reduction feature called Truncated Singular Value Decomposition (SVD) [36] through a topic modeling technique called Latent Semantic Analysis [31].

Classification
We sampled records with SNOMED-CT morphology codes to create classes of malignancy status for the training data. The cancer morphology codes are five-digit codes ranging from 8000/0 to Information 2020, 11, 455 4 of 17 9992/9. The first four digits indicate the specific histologic term [37], while the code after the backslash represents the behavior code. The behavior code can be as follows: 0 is benign, 1 is uncertain whether malignant or benign, 2 is carcinoma in situ, 3 is the malignant primary site, 6 is the malignant metastatic site, and 9 is malignant, uncertain whether primary or metastatic site [37].
Using regular expressions [30], we were able to construct classes for "Malignant", "Non-malignant", and "No diagnosis". We performed the MMSML classification in scikit-learn in Python [31]. We randomly sampled 5000 rows each from "Malignant" and "Non-malignant" classes and 1000 rows from "No diagnosis" to create the training data. Then, we used the Label Encoder to turn the labels into numbers that are 0, 1, and 2 for "Non-malignant", "Malignant", and "No diagnosis", respectively. We applied a multiclass classification approach. We split the training data into "X" and "y" where "X" is the result text that also contains the features for the classification and "y" is the encoded labels.
We evaluated the model performance by running non-optimized classification algorithms. We used the train test split by stratification method in scikit-learn [31]. The test size was "0.3", the random state was "three", stratification was "yes", and shuffle was set to "true". Then, we optimized the models by performing hyperparameter tuning in GridSearchCV. The scikit-learn library allowed us to stack these models together, thereby making it possible to compare the performance of each algorithm [31]. The algorithms used in this model are briefly explained below.

Gaussian Naïve Bayes (GNB)
GNB is a classifier is based on Bayes theorem [35]. It relies on the conditional probability to predict the outcome of an occurrence [35]. For example, if documents n fit into k categories where k ∈ {c 1 , c 2 , . . . , c k }, then the predicted output is c ∈ C. The model function is given as below: where d is documents and c indicates classes.

Adaptive Boosting (AB)
AB was discovered by Freund and Schapire; this algorithm works by reweighing the examples in the training set to improve the classification accuracy [38]. It converts any algorithm with an accuracy higher than guessing to a higher performance [38]. A boost classifier is giver by the function below: where f t is a weak learner that takes an object x as input and return the class it belongs to.

Logistic Regression (LR)
LR uses a logistic function to predict a given outcome [39]. LR is also referred to as the maximum entropy model in the multiclass text classification domain [39]. To perform multiclass text classification, LR must be regularized. This is possible by adding a regularization term w T w/2, and a regularized logistic regression is given by the function below; where C > 0 is a parameter set by users. The function estimates (w) weight by (min), minimizing the negative log-likelihood. SGD is mostly used in large-scale machine learning problems since the computational complexity of machine learning becomes a limiting factor in very large datasets [40]. SGD addresses the complexity by having a faster convergence [41], as the SGD algorithm learns by randomly obtaining examples from the ground truth without necessarily taking into consideration the previous iterations [40]. Every iteration in SGD updates the weights based on the gradient from the randomly picked example [40]: where z t is the random example picked, w t , t = 1, . . . , t = n, is the stochastic process that depends on the randomly picked example [40]. SGD is derived from Batch Gradient Descent (BGD) [41]. BGD is meant for small datasets, while SGD works well in large datasets. We used a constant learning rate to maintain class labels.

K-Nearest Neighbor (KNN)
This is a non-parametric algorithm, which considers the closest neighbor to the point of prediction [35]. For example, consider a document with the x training set; the algorithm will find all the k neighbors of x. Since there may be lots of overlap in the neighbors, the algorithm assigns a score to the k neighbors and only puts the k with the highest scores depending on the value of x. We used weight-adjusted KNN that uses the TF-IDF weight vectors for the classification [35], where the KNN weighted cosine measure was derived as follows: where T is the set of words, and x t and y t are the term frequencies. The training set (d ∈ D), where N d = {n 1 , n 2 , . . . , n k } is the set of k-nearest neighbors of d. The similarity sum of d neighbors that belongs to class c given by N d defined as: The similarity total is given as below: and d contribution defined in the terms of S c of classes c.

Support Vector Machine (SVM)
Originally developed as a binary classifier, but with the recent advancement in technology, SVM algorithms have improved to non-binary and multiclass classifications models [35]. SVM uses either linear or non-linear kernels to perform classification [35]. We used multiclass SVM by applying one versus the rest while generating classification features from TF-IDF [35]. To get proper classification, we used a string kernel [35]. The string kernel uses Φ(.) to map the string in the feature space. By using the spectrum kernel, which counts the number of times a word appears in string x i as a feature map where defining feature maps from x → R lk : where the kernel Φ j (x) = number of j feature appears in x. The feature map Φ i (x) is then generated by sequence x i and kernel defined as follows: This classifier performs classification by creating a tree based on attributes of data points [35,42]. Classification is performed by getting attributes with the largest information gain as the parent's node, then using cross-entropy to evaluate the performance of the classification [42]. For example, consider an attribute A with k distinct value divides the training set E in subsets of {E 1 , E 2 , . . . , E k }.

Random Forest (RF)
This is an ensemble learning method for text classification; it works by generating random decision trees [35]. It is faster to train for faster classification, though it is quite slow to make predictions [35]. The algorithm has been improved to have convergence as margin measures (mg(X,Y)) with indicator function I(.) as below: The predictions in RF are assigned based on voting as follows: such that r ij + r ij = 1.

Model Optimization
Hyper-tuning of the model parameters was done, and the classification was run through a GridSearchCV to improve the model performance [31]. GridSearchCV implements a fit and score method [31]. We performed 5-fold cross-validation on a GridSearchCV, and the best model was selected based on the score each fold returned. This allowed us to optimize all the algorithms in the model except for the dummy classifier. Table 1 shows the optimized parameters that we used to perform classification.

Evaluation
We evaluated our models by calculating the accuracy, precision, recall, F 1 -score, misclassification rate (error rate), micro-average, and macro-average. We achieved this by plotting a Confusion Matrix (CM), Receiver Operating Characteristics (ROC), and Area Under Curve (AUC) for various algorithms [35]. CM is a table used to measure the performance of a classification model by getting counts of predicted values against actual values [35]. ROC and AUC measure the performance of the classification at various threshold settings [35]. ROC is the probability curve, while AUC is a measure of separation of the classes [35]. The plotting of confusion matrices, ROC, and AUC are possible from calculating elements such as the True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN), True Positive Rate (TPR), and False Negative Rate (FNR) [35].
The example reports in Table 2 show examples of the reports before and after cleaning and their respective SNOMED-CT classes.

Pre-Processing
A total of 60,083 histology reports were registered by the NHLS for the Western Cape Province in 2016. The mean character count before pre-processing was 1032.12 with a standard deviation of 832.17. The character count of the reports before pre-processing ranged from 1 to 16,961, which changed to 0 to 12,419 after pre-processing. The word count of the reports before pre-processing ranged from 1 to 5613, which changed from 0 to 1607 after pre-processing. Table 3 shows the summary statistics of the text results before and after pre-processing. We plotted the distribution characteristics of the characters, word count, and unique word count before and after pre-processing of the histopathology reports ( Figure 1). The characters count, word count, and unique word count remained fairly unchanged from before to after pre-processing, as portrayed in the shapes of the distribution curves ( Figure 1).   We plotted two word clouds (Figure 2). Word clouds were used to visualize the keywords in the histopathology reports [43]. From the two word clouds, the effects of pre-processing were highlighted: the text changed to lower case, the new line tags were eliminated but generally, while the content of the histopathology reports remained unchanged.
The average word percentage change for before and after pre-processing was 14.41% ( Figure 3). An expansion of shortened words led to a stopwords removal and a decreasing of words for some histopathology reports while other reports remained unchanged.

Classification
We randomly sampled 11,000 reports (from the total of 60,068 that survived pre-processing), where the SNOMED-CT codes indicated the "Malignant", "Non-malignant", and "No diagnosis"

Classification
We randomly sampled 11,000 reports (from the total of 60,068 that survived pre-processing), where the SNOMED-CT codes indicated the "Malignant", "Non-malignant", and "No diagnosis"

Classification
We randomly sampled 11,000 reports (from the total of 60,068 that survived pre-processing), where the SNOMED-CT codes indicated the "Malignant", "Non-malignant", and "No diagnosis" classifications for our training set. Then, we split the training dataset into two where 70% was for the training set and 30% was for the test set. We performed classification without optimization (Table 4). We plotted the CM, ROC, and AUC to show the classification rates between various classes and the average classification as shown in Figures 4-7.  Information 2020, 11

Discussion
This study demonstrates the possibilities of integrating ML models to process cancer reports. Data labels make it possible to assign the SNOMED-CT codes to the unlabeled data. This is very important, as in our data, 8.17% of the pathology reports are not assigned any SNOMED-CT codes, while 10% of the assigned SNOMED-CT codes are misclassified. Considering the increasing cancer burden in low and middle-income countries (LMIC) worldwide due to changes in the lifestyle and environmental factors [5,44], more cancer reports are being collected in cancer registries. This requires faster and more efficient means of data processing to meet the demand for the timely and accurate reporting of cancer statistics.
By integrating ML models in data processing, it is possible to achieve timely data processing for the increased reporting load. For example, NCR collects more than 100,000 raw reports per annum for reporting purposes [1]. The number of raw reports is expected to increase with the increase in population size and the rise of cancers cases in LMICs.
Using ML techniques such as TF-IDF to generate classification features per classes assigned, it is possible to classify records without creating a reference/word dictionary. This also makes it possible to classify records with variability, since the majority of pathology records do not follow definite standard reporting guidelines, and variation exists amongst pathologists. Despite TF-IDF generating many classification features, a dimensional reduction of features using SVD makes it possible to reduce the number of classification features, thereby reducing the classification time while increasing the accuracy. This allows the appropriate features to be assigned to the appropriate classes for the training data promptly. This is evident, as we first tested the models by splitting the training dataset into two and using 30% for testing each classification algorithm.
RF performs well with least misclassification rates followed by SVM, DT, LR, and then KNN; this is also mirrored in the F 1 -score of the five algorithms except for DT. The F 1 -score for DT during training and optimization were at 92%, but there was improvement during the actual classification (Table 5, Figures 4 and 6). The classification rates of RF, SVM, LR, KNN, and SGD for each class were at 97% and above for the five algorithms, while the micro-and macro-classification were at 98% and above, as shown the ROC curves (Figures 6 and 7). Even though AB and GNB algorithms takes a short time to train, optimize, and perform classification with, they have high misclassification rates and are not appropriate to perform the classification of histopathology reports. The run time for RF is a limitation, but it had the least number of misclassification rates on the label data, and therefore, this showed its classification strength. The model is also known to train faster, but it takes longer to perform optimization [35]. LR can still be applied in text classification tasks for histopathology reports since it had a misclassification error of below 5% and an F 1 -score of 96%. LR also takes a short time to train, optimize, and perform classification.
When we explored the misclassified reports in the model, all the algorithms uniformly misclassified 59 reports: 2 reports were predicted as malignant but were non-malignant, while 57 reports were predicted as non-malignant but were malignant. Our study did not incorporate Deep Learning models that are gaining popularity in the Natural Language Processing domain [35]. It would be ideal to try such models in the histopathology reports and measure their performance. We were also not able to incorporate Multinomial Naïve Bayes (MNB) classifier, which is an improved GNB and has an enhanced performance compared to GNB [35]. The SVD dimensionality reduction applied in this study generates classification features with negative values, which made the MNB algorithm generate value error.
Performing classification with ML models saves more time compared to human coding [6] and is more accurate compared to rule-based approaches [13] in cases where datasets are big and have no standard structure. This helps to cope with constantly increasing heterogeneous data when such models are incorporated in workflow pipelines. This is a major strength, as there are no or little adjustments made to the model compared to rule-based approaches [45].

Conclusions
Our study demonstrated the possibility of applying ML techniques in the processing of free-text pathology reports for cancer registries for cancer incidence reporting in a Sub-Saharan African setting. This is an important consideration for resource-constrained environments to leverage ML techniques to reduce workloads and increase productivity. We can apply ML models to improve data processing efficiency and report misclassification.
RF, though it takes a long time to train and optimize, has the least misclassification rates and therefore would be recommended for performing the classification of histopathology reports. DT had the third least misclassification rates after SVM, which makes RF, SVM, and DT classifiers appropriate for text classification of the histopathology reports.