Study on the TB and non-TB diagnosis using two-step deep learning-based binary classifier

A deep learning-based binary classifier was proposed to diagnose tuberculosis (TB) and non-TB disease using a chest X-ray radiograph. The proposed classifier comprised two-step binary decision trees, each trained by a deep learning model with convolution neural network (CNN) based on the PyTorch frame. Normal and abnormal images of chest X-ray was classified in the first step. The abnormal images were predicted to be classified into TB and non-TB disease by the second step of the process. The accuracies of first and second step were 98% and 80% respectively. Moreover, re-training could improve the stability of prediction accuracy for images in different data groups.


Introduction
Tuberculosis (TB) is classified as the fifth leading cause of death worldwide, with about 10 million new cases and 1.5 million deaths per year [1]. Being one of the world's biggest threats and being rather easy to cure, the World Health Organization recommends systematic and broad use of screening to extirpate the disease. Posteroanterior chest radiography, in spite its low specificity and difficult interpretation, is one of the preferred tuberculosis screening methods. Unfortunately, since TB is primarily a disease of poor countries, the clinical officers trained to interpret these chest X-Rays are often lacking [2,3]. Several skin tests based on immune response are available for determining whether an individual has been exposed to TB. However, skin tests are not an indicator of active disease, and are also affected by vaccinations. A definitive test for diagnosing TB is the identification of the bacteria in a clinical sputum or pus sample, which is the current gold standard [4,5]. However, it may take several months to identify this slow-growing organism in the laboratory. Another technique is sputum smear microscopy, in which bacteria in sputum samples are observed under a microscope.
Recent advances in automated diagnostic tests are based on DNA analysis [4]. The test is fast and accurate, but it is expensive making it unsuitable for population screening. As a result, taking posterior-anterior (PA) chest X-rays (CXRs) continues to remain as an inexpensive and mandatory part of every evaluation for TB [6,7]. However, there can be non-TB disease images in real examination for CXRs and these non-TB images need to be classified to another category compared to TB or normal images.
-1 -Many of the computer aided diagnosis (CAD) papers dealing with abnormalities in chest radiographs do so without focusing on any other specific disease (non-TB). Most of CAD systems specializing in TB detection have been reported, such as [8]- [14]. Further, TB screening is not just limited to normal/abnormal classification, but also to spot the location in CXRs where the abnormality lies [15]. In [16], Hogeweg et al. use a combination of pixel classifiers and active shape models for clavicle segmentation. Note that the clavicle region is a notoriously difficult region for TB detection because the clavicles can obscure manifestations of TB in the apex of the lung. Freedman et al. showed in a recent study that an automatic suppression of ribs and clavicles in CXRs could significantly increase a radiologist's performance for nodule detection [17]. Arzhaeva et al. use dissimilarity-based classification to cope with CXRs for which the abnormality is known but the precise location of the disease is unknown [18]. They report classification rates comparable to rates achieved with region classification on CXRs with known disease locations. More information on existing TB screening systems can be found in our recent survey [19].
In real examination of lung disease, TB and Non-TB images are mixed, thus, classification into normal, TB and Non-TB categories is necessary for lung disease detection. If we use the algorithm trained by only TB/Normal data, the accuracy can be lower in examination which include Non-TB images [20,21]. In addition to TB, several studies on the detection of the various kinds of Non-TB disease have been conducted [22]- [24]. Also, abnormal image detection including Non-TB was investigated by using a multi-CNN model [25]. ChestX-ray14 dataset is from the National Institutes of Health (NIH) Clinical Center [23]. Due to its size, the ChestX-ray14 consisting of 112,120 frontal CXR images from 30,805 unique patients attracted considerable attention in the deep learning community. Triggered by the work of Wang et al. [23] using convolution neural networks (CNNs) from the computer vision domain, several research groups have begun to address the application of CNNs for CXR classification. In the work of Yao et al. [26], they presented a combination of a CNN and a recurrent neural network to exploit label dependencies. As a CNN backbone, they used a DenseNet [27] model which was adapted and trained entirely on Xray data. Li et al. [28] presented a framework for pathology classification and localization using CNNs. More recently, Rajpurkar et al. [29] proposed transfer-learning with fine tuning, using a DenseNet-121 [27], which raised the AUC results on ChestX-ray14 for multi-label classification even higher.
In this study, we suggest two-step of process for detection of normal, TB and non-TB disease using binary classifier for each step. The accuracies of algorithm for each step were investigated and the results showed that this process can be a good candidate for the clinical purpose.

Workflow
A two-step process was suggested in order to categorize examined chest X-ray images (CXR) into three clinical states (Normal, TB and Non-TB) through binary classification for each step as shown in figure 1. All CXR is checked by radiographer when it is taken before being predicted by a CAD algorithm to check their imaging quality. In our suggested workflow, there are two automated X-ray imaging systems (AXIRs: AXIR1 and AXIR2 which are for the classification of Abnormal/Normal -2 -and TB/Non-TB, respectively). When a good quality CXR is taken by AXIR1 and predicted as normal state, there is no more action taken for that patient. However, the image which is classified as abnormal state is predicted by next step (AXIR2) to determine whether it is TB or Non-TB. Finally, medical doctor reviews the classified images and makes a decision for further investigation of clinical assessment.
In each step, the prediction algorithm is trained by with an optimized data group for its intended purpose. AXIR1 is trained with a combination of normal and abnormal data. In this case, the abnormal data include TB and Non-TB data. In the case of the Non-TB data, ChestX-ray14 dataset (NIH) was used for the training. AXIR2 is trained by the TB and Non-TB cases taken from the ChestX-ray14 dataset. Each of the classifiers in the step system is designed for binary classification.

Patient data and augmentation
In an effort to provide sufficient training data for the research community, allowing benchmark tests, the U.S. National Library of Medicine has made two datasets of posterio-anterior (PA) chest radiographs available: the MC set and the Shenzhen set, Both datasets contain normal and abnormal chest X-rays with manifestations of TB and include associated radiologist readings. The categorization of TB is based on the final pathological diagnosis. The Shenzhen set was used in this study for TB dataset. The open-source Shenzen data comprise 326 TB and 226 normal cases (https://lhncbc.nlm.nih.gov/publication/pub9931). The average image size for Shenzen data is 3000 × 3000.
Also, the NIH (National Institutes of Health, U.S.) Clinical Center recently released over 100,000 anonymized chest x-ray images and their corresponding data to the scientific community. The database from NIH is available online at https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345. The size of each image is 1024 × 1024. This dataset is categorized into 14-lung disease sub-patterns which are based on pathological diagnosis. However, this dataset does not show TB or Non-TB classification with pathological confirmation.
-3 -In addition to above open dataset, we used East Asian Hospital dataset (which is based on cooperation with Radisen) for TB and Non-TB disease. This dataset is based on pathological diagnosis. In the case of the non-TB data, all images were categorized into NIH-14 based subpatterns. The images were taken by a Vicomed system and viewed with a Radisen Detector (17 × 17 ). The average image size is 2484 × 3012. The image acquisition conditions were as follows: voltage = 105 kVp, current = 125 mA, charge = 10 mAs, time = 80 ms, source-to-image distance = 130-150 cm. For the correct collection of the data, two radiologists review dataset and only data which is agreed by two radiologists were used in this study. From those data set (open source and East Asian Hospital dataset), trained and tested images were randomly selected, excluding children's images.
In the case of the AXIR1, among the 1,170 patients in the total dataset (table 1), 170 patients (14.5%) were randomly selected for testing. Among these 170 patients, 85 were abnormal and 85 were normal. The remaining 1,000 patients were split in a 50% : 50% ratio to make abnormal (500 patients) and normal (500 patients) cases for training. Among the 585 abnormal cases, 442 were from NIH data and 143 were from East Asian hospital data. The normal data images also came from the NIH and East Asian countries' data. In the case of the AXIR2, among the 984 patients in the total dataset (table 2), 164 patients (16.7%) were randomly selected for testing. Among these 164 patients, 82 were TB cases and 82 were non-TB (other) cases. The remaining 820 patients were split in a 50% : 50% ratio to make TB (410 patients) and non-TB cases (410 patients) for training. Among the 492 TB patients, 372 images were from East Asian data, and the remaining 120 TB cases were from Shenzhen data. All 492 non-TB lung disease images were taken from East Asian hospital data.
We applied several data augmentation algorithms to improve the training and classification accuracy of the CNN model and achieved remarkable validation accuracy. 1500 images including augmented data are used to train the model. The settings applied for the image augmentation are shown below in table 3.
The rotation range denotes the range in which the images were randomly rotated during training, i.e., 10 degrees. Width shift denotes the horizontal translation of the images by 0.2 percent, and height shift is the vertical translation of the images by 0.2 percent. And finally, the images were flipped horizontally.

Horizontal flip True
We performed stability test for AXIR1 using other group data (Shenzhen data). First of all, 100 TB and 100 normal data were used for the test. And the same test was repeated, after re-training the model with 226 TB and 100 normal data.

Annotation and classification
We further categorized abnormal patterns of the NIH data as TB or non-TB. TB patterns were then subcategorized into seven active TB pattern classes (consolidation, cavitation, pleural effusion, miliary, interstitial, tree in bud, and lymphadenopathy) and an additional inactive TB pattern class. These TB pattern classes are based on those used by the Centers for Disease Control and Prevention (CDC) of the U.S.A. [30]. Also, we subcategorized the Shenzhen TB data into subclasses (according to the ChestX-ray14 dataset (NIH data)), which are infiltration, consolidation, pleural effusion, pneumonia, fibrosis, atelectasis, nodule, pneumothorax, pleural thickening, mass, hernia, cardiomagaly, edema, and emphysema. These additional categorizations are based on the radiological readings for the purpose of the detailed analysis for sub-patterns of FN and FP. The resulting annotated data were reviewed by two radiologists independently. In the case of the TB data, most data are related with infiltration, consolidation, plural effusion, fibrosis, and nodule. These patterns are strongly related with TB patterns.
We also used East Asia countries' data (TB and Non-TB disease) from our cooperating hospitals to diversify the trained dataset. Non-TB disease dataset for Easter Asia Hospital is categorized based on NIH-14 disease pattern related with final confirmed diagnosis. In the case of the TB data (Eastern Asian Hospital), we annotated more detailed sub-patterns according to radiological sub-TB patterns. However, these sub-TB patterns are categorized with only radiological readings.

Deep learning algorithm and training/testing technique
Herein, we used 2D CNN algorithm with PyTorch frame for the training and testing for our twostep CAD process. Figure 2 shows the overall architecture of the proposed CNN model which is based on the pre-trained ResNet18 [31,32], using ImageNet dataset. ResNet is one of the most popular CNN architecture, which provides easier gradient flow for more efficient training, and was the winner of the 2015 ImageNet competition. The core idea of ResNet is introducing a so-called identity shortcut connection that skips one or more layers. This would help the network to provide a direct path to the very early layers in the network, making the gradient updates for those layers much easier. ResNet18, which is similar in performance to the ResNet50 structure, was employed to reduce the training and verification time [33]. The schematic diagram of ResNet18 model for lung disease detection is illustrated in figure 2.
We analyzed the performance of the prediction results with accuracy, sensitivity, specificity, precision and area under curve (AUC) as following:

Accuracy for AXIR1 and AXIR2
We performed the prediction and the results are shown in table 4 and 5. The Accuracy of the AXIR1 and AXIR2 are 0.98 and 0.80 respectively.
Also, we investigated the classification capability of AXIR2 for normality using 100 normal test data. The normal data can be classified into TB and others with the ratio of 1 : 3 by AXIR2. In addition, we analyzed detailed distribution of false negative (FN) and false positive (FP) classification errors in AXIR2. The decision tree made 9 FP and 23 FN predictions, as shown in table 6. Atelectasis and plural thickening (67% of all FPs) and infiltration, consolidation and fibrosis (NIH classification) (83% of all FNs) are the categories most likely to be involved in such errors. These three sub-patterns can be categorized as consolidation sub-pattern in CDC sub-TB classification.

Stability test for other group data
We evaluated the stability of the studied model for Shenzhen data set (100 TB and 100 Normal data) applying the same trained algorithm. In this case, the accuracy and sensitivity become much lower than those of the NIH-14 and East Asian test data. Thus, we re-trained the model using a Shenzhen data (236 TB cases and 100 Normal cases, which are different from the test set). For the Shenzhen test dataset before re-training, the accuracy of AXIR1 was low as 0.50 and was improved up to 0.82 after re-training (table 7). For the NIH test dataset, the accuracy after re-training show not so significantly difference (2%) from that before re-training as shown in table 8.
The effect of number of Shenzhen TB images added on the accuracy of the re-trained model was shown in figure 3. The accuracy are increased with number of images re-trained and saturated -7 -2020 JINST 15 P10011 Table 7. Performance of classification for re-trained algorithm for Shenzen test data.
Accuracy for AXIR1 before re-training 0.50 After re-training 0.82 Table 8. Performance of classification for re-trained algorithm for NIH-14 test data.
Accuracy for AXIR1 before re-training 0.98 After re-training 0.96 after adding 200 images or more into the re-training dataset. For this re-training, augmentation method was not applied.

Discussions
The accuracy of AXIR1 is 0.98. This prediction results show that our trained algorithm can be used clinically for the screening of abnormality and normality. The sensitivity is slightly higher than specificity by 2%, but this is not so significant (table 4) 9). Table 6 shows the detailed distribution among these classes for 9 FP and 23 FN predictions of AXIR2. Atelectasis and plural thickening show a higher portion of FP (67%) and infiltration, consolidation and fibrosis are higher potion for FN (83%). In atelectasis and plural thickening, the region can be similar to plural effusion in lower lobe in both side of lung. Thus, the algorithm's prediction results can show larger portion of FP due to atelectasis and plural thickening specifically. Clinically, plural effusion is strongly associated with TB, but atelectasis or plural thickening is not. Also, infiltration, consolidation and fibrosis, as defined by the NIH, are strongly associated with TB; however, some of these radiological sub-patterns can be predicted as non-TB. When these sub-pattens are annotated as TB, the position of occurrence was frequently in the upper apex region. The most of cases of FP show that the locations of occurrence are middle or lower lobe area. Thus, applying position-based feature filtering in the algorithm might improve the accuracy. For this -9 -analysis of sub-pattern, we just categorized data with only radiological readings. However, the prediction results for TB and Non-TB, all training data structure is based on pathological diagnosis.
The accuracy depends on not only the optimization of model, also properties of the trained dataset. We performed stability test of the accuracy between data groups. First of all, we trained our algorithm by using NIH dataset and East Asian dataset which is same dataset of test. Then we compared our prediction results with other data group (Shenzhen dataset) for the test. The results show that accuracy become lower significantly for different test dataset as shown in table 7. The accuracy can be improved again, as we re-trained the model using Shenzhen dataset additively (see table 8). Also, the number of images that is needed to re-train the algorithm of 200 images can be enough for the improvement as shown in figure 3. These results implies that it is important to conduct maintenance in which CAD algorithm need to be re-trained by using appropriate dataset (same data group or similar pattern of annotation) in order to use it as a clinical purpose in local hospital.
There are many data augmentation strategies for image data. We added more training data to our deep learning model using easy-to-implement methods such as horizontal flip, rotations, and shifts. Image processing techniques using stochastic region sharpening, elastic transforms, randomly erasing patches, and many more to augment data can be considered to improve the performance of the resulting model. Therefore, further studies (enlargement, contrast adjustment, and baseline correction) are needed on advanced augmentation techniques for building better models and creating a system that does not require gathering a lot of training data to get a reliable statistical model.
According to our pre-study, the effect of the image size from 512 to 2048 does not affect the performance significantly. However, in some cases, mixed image sizes could affect the performance. Therefore, the different image sizes were normalized to 1024 × 1024, which is the smallest image size of the obtained images, to avoid image size effect on the performance.

Conclusion
In this study, we proposed two-step process with binary filter to classify Normal, TB and Non-TB lung disease. 2D CNN algorithm (ResNet18 model) with PyTorch frame was used for each step with optimized trained data. In first step (AXIR1), Abnormal/Normal data are used for training to perform screening of normal image from patient data. In second step, TB and Non-TB data are used for training to classify them into TB and Non-TB lung diseases. To analyze the tendency of the accuracy according to sub-patterns, we annotated our dataset with TB/Non-TB items and NIH-14 based disease items separately. The accuracy of 0.98 and 0.80 can be achieved for AXIR1 and AXIR2 respectively and stability can be satisfied with re-training by using a same data group with test/training data group. The reason for low accuracy for AXIR2 compared to AXIR1 is resulted from the large portion of FP with atelectasis and plural thickening. Also, large portion of FN with infiltration, consolidation and fibrosis can result in the low accuracy of AXIR2.