Application and utility of boosting machine learning model based on laboratory test in the differential diagnosis of non-COVID-19 pneumonia and COVID-19

Background Non-Coronavirus disease 2019 (COVID-19) pneumonia and COVID-19 have similar clinical features but last for different periods, and consequently, require different treatment protocols. Therefore, they must be differentially diagnosed. This study uses artificial intelligence (AI) to classify the two forms of pneumonia using mainly laboratory test data. Methods Various AI models are applied, including boosting models known for deftly solving classification problems. In addition, important features that affect the classification prediction performance are identified using the feature importance technique and SHapley Additive exPlanations method. Despite the data imbalance, the developed model exhibits robust performance. Results eXtreme gradient boosting, category boosting, and light gradient boosted machine yield an area under the receiver operating characteristic of 0.99 or more, accuracy of 0.96–0.97, and F1-score of 0.96–0.97. In addition, D-dimer, eosinophil, glucose, aspartate aminotransferase, and basophil, which are rather nonspecific laboratory test results, are demonstrated to be important features in differentiating the two disease groups. Conclusions The boosting model, which excels in producing classification models using categorical data, excels in developing classification models using linear numerical data, such as laboratory tests. Finally, the proposed model can be applied in various fields to solve classification problems.


Introduction
In 2019, the World Health Organization declared a pandemic caused by SARS-CoV-2. In January 2022, the number of people infected with the coronavirus disease 2019 (COVID-19) peaked worldwide. Since then, the incidence rate has declined but continues [1]. In a situation where COVID-19 persists, the medical field will continue to incur the burden of the disease, and patients of non-COVID-19-related diseases will also be affected [2,3].
Reverse-transcription polymerase chain reaction (RT-PCR) is the standard diagnostic method for COVID-19 [4,5]. However, as RT-PCR requires skilled personnel and equipment, it cannot be performed as frequently as a general laboratory blood test. Moreover, the test takes 48 h from the collection of samples to the reporting of results [6,7], and it has high specificity but low sensitivity [8]. These are problematic aspects because early detection of COVID-19 is important for improving the patient's clinical outcome [9,10].
Negative pressure isolation treatment is recommended for COVID-19 patients [11]. However, patients with respiratory symptoms and fever may not receive the appropriate treatment for their actual diagnosis until COVID-19 is confirmed. Inpatients other than those with COVID-19 complain of fever and respiratory symptoms, which are subsequently confirmed as COVID-19 cases [12]. When this happens, patients do not receive clear isolation guidelines and measures to prevent the spread of COVID-19, causing an outbreak in the hospital, including among the medical staff [12]. Therefore, differentiation between non-COVID-19 pneumonia and COVID-19 is important because the two belong to the same family of respiratory diseases, yet have different prevention protocols owing to their different sources.
Several research studies related to COVID-19 have been conducted since the declaration of the pandemic. Among them, some studies used artificial intelligence (AI), and most utilized image data, such as chest Xrays and chest computed tomography (CT) [13][14][15]. Some studies conducted laboratory experiments on COVID-19 to train AI [16,17]. However, none of the AI models could accurately distinguish between non-COVID-19 pneumonia and COVID-19 based on laboratory data. In pneumonia management, although laboratory results are used to evaluate a patient's clinical course, they hold little value in a confirmatory test. To enhance the classification accuracy of the AIs, some studies employed boosting models such as eXtreme Gradient Boosting (XGBoost), which are known to efficiently solve classification problems, such as differential diagnosis [18][19][20].
This study developed an AI model that can differentiate between non-COVID-19 pneumonia and COVID-19 using laboratory results. In addition, we applied various AI models to the classification problem of differential diagnosis, including the boosting series model, and determined the most beneficial. During the COVID-19 pandemic, RT-PCR assays were performed for patients with suspected respiratory diseases. Although extant laboratory results are accessible, as COVID-19specific findings have not been made yet, they cannot be used for diagnosis. Hence, we rediscovered the value of laboratory findings by developing an AI model trained with laboratory data.
The contributions of this study are: 1) we developed a medical AI model that utilizes laboratory test data to classify pneumonitis in COVID-19 and non-COVID-19 patients; 2) we identified nonspecific clinical markers of pneumonitis in both cases (aspartate aminotransferase (AST) and D-dimer, etc.); 3) we developed a model that can train with hospital data despite the problem of missing values; 4) we identified the best boosting algorithms for solving complex classification problems; and 5) we proposed a reference tool to help the medical staff in their decision-making, such as determining the need for a negative pressure isolation room.

Patients and data collection
This study enrolled non-COVID-19 pneumonia and COVID-19 patients aged 19 years or older who required hospitalization in a tertiary hospital from September 2020 to March 2022. The 'non-COVID-19 pneumonia' group is defined as the patients who, by chest X-ray, chest CT, or sputum culture test, were diagnosed with hospitalizationrequiring pneumonia but received a negative result in the COVID-19 RT-PCR test. The 'COVID-19 ′ group is defined as the patients who had both hospitalization-requiring pneumonia and a positive RT-PCR test on a nasopharyngeal swab. We collected only the first set of laboratory data on the day of admission. In addition to laboratory data, the sex, age, and vital signs were collected. Data collected from both groups are presented in Supplementary Table 1.

Data preprocessing
We collected 71 parameters from 1,065 patients, and 12,583 (16.64%) values were missing. The missing values were replaced with the median value for each model. Finally, 75,615 (71 × 1065) datasets were acquired. As a data-preprocessing step during the training of all models, the range of columns in the dataset was standardized and scaled using scikit-learn, a machine learning library for the Python programming language.

Model selection
Among the decision-tree models, the selected ones were XGBoost, light gradient boosting machine (LGBM), category boosting (CatBoost), and random forest (RF). Decision-tree models are largely divided into bagging and boosting models. Through the Kaggle competition, decision-tree models exhibit good performance in solving classification problems. Additionally, natural gradient boosting (NGBoost), a state-ofthe-art boosting model, was used.
As the data in this study were mostly continuous and in numerical form, such as complete blood count, liver function tests, and renal function tests, we chose the K-nearest neighbor (KNN) and support vector machine (SVM) models. These solve classification problems using the relative distance based on the vector concept. Gaussian Naive Bayes (GNB), which is a classical model used to solve classification problems, was also selected. In addition, to evaluate the usefulness of the deep learning (DL) model using numerical structured data, multilayer perceptron (MLP) with a neural network was chosen among various DL models. The MLP adds one or more fully-connected hidden layers to the input and output layers, and it applies an activation function to the result of each hidden layer.

Model development
All collected data were split into a training set (80%) and a test set (20%). The stratification option of the data was based on COVID-19, which was the answer (dependent variable) class. This allowed the ratio of the answer value (COVID-19) to be the same in the training and testing sets. K-fold cross-validation (n_split:5) [21][22][23][24][25] was performed to avoid data loss during model development training and to improve the model prediction performance. The testing dataset was not used for model training and validation. For training and validation, 80% of the training dataset was subdivided into five equal parts (five-time stratified K-fold). In other words, 80% of the training set was created with five each of 64% and 16% for training and validation. Performance evaluation of all models was expressed as the area under the receiver operating characteristic (AUROC), accuracy, precision, recall, and F1-score. Additionally, F1-score optimization was required because data imbalance between non-COVID-19 pneumonia and COVID-19 existed. In such cases, the F1-score is an important performance index [26], and its optimization was performed through cut-off adjustment.
In the development of each model, various hyperparameter tunings were used. In the tree series models, XGBoost, LGBM, and NGBoost, the col-sample and sub-sample options were set to 0.7-0.9 to optimize learning when creating one tree. As our dataset had several columns, we improved performance by randomly extracting columns. Similarly, data were randomly extracted from the subsample, and performance improved by diversifying the questions when creating the tree. In addition, tree-based models can set the level of detail of learning; hence, this factor was also considered. The level of learning was set to max_depth in XGBoost and num_leaves in LGBM. In LGBM, num_leaves was set to 26, which is lower than the default value of 31, to prevent overfitting. In XGBoost, the optimal performance was obtained with the default max_depth of 6.
As we enrolled 769 non-COVID-19 pneumonia cases and only 296 COVID-19 cases, we had to account for data imbalance. However, the imbalance was minor; hence, the cut-off adjustment implemented were only 0.5 for XGBoost, 0.55 for LGBM, 0.45 for CatBoost, 0.4 for RF, 0.3 for KNN, and 0.5 for other models.
In XGBoost, LGBM, and CatBoost, performance improved when missing values were treated jointly. Except for these models, the median value of the training set was used in place of missing values in the training and testing sets to avoid data leakage. In XGBoost, LGBM, and CatBoost, modeling is possible even with missing values, so modeling was additionally performed even without preprocessing.

Feature impact investigation by "feature importance" (FI) technique and SHapley Additive exPlanations (SHAP) method
FI refers to a technique that assigns a score to input features based on their usefulness in predicting a target variable [27]. The FI technique also visually measures the contribution of a parameter to the prediction through a ranking. FI cannot be implemented in DL models such as the MLP. Therefore, we also used the SHAP method to analyze the feature impact of each laboratory result in the classification of non-COVID-19 pneumonia and COVID-19. The SHAP method is a novel technique that visualizes the feature impact that affects the classification model [28]. Unlike the FI technique, SHAP methods provide positive or negative correlation information when each parameter predicts the target variable.

Ethics approval and consent to participate
This study was approved by the Institutional Review Board of the Ewha Womans University Mokdong Hospital (approval number: EUMC 2022-01-031-001). The patient records were reviewed and published in accordance with the Declaration of Helsinki. The requirement for informed consent was waived owing to the retrospective nature of the study.

Results
This study was conducted between September 2020 and March 2022. We recruited 1,065 patients diagnosed with pneumonia and categorized them into a non-COVID-19 pneumonia (769 patients) group and a COVID-19 (296 patients) group. 75,615 (71 × 1065) datasets were used for model development, of which 12,583 (16.64%) values were missing. The data collected and the comparison between the two groups are summarized in Supplementary Table 1.
For XGBoost, CatBoost, and LGBM, we redeveloped the model without treating the missing values as a median value. In addition, cutoff adjustment was performed on each model to optimize the F1 score.  (Table 1).

Evaluation of feature contribution to the prediction model by the FI technique
We investigated the feature contributions of XGBoost, CatBoost, LGBM, and RF. They displayed excellent performances in the classification of non-COVID-19 pneumonia and COVID-19. D-dimer had the highest feature importance (FI) with XGBoost, followed by procalcitonin, ferritin, pulse rate (PR), N-terminal pro-B-type natriuretic peptide, and eosinophil ( Fig. 2A). D-dimer also scored the highest FI with Cat-Boost, followed by PR, respiratory rate (RR), procalcitonin, body temperature (BT), eosinophil, and total bilirubin (Fig. 2B). PR scored the highest FI with LGBM, followed by total bilirubin, D-dimer, BT, eosinophil, glucose, RR, and activated partial thromboplastin time (Fig. 2C). D-dimer also had the highest FI with RF, followed by procalcitonin, RR, PR, ferritin, high sensitivity troponin T, BT, and lactate dehydrogenase (Fig. 3D).

Feature impact investigation by SHAP method
We applied the SHAP method for the following reasons. In an MLP, feature impact cannot be obtained using FI. However, through the SHAP method, information on positive (red color) or negative (blue color) correlation for the target variable (COVID-19 pneumonia) of features can also be obtained. In XGBoost, D-dimer (red) scored the highest impact, followed by RR (blue), PR (blue), ferritin (blue), eosinophil (blue), fibrinogen degradation product (FDP, blue), and BT (red) (Fig. 3A). In CatBoost, D-dimer (red) scored the highest, followed by RR (blue), PR (blue), eosinophil (blue), ferritin (blue), FDP (blue), and total bilirubin (blue) (Fig. 3B). In LGBM, D-dimer (red) had the highest feature impact, followed by RR (blue), PR, FDP (blue), procalcitonin (blue), ferritin (blue). and eosinophil (blue) (Fig. 3C). Unlike the F1score, the feature impact can be obtained in the MLP, which is a DL model, using the SHAP method. Consequently, D-dimer (red) scored the highest, followed by basophil (blue), FDP (blue), white blood cell (blue), sex (female, red), and blood urea nitrogen (red) (Fig. 3D). Characteristically, D-dimer ranked highest among the four models and exhibited a positive correlation with COVID-19 pneumonia.

Discussion
The AI model developed with various boosters, particularly, XGBoost, CatBoost, and LGBM, exhibited accurate classification performance in the differential diagnosis of non-COVID-19 pneumonia and COVID-19 (Table 1 and Fig. 1). Boosting models are generally known for their excellent classification performance using categorical data [18][19][20]. In this study, we demonstrated the advantages of the boosting model in utilizing numerical data, such as laboratory tests, which Table 1 Performance of a machine learning model for the differential diagnosis of non-COVID-19 pneumonia and COVID-19. constitute the majority of hospital data. Both non-COVID-19 pneumonia and COVID-19 are respiratory system diseases, and thus far, no method has yet been developed to differentiate them except by RT-PCR, which has a high diagnostic waiting time. In addition, even if the patient is afflicted by COVID-19, appropriate treatment cannot be administered until the RT-PCR result is known. Occasionally, COVID-19 infects non-COVID-19 patients who are already hospitalized [12]. As the model developed in this study uses only the initial data of patients, it is advantageous for the early classification of non-COVID-19 pneumonia and COVID-19, which have significantly different treatment protocols, such as the use of negative pressure isolation rooms. Therefore, this model can be considered for application in various medical fields. The minimization of missing values can aid the development of an AI model. In the case of hospital data, missing values are inevitable. These missing values are generally treated as median or mean values when developing AI models. In this study, we preprocessed them as median values. However, XGBoost, CatBoost, and LGBM can learn the model despite missing values. Therefore, in this study, the boosters learned the models with missing values as blanks, without additional preprocessing. Consequently, the three models performed efficiently even when the missing values were preprocessed as median values. However, they exhibited optimal performances with blanks, which is meaningful.

Model
Preprocessing missing values as a median value can act as a bias; however, if the missing values are left blank, the model randomly inserts values into the column and chooses the value with the best score from the training set. This generally increases the chances of scoring on the test set. Except for the XGBoost, LGBM, and CatBoost, the models cannot be modeled if missing values are present, so we filled the missing values with the median value. We used only the training data set, not the entire data set, to obtain the median value, and filled in the missing values in the training and test sets with this median value. Preprocessing missing values using the entire dataset (training and testing set) can lead to data leakage. Our approach to missing values in modeling deserves to be considered in various medical AI studies in the future.
These results revealed the usefulness of the boosting series model in developing a prediction model using hospital data, such as laboratory tests, which occasionally contain missing values. We implemented a prediction model using NGBoost, which is a state-of-the-art boosting model. NGBoost is a general stochastic prediction algorithm with gradient boosting, which generalizes gradient boosting to stochastic  regression by treating the parameters of conditional distributions as targets for multiparameter boosting algorithms. The results also demonstrate the need for natural gradients to correct the training dynamics of multiparameter boosting approaches [29]. NGBoost-the latest machine learning model-performed the best, followed by XGBoost, CatBoost, and LGBM. The performance indices, including the F1-score of the three boosting models (XGBoost, LGBM, and CatBoost), were excellent. The reason for this has to do with our dataset being mainly composed of structured and linear data. As some columns nonlinearly affect the answer value (COVID-19), the tree-based models (boosting models) that are optimal for differentiating data, particularly data that share specific characteristics, performed superbly. The DL model (MLP) performing more poorly than the boosting model is attributed to the small size and linearity of the data, which are not optimal conditions for DL.
These results also substantiate the usefulness of boosting series models in medical data research. Among the AI models, an MLP, a type of DL model that uses neural networks, was also used. The raw data consisted mainly of linear numerical data with categorical features. Tree-based boosting models are mainly beneficial for learning using categorical data. As the MLP performs both linear and nonlinear learning, we expected a better performance [30,31]. However, its performance was inferior to that of the boosting models but significantly better than that of the SVM, KNN, and GNB. Therefore, future research on the development of AI for medicine must test various models.
In this study, we identified important features in the classification of non-COVID-19 pneumonia and COVID-19 using the FI technique and SHAP method. In particular, the comprehensive analysis of FI and SHAP results suggested D-dimer as the most important feature in the differential diagnosis of non-COVID-19 pneumonia and COVID-19. D-dimer is a product of fibrinolysis, which is a small protein fragment present in the blood after the clot is broken down by fibrinolysis, and it is a prerequisite for the diagnosis of thrombosis [32]. However, in addition to thrombosis, elevated D-dimer levels have been associated with liver cirrhosis, high rheumatoid factor, inflammation, malignancy, trauma, pregnancy, recent surgery, and advanced age [33]. Various recent studies have revealed the association between COVID-19 and D-dimer [34][35][36]. The developed model performed efficiently in this aspect. Moreover, the discovery of D-dimer as the primary feature contributed to the performance. In the FI and SHAP results of XGBoost, CatBoost, LGBM, and RF, eosinophil (negative correlation) and glucose (positive correlation) contributed significantly to the performance of the classification model.
Studies have suggested that serum glucose levels and eosinophils were related to the severity of COVID-19 [37][38][39][40]. however, the literature does not contain studies highly specialized in COVID-19 diagnosis. The importance of the initial values of eosinophil and serum glucose levels in the COVID-19 diagnosis revealed herein warrants additional studies. In the FI and SHAP results of the five models (XGBoost, Cat-Boost, LGBM, RF, and MLP), AST (positive correlation) was common among the top 20. Reports have suggested that AST can be elevated in COVID-19 patients, and consequently, increase mortality [41,42].
AST is found in very high concentrations in the liver, kidney, pancreas, lungs, and cardiac and skeletal muscles [43]. Therefore, although the assay for detecting this enzyme is a nonspecific laboratory test, it is still an interesting and important feature discovered by the differential diagnosis of non-COVID-19 pneumonia and COVID-19 in this study. Four models (XGBoost, CatBoost, LGBM, and MLP) using the SHAP method revealed basophil (negative correlation) as a common denominator in these diagnoses. Although a previous study identified several associations between basophil and COVID-19, this white blood cell is associated with viral infections other than COVID-19 [44]. The non-COVID-19 pneumonia patients in this study included both bacterial and viral pneumonia. Hence, the features that affect the performance of this model have a high significance despite being nonspecific laboratory test results.
Since the declaration of the COVID-19 pandemic, methods for diagnosing COVID-19 have advanced considerably. Already, nucleic acid amplification tests show fairly high accuracy in diagnosing COVID-19 [45]. The AI model developed herein attempted to classify diseases that have similar clinical features but require early classification. In addition, an accessible laboratory test was used for the development of the model, which validates and adds value to this study. This study also presents evidence for the effectiveness and future potential of AI research based on laboratory tests.
In conclusion, the developed classification model exhibited robust performance in differentially diagnosing non-COVID-19 pneumonia and COVID-19 and identifying important features through the FI and SHAP methods. By rapidly and accurately classifying the two forms of pneumonia with this model, hospitals can prevent the spread of COVID-19 and provide rapid treatment for non-COVID pneumonia, which does not require negative pressure isolation rooms. In addition, the usefulness of boosting models in medical research using AI was demonstrated.

Funding
None.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.