Multicenter analysis and a rapid screening model to predict early novel coronavirus pneumonia using a random forest algorithm

Abstract Early determination of coronavirus disease 2019 (COVID-19) pneumonia from numerous suspected cases is critical for the early isolation and treatment of patients. The purpose of the study was to develop and validate a rapid screening model to predict early COVID-19 pneumonia from suspected cases using a random forest algorithm in China. A total of 914 initially suspected COVID-19 pneumonia in multiple centers were prospectively included. The computer-assisted embedding method was used to screen the variables. The random forest algorithm was adopted to build a rapid screening model based on the training set. The screening model was evaluated by the confusion matrix and receiver operating characteristic (ROC) analysis in the validation. The rapid screening model was set up based on 4 epidemiological features, 3 clinical manifestations, decreased white blood cell count and lymphocytes, and imaging changes on chest X-ray or computed tomography. The area under the ROC curve was 0.956, and the model had a sensitivity of 83.82% and a specificity of 89.57%. The confusion matrix revealed that the prospective screening model had an accuracy of 87.0% for predicting early COVID-19 pneumonia. Here, we developed and validated a rapid screening model that could predict early COVID-19 pneumonia with high sensitivity and specificity. The use of this model to screen for COVID-19 pneumonia have epidemiological and clinical significance.


Introduction
The coronavirus disease 2019 (COVID-19) pneumonia outbreak has presented critical challenges for the public health, research, and medical communities. Since December 2019, the COVID-19 pneumonia outbreak has rapidly swept across China and beyond through human-to-human transmission. [1][2][3] Unlike the improved situation within China, the number of patients with COVID-19 pneumonia in other countries increased swiftly, [4] which means that the battle with COVID-19 is far from over.
Medical reports [5][6][7] recently showed that the initial symptoms of COVID-19 pneumonia are nonspecific and usually present as fever, cough, headache, vomiting, or diarrhea; additionally, some cases present without any symptoms. Laboratory results can include normal or reduced leukocyte levels and chest computed tomography (CT) findings appearing as patchy/punctate groundglass opacities in single or multiple lung lobes. [8] Currently, the real-time reverse transcription-polymerase chain reaction (RT-PCR) assessments are regarded as the gold standard for COVID-19 diagnosis. [9,10] Nevertheless, the COVID-19 specific IgM antibody is usually generated 3-5 days after onset, COVID-19 pneumonia cases initially presented negative via RT-PCR. [11,12] There were numerous suspected COVID-19 pneumonia cases in clinics during the COVID-19 infection pandemic season. In addition, a shortage of RT-PCR test kits and specimen sampling uncertainty could adversely impact early COVID-19 pneumonia diagnosis, lead to delayed patient isolation and treatment, and challenge COVID-19 pneumonia epidemiology and prognosis. Therefore, a rapid diagnostic model to screen high-risk or suspected COVID-19 pneumonia patients is urgently needed.
In the research, we aimed to develop and validate a rapid, computer-assisted screening model based on epidemiological data, clinical characteristics, and laboratory results and imaging examinations to detect early COVID-19 pneumonia from numerous suspected subjects. The random forest algorithm was adopted to build a rapid screening model. Model performance was assessed with the receiver operating characteristic curve (ROC) and areas under the curves (AUC). The predictive power of the prediction model was validated by the confusion matrix in the validation set. This is the first prospective screening model to predict early COVID-19 pneumonia, as well as its widespread application in other countries.

Study subjects
A total of 914 participants suspected of COVID-19 infection were enrolled from multiple medical institutions in Hangzhou, Shaoxing, Jinxing, Wenzhou, Ningbo, and Taizhou from January 17, 2020 to February 29, 2020. The sample size was calculated by G * Power 3.1.9.2. The suspected or confirmed patients were diagnosed based on the 7th edition of the Chinese recommendations for the diagnosis and treatment of pneumonia caused by COVID-19. The purpose of the study was to develop and validate a rapid screening model to predict early COVID-19 pneumonia from suspected cases using a random forest algorithm in China.
2.1.1. Suspected COVID-19 pneumonia cases. COVID-19 pneumonia suspected case met any 1 of the epidemiological criteria and any 2 clinical presentation criteria. If there were no epidemiological factors, then the suspected patients should meet 3 of the clinical presentation criteria.
Epidemiological factors: (1) a history of residence or travel in the outbreak area (Wuhan) or its nearby areas, communities with confirmed cases or other areas with persistent local transmission within 14 days; (2) a history of being in contact with confirmed COVID-19 infected patients (positive nucleic acid detection) within 14 days; (3) a history of being in contact with patients with fever or respiratory symptoms who had a history of residence or travel to the outbreak area (Wuhan) or its neighboring areas, communities with confirmed cases or other areas with persistent local transmission within 14 days; and (4) association with a cluster outbreak, defined as the confirmed COVID-19 pneumonia case in a place of work or a family or community, which was not in the outbreak area, within 14 days, along with other patients suffering from a fever or respiratory symptoms.
Clinical presentations: (1) fever and/or respiratory symptoms; (2) typical chest imaging features of COVID-19 pneumonia, such as ground-glass opacities, pulmonary consolidation, and infiltrating shadows; and (3) normal or decreased white blood cells (WBC), or decreased lymphocytes in an early stage of the disease.

Confirmed COVID-19 pneumonia cases.
Confirmed COVID-19 pneumonia cases were defined as a suspected COVID-19 pneumonia case that had any 1 of the following criteria: (1) positive COVID-19 nucleic acid found using RT-PCR from sputum, blood samples, throat swab, or stool sample and (2) genetic DNA sequencing results of the samples were homologous with the known COVID-19.
A total of 914 patients were prospectively recruited in the study. The exempt informed consent was approved because the patients would not be exposed to any risks, and the patients' information was anonymously collected prior to analysis in this observational study. This study was reviewed and approved by the Ethics Committees of all the medical institutions.

Epidemiological factors, clinical characteristics, and laboratory and imaging findings
Epidemiological factors and clinical manifestations were collected for each case. Age, sex, region of residence, body temperature, dry cough, fatigue, dyspnea, sputum, conjunctival congestion, nasal congestion, diarrhea or abdominal ache, dizziness or headache, nausea or vomiting, sore throat, muscle soreness, and comorbidities were recorded for all patients. Blood routine examination and C-reactive protein (CRP) were performed by standard laboratory methods. The X-ray examination followed the common chest protocol: the patients stood with their backs to the X-ray device and their chests pressed firmly against the plate, with their hands resting on the ILIUM, shoulders drooped, upper arms turned inward (pulling apart the shoulder blades), head tilted back slightly, and lower jaw resting on the upper edge of the plate. CT scanning followed the common chest protocol: the patients were placed in a supine position with their arms raised. The patients were instructed to hold their breath during the data acquisition, which included the whole lung volume. The overall scan time was 2 seconds, and the slice thickness for reconstruction was 1.25 mm. Throat swabs, sputum, stool, or blood samples were collected to test for COVID-19 nucleic acid using standardized RT-PCR test kits following the standard protocol. If an initial RT-PCR test was negative, 2 repeated tests were Bao et al. Medicine (2021) 100:24 Medicine performed after 24 hours and COVID-19 specific IgM and IgG antibodies were negative after 7 days.

Establishment of the rapid screening model
We included age, sex, comorbidities, epidemiological data, clinical symptoms, body temperature, WBC count, lymphocyte count, neutrophil count, and chest imaging findings to establish a novel diagnostic model for COVID-19 pneumonia based on the epidemiology in China. The epidemiological features and symptoms were considered binary variables and were scored as "1" if "yes" and "0" if "no." The thoracic radiologic findings were simply classified as "normal," "unilateral local patchy shadowing," "bilateral multiple ground-glass opacity," "bilateral diffuse ground-glass shadow with pulmonary consolidation," and "other imaging alterations such as pulmonary nodule or pleural effusion," and were scored as "0," "0.5," "1," "2," and "3," respectively. The samples from the patients were classified into a COVID-19 pneumonia group with 361 individuals total and a non-COVID-19 pneumonia group with 553 individuals total, based on the RT-PCR outcomes, which were considered the gold standard; the patients were randomly divided into 2 groups: 80% for model development and 20% for model validation. The computerassisted embedding method was used to screen the variables. The random forest algorithm was applied to build and validate the predictive model.

Statistical analysis
All statistical analyses were performed using SPSS software (version 19.0, SPSS Inc., IBM, Chicago, IL, USA), G * Power (version 3.1.9.2), and Scientific Python 3.6 libraries (Scikit-Learn package). Continuous variables are expressed as the mean ± standard deviation and were compared using Student t test, and categorical variables are expressed as numbers and percentages and were compared using the chi-squared test. For multiple comparisons, a one-way analysis of variance was applied.
G * Power was used for sample size, prior analysis was conducted, the effect size was 0.25, a err prob was 0.05, and power was 0.98. The computer-assisted embedding method was used for variable selection, and the algorithm of variable selection used in this study was logistic regression; the threshold of the variable selection was manually set to 0.85 based on the optimal detection principle. Collinearity diagnosis was performed to screen the variables, which were used for all subsequent analyses. In the training set, the random forest algorithm, an ensemble, supervised machine learning algorithm, was used to build a classifier (a predictive model based on a panel), and the importance of each variable was calculated. To evaluate the prediction accuracy of the screening model for determining COVID-19 pneumonia, the predictive model was evaluated by the confusion matrix, ROC, and AUC in the validation set. The cutoff value was defined as the value that allowed the maximum sensitivity and specificity values. All statistical tests were carried out in two-tailed ranges, and a probability level of P < 0.05 was considered statistically significant.

Results clinical characteristics
A total of 935 patients were recruited in this study, including 21 excluded patients due to data missing, and 914 participants were eligible for the evaluation. Among them, 553 patients were excluded because of at least 2 negative results by RT-PCR, and the remaining 361 patients were diagnosed as having COVID-19 pneumonia with a positive COVID-19 detection using RT-PCR.
The patient's characteristics are shown in Table 1. Among the 361 COVID-19 pneumonia patients, the mean age was 47.16 ± 14.47 years, and 204 patients were male (56.51%). The mean age of the COVID-19 pneumonia patients was remarkably older than the age of those without COVID-19 pneumonia (P < 0.001). Of the confirmed COVID-19 pneumonia patients, 33.24% had a travel or residence history in the outbreak area (Wuhan) within 14 days, 25.76% had contact with patients with fever or respiratory symptoms, and 36.84% were associated with cluster outbreaks within their families or working places.

Candidate biomarkers associated with early COVID-19 pneumonia
The computer-assisted embedding method was used for variable selection, and this variable selection algorithm applies logic regression in this study. The following 31 variables were adopted for variable selection: age; sex; comorbidities; travel or residence in or near the outbreak area (Wuhan) in Hubei Province, other areas with persistent local transmission, or a community with confirmed COVID-19 pneumonia cases within 14 days; contact with patients with fever or respiratory symptoms from the outbreak area (Wuhan), areas near of the outbreak area (Wuhan) in Hubei Province, other areas with persistent local transmission, or a community with confirmed cases within 14 days; association with a cluster COVID-19 pneumonia outbreak; exposure to wildlife animals; contact with patients with influenza A or influenza B, which were tested by standard kits; the presence of fever, dry cough, sputum, fatigue, dyspnea, conjunctival congestion, nasal congestion, dizziness or headache, nausea or vomiting, sore throat, and muscle soreness; laboratory tests including WBC, lymphocyte, and neutrophil cell counts and CRP levels; and radiological examination findings including chest X-ray or CT scanning.
The logistic regression coefficients are shown in Supplementary Table 1, http://links.lww.com/MD2/A224, and the variable selection threshold was manually set to be an absolute value of 0.85 based on the optimal detection principle. The top 10 variables ranked by regression coefficient are shown in Fig. 1. These variables were further analyzed for the probability of  One valvular heart disease, 2 atrial fibrillation, 1 HIV infection, 2 ankylosing spondylitis, 1 an anxiety disorder, 1 gout, 2 cerebral infarction, and 1 depression. † Three chronic nephritis, 2 cerebral infarction, 3 depression, 2 schizophrenia, 1 rheumatoid arthritis, 1 gout, 1 hypothyroidism, and 1 trauma. ‡ Fever is defined as a body temperature >37.5°C.

Bao et al. Medicine (2021) 100:24 Medicine
having COVID-19 pneumonia, including travel or residence history within 14 days in the outbreak area (Wuhan); contact with patients with fever or respiratory symptoms within 14 days who had a travel or residence history in the outbreak area (Wuhan); contact with patients from other areas with persistent local transmission or community with confirmed cases; association with a cluster outbreak; the presence of fatigue, dyspnea, and muscle soreness, reduced WBC and lymphocytes; and imaging findings on chest radiography. In addition, all the tolerances of the top 10 variables were >0.1, and all the variance inflation factors were <10. Therefore, no collinearity was found among the variables (Table 2).

Development and validation of a model to predict the probability of early COVID-19 pneumonia
To identify crucial predictors for the probability of early COVID-19 pneumonia, a random forest model was trained on 80% of the  patients using the top 10 variables. We established the importance ranking with the random forest algorithm. High ranks with variable importance are important for tree building and prediction. As shown in Fig. 2, imaging findings on chest radiography, reduced WBC and lymphocytes, and association with a cluster outbreak were the most important predictors in the random forest model with n_estimators = 40 and criterion = entropy.
Model performance was assessed on the 20% validation set with ROC and AUC. As shown in Table 3, the AUC was the highest with an n_estimators = 40, which was therefore selected as the best value. These findings indicated that the AUC was 0.956 with a sensitivity = 83.82% and a specificity = 89.57% to predict the probability of COVID-19 pneumonia (Fig. 3).
The predictive power of the prediction model for COVID-19 pneumonia was also subsequently validated by the confusion matrix in the validation set ( Table 4). The overall predictive percentage was 87.4%. The classified predictive percentages were 83.8% and 89.6% in COVID-19 pneumonia and non-COVID-19 pneumonia patients, respectively.

Discussion
In this study, we analyzed and compared the epidemiological factors, clinical manifestations, and laboratory and imaging findings between COVID-19 pneumonia patients and suspected COVID-19 pneumonia patients, for whom COVID-19 pneumonia was excluded. We applied the computer-assisted embedding method and random forest algorithm to build and validate a predictive model for early COVID-19 pneumonia. The model included using 4 epidemiological features: travel or residence in the outbreak area (Wuhan); contact with patients with fever or respiratory symptoms from the outbreak area (Wuhan) within 14 days; contact with patients from other areas with persistent local transmission or communities with confirmed cases; and association with a cluster outbreak. The model also included 3 clinical manifestations: fatigue, dyspnea, and muscle soreness; decreased WBC and lymphocyte count; and imaging changes on chest X-ray or CT scanning. The diagnostic performance of this established  Prior studies have shown that most COVID-19 pneumonia patients were characterized by non-specific clinical presentations, such as fever, cough, fatigue, myalgia, diarrhea, nausea, headache, and sore throat in the early stage of the disease. [13] A proportion of patients gradually progressed to dyspnea, especially in patients with low immune functions. [12] During treatment, the clinicians found that dyspnea was more likely to occur in populations with low immune functions than in those with normal immune functions. [6] The onset of complications, such as arrhythmia, acute respiratory distress syndrome, and shock was often a sign of poor prognosis. [14][15][16] Among the laboratory results, the most common laboratory abnormalities were leukopenia and lymphocytopenia. Previous study showed that hypoalbuminemia elevated CRP and lactate dehydrogenase, and decreased CD8 count could be found in some patients as well. [13] The most frequent imaging findings were patchy/punctate ground-glass opacities in a single lobe or multiple lobes of the lungs. [17] In addition, abnormalities on chest CT scanning could reflect disease progress and severity. [18] COVID-19 infection could also present with normal pulmonary imaging, particularly in the early stage, suggesting a necessity to combine epidemiology with clinical characteristics, laboratory tests, and imaging findings in the screening and diagnosis of COVID-19 pneumonia. [19] RT-PCR can provide confirmation of the diagnosis of COVID-19 pneumonia. [9] RT-PCR has high specificity and sensitivity and has been widely used in determining different coronavirus infections; however, The RT-PCR has some disadvantages, such as time-consuming, test kit shortages, and specimen sampling issues. Furthermore, RT-PCR might show false-negative results when used with unstable kits or non-standardized sampling, and repeated tests are required for a number of patients with initial negative RT-PCR results. [20] Moreover, COVID-19 specific IgM antibodies are usually generated 3-5 days after onset. Some COVID-19 pneumonia patients were diagnosed only based on clinical and imaging findings due to earlier negative testing results for viral RNA in the outbreak area. All of these presentations and disadvantages have made early COVID-19 pneumonia diagnosis challenging, especially during the pandemic, and have prevented timely isolation and early treatment. A rapid screening diagnostic model is urgently needed to distinguish highly suspicious patients in a large scale population to support epidemiologists and help clinicians make early treatment decisions to ultimately reduce patient mortality.
We applied the random forest algorithm as a classification method that consists of multiple nodes of decision trees in establishing a concise and accurate diagnosis model. This method has clear advantages including a low chance of overfit, more robust noise reduction, faster training speed, and more prediction accuracy. [21] Moreover, compared with other prediction models, this model could more effectively identify interactions and nonlinear relationships between variables. [22] There were 10 candidate variables selected by the computerassisted embedding method in this study. We set the random forest parameters to "n_estimators = 40 and criterion = entropy" and calculated the importance of each variable according to the original random-forest algorithm. The rapid screening model was subsequently established based on 4 epidemiological features, 3 clinical characteristics, 2 laboratory test results, and radiographic imaging findings. The area under the ROC curve was 0.956 with a high sensitivity of 83.82% and a high specificity of 89.57%, and the confusion matrix achieved 87.0% accuracy for the screening model to predict early COVID-19 pneumonia.
There were some limitations in this study. First, the patients enrolled were limited to China, which may lead to certain regional restrictions, in particular in the epidemiology of possible compromises in different regions. Global studies are needed to assess the application of the model. Second, our research was confined to early and rapid screening, with no adequate information supported on this disease progression and prognosis. Follow-up studies are necessary. Third, we screened only some critical predictors to fit the model and the overall fit of the model could be affected under the limitation of sample size and predictors.
In summary, the use of this screening model to rapidly distinguish COVID-19 pneumonia patients from a large scale of suspected patients has great significance in both epidemiology and clinical therapeutics under the circumstance of continuing widespread COVID-19 infection. Unlike the methods of virus isolation, RT-PCR, or specific IgM antibody assays, this early screening model is economical, uncomplicated, and fast, and it may save some medical resources and life by reducing COVID-19 pneumonia mortalities regionally or globally. Table 4 The confusion matrix in predicting early COVID-19 pneumonia in the validation set.

Parameter
Group