Benign‐malignant classification of pulmonary nodules by low‐dose spiral computerized tomography and clinical data with machine learning in opportunistic screening

Abstract Background Many people were found with pulmonary nodules during physical examinations. It is of great practical significance to discriminate benign and malignant nodules by using data mining technology. Methods The subjects' demographic data, baseline examination results, and annual follow‐up low‐dose spiral computerized tomography (LDCT) results were recorded. The findings from annual physical examinations of positive nodules, including highly suspicious nodules and clinically tentative benign nodules, was analyzed. The extreme gradient boosting (XGBoost) model was constructed and the Grid Search CV method was used to select the super parameters. External unit data were used as an external validation set to evaluate the generalization performance of the model. Results A total of 135,503 physical examinees were enrolled. Baseline testing found that 27,636 (20.40%) participants had clinically tentative benign nodules and 611 (0.45%) participants had highly suspicious nodules. The proportion of highly suspicious nodules in participants with negative baseline was about 0.12%–0.46%, which was lower than the baseline level except the follow‐up of >5 years. In the 27,636 participants with clinically tentative benign nodules, only in the first year of LDCT re‐examination was the proportion of highly suspicious nodules (1.40%) significantly greater than that of baseline screening (0.45%) (p < 0.001), and the proportion of highly suspicious nodules was not different between the baseline screening and other follow‐up years (p > 0.05). Furthermore, 322 cases with benign nodules and 196 patients with malignant nodules confirmed by surgery and pathology were compared. A model and the top 15 most important clinical variables were determined by XGBoost algorithm. The area under the curve (AUC) of the model was 0.76 [95% CI: 0.67–0.84], and the accuracy was 0.75. The sensitivity and specificity of the model under this threshold were 0.78 and 0.73, respectively. In the validation of model using external data, the AUC was 0.87 and the accuracy was 0.80. The sensitivity and specificity were 0.83 and 0.77, respectively. Conclusions It is important that pulmonary nodules could be more accurately identified at the first LDCT examination. A model with 15 variables which are routinely measured in the clinic could be helpful to distinguish benign and malignant nodules. It could help the radiological team issue a more accurate report; and it may guide the clinical team regarding LDCT follow‐up.


| INTRODUCTION
Lung cancer continues to be the cancer with the highest incidence and mortality rate worldwide, and has remained so for almost 40 years. [1][2][3] Early detection of lung cancer mainly depends on the screening of low-dose spiral computerized tomography (LDCT) in the lung. 4 The proportion of pulmonary nodules found in LDCT screening has been shown to be as high as 30%, most of which are subsequently confirmed as benign. Despite the benign status, many patients still receive invasive examination and treatment such as surgery and puncture. 5 A study by Gopal and co-workers 6 found that nine cases of early lung cancer were detected per 1000 people screened by LDCT, but among these, 235 false-positive nodules were also detected. In some studies, it was believed that after learning that the pulmonary nodules were positive, rates of LDCT re-examination increased. These excessive rates of LDCT testing are responsible for an increased economic burden on the healthcare system, 7,8 and furthermore expose patients to high doses of radiation for medically unnecessary reasons. 9,10 It has also been shown that LDCT re-examination causes undue stress and emotional effects on patients. 11 Recent medical reports found that lung cancer rates were increasing among youth and nonsmokers in China. 12,13 Given that rates are increasing and the practical and social significance of early lung cancer detection, it is obviously unwise to reduce the scope of LDCT screening and focus only on high-risk groups. 14 The more feasible method is to continuously expand the scope of screening 15 while simultaneously improving the accuracy and predictability of screening methods. On the one hand, instrumentation performance can be enhanced to obtain clearer images, and technical personnel can be more rigorously trained in image analysis. [16][17][18][19][20] On the other hand, with modern computational power, big data and artificial intelligence can be utilized to analyze clinical data more effectively. This latter approach shows the most potential to improve overall lung cancer screening quality and efficiency, and remains an important new direction of development. 21,22 The screening of lung cancer by physical examination continues to be opportunistic in its application. 23 In China, opportunistic screening plays a vital role in the early detection of lung cancer. 24 At present, there are many patients with pulmonary nodules found in routine physical examination, most of whom undergo regular review by LDCT every 3-12 months. 25,26 From the perspective of avoiding missed diagnosis, this strategy seems scientific and reasonable. However, for most people with benign nodules, the clinical and biological significance of this high-frequency re-examination has always been a concern among physicians and scientists. The present study used data mining technology to analyze clinical variables captured in physical examination of patients with positive pulmonary nodules, to determine whether computational methods can assist in the discrimination between benign and malignant nodules. On the one hand, it can help the radiological team issue a more accurate report; on the other, it can better guide the clinical team regarding LDCT follow-up, thereby avoiding the unnecessary and undue risks and costs associated with this follow-up. In the validation of model using external data, the AUC was 0.87 and the accuracy was 0.80. The sensitivity and specificity were 0.83 and 0.77, respectively. It is important that pulmonary nodules could be more accurately   identified at the first LDCT examination. A model with 15 variables which are   routinely measured in the clinic could be helpful to distinguish benign and malignant nodules. It could help the radiological team issue a more accurate report; and it may guide the clinical team regarding LDCT follow-up.

K E Y W O R D S
cancer screening, health examination, low-dose computed tomography, lung cancer, opportunistic screening, pulmonary nodules dysfunction or cardiovascular disease; patients who have previously been diagnosed with lung cancer; patients who have been treated for space-occupying lesions of the lung, have tumors in other organs, or are suspected of cancer metastasis. Participants included in the validation model were from Henan Provincial People's Medical Health Examination Center and Sichuan Provincial People's Hospital with similar inclusion and exclusion criteria as mentioned above (Figure 1). The study protocol was approved (S2021-427-01) by the Ethics Committee of Chinese People's Liberation Army General Hospital and complied with the principles of the Declaration of Helsinki and its contemporary amendments. The subjects were informed that their physical examination data may be used scientifically in a deidentified manner, and informed consent was obtained.

| Data collection
The questionnaire was utilized to record in detail the subject's age (calculated according to the date of birth on the ID card), smoking and alcohol use, personal and family medical history, and environmental exposure. The personal medical history included any prior diagnosis of chronic obstructive pulmonary disease (COPD), tuberculosis and/or asthma, and whether the individual has a history of environmental or high-risk occupational exposure (such as exposure to asbestos and radioactive toxic gas); the family history included prior diagnoses of lung cancer and/or other malignant tumors in immediate relatives.
Smoking status definitions included the following: smoking ≥10 cigarettes a day for more than 1 year and those who had quit smoking within the past 5 years were regarded as smokers; those who had quit smoking >5 years were regarded as nonsmokers. Smoking amount = number of cigarettes per day × years of smoking. Alcohol use definitions included the following: no use or a small amount of alcohol use per day (≤25 g/day for males, ≤15 g/day for females), and habitual users (≥25 g/day for males, ≥15 g/ day for females). Environmental exposures included living with a habitual smoker (i.e., continuous exposure to secondhand smoke) and high-risk occupational exposure (e.g., asbestos, coal mine, radioactive toxic gas).
Participants wore uniform and loose clothes for physical examination under fasting state. Body weight, height, and body fat percentage were measured by using a bioelectrical impedance analyzer (InBody 720 analyzer, InBody Co. Ltd), and body mass index (BMI = body weight [kg]/body height [m 2 ]) and skeletal muscle mass index (SMI = muscle mass/body weight × 100%) were calculated. After 10 min of rest, systolic and diastolic blood pressures were measured. Deep tissue ultrasound was used to determine the presence of fatty liver and thyroid nodules.

| Determination of Helicobacter pylori infection
A 13 C-urea breath test (Helikit, AltaChem Pharma Ltd) was used to detect Hp infection of the gastric mucosa. None of the patients had taken antibiotics during the past month. Participants fasted for 4 h before the test. The obtained samples were analyzed by gas chromatography-isotope ratio mass spectrometry. 30 The results were determined as positive or negative based on a device algorithm. The participants with a delta over baseline (the 13 CO 2 / 12 CO 2 ratio) ≥4 were considered positive (i.e., confirmed gastric Hp infection).

| LDCT scanning
A tube voltage of 100-140 kVp was used according to the subject's weight. A tube current of <60 mA was used. The total radiation exposure dose was ≤1 mSv. The scanning area was from the apex of the lung to the costophrenic angle to include the entire lung. After scanning, raw data were used for thin slice reconstruction. The reconstructed slice thickness was 0.625-1.25 mm. Scanning range was from lung tip to costophrenic angle (including the whole lung). Scanning sampling time was ≤10 s, the respiratory phase was defined as the end of deep inhalation, and the CT scanning detector spanned ≥16 rows with no contrast agent required. Soft tissue density or lung algorithm was recommended for thin-layer reconstruction. For the detection of pulmonary nodules, the maximum density projection was used for three-dimensional reconstruction of thin-layer images. 26

| Analysis of LDCT scans
Nodule-positive scans were characterized as being positive for focal, quasi-circular, dense solid, or sub-solid (partially solid) ground-glass lung shadows with 5 mm ≤ nodule diameter ≤ 30 mm. These could be isolated or multiple (≤10), without atelectasis, hilar lymphadenopathy and pleural effusion, as previously described. 31 Highly suspicious nodules were designated according to the following criteria: (1) the diameter of pure ground-glass nodules was ≥10 mm; or the presence of nodules whose diameter increased by ≥2 mm as compared with the baseline diameter of ≤15 mm; or those nodules whose baseline diameter was >15 mm and increased by more than 15% as compared with the baseline 32 ; (2) the density of pure ground-glass nodules increased or there were solid components in them; or the solid components of sub-solid ground-glass nodules with uneven density exceeded 50% 33 ; (3) thickening of tracheal and bronchial walls, lumen stenosis, or intraluminal nodules; (4) the presence of angiogenesis consistent with the law of malignant pulmonary nodules; and (5) the presence of lobulation, burr, and/or pleural depression. All other nodules were designated as clinically tentative benign nodules. Patients with lung cancer confirmed by pathological results in the later stage were included in the early lung cancer group. Although there were no pathological results, patients with confirmed benign nodules or the disappearance of nodules were included in the benign nodule group. CT results were independently analyzed by two doctors.

| Clinical follow-up
Within the same subject, each LDCT re-examination was regarded as one follow-up. The first year of follow-up was defined as any re-examination with 365 days of initial LDCT scan; similarly, follow-up of >1825 days was defined as >5 years. Patients with suspicious malignant nodules found at baseline examination were referred to the clinic for intervention. Subjects who underwent thoracic surgery in our hospital were tracked through ID number or through family members via telephone to understand the results of any later operation. If the clinically tentative benign nodules found at baseline remained as benign nodules in the LDCT re-examination with an interval of >5 years, or the nodules disappeared in the subsequent follow-up, then all pulmonary nodules found at baseline in those patients were defined as benign nodules.

| XGBoost machine learning model
XGBoost (Extreme Gradient Boosting) is an integrated learning algorithm based on the gradient boosting decision tree. The objective function is Taylor-expanded to the second order to make the gradient decline faster and more accurate, and the regularization term is introduced to control the complexity of the model and prevent over-fitting. Automatically processing missing values greatly improves the efficiency of the algorithm. When splitting the decision tree, features are selected based on the information gain.
The more times a feature is selected to split, the greater its average gain (i.e., it is proved to be an important variable).

| Statistical analysis
Questionnaire data were encoded, quantified, and input into the computer, and statistical analysis was carried out using Stata 11.0 software. The Kolmogorov-Smirnov method was used to test for normality, the classification data were expressed by frequency and rate, and the independent sample t-test and chi-squared test were used to determine main effect differences between groups with p < 0.05 designated as statistically significant. All data were divided into training set and test set in a ratio of 8:2; the XGBoost machine learning model was constructed with python 3.8. The Grid Search cross-validation method was used to select the super parameters. External unit data were used as an external validation set to evaluate the generalization performance of the model. To evaluate the prediction performance of the model, receiver operating characteristic curves (ROCs) were used to calculate the area under the curve (AUC), and the accuracy, sensitivity, and specificity were determined based on the threshold of 0.5.

| Clinical features
A total of 135,503 people underwent physical examination and lung LDCT screening during the study period. The average subject age was 47.96 ± 9.93 years; there were 89,705 males (66.20%) and 45,798 females (33.80%). The baseline LDCT examination of 135,503 subjects showed that 107,256 persons (79.15%) had negative nodules. Clinically tentative benign nodules were found in 27,636 cases, accounting for 20.40%. There were 611 highly suspicious nodules, accounting for 0.45%.

| LDCT follow-up screening
A total of 27,053 LDCT follow-ups were completed in 135,503 participants ( Table 1). The results showed that the proportion of highly suspicious nodules in participants with negative baseline was about 0.12%-0.46%, which was lower than the baseline level except the follow-up of >5 years. During follow-up, 16.5%-27.5% of the participants with clinically tentative benign nodules at baseline turned negative, 71.53%-81.95% remained benign, and the proportion of highly suspicious nodules was higher at follow-up than that of the baseline screening (0.70% vs. 0.45%) (χ 2 = 8.09, p = 0.004). However, only in the first year of LDCT re-examination was the proportion of highly suspicious nodules (1.40%) significantly greater than that of baseline screening (0.45%) (p < 0.001), and the proportion of highly suspicious nodules was not different between the baseline screening and other follow-up years (p > 0.05). This suggested that if he or she was diagnosis as clinically tentative benign nodules in the opportunistic screening, it was important to review LDCT within 1 year, and the follow-up value of more than 1 year is not greater than that of random opportunistic screening.
In total, the proportion of highly suspicious nodules in subjects who completed follow-up was higher than that of baseline screening (0.57% vs. 0.45%) (χ 2 = 6.74, p = 0.009); it was only in the first year of LDCT re-examination when the proportion of highly suspicious nodules was significantly higher than at baseline (1.24% vs. 0.45%) (χ 2 = 6.74, p = 0.009). Furthermore, the proportion of highly suspicious nodules was not different between the baseline screening and other follow-up years. The above results suggest that if we could accurately find those nodules, and distinguish benign and malignant nodules, we could avoid many unnecessary LDCT examinations.

| Comparison of baseline physical examination results between benign nodule group and lung cancer patients
Among the 611 patients with suspected malignant nodules found in the baseline examination, 196 cases were confirmed as lung cancer by surgery and pathology, predominantly non-small cell lung cancer. There were 174 cases of adenocarcinoma, 10 cases of squamous cell carcinoma, and 12 cases of alveolar carcinoma. Among 27,636 cases of clinically tentative benign nodules found at baseline, 375 cases remained benign or disappeared completely after >5 years of follow-up. The 53 cases with incomplete baseline data were excluded, and 322 cases were included in the benign nodule group (Figure 1). Tables 2 and 3 show comparisons of baseline physical examination results between the lung cancer and benign nodule groups.

| Data mining and machine learning
Since the physical examination results may be clinically related to pulmonary nodule status (i.e., benign vs. malignant), a computational approach was implemented using all the above variables as scalars and analyzed them with data mining technology. A prediction model was established and the top 15 most important clinical variables determined by XGBoost algorithm included eosinophil count, age, erythrocyte count, determination of hematocrit, smoking, UA, gender, average erythrocyte hemoglobin, total protein, systolic blood pressure, blood creatinine, neutrophils, diastolic blood pressure, average erythrocyte volume, and CA19-9 ( Figure 2 Figure 3.

| Validation of model using external data
Data used in the validation model were from patients evaluated at Henan Provincial People's Medical Health Examination Center and Sichuan Provincial People's Hospital. The inclusion and exclusion criteria were the same as above. Among the 5146 patients screened in this validation data set, 24 cases were identified as early lung cancer, and the pathological results were adenocarcinoma. Benign nodules were found in 39 cases. The comparison of the 15 clinically relevant variables identified using the XGBoost algorithm in this validation cohort is shown in Table 4. In this validation cohort, the model AUC was 0.87, and the accuracy was 0.80. The sensitivity and specificity were 0.83 and 0.77, respectively. The ROC is shown in Figure 4.

| DISCUSSION
With respect to lung cancer screening, the value of opportunistic screening remains controversial. 31,34 Despite this controversy, LDCT screening has emerged as the most reliable and rigorous method of early lung disease detection among healthy people. 19 A common chest CT typically delivers more than a hundred times the radiation dose of a routine frontal and lateral chest X-ray (0.02-0.2 mSv, which is often quoted as 8-10 mSv). However, the total radiation exposure dose of LDCT in our study was ≤1 mSv, which is about 5-50 times of X-ray. 35 In the present study, baseline LDCT screening identified 107,256 persons (79.15%) with negative nodules, 27,636 persons (20.40%) with clinically tentative benign nodules, and 611 patients (0.45%) with highly suspicious nodules. It is notable that most nodules identified with LDCT screening in the baseline examination were determined to be clinically tentative benign nodules, which is consistent with previous reports. 4 The proportion of highly suspicious nodules is significantly higher than the reported incidence of lung cancer in China (28.49/100,000). 36,37 This may be related to the fact that most of the subjects screened were middle-aged and elderly people, with an average age of 47.96 ± 9.93 years. It is also important to note that many of these 611 highly suspicious nodules were determined not to be lung malignancies as confirmed by pathology.

Category
In the population with negative nodules at baseline, the proportion of highly suspicious nodules was lower than the baseline level (p < 0.05), only reached 0.46% during >5 year follow-up. This suggested that if no pulmonary nodules were found in the opportunistic screening, the probability of highly suspicious nodules within the next 5 years was not high, which was lower than the random opportunistic screening. In the population with clinically tentative benign nodules found in the baseline screening, it was determined that the proportion of highly suspicious nodules increased only within the first year of LDCT follow-up, while the proportion of highly suspicious nodules was not different between other follow-up years and the baseline screening.
This suggests that in patients with pulmonary nodules for 2-5 years, or even more than 5 years, results of LDCT re-examination were the same as that of the population receiving the first screening of LDCT. These findings are clinically significant given the psychological impact after finding nodules, 10 the economic cost of re-examination 36 and radiation burden, 38 because they suggest that multiple LDCT re-examinations after 1 year cannot significantly increase the value of screening even if nodules are found with the initial screening. More importantly, the findings suggest that if malignant nodules can be more accurately identified at the first LDCT examination, the proportion of highly suspicious nodules found by the LDCT re-examination may decline in the first year of follow-up.
Based on this finding, we carried out further analysis on the population with pulmonary nodules by comparing patients with surgically confirmed early lung cancer with those whose nodules disappeared or remained benign after >5 years of follow-up. In the baseline physical examination of these subjects, there were clear differences in age, gender distribution, smoking, environmental exposure, and many blood-borne variables (Tables 3 and  4). However, it is very difficult to analyze whether these variables were affected by age and gender distribution, or whether they were really related to the status of the nodules. Therefore, we used machine learning technology and XGBoost algorithm to establish a prediction model of benign and malignant nodules using these variables.
The clinical variables included in this study were from the comprehensive physical examination of each subject. Statistically significant differences in many variables between patients with early lung cancer and those with benign nodules were initially identified with our analysis. However, after further analysis using XGBoost, it was found that the auxiliary prediction value of only 15 common clinical variables was significant. Moreover, these 15 variables were routinely measured in the clinic, and easy to obtain at most institutions. Therefore, our results also provide a good prerequisite for the universal application of this prediction method. Although the exact relationship between these variables and early lung cancer cannot remain unknown at the present time, our preliminary findings undoubtedly provide an alternative means of evaluating and stratifying lung cancer risk in patients with pulmonary nodules. This method could not only be useful for imaging experts, but may also provide clinicians with an additional tool to decide appropriate care for patients with nodules detected by LDCT. This evaluation model can be easily implanted into doctors' computers. By entering these 15 variables, it can be evaluated quickly. This method may help radiologists improve the accuracy of the diagnosis of pulmonary nodules. Compared with the standard of care, this method can help clinicians give patients clear suggestions, especially when those pulmonary nodules are ambiguous. Generally, the sensitivity and specificity of internal verification are higher than those of external verification. However, it was the opposite in our results. This may be due to different validation populations. Because subjects used in this study were from the population of patients who have undergone comprehensive physical examination as a part of opportunistic screening, which were very difficult to include these subjects in other health examination institutions, especially patients with benign nodules. In the control group, the proportion of pulmonary nodules that disappeared in the follow-up were diagnosed as benign nodules was relatively high. The above reasons may lead to different results of internal test and external verification.
In recent years, with the popularization of data mining technology and the development of deep learning methods such as convolutional neural network, the ability to predict the risk of lung cancer in patients using clinical data has become more frequent. Many studies focus on the prediction of treatment response and prognosis in patients with confirmed lung cancer. [39][40][41][42][43] Some studies have predicted the risk of lung cancer through well-known clinical indicators such as age, smoking history, past tumor history, asbestos exposure, COPD, weight, physical activity, and fasting blood glucose level. 44 Based on the deep learning of image data, [45][46][47][48][49] or with the help of a tracer, [50][51][52][53][54][55][56] it has become easier to determine whether pulmonary nodules are benign or malignant. Furthermore, some researchers have developed computer-aided targeting systems using these technologies. 57 Some studies have also used clinical indicators to T A B L E 4 Comparison of 15 indicators of external verification data.

Category
Early lung cancer group (n = 24) Benign nodule group (n = 39) Statistics determine the risk of benign and malignant pulmonary nodules in patients. 58,59 However, a major limitation to these previous studies is that they have relied on a few indicators which have only been loosely established as having an association with lung cancer risk. Since the body is a highly complex and interconnected system, a clear cause-effect relationship between these commonly used indicators and lung cancer cannot be established and changes in some of these indicators are easy to ignore. The novelty and importance of the present study is that we have implemented machine learning to generate a risk model for lung cancer using data that is routinely captured in most physical examinations. This model provides clinicians with a valuable tool to assist in the diagnosis of benign and malignant pulmonary nodules.

| Conclusion
For the subjects with negative or clinically tentative benign nodules in the initial LDCT screening, multiple LDCT re-examinations within 5 years of follow-up did not seem to have any further clinical value than the initial LDCT screening. Further analyses using data mining and machine learning technology found that for the population with pulmonary nodules identified by LDCT, it was feasible to use 15 variables routinely captured in physical examination to establish a risk model for predicting whether the pulmonary nodules were benign or malignant. The model was determined to be concise and effective and was validated using an external data set, which shows the model has generalizability and could be widely implemented. A limitation of this study includes the small sample size of the early lung cancer and benign nodule groups. Additionally, the benign nodules detected with LDCT in this study were those assumed to not change during the follow-up of at least 5 years, but there was no pathological evidence to confirm their benign status. Moreover, subjects used in this study were from the population of patients who have undergone comprehensive physical examination as a part of opportunistic screening. As such, this population may not fully represent ordinary, otherwise healthy individuals. Finally, the sample size of the cohort used for external validation was small. Future studies should validate this model in a much larger patient population to further test the generalizability and feasibility of its use as a clinical tool for predicting lung cancer risk.