Developing a multi-institutional nomogram for assessing lung cancer risk in patients with 5–30 mm pulmonary nodules: a retrospective analysis

Background The diagnosis of benign and malignant solitary pulmonary nodules based on personal experience has several limitations. Therefore, this study aims to establish a nomogram for the diagnosis of benign and malignant solitary pulmonary nodules using clinical information and computed tomography (CT) results. Methods Retrospectively, we collected clinical and CT characteristics of 1,160 patients with pulmonary nodules in Guang’an People’s Hospital and the hospital affiliated with North Sichuan Medical College between 2019 and 2021. Among these patients, data from 773 patients with pulmonary nodules were used as the training set. We used the least absolute shrinkage and selection operator (LASSO) to optimize clinical and imaging features and performed a multivariate logistic regression to identify features with independent predictive ability to develop the nomogram model. The area under the receiver operating characteristic curve (AUC), C-index, decision curve analysis, and calibration plot were used to evaluate the performance of the nomogram model in terms of predictive ability, discrimination, calibration, and clinical utility. Finally, data from 387 patients with pulmonary nodules were utilized for validation. Results In the training set, the predictors for the nomogram were gender, density of the nodule, nodule diameter, lobulation, calcification, vacuole, vascular convergence, bronchiole, and pleural traction, selected through LASSO and logistic regression analysis. The resulting model had a C-index of 0.842 (95% CI [0.812–0.872]) and AUCs of 0.842 (95% CI [0.812–0.872]). In the validation set, the C-index was 0.856 (95% CI [0.811–0.901]), and the AUCs were 0.844 (95% CI [0.797–0.891]). Results from the calibration curve and clinical decision curve analyses indicate that the nomogram has a high fit and clinical benefit in both the training and validation sets. Conclusion The establishment of a nomogram for predicting the benign or malignant diagnosis of solitary pulmonary nodules by this study has shown good efficacy. Such a nomogram may help to guide the diagnosis, follow-up, and treatment of patients.


INTRODUCTION
With the increasing popularity of CT screening and people's increased attention to physical examinations, the detection rate of pulmonary nodules has increased significantly (Herder et al., 2005;Horeweg et al., 2014).In the United States, incidental pulmonary nodules observed on chest CT account for 31% of cases (Gould et al., 2015).Lung cancer is the leading cause of cancer-related death worldwide and is one of the most common cancers (Sung et al., 2021).The five-year survival rate for lung cancer decreases with increasing stage, with a five-year survival rate as high as 92% for stage IA1 (Goldstraw et al., 2016).Therefore, the early diagnosis of malignant pulmonary nodules is particularly important for follow-up treatment and improving patient survival.
The nomogram is built based on a multifactorial regression analysis that integrates multiple predictive factors.It transforms the complex regression equation into a visual graph by converting the relationship between the variables by the calibrated line segment (She et al., 2017).Due to its readability and convenience, the nomogram has gained more attention and applications in clinical research and medical practice.
Foreign guidelines recommend the use of risk prediction models to assess the risk of lung cancer (Gould et al., 2013), but these models may not be applicable to all populations (Bai et al., 2016).Although some prediction models have been proposed in China in recent years (Li et al., 2011;Liu et al., 2022;Zhou et al., 2022), there is still no unified and clear prediction model that can be used clinically.Therefore, the purpose of this study is to establish and validate an appropriate nomogram for predicting the probability of malignant pulmonary nodules by combining clinical and CT imaging characteristics, with the aim of providing evidence-based support for the clinical management of pulmonary nodules.

Patients
The study retrospectively analyzed patients with pulmonary nodules at the Affiliated Hospital of North Sichuan Medical College and Guang'an People's Hospital between January 2019 and November 2021.Inclusion criteria required pulmonary nodules with a diameter of 5 mm to 30 mm, clear pathological diagnosis (benign nodules required surgical specimens, while malignant nodules required surgical or small biopsy specimens), and CT scan before pathological diagnosis.Exclusion criteria included completely calcified nodules, incomplete clinical information, and previous history of lung cancer.The study was approved by the Medical Ethics Committee of Affiliated Hospital of North Sichuan Medical College with file number 2022ER234-1.Informed consent was not obtained because this was a retrospective analysis where patient identity and privacy were protected.The process of patient selection is shown in Fig. 1.

Statistical analyses
Demographic and radiographic disease characteristics were presented as counts (%), and statistical analysis was performed using R software version 4.1.3.The least absolute shrinkage and selection operator (LASSO) method is a type of linear regression that incorporates constraints to improve model accuracy.These constraints can prevent overfitting by shrinking the coefficients of less important features toward zero (Nordhausen & Klaus, 2014).Lasso regression is particularly useful in data sets with many features, some of which may not be relevant to the prediction task (Friedman, Hastie & Tibshirani, 2010;Efron et al., 2008).It is used to identify the most important features in a data set and reduce the complexity of the model (Nordhausen & Klaus, 2014).Variables selected by LASSO are included in multivariate logistic regression to identify independent predictors and develop the final model.
The calibration of the nomogram was evaluated using calibration curves, which assess the agreement between predicted probabilities and actual observed probabilities.The discriminative ability of the nomogram was assessed by Harrell's C-index and receiver operating characteristic (ROC) curves.Bootstrapping with 1,000 bootstrap resamples was used to validate the nomogram (Kramer & Zimmerman, 2007).Decision curve analysis (DCA) was performed to assess the clinical significance of the nomogram.DCA is a statistical method used to evaluate the clinical utility of diagnostic and prognostic strategies, and it can indicate the net benefit of using the nomogram compared to alternative strategies (Vickers & Elkin, 2006).

Characteristics of patients
A total of 1,160 patients were included in this study, with 830 patients from the Affiliated Hospital of North Sichuan Medical College and 330 patients from Guang'an People's Hospital.Among the patients, the training set consisted of 773 individuals (372 males and 401 females), while the validation set comprised 387 individuals (172 males and 215 females).In the training set, 511 patients (66.11%) were diagnosed with malignant tumors, including 216 males (42.27%) and 295 females (57.73%).In the validation set, 288 patients (74.42%) were diagnosed with malignancies, including 119 males (41.32%) and 169 females (58.68%).A comprehensive overview of patient demographics and CT characteristics for both the training and validation sets is provided in Table 1.

Feature selection
The study investigated the potential factors that influence the transformation of solitary pulmonary nodules into malignant tumors.Demographic data, including age, gender, smoking history, annual smoking volume, dust exposure history, family history of malignancy, and family history of lung cancer, and CT characteristics, including density of the nodule, nodule diameter, spiculation, edge, lobulation, shape, calcification, cavity, vacuole, vascular convergence, bronchiole, and pleural traction, were evaluated and included in the LASSO regression analysis.Thirteen features with non-zero coefficients were found after screening (Figs.2A and 2B).These features included age, gender, family history of lung cancer, density of the nodule, nodule diameter, spiculation, lobulation, calcification, cavity, vacuole, vascular convergence, bronchiole, and pleural traction.

Development of an individualized prediction model
In this study, logistic regression was used to identify predictors that could be used to differentiate between malignant and benign solitary pulmonary nodules.Thirteen features were initially analyzed, and the results showed that nine of these features could be used as independent predictors of malignancy: gender, density of the nodule, nodule diameter, lobulation, calcification, vacuole, vascular convergence, bronchiole, and pleural traction.Further analysis using multivariate logistic regression confirmed the importance of these factors, as summarized in Table 2. Finally, a nomogram was developed based on these nine predictors (Fig. 3).

Nomogram model validation and clinical use
The calibration curves of the developed nomogram were evaluated in both the training (Fig. 4A) and validation (Fig. 4B) sets.The results showed good agreement between the predicted and actual probabilities, indicating a well-fitting model.In the training set, the C-index was 0.842 (95% CI [0.812-0.872]),and the AUC value (Fig. 4C) was 0.842 (95% CI [0.812-0.872]).The cut-off value of 0.504 yielded a maximum Youden index with a sensitivity of 0.611 and specificity of 0.918.In the validation set, the C-index was 0.856 (95% CI [0.811-0.901]),and the AUC value (Fig. 4D) was 0.844 (95% CI [0.797-0.891]).
The highest Youden index of the model was observed at a cut-off value of 0.761, with a sensitivity of 0.828 and a specificity of 0.715.The non-significant difference in AUC (P = 0.952) between the validation and test sets indicated that the nomogram had excellent discrimination in internal validation.In addition, decision curve analysis (DCA) was performed on both the training (Fig. 4E) and validation (Fig. 4F) sets, indicating that using the nomogram to predict malignancy risk in lung nodules and intervening when the threshold probability is within a reasonable range may provide more benefit than intervening in all or none of the patients.

DISCUSSION
In this retrospective study, we developed and validated a nomogram based on clinical and imaging features to differentiate early-stage lung cancer characterized by pulmonary nodules using logistic regression analysis.Gender, density of the nodule, nodule diameter, lobulation, calcification, vacuole, vascular convergence, bronchiole, and pleural traction were identified as the most valuable predictors for identifying malignant lung nodules.The nomogram showed high accuracy and robustness in both the training and validation sets.In the training set, the AUC of the nomogram was 0.842, indicating an accuracy rate of approximately 84% in predicting the probability of malignancy in clinical pulmonary nodules.Robustness was confirmed in the validation set with an AUC of 0.844.The calibration curve showed good agreement between the predicted and actual probabilities in both the training and validation sets, suggesting that the model has a high degree of calibration.Decision curve analysis revealed our nomogram would be beneficial in guiding decisions about scenarios where all patients receive no intervention or all patients receive intervention when the threshold probability is within a reasonable range.Physicians rely primarily on visual assessment of two-dimensional CT images to evaluate pulmonary nodules, which inevitably leads to individual errors (He et al., 2021).Consequently, differences in personal experience among different physicians may lead to controversies in clinical practice.Therefore, it is crucial to establish quantitative relationships between clinical features, radiographic findings, and pathologies to support accurate clinical decision making.Our nomogram accurately quantifies clinical and imaging features and can effectively guide clinicians in evaluating and making treatment decisions for pulmonary nodules.
Numerous studies have demonstrated that lobulation is a reliable factor for differentiating benign from malignant pulmonary nodules (Qi et al., 2020;Chu et al., 2020;Liu et al., 2020).In our study, we also found that lobulation (OR =1.756, 95% CI [1.013-3.056])was an independent risk factor for such differentiation.Furthermore, our research indicated that gender (OR =2.489, 95% CI [1.688-3.700]) is another predictor of benign and malignant pulmonary nodules, with females having a higher likelihood of developing malignant nodules.This is consistent with recent studies suggesting that lung adenocarcinoma, a more common type of lung cancer than squamous cell carcinoma (Austin et al., 2013), typically affects women more often than men due to differences in disease susceptibility and risk factors (Bai et al., 2016;Kinoshita et al., 2017).Our study also found that solid nodules (OR =0.414, 95% CI [0.220-0.761])were a protective factor, whereas mixed ground-glass nodules (OR =3.269,) were a risk factor.This indicates that ground-glass nodules are more likely to be malignant than solid nodules.Research by Henschke et al. (2002) supports this finding, showing that mixed ground-glass nodules had a malignancy rate of 63%, compared to 18% for pure ground-glass nodules and 7% for solid nodules.In addition, vascular convergence sign, an indicator of malignant tumor growth and metastasis (Nishida et al., 2006;Raghu et al., 2019), was also identified as an independent risk factor for pulmonary nodules in our study, consistent with previous research (Huang et al., 2017;Wang et al., 2018).
As nodular diameter increases, so does the probability of malignancy (MacMahon et al., 2017).Nodules with a maximum diameter of less than five mm rarely have a malignancy rate above 0.4%, while nodules between 5 and 10 mm have a 1.3% probability of being cancerous.Nodules larger than 10 mm have a much higher malignancy rate of 15.2% (Horeweg et al., 2014).In our study, we found that as the diameter of nodules increased between 10 mm and 20 mm, the risk of malignancy increased by approximately 0.6 times for each 1 mm increase in diameter.When nodules exceeded 20 mm in diameter, the risk of malignancy increased by approximately 1.9 times per 1 mm increase.Calcification was identified as a protective factor in our model.This is because calcification is typically associated with the healing of old lesions and represents stable, benign lesions (Gorospe et al., 2020).Features such as diffuse, layered, and central calcifications are highly suggestive of benign lesions, such as inflammatory pseudotumors, whereas only a small percentage of malignant nodules exhibit eccentric or punctate calcifications (Zhou et al., 2021).Our study also included several characteristic signs of malignant pulmonary nodules.(Xia et al., 2021;Zhao et al., 2021;Hou et al., 2021).
Our study has several advantages.First, we developed a nomogram to build prediction models, which facilitates the quantification of risks and provides more intuitive results.Second, all nodules included in our study had clear pathologic diagnoses, ensuring that our results were based on accurate data.Third, by screening data from two research centers, our study was able to obtain a larger volume of data and broader coverage, resulting in a prediction model that is more universally applicable.Fourth, our model was developed based on CT and clinical features, which makes the scoring factors for pulmonary nodules simple and easy to obtain.In comparison, other models that incorporate radiomics, ctDNA, and serum tumor markers may be less accessible to clinicians.As such, our model may be easier to disseminate, particularly to outpatient physicians who need a quick and accurate assessment of benign and malignant pulmonary nodules to make informed decisions about diagnosis and treatment.
This study has several limitations.First, our study was retrospective, and further verification of its performance will be required in future prospective studies.Second, although the internal validation of the model showed excellent stability and calibration, the external validation of the model using separate data still needs to be performed in the future.Third, in order to make the model more concise, accessible, and user-friendly, we did not include radiomics, ctDNA, and serum tumor markers in our study.However, the inclusion of these factors may help to improve the accuracy of the model in the future.

CONCLUSIONS
A nomogram based on clinical and CT characteristics was developed and validated in our study to quantify the probability of pulmonary nodule presenting as early lung cancer.This nomogram has potential value in the clinical management of pulmonary nodules.

Figure 2
Figure 2 Demographic and clinical feature selection using the LASSO binary logistic regression model.(A) The model coefficient trend lines of the 19 features.Each line on the graph represents a variable, with the vertical axis showing the estimated coefficient of the variables and the ordinate showing the tuning parameter log (lambda) sequence.Different lambda values were used to identify different candidate variables, and the specific correlation coefficient for each measured variable was determined using coef (cvfit, s = lambda).The variables with non-zero coefficients were selected.(B) Depiction of the process of selecting optimal parameters by LASSO regression.The abscissa represents the logarithm of the parameter λ, while the ordinate represents the model errors.The numbers at the top of the figures indicate the number of candidate variables for the corresponding lambda value in LASSO regression.Lambda weighting parameters of λ = 0.0087 were considered optimal, and thirteen non-zero radiomics features were ultimately selected to construct the model.Abbreviations: LASSO, least absolute shrinkage and selection operator.Full-size DOI: 10.7717/peerj.16539/fig-2

Figure 3 Figure 4
Figure 3 Nomogram to predict the probability of pulmonary nodules being malignant in patients.Each variable is represented by a marked line segment on a scale representing its possible range of values, with the length of the line reflecting its contribution to the outcome event.Each variable was assigned a score on the point scale axis, and a total score was calculated by summing these individual scores.By projecting the score to the lower end of the total point scale, we could estimate the probability that the patients' pulmonary nodules were malignant.Full-size DOI: 10.7717/peerj.16539/fig-3