A nomogram model to predict death rate among non-small cell lung cancer (NSCLC) patients with surgery in surveillance, epidemiology, and end results (SEER) database

This study aimed to establish a novel nomogram prognostic model to predict death probability for non-small cell lung cancer (NSCLC) patients who received surgery.. We collected data from the Surveillance, Epidemiology, and End Results (SEER) database of the National Cancer Institute in the United States. A nomogram prognostic model was constructed to predict mortality of NSCLC patients who received surgery. A total of 44,880 NSCLC patients who received surgery from 2004 to 2014 were included in this study. Gender, ethnicity, tumor anatomic sites, histologic subtype, tumor differentiation, clinical stage, tumor size, tumor extent, lymph node stage, examined lymph node, positive lymph node, type of surgery showed significant associations with lung cancer related death rate (P < 0.001). Patients who received chemotherapy and radiotherapy had significant higher lung cancer related death rate but were associated with significant lower non-cancer related mortality (P<0.001). A nomogram model was established based on multivariate models of training data set. In the validation cohort, the unadjusted C-index was 0.73 (95% CI, 0.72–0.74), 0.71 (95% CI, 0.66–0.75) and 0.69 (95% CI, 0.68–0.70) for lung cancer related death, other cancer related death and non-cancer related death. A prognostic nomogram model was constructed to give information about the risk of death for NSCLC patients who received surgery.


Background
The morbidity and mortality of lung cancer ranked the first in China and globally [1,2]. Non-small cell lung cancer (NSCLC) accounts for about 75 to 80% of lung cancer patients, thus the treatment of NSCLC has been an urgent health issue worldwide.
Radical surgery is required for early stage and parts of locally advanced NSCLC patients [3]. Survival of NSCLC patients after surgery varies greatly, and previous reported prognostic factors include age, tumor size, metastatic lymph node numbers, clinical stage, etc. [4][5][6] However, other factors such as ethnicity, surgical method, primary tumor location, anatomic sites, histological subtype, etc. remain controversial. Therefore, studies with larger sample data and more rigorous statistical method assessing this problem are still needed.
For the reason that some early stage NSCLC patients who received radical surgery may have relative longterm survival, several other causes of death may occur among NSCLC patients. But previous studies mainly focus on investigating prognostic factors for lung cancer related death, studies considering non-cancer related death are inadequate.
To better evaluate the prognosis of resected NSCLC patients, and therefore to further provide more optimal treatment strategies for these patients, we estimated the causes of lung cancer related, other cancer related, and non cancer related death among patients in a population based Surveillance, Epidemiology, and End Results (SEER) cohort using a innovative and validated nomogram model.

Data source
We collected data from the SEER database of National Cancer Institute in the United States [7]. The data was obtained using the SEER* Stat. The North American Association of Central Cancer Registries (NAACCR) documented data items and codes [8]. Primary cancer histology and site were coded by the 3rd edition of the International Classification of Diseases for Oncology (ICD-O-3).

Cohort selection
Patients with lung tumors (site codes, C34.0-C34.9) were included in this study from the year 2004 to 2014 . The  following histologic codes were designated as NSCLC:  8010, 8012, 8013, 8014,8015, 8020,8021,8022,8031,8032,  8046, 8050-8052, 8070-8078, 8140-8147, 8250-8255,  8260, 8310,8323, 8430, 8480, 8481,8482, 8490, 8560, and 8570-8575. Patients who did not receive radical surgery or aged 18 years or younger were excluded. In accordance with the requirement of using SEER database [9], we obtained the data agreement. Figure 1 displayed the flow chart of patients' selection procedure in this study. SEER database conducted the follow-up for all patients, and the information of patients' follow-up time, survival status and survival time were all recorded. Therefore we could investigate the follow-up time and OS for these patients. In this study, the missing data that could not use to assess the survival status was eliminated before statistics.

Statistical analysis
Demographic and clinical variables adopted in the further analysis included age, gender, ethnicity, primary tumor location, anatomic sites, histological subtype, tumor extent, differentiation, clinical stage, tumor size, lymph node involvement, examined lymph node (ELNs), positive lymph node (PLNs), chemotherapy and radiotherapy. Categorical variables were grouped for clinical reasons, and the decisions regarding grouping were made before data analysis. Mean, medians and ranges were reported for continuous variables, as appropriate. Frequencies and proportions were reported for categorical variables.
The primary endpoint of this study was cause-specific survival. According to the COD code, we defined the cause of death into three groups: lung cancer related, other cancer related and non-cancer related. Cumulative incidence function (CIF) was used to illustrate death rate. The CIF was compared across groups by using Gray's test [10]. Fine and Gray competing risks proportional hazards regressions was performed to predict fiveand ten-year probabilities of the three causes of death [11]. For nomogram construction, two thirds of the patients were randomly assigned to the training data set (n = 31,415) and one third to the validation data set (n = 13,465). We used restricted cubic splines with three knots at the 10, 50, and 90% empirical quantiles to model continuous variables [12]. A model selection technique based on the Bayesian information criteria was employed to avoid overfitting when establishing competing risk models (eTable S1) [13].
The performance of the nomogram included its discrimination and calibration was tested using the validation data set. Discrimination is the ability of a model to separate subject outcomes, which is indicated by Harrell C index [14,15]. Calibration, which compares predicted with actual survival, was evaluated with a calibration plot. We used the validation set to compare the final reduced model-predicted probability of death with the observed 5 and 10-year cumulative incidence of death. The predictions were supposed to fall on a 45-degree diagonal line if the model was well calibrated. In addition, the bootstrapping technique was used for internal validation of the developed model based on 1000 resamples.
The R software (version 3.3.3; http:// www.r-project.org) was performed for all statisitcal analysis. We used R packages cmprsk, rms and mstate for modeling and developing the nomogram. The reported significance levels were all two-sided, with statistical significance set at 0.05.

Patient characteristics
A total of 44,880 NSCLC patients who received surgery from 2004 to 2014 were included in this study. Most patients were diagnosed at stage I (62%), were Caucasians (83.5%) and received lobectomy (82.9%). The median diagnostic age was 67 years. The median follow-up time was 31 months (IQR 12 to 61 months), and for still alive patients, the median follow-up time was 42 months (IQR 17-74 months). At last follow up, the death rate was 41.9%, with 12,958 patients (28.9%) died from lung cancer, 510 (1.1%) died from other cancers, and 5357 (11.9%) died from non-cancer causes. The most frequent other cancer death were resulted from miscellaneous malignant cancer (54.5%), brain and other nervous system (6.9%) and pancreas (3.5%) cancers. The most frequent non-cancer deaths were resulted from diseases of heart (28.3%), chronic obstructive pulmonary disease and associated conditions (19.8%) and cerebrovascular diseases (5.8%) ( Table 1).

Survival
Lung cancer related, other cancer related and noncancer related death probability were shown in eFigure S1, S2, S3 and S4. Diagnostic age, gender, ethnicity, anatomic sites, histologic subtype, differentiation status, clinical stage, tumor size, tumor extent, examined lymph node, surgery type, showed significant relationships with overall survival (P<0.001) (eTable S2). Five-and 10-year lung cancer related death probability increased with age, stage, tumor size, tumor extent, lymph node stage, positive lymph node numbers (P<0.001). Male patients had higher lung cancer-related death rate compared with female patients (P<0.001). Ethnicity, histologic subtype, anatomic sites of lung cancer, examined lymph node, differentiation status, surgery type, showed significant relationships with lung cancer related death probability (P< 0.001). Patients who received chemotherapy and radiotherapy had significant higher lung cancer related mortality for NSCLC patients with surgery but were associated with significant lower non-cancer related death rates (P<0.001) ( Table 2).

Nomogram prognositc model
A nomogram model was established based on multivariate models of training data set. We could calculate the 5-or 10-year death rate by this nomogram prognositic model (Fig. 2). Schoenfeld−type residuals of a proportional sub distribution hazard model for lung cancer related deaths were shown in eFigure S5. In the validation cohort, the unadjusted C-index was 0.73 (95% CI, 0.72-0.74), 0.71 (95% CI, 0.66-0.75) and 0.69 (95% CI, 0.68-0.70) for lung cancer related death, other cancer related death and non-cancer related death. This indicated that the models are convincingly precise. Figure 3 illustrated the CIF plot calibration. Good coincidence between predicted and actual outcomes was observed because the points are close to the 45-degree line. were associated with long time survival for lung cancer patients with surgery. However, the results were heterogeneous for the reason that most studies evaluating the prognosis of NSCLC had relative short follow-up with limited sample size. Therefore larger sample data with more validated and rigorous statistical methods were required. Besides, the population-based SEER database could be used with the ability to assess this issue on a  larger sample with long follow-up, which can effectively avoid biases. In this study, was collected a large population of 44,880 resected NSCLC patients in SEER database. Moreover, to make the bias minimized, we used a novel and validated prognostic model. Nomogram has been considered as a trustworthy method to generate more accurate prediction of prognosis [16][17][18]. The performance of the nomogram may also have discrimination, thus calibration should be conducted using a validation data set. Our study showed, the unadjusted C-index was 0.73 (95% CI, 0.72-0.74), 0.71 (95% CI, 0.66-0.75) and 0.69 (95% CI, 0.68-0.70) for lung cancer related death, other cancer related death and non-cancer related death in the validation cohort. This indicated that the models are convincingly precise. Besides, our study showed good coincidence between predicted and actual outcomes because the points are close to the 45degree line.
Our study showed 5-and 10-year lung cancer related death probability increased with age, stage, tumor size, tumor extent, lymph node involvement, positive lymph node numbers which were consistent with previous studies [3][4][5][6]. In our study, male patients had higher lung cancer-related death rate compared with female patients.
Several studies have demonstrated that epidermal growth factor receptor (EGFR) -tyrosine kinase inhibitors (TKIs) could noticeably improve survival of EGFR positive mutation advanced NSCLC patients [19][20][21][22]. EGFR mutation is the most common gene mutation in Asian female lung adenocarcinoma patients, therefore the prognosis of female lung cancer patients might be better. Our study showed patients with radiotherapy were associated with a significantly higher lung cancer related death rate. Radiotherapy was always performed to patients with more aggressive stage or, mediastinal lymph node metastasis and these patients may originally have poor prognosis. However, the appropriate opportunity and indication of radiotherapy still need further investment. Previous studies mainly focus on investigating lung cancer related survival for NSCLC patients, studies with concern of other causes of death are limited. In SEER database, the data of survival status, survival months, cause-specific death classification was available and death resulting from other cancer and noncancer was also recorded. Therefore we could investigate calculate lung cancer related, other cancer related and non-cancer related death probability using these data. We divided cause of death into lung cancer related, other cancer related and non-cancer related. In our study, the most frequent non-cancer deaths were resulted from diseases of heart, chronic obstructive pulmonary disease and associated conditions, and cerebrovascular diseases. Therefore the complications of heart and respiratory system during treatment procedures require closer monitoring.
There were also some limitations in this study. First, some variables are not recorded in SEER database, such as disease progression time, specific chemotherapy regimens, etc. Besides, we did not use the 7th or 8th AJCC staging system in this study. We selected patients in the SEER database from 2004 to 2014. The 6th AJCC staging system was applied for all patients during the decade. But the 7th AJCC staging system had not been widely used before 2010. The 8th AJCC staging system was applied after 2017. Stage information from 2004 to 2010 could not be accessed when using the 7th or 8th AJCC staging system. For the huge sample size, re-classification of patients was impossible. But there was no significant difference between stage I to stage III patients according to different staging systems, which had no significant impact on the study results.

Conclusions
A novel prognostic nomogram model using a large population based database was constructed to predict mortality for NSCLC patients who received surgery. This validated prognostic model may be helpful to give information about the risk of death for these patients.
Additional file 1: eTable S1. Proportional Subdistribution Hazards Models of Death Rate. eTable S2. Prognostic factors for overall survival by multivariable Cox regression. eFigure S1. Lung cancer related, other cancer related and non-cancer related death rates by