Development and validation of a LASSO prediction model for cisplatin induced nephrotoxicity: a case-control study in China

Background Early identification of high-risk individuals with cisplatin-induced nephrotoxicity (CIN) is crucial for avoiding CIN and improving prognosis. In this study, we developed and validated a CIN prediction model based on general clinical data, laboratory indications, and genetic features of lung cancer patients before chemotherapy. Methods We retrospectively included 696 lung cancer patients using platinum chemotherapy regimens from June 2019 to June 2021 as the traing set to construct a predictive model using Absolute shrinkage and selection operator (LASSO) regression, cross validation, and Akaike’s information criterion (AIC) to select important variables. We prospectively selected 283 independent lung cancer patients from July 2021 to December 2022 as the test set to evaluate the model’s performance. Results The prediction model showed good discrimination and calibration, with AUCs of 0.9217 and 0.8288, sensitivity of 79.89% and 45.07%, specificity of 94.48% and 94.81%, in the training and test sets respectively. Clinical decision curve analysis suggested that the model has value for clinical use when the risk threshold ranges between 0.1 and 0.9. Precision-Recall (PR) curve shown in recall interval from 0.5 to 0.75: precision gradually declines with increasing Recall, up to 0.9. Conclusions Predictive models based on laboratory and demographic variables can serve as a beneficial complementary tool for identifying high-risk populations with CIN.


Introduction and background
Cisplatin and its analogues are widely used in chemotherapy regimens for cancer treatment, with approximately 10-20% of cancer patients receiving such treatment.However, the side effects of cisplatin can lead to reduced dosage or the selection of alternative therapies, ultimately affecting prognosis.The lack of effective treatment measures to alleviate side effects, such as gastrointestinal problems, hematologic toxicity, neurotoxicity, and ototoxicity, can decrease the quality of life and increase medical costs [1].Cisplatin-induced nephrotoxicity (CIN) is a common side effect affecting 20-45% of patients, which is also the main limitation for its use [2][3][4].Chemotherapy itself can cause renal tubular injury, interstitial nephritis, and thrombotic microvascular disease [5].As cisplatin uptake and excretion are mainly mediated by proximal tubule transporters, its accumulation in renal proximal tubule cells can lead to cell injury [2].Up to now, risk factors associated with CIN include advanced age, smoking, type of cancer, comorbidities, baseline blood biochemical levels before chemotherapy (such as creatinine, albumin, cystatin, etc.), exposure to nephrotoxic drugs (such as iodinated contrast agents, long-term use of non steroidal anti-inflammatory drugs (NSAIDs), and gemcitabine), electrolyte disorders (low serum magnesium levels), alcohol intake, and high-dose cisplatin (≥ 50 mg/m2) per dose, Frequency of administration, cumulative dose, and insufficient hydration during administration [6,7].By investigating related pathological mechanisms, such as reactive oxygen species and mitochondrial dysfunction, cell death pathways, inflammatory responses, autophagy, and other related signaling pathways, researchers have identified differences in the genetic characteristics of key genes in CIN [2,[8][9][10].However, variations in clinical features, laboratory and genetic results, and the weight of risk factors have been observed across different studies, and there is a lack of sensitive and specific CIN prediction biomarkers for both genetic and non-genetic factors [11].These differences may be attributed to genetic variability among research subjects, disease types and protocols, inconsistencies in laboratory results and research design and the standardization of data analysis [1,12].
Predictive models have been widely used to diagnose, treat, and evaluate prognosis by integrating non-unique factors and comprehensively assessing their weight [13].Such models may help identify individuals at risk of nephrotoxicity, guide optimal drug and dose selection, and inform prevention strategies.Given the objectivity of tumor genetic heterogeneity, it is necessary to construct a prediction model that combines prediction indicators based on more comprehensive clinical information and specific target gene information for unique types of tumors.Genetic candidate genes and GWAS have identified several genetic risk factors for CIN [7,11].Okawa T [5] et al have developed a prediction model for CIN in elderly prostate cancer patients using a random forest algorithm that incorporated clinical and genomic characteristics extracted from saliva samples.It is believed that Genomic markers associated with nephrotoxicity are believed to be located in the regions between NAT1, NAT2, CNTN6, and CNTN4.Lung cancer remains the leading cause of cancer-related deaths worldwide, accounting for 30% of all cancer deaths in China [14,15].In terms of incidence, lung cancer is the most common cancer in China, with a mortality rate of 50% in Chinese males in 2020 [14].Commonly recognized genetic variants associated with lung cancer and CIN include single nucleotide polymorphisms in genes such as ERCC1, ERCC2, and SLC22A2 [12].In our study on mitochondrial pathway disorders, we observed a reduced risk of nephrotoxicity in carriers of the T allele of rs920829 in the TRAP1 gene compared to carriers of the C allele (OR 0.684, 95% CI 0.524-0.894,p = 0.003).Consequently, we plan to include SNP features of ERCC1, ERCC2, SLC22A2, and TRAP1 gene in future research.
The objective of this study is to utilize Lasso regression to identify suitable clinical and genetic features and construct and validate a CIN risk prediction model for lung cancer patients.

Study subjects
A retrospective traing set was constructed to develop a predictive model for patients with clear lung cancer diagnosis and platinum chemotherapy regimen.The traing set included 696 patients who were hospitalized at Sichuan Provincial People's Hospital between June 2019 and June 2021, of which 189 cases had CIN.A test set of 283 patients with lung cancer and platinum chemotherapy regimen was prospectively and continuously included from July 2021 to December 2022 in the same hospital.All patients underwent the same preliminary clinical evaluation and treatment observation.The research process was shown in Fig. 1.
Inclusion criteria were as follows: unrelated Han Chinese; having carboplatin-based chemotherapy; signed written informed consent; having demographic characteristics, physical examination, laboratory examinations, pathologically and histologically confirmed lung cancer; normal liver and kidney function before chemotherapy; and no obvious abnormalities in the preliminary clinical evaluation.Exclusion criteria included: <18 years old; liver or kidney dysfunction prior to initial chemotherapy [16].This study conformed to the provisions of the Declaration of Helsinki (as revised in 2013) and it was authorized by the Ethics Committee of Sichuan Provincial People's Hospital, University of Electronic Science and Technology of China Hospital.(Registration Number: AF-02/01.0).The chemotherapy regimens are listed in Table 1.

Definitions
Throughout each treatment cycle, toxicology information pertinent to the evaluation of cisplatin therapy (defined using the Common Terminology Criteria for Adverse Events version 5.0) was documented at least twice weekly [17].This is the criteria how nephrotoxicity was rated: Grade 1, increased levels of creatinine above 0.3 mg/dL or 1.5-2.0times higher than baseline levels; grade 2, 2-3 times higher than baseline levels; grade 3, more than 3 times higher than baseline levels or absolute levels above 4.0 mg/dL or requiring hospitalization; and grade 4, lifethreatening consequences or requiring dialysis [17].After 2 and 14 cycles, oncologic outcome reporting criteria were used to classify patient responses to treatment into 4 categories: complete response (CR), partial response (PR), stable disease (SD), and progressing illness (PD) [18].

Data collection, preprocessing, and feature variable screening
The definitive diagnosis of CIN and basic medical history of subjects were exported from the HIS system by data collectors, and all relevant laboratory indications were exported in the LIS system. of complete blood count (SYSMEXXN-10, Sysmex, Japan), coagulation tests (SYS-MEXCS-5100, Sysmex, Japan), and biochemical examination (Cobas c702, Roche, Germany)(Table 1).Candidate SNPs loci were typed using 48-Plex SNPscan® highthroughput SNP typing technology (18).Thirty samples were randomly selected for double-blind experiments to ensure the repeatability and stability of the genotyping results, and all the genotype calling success rates were greater than 99.0% [19].For single variables measured multiple times, we retrieved patients' admission records from the Hospital Information System (HIS) for those who underwent cisplatin chemotherapy regimens,

Identification of candidate predictors and construction of prediction models
The prediction model was constructed using multivariate logistic regression based on demographic variables and laboratory panel data [20]. .STATA software v15.0 was used to model candidate variables, with the goodness of  fit evaluated using Akaike's Information Criterion (AIC) [13,21]. .The selection criteria were AIC minimization and candidate variable minimization without affecting predictive efficacy [21].

Adjustment for model confounders and evaluation of predictive efficacy using training and test set data
Through 10-fold cross-validation, the model with the highest accuracy was selected.Covariance and interaction analyses were also performed on the candidate predictors.We used sensitivity, specificity, positive predictive value, negative predictive value, receiver operating characteristic (ROC) curves and C-index were used for model differentiation assessment, while calibration curve plots were used for consistency assessment [20].

Statistical analysis
The clinical and laboratory data were analyzed using SPSS software (version 23.0).Quantitative data with normal distribution were analyzed using t-tests or ANOVA, while non-normal quantitative data were analyzed using Mann-Whitney or Kruskal-Wallis nonparametric tests.Count data were analyzed using the chi-square test or logistic regression [16].Potential predictors were screened using Lasso regression in R version 3.6.1 software.Multi-factor analysis was performed using STATA version 14 software with logistic regression stepwise selection method, and the model was constructed based on the minimum AIC and the minimum number of predictors.Precision-Recall (PR) curve was plotted using the "ggplot2" package in R version 3.6.1 software.A nomogram was used to visualize the prediction model, and decision curves were used to analyze its clinical application value.The incidence of CIN in the China population was approximately 20% [22].The bilateral significance level was set at 5%, with a test power of 80%.Taking into account a 10% loss to follow-up, the sample size for each group was estimated at approximately 100 cases [23].

Basic information about the study population and clinical characteristics
In total, 979 patients were included in this study, with 696 patients (189 CIN vs. 507 controls) in the traing set and 283 patients (71 CIN vs. 212 controls) in the test set.
There was no significant difference in the frequency of CIN between the two sets.Table 1 presents the clinical characteristics of the study subjects, while Table 2 displays the distributions of allele and genotype frequencies of all SNPs.

Model predictor screening
Lasso regression was utilized to screen variables in the traing set, revealing that the optimal subset of nonzero coefficient variables for inclusion in the model was 36 at the 1sd value of 10-fold cross-validation error λ = 0.02185674 and 11 at the minimum value of 10-fold cross-validation error λ = 0.006521281, as depicted in Figs. 2 and 3.   predictive factors of the model increased.Therefore, model 2 was considered the best model with the characteristics of incorporated variables as shown in Table 3.

Adjustment for model confounders and evaluation of predictive efficacy
In the adjustment for model confounders, interaction and collinearity were evaluated among the variables included in model 2 using the "corr test" command of STATA software.There was no interaction or collinearity between the predictors (data availabe if necessary).Table 4 presents the variables and characteristics that were ultimately included in Model 2. The predictive performance of the model is displayed in Table 5; Fig. 4, while the nomogram based on this prediction model is presented in Fig. 5.The agreement between the predicted and observed actual risk of CIN is compared in Fig. 6, and the clinical decision curve for the CIN prediction model is shown in Fig. 7.The model is deemed clinically valuable when the risk threshold ranges between 0.1 and 0.9.
Given the class imbalance, we used Precision-Recall (PR) curve for the assessment of the model's predictive performance as shown in Fig. 8.In recall interval from 0.5 to 0.75: precision gradually declines with increasing Recall, remaining relatively high, up to 0.9.Within this range, the model maintains high accuracy in identifying positive samples and minimizing errors.In ecall interval from 0.75 to 0.90,precision drops more rapidly, from 0.9 to 0.60.To improve recall further and identify more positive samples, the model sacrifices more Precision, resulting in more false positives.In recall interval from 0.90 to 1.0,as recall approaches completeness, precision sharply decreases to about 0.10.In the pursuit of complete recall, the model's accuracy significantly diminishes, introducing a large number of false positive predictions.

Independent validation
The proposed model's performance was evaluated using test set data, and its fit was consistent with that of the set data, as determined by the Hosmer-Lemeshow test (p = 0.4636).The overall predictive performance of the model is illustrated in Table 5; Fig. 4, and Fig. 6.

Discussion
This study utilized machine learning algorithms to construct a CIN prediction model based on clinical, laboratory, and genetic variables.The construction process was conducted strictly to the statement of clinical prediction models as follows: developing the prediction model, validating the prediction model, and predictive effectiveness evaluation [24].The model demonstrated good sensitivity and specificity, indicating that combining laboratory and clinical variables can effectively identify high-risk populations of CIN.While the model cannot be used as an independent diagnostic method, it can serve as a supplementary tool due to its common, objective, and easily obtainable predictive factors.
The predictive set factor included 69 feature variables, 8 of which were genetic.If the genetic variables were considered as dummy variables, the total number of variables would increase to nearly 80. we employed LASSO regression with a 1sd penalty coefficient to consolidate laboratory variables.This method effectively reduced the number of predictors and eliminated unimportant variables.LASSO is a method of shrinkage estimation based on model reduction.By constructing different penalty functions, the regression coefficients of variables will decrease accordingly, and the regression coefficients of unimportant variables will eventually decrease to zero.Compared with the classical screening method, Lasso can effectively avoid the influence of factors such as different orders of magnitude, different units and possible collinearity between variables [25].To screen candidate variables, we opted for Lasso regression over classic single factor regression, using a 1 standard deviation penalty coefficient lambda (λ) as the screening parameter to the exclusion of relatively unimportant variables [7,26,27].The LASSO algorithm was executed using the "glmmet" R package, while the logistic regression model was constructed using the "glm" R package [20].Subsequently, we employed multifactor logistic stepwise regression to identify a concise and effective set of variables, which were then fitted into the formula based on their respective weights.This standardized approach to variable selection and weight conversion helps mitigate differences in the same indicator arising from different laboratory methods [13,28].
In the traing set, the genetic variable rs3212986 of ERCC1 exhibited statistically significant differences in allele frequency and genotype characteristics between ; Emax, maximum absolute difference in predicted and loess-calibrated probabilities; E90, the 0.9 quantile absolute difference in predicted and loess-calibrated probabilities; Eavg, the average quantile absolute difference in predicted and loess-calibrated probabilitie; S:Z, The Spiegelhalter Z-test for calibration accuracy; S:P, the two-tailed value of Spiegelhalter Z test the CIN group and the control group.The proportion of A-allele carriers was higher in the CIN group (31.21%) than in the control group (24.92%).The proportions of AA, CA, and CC genotypes were 11.64%, 39.15%, and 49.20% in the CIN group, and 12.03%, 25.64%, and 62.32% in the control group, respectively.These findings suggest that carriers of the A allele of rs3212986 are more likely to develop CIN, which is consistent with previous studies [29].Similarly, the allele frequency and genotype characteristics of rs920829 of TRPA1 were also statistically different between the CIN group and the control group.The proportion of T allele carriers was lower in the CIN group (22.75%) than in the control group (28.69%).The proportions of TT, CT, and CC genotypes were 8.46%, 28.57%, and 62.96% in the CIN group, and 16.96%, 23.47%, and 59.57% in the control group, respectively.These results suggest that T allele carriers of rs920829 are less likely to develop CIN.However, during the optimization of variables through multiple factor logistic regression, neither rs3212986 nor rs920829 were incorporated.It is possible that these variables lack independent predictive power or their independent predictive value is not significant enough [30].
Cystatin-C (Cys-C) was identified as the independent risk factor with the highest odds ratio (OR) value in the prediction model, surpassing other factors in predictive performance.The reasons for the increase of Cys-C and the high risk of CIN are analyzed as follows: 1) Cys-C is produced by all nucleated cells in the body.Cys-C in the blood is filtered by the glomerulus, and is degraded through reabsorption of the renal tubules, and is not secreted through the renal tubules.The progress makes it a more effective indicator of early glomerular filtration function than creatinine, urea nitrogen, and other indicators [31,32].Secondly, Cys-C is a member of the cysteine protease inhibitor family and an imbalance between cathepsin and protease inhibitors may lead to tumor invasion and metastasis, which can also promote an elevation of Cys-C [33,34].Other factors in the model, such as dbil and LDH, were not traditional renal function indicators or related to cisplatin metabolism pathway, but may reflect changes in physiological or pathological pathways during the occurrence and development of CIN (such as secretion and excretion, inflammatory response, oxidative stress damage, and electrolyte imbalance) during the occurrence and development of CIN [27].Therefore, using appropriate weighted models for joint evaluation can can aid in the earlier identification of CIN risks.
The model showed high sensitivity and negative prediction value(NPV), which can help to recognize the high risk of CIN and remind clinical attention to the selection of chemotherapy regimen and the compatibility with drug dosage.The results also showed a satisfactory discrimination ability and a prediction curve that is close to the actual curve, which indicates that the model can Fig. 7 Clinical decision curves for the established CIN prediction model.The thin blue line is the net benefit of therapeutic intervention for all men; the thin green line is the net benefit of therapeutic intervention for the men on the basis of the statistical model; the thick black line is the net benefit of therapeutic intervention for no man.The threshold probalility of X-axis and Net benefit of Y-axis are displayed as a ratio.Pr, Threshold Probability provide prediction results are highly consistent with the actual ones to identify cases with high risk of CIN.The model had a C-index = 0.922 for the traing set's discriminant test, with the consistency test S: P = 0.790, Emax = 0.044, Eave = 0.007 and S: p = 0.790, suggesting both the model's discriminant and consistency were good.To avoid overfitting of the model due to random and systematic errors, a validation model was constructed from aother prospective dependent set data.The fitting of the model constructed from the test set data is consistent with the fitting of the model constructed from the traing set data.Further clinical decision curve analysis of the model revealed that the model was of good value clinical use when the high-risk threshold was between 0.1 and 0.9.Meanwhile, Recision-Recall curve shown in recall interval from 0.5 to 0.75: precision gradually declines with increasing Recall, up to 0.9.
The prediction model developed in this study has certain limitations.Firstly, it is a single-center study, and although the test set data was prospectively included, the test set data was obtained retrospectively from the electronic medical record system.Consequently, there were unavoidable factors such as missing data, resulting in a final traing set of 696 patients, which may limit the model's scalability and necessitate further multicenter research and external validation.Secondly, the study did not incorporate the latest CIN-related biomarkers, such as malondialdehyde (MDA), NADPH oxidases (NOX), or heme oxygenase 1 (HO-1), which could potentially impact the results [2].Future research should focus on gradually conducting validation studies across multiple centers to continuously refine and enhance the model and provide guidance for clinical practice.

Conclusion
Predictive models based on laboratory and demographic variables can serve as a beneficial complementary tool for identifying high-risk populations with CIN.

Fig. 1
Fig. 1 Flow diagram of the study population

Fig. 4
Fig. 4 (a) ROC curve of the prediction model built from the training set data.The area under the curve is 0.9217, indicating good discrimination.ROC, receiver opertating characteristic.(b) ROC curves established by applying the CIN prediction model in the validation set. the area under the ROC curve is 0.8288, indicating good discrimination

Fig. 5
Fig. 5 CIN prediction model presented as a column line graph plot

Fig. 6
Fig. 6 (a) Comparison of the agreement between the predicted risk of the CIN prediction model and the observed actual risk of the CIN in the training set. the gray straight line at 45° over the origin represents the ideal line; the gray dashed line represents the actual observed value and the black straight line represents the predicted value according to the logistic model, S:p = 0.790.CIN: cisplatin induced nephrotoxicity Dxy, Somer's rank correlation between p and y: DXY = 2(C-0.5);C, ROC area; ROC, receiver opertating characteristic; R2 Nagalkerke-Cox-Snell-Magee R-saquard index; D, Discrimination index D; U, unreliability index; Q, the quality index; Brier, Brier score (average squared difference in p and y); Emax, maximum absolute difference in predicted and loess-calibrated probabilities; E90, the 0.9 quantile absolute difference in predicted and loess-calibrated probabilities; Eavg, the average quantile absolute difference in predicted and loess-calibrated probabilitie; S:Z, The Spiegelhalter Z-test for calibration accuracy; S:P, the two-tailed value of Spiegelhalter Z test

Fig. 8
Fig. 8 Precision-Recall (PR) curve of the predction model.The vertical axis in the figure represents accuracy, the horizontal axis represents recall, and the curves represent the corresponding accuracy and recall values at different cut-off points

Table 2
The distributions of allele and genotype frequencies of all SNPs

Table 3
Multiple models using multivariate logistic regression for comparison AIC, Akaike's information criterion; BIC, Baysian information criterion;

Table 3
candidate predictors were modeled in various ways, and the screening p values, AIC, and BIC were presented in model 8 using the "lrtest test command" of STATA software revealed that although model 4 and model 8 incorporated fewer variables, their predictive efficacy was reduced (both p < 0.05).The inclusion of rs3212986 as a dummy variable in the predictive factors did not improve the predictive efficiency as the AIC and the number of

Table 4
Variables and characteristics eventually included in the model

Table 5
Performance of prediction model in training and test set