Development of a predictive model for 1-year postoperative recovery in patients with lumbar disk herniation based on deep learning and machine learning

Background The aim of this study is to develop a predictive model utilizing deep learning and machine learning techniques that will inform clinical decision-making by predicting the 1-year postoperative recovery of patients with lumbar disk herniation. Methods The clinical data of 470 inpatients who underwent tubular microdiscectomy (TMD) between January 2018 and January 2021 were retrospectively analyzed as variables. The dataset was randomly divided into a training set (n = 329) and a test set (n = 141) using a 10-fold cross-validation technique. Various deep learning and machine learning algorithms including Random Forests, Extreme Gradient Boosting, Support Vector Machines, Extra Trees, K-Nearest Neighbors, Logistic Regression, Light Gradient Boosting Machine, and MLP (Artificial Neural Networks) were employed to develop predictive models for the recovery of patients with lumbar disk herniation 1 year after surgery. The cure rate score of lumbar JOA score 1 year after TMD was used as an outcome indicator. The primary evaluation metric was the area under the receiver operating characteristic curve (AUC), with additional measures including decision curve analysis (DCA), accuracy, sensitivity, specificity, and others. Results The heat map of the correlation matrix revealed low inter-feature correlation. The predictive model employing both machine learning and deep learning algorithms was constructed using 15 variables after feature engineering. Among the eight algorithms utilized, the MLP algorithm demonstrated the best performance. Conclusion Our study findings demonstrate that the MLP algorithm provides superior predictive performance for the recovery of patients with lumbar disk herniation 1 year after surgery.


Introduction
Lumbar disk herniation (LDH) is a common and frequently occurring disease that is the most common cause of back and leg pain, resulting in great suffering such as reduced ability to work and learn, reduced quality of life, and even disability (1).Surgery, especially tubular microscopic discectomy (TMD), has become the conventional treatment for LDH in recent years (2).TMD is a minimally invasive method to remove the herniated disk from the posterior approach using surgical microscopic instruments.However, there are several factors that can affect postoperative recovery (3).Clinical predictive modeling (CPM) is a statistical model based on multiple pathologies of the disease that can predict the risk of certain future outcomes in patients with certain characteristics (4,5).Building statistical models requires a large amount of clinical data, and machine learning (ML) algorithms can accurately process the raw data, analyze the connections between important data, and make accurate decisions (6).With the widespread use of machine learning, deep learning, as an important branch of machine learning, has advantages in automatic feature learning and function simulation construction (7)(8)(9).Due to the complexity and size of clinical data, using deep learning models and machine learning can improve the accuracy of models and predictions in data processing, as well as in building clinical models (10,11).The goal of this study is to develop a predictive model based on deep learning and machine learning for the recovery of patients with lumbar disk herniation 1 year after surgery.

Methods
All data for this study were obtained from the Department of Neurosurgery, Fujian Medical University Union Hospital.The study recorded the medical variables of patients who were hospitalized and underwent TMD between January 2016 and January 2018.The data included patients' basic information, medical history, physical examination, preoperative test results, and preoperative scores.Retrospective analysis was conducted, and deep learning and machine learning algorithms were used to establish a predictive model for the 1-year postoperative recovery of patients with lumbar disk herniation.

Inclusion criteria
(1) Age of inclusion: 12-85 years old; (2) The prominent lumbar segments are: L3/4, L4/5, or L5/S1, including cases of combined protrusions involving two or three segments.(3) have typical sciatica with or without lumbar pain and other symptoms; (4) those who have been ineffective after standardized conservative treatment for more than 3 months and seriously affect their lives, or those with severe pain, cauda equina dysfunction, muscle strength loss, muscle atrophy, and other symptoms; (5) the straight leg raising test on the affected side is less than or equal to 70°; (6) confirmed by CT and MRI lumbar disk protrusion, and the location of the protrusion matches the corresponding neurological symptoms; and (7) receiving standardized unilateral paraspinal tubular microdiscectomy (TMD) technology treatment and a consistent physical therapy regimen (12,13).
For more information about this study and the standardized surgical procedures at our institution, please refer to our previously published study (14).

Exclusion criteria
(1) Those with missing imaging data or unable to follow up as required; (2) those with segmental lumbar instability suggested by frontal and lateral lumbar X-ray and hyperextension and hyperflexion; (3) those with other serious physical, psychological, or mental diseases; (4) those with rheumatic immune diseases that may cause similar symptoms; and (5) those who are participating in other clinical trials.

Data collection
To construct and validate the prognostic model, we retrospectively collected clinical data related to patients with LDH who met the inclusion and exclusion criteria.The potential predictors included 42 variables related to patients' medical history, examination, and preoperative test results, with the cure rate of the lumbar Japanese Orthopedic Association (JOA) score 1 year after TMD as the outcome measure.
The following variables were included as factors in the analysis: age, gender, height, weight, body mass index (BMI), high-risk occupation (occupations that require prolonged sedentary or highintensity physical activity), family history (with first-degree relatives affected by LDH), history of lumbar trauma, duration of disease, duration of preoperative conservative treatment, duration of preoperative pain medication, low back pain, underlying diseases (hypertension, diabetes), history of smoking, history of alcohol abuse, angle of preoperative physical examination (as measured by the straight leg raise test), sensory impairment, muscle strength classification of the affected limb, Barthel scale, serum creatine kinase (CK), and lumbar degeneration, associated lumbar disk herniation, American Society of Anesthesiologists (ASA) grading, Oswestry Disability Index (ODI) score, preoperative low back pain and leg pain numerical rating scale (NRS) scores, the number of surgical segments as determined by the JOA, surgical time, and intraoperative bleeding.These are shown in Table 1.The cure rate score of the lumbar JOA 1 year after TMD surgery was also used as an outcome measure.Further details on these factors are provided in Supplementary material 1.

Outcome indicators
Cure rate scores for lumbar JOA score at 1 year after TMD surgery were calculated using the same method as before the operation.The cure rate was calculated as follows: This rate reflects the improvement of lumbar spine function before and after treatment, and is utilized to evaluate the clinical efficacy of the intervention.A cure rate of 100% indicates complete recovery, while a cure rate of greater than 60% is considered to be significantly effective.Improvement rates falling within the range of 25-60% are categorized as effective, while   those below 25% are classified as ineffective.To process the data, patients with an improvement rate of lumbar JOA score > 60% (significant efficacy or cure) 1 year after TMD were recorded as 1, while patients with an improvement rate of lumbar JOA score ≤ 60% (effective but not significant or ineffective) were recorded as 0.

Feature engineering
Feature engineering is a process that involves transforming raw data into features that are more suitable for modeling.By doing so, the resulting features are able to capture relevant patterns, thereby improving the predictive accuracy of machine learning and deep learning models on unseen data (15).
In this study, the feature engineering process began by transforming raw data into more suitable features for modeling through data preprocessing and feature selection.Missing values were addressed using mean interpolation (16,17), and the data were standardized using Z-score normalization to ensure uniformity, with all features having a mean of 0 and a standard deviation of 1. Further, before applying the features to eight different predictive algorithms, feature selection was carried out using the Mann-Whitney U test, retaining only those features with p values less than 0.05.To reduce redundancy, a Spearman correlation matrix heatmap was used to identify highly correlated features (|ρ| > 0.9), which were eliminated, except for one retained to maintain descriptive power.The final selection utilized LASSO regression with 10-fold cross-validation to identify features with non-zero coefficients essential for modeling.

Spearman ρ correlation matrix heat map
We conducted a correlation analysis of the data using a Spearman ρ correlation matrix heat map (18).The Spearman correlation matrix heat map is suitable for analyzing data that do not conform to a normal distribution, as well as data that contain categorical variables.It can measure the correlation between any two variables, with a value of +1 indicating a total positive correlation, −1 indicating a total negative correlation, and 0 indicating no correlation.The results of the correlation analysis can be visually represented using a heat map, which uses color to indicate the magnitude of the correlation, making it easier and more intuitive to interpret the results.

Machine learning and deep learning
We employed a systematic framework based on machine learning and deep learning to construct prognostic models.To this end, we divided the data into a training dataset for developing the predictive model and a test dataset for evaluating the accuracy of the model (19).The data were randomly divided into two groups in a ratio of 70:30, with 70% (n = 329) of the samples designated as the training set for developing the predictive model, and 30% (n = 141) of the samples designated as the test set for evaluating the accuracy of the model.Once the training set was defined, an optimal model was developed using eight different machine learning algorithms, including Random Forests, Extreme Gradient Boosting, Support Vector Machines, Extra Trees, K-Nearest Neighbors, Logistic Regression, Light Gradient Boosting Machine, and MLP (Artificial Neural Networks) from scikit-learning (version: 0.18) in python.
To optimize the accuracy of the predictive models, a grid search was conducted on the hyperparameters for each of the eight ML algorithms used.A 10-fold cross-validation was employed, whereby the training data set was divided into 10 equally-sized folds, and the model was created using 90% of the data in each fold, with the remaining data used to evaluate the model's accuracy.The process was repeated 10 times, with each fold being used for one of the 10 training steps (20,21).The area under the receiver operating characteristic (ROC) curve, also known as area under the curve (AUC), was used as the primary accuracy metric during the grid search (22).The AUC is a performance measure that evaluates the strengths and weaknesses of the learner and is widely used in clinical settings to assess the performance of ML algorithms on test datasets (23).In addition to the AUC, Accuracy, AUC, Sensitivity, Specificity, PPV, NPV, Precision, Recall, and F1 values were also reported to provide a comprehensive picture of the algorithm's performance (22).
The modeling and prediction process for deep learning is similar to traditional machine learning, with the main difference being that deep learning is end-to-end and can automatically extract high-level features, greatly reducing the reliance on feature engineering in traditional machine learning (7).

Statistical analysis
Continuous variables were presented as mean ± standard deviation, while categorical variables were presented as frequencies and percentages.Group comparisons for categorical variables were conducted using the chi-square test or Fisher's exact test, whereas

General
A total of 470 patients meeting the inclusion and exclusion criteria were enrolled in this study.All patients underwent TMD surgery between January 2018 and January 2021 and were followed up for 1 year.In order to develop predictive models, 42 variables were collected, including gender, age, BMI, medical history, and preoperative indicators.

Machine learning and deep learning
After performing data preprocessing and segmenting the dataset into training and test sets, this study employed eight algorithms to develop the predictive model.Finally, 15 variables after Feature Engineering (Figure 2C) were used to input DL and ML algorithm, including high-risk occupation, preop_ODI, calcification, and other 12 variables.Each algorithm was also subjected to a hyperparameter grid search based on a 10-fold cross-validation and after finding the optimal hyperparameters, the models were used to generate predictions.
As shown in Figure 2 and Table 2, MLP exhibits the highest AUC values (Train AUC = 0.872; Test AUC = 0.840), also demonstrating superior performance across other metrics such as an Accuracy of The Spearman ρ correlation matrix heat map used to construct the model independent variables.A large number of highly correlated features are eliminated.3A,B).Additionally, Figure 3C illustrates the superior clinical decision-making capability of MLP (represented by the blue curve) at thresholds greater than 40% (DCA), where it demonstrates a higher net benefit compared to other machine learning algorithms.The Probability Calibration Curve also supports our decision-making process (Figure 3D).Performance comparisons of each model are detailed in Table 2.

Discussion
In the field of surgical treatment for disk herniation, there have been numerous studies investigating the efficacy of different surgical approaches.Specifically, research has focused on the differences in treatment outcomes between TMD and other approaches, such as open microdiscectomy (OMD).Studies have demonstrated that TMD and OMD yield comparable treatment outcomes, but TMD has a significant advantage in reducing intraoperative bleeding (24).Additionally, research has shown that TMD and conventional microdiscectomy (CMD) produce similar outcomes 1 year after surgery, with TMD not having any advantage in preventing reoperation or dural tears (25).However, limited discussion has been dedicated to patient recovery 1 year after TMD.This study provides a novel approach to addressing the lack of research in this area by implementing machine learning and deep learning techniques to develop predictive models for patient recovery 1 year after TMD.
A limited amount of central data can also be used for deep learning predictive analysis and may be useful for clinical decision making (26).Its comparison of logistic regression models with deep learning models shows the superiority of deep learning performance.Our prediction results demonstrate the advantages of MLP models, especially in terms of AUC values.Of course, close results were obtained for LR, RF, etc., which may be related to the small amount of data, coming from a single clinical study center.
Logistic regression without regularization may be criticized for underfitting, but L2-regularized logistic regression effectively mitigates the risk of overfitting by incorporating a regularization factor or penalty factor, denoted as λ, which multiplies the sum of the squares of all parameters.This reduces the impact of insignificant parameters on the predictive outcome.

Figure 1
Figure1presents the Spearman ρ correlation matrix heatmap, which is utilized to construct the model's independent variables.This

FIGURE 2 The
FIGURE 2 The LASSO and MSE in feature engineering and the 15 variables used to input into eight algorithms.(A) The least absolute shrinkage and selection operator (LASSO); (B) A 10-fold-validated mean squared error (MSE); (C) feature weights: variables-score histogram derived from LASSO-selected features.

TABLE 1
Descriptive statistics of different influencing factors in a study population grouped by whether the improvement in lumbar JOA score was >60% 1 year after TMD.

TABLE 1 (
Continued) Conservative treatment time; WLPT, Waist leg pain time, SLETA, Straight leg elevation test angle of affected limb, DOS, Disturbance of sensation; MS, Muscle strength; Number, Number of salient segments; SSN, Surgical segment number; Segment, Number of operative segments; Collapsa, Collapse of intervertebral space; LS, Lumbar spondylolisthesis; Calcification, Calcification of ligaments hyperplasia of bone; SD, Sagittal diameter; Position, Sagittal disk herniation horizontal position; Location, Transected herniated disk location;Grade, Grading of transected disk herniation; Numbness after, Numbness in the year after surgery; Reduction of lumbago, Reduction of lumbago NRS 1 year after surgery ≧2; Reduction of leg, Reduction of leg pain NRS 1 year after surgery>2; JOA improvement, JOA improvement rate 1 year after surgery ≧ 25; ODI difference, ODI difference 1 year after surgery>20; Proximal lumbar process, Proximal lumbar process within 1 year after surgery; Recurrence, Recurrence occurred within 1 year after surgery.