Introduction

Lumbar spinal stenosis (LSS) is one of the most commonly diagnosed spinal pathologies in the United States with high prevalence among older age groups [1, 2]. LSS is defined as the narrowing of the spinal canal in the lower part of the back. The clinical symptoms of LSS can include low back, lower extremity and gluteal pain, numbness, a substantial loss of mobility, and severe limitations in one’s ability to perform daily activities [3]. Among degenerative spine disease, LSS represents one of most common reason for spinal surgery in patients above the age of 65 years [4], hence, accurate and reliable diagnosis and subsequent treatment has important economic and social consequences [5]. LSS is characterized by a narrowing of the central canal, lateral/subarticular recesses, or intervertebral foramen, which in turn causes compression of the associated neural structures [6].

While LSS is defined as a clinical syndrome and diagnosed as such, imaging is frequently used as a confirmatory tool prior to determining treatment. Magnetic resonance imaging (MRI) is the mainstay modality for detailed anatomical assessment of the spine with excellent soft tissue contrast and used for confirmation of clinical symptom-based diagnoses and determining the optimal treatment course. MRI is essential for evaluation of LSS and to confirm clinical findings, largely supplanting CT myelogram studies due to its superior soft tissue contrast. A number of studies have attempted to identify core quantitative radiological criteria for the diagnosis of LSS [7,8,9]. However, lumbar spine MRI interpretation is time-intensive and depends on the individual radiologist or surgeons’ expertise and experience, therefore supporting objective and standardized methods of diagnosing and decision-making are desired.

Machine learning (ML) models, including deep convolutional neural networks (CNN), have already been successfully applied for evaluation of LSS and other degenerative changes with high accuracy in various approaches [10, 11]. However, most CNN algorithms rely on one-component models for binary classification (present/absent) of LSS. One recent work applied a two-component CNN to detect stenosis targeting the central canal, lateral recesses, and neural foramina with subsequent grading of the stenosis [12].

Here, the purpose of this study was to develop a three-stage convolutional neural network (CNN) approach to segment anatomical structures, classify the presence of lumbar spinal stenosis (LSS) and assess its severity on spine MRI on axial and sagittal MR images. The classification covers the detection of all three stenosis types—central canal, foraminal, and lateral recess. The performance of the model has been compared to a panel of radiologist subspecialists to test its reliability and accuracy.

Methods

Data set and annotation

External institutional review board approval was obtained to retrospectively review anonymized imaging data. The initial data set consisted of 1635 MRI studies of adult subjects referred for lumbar spine MRI for low back pain. The data set consisted of 45.7% of males (54.3% of females), with age ranging from 18 to 85 years. Each MRI study corresponded to a patient (1635 MRI studies = 1635 patients). Patients with implants or instrumentation, severe scoliosis, and poor image quality were excluded. Each MRI study was acquired using a standard lumbar spine protocol, including T2-weighted axial and sagittal pulse sequences with balanced labels. First, T2-weighted axial sequence was extracted from each lumbar study. All slices from the lumbar disc levels (L1/L2, L2/L3, L3/L4, L4/L5, L5/S1) were selected from the T2-weighted axial sequence and labeled per slice. On average, around 10 to 15 axial slices were obtained from each study, with a more precise count of approximately 13.3 for this specific dataset (21,702 images in total).

For axial images, muscle tissue, the discs, spinal canal, thecal sac, neural foramina, nerves, nerve roots, lateral recess, facet joints, spinous process, articular process, ligamentum flavum, disc bulging or herniation, arteries, veins, and kidneys were labeled. On sagittal plane images, the discs, vertebral body, spinal canal, spinal cord with nerve roots, and spinous processes were labeled for segmentation. The segmentation labelling was performed by administrators.

In addition to segmentation of the key anatomical structures, the studies were also labeled by musculoskeletal-trained radiologist subspecialists on a scale of 0 (absent), 1 (mild), 2 (moderate), 3 (severe) for LSS to establish the reference standard [13]. The reference standard was determined by majority voting rule and in case of disagreement, adjudicated by a further radiologist. At first, during training, the segmentation model detected the facet joints and spinal canal. Then images were resized and augmented (horizontal flips and rotations). Of the 1635 studies, 1390 were used for CNN weight training, and 245 as a validation set for hyperparameter tuning. The dataset was randomly divided into a training set for training the CNN parameters (1635 studies) and a validation set for hyperparameter tuning (245 studies). This was done while maintaining the same distribution in each subset, split ratio and the rule that images from one study can’t be in different sets at the same time. Thus, data splitting was carried out based on studies rather than individual images, such as images from the same MRI study could go to different subsets (to avoid for instance two images from the same MRI study going to both training and testing).

Additionally, an external data set of 150 studies were reserved for final model accuracy assessment. The inclusion/exclusion criteria were identical to the those of the training data set. The external validation set were graded on a scale of absent, mild, moderate or severe by a panel of 7 radiologist subspecialists. The radiologist interpretations were then compared to the interpretation of the model.

The number of images for each stenosis (central, lateral recess, foraminal) for each severity class (absence, mild, moderate, severe) and for each data subset (train, validation, test) is shown in Supplementary Table 1.

Image preprocessing

We employed an in-house tool referred to as the “preprocessor.” This tool accepts an MRI study as input, where it subsequently separates the MRI sequences, identifies the specific body part, determines the projection, and categorizes the imaging type for each sequence. Here, the term “imaging type” refers to the type based on a relaxation time, such as T1-weighted, T2-weighted, and so on.

With the assistance of this tool and the sagittal segmentation model, which outlines the contours of vertebral bodies and intervertebral discs, among other objects (which are the discs, vertebral body, spinal canal, spinal cord with nerve roots, disc material bulge, and sacrum), we can identify the intervertebral disc or vertebral body level that corresponds to a given axial slice.

Neural network model

In this section, we describe the neural network model employed for the evaluation of lumbar stenosis severity. Our approach utilizes a three-stage model, as illustrated in Fig. 1, to assess the severity of three types of lumbar stenosis: central, lateral recess, and foraminal. This model comprises a total of five sub-models: one segmentation model, one binary multilabel stenosis classification model and three stenosis severity classification models (central, lateral recess, and foraminal). The output of our three-stage model is depicted in Fig. 2.

Fig. 1
figure 1

Schema of the 3-stage neural network approach. It includes the segmentation model (detection of anatomical structures), the binary classification (presence or absence) for central, lateral recess and foraminal stenoses and the severity classification model (mild, moderate, severe)

Fig. 2
figure 2

Illustration of the segmentation model output, the stenosis binary classification output (presence/absence) and the severity classification model output for mild, moderate and severe central stenoses on axial plane from MR images

Segmentation model

The segmentation model is designed with a U-Net architecture and operates on grayscale MRI axial images as its input. Its primary function is to produce a pixel-wise mask delineating 17 anatomical objects within the lumbar region. These objects include the disc, thecal sac, nerve, neural foramina, lateral recess, spinal canal, herniation, nerve roots, vertebral arch, spinous process, facet joint, articular process, ligamentum flavum, muscle, vein, artery, and kidney.

Binary multilabel stenosis classification model

Running in parallel with the segmentation model, the binary multilabel stenosis classification model adopts the RegNet architecture [18], specifically implementing the RegNetY800MF configuration. This model takes the same input axial image and provides an output indicating the presence or absence of each of the three types of lumbar stenosis: central, lateral recess, and foraminal.

Severity classification models

For each type of lumbar stenosis (central, lateral recess, and foraminal), we employ a separate severity classification model based on the RegNetY32GF architecture. Our approach involves customizing the input data by incorporating specific masks that correspond to the type of stenosis under examination. In the instance of central stenosis, we include masks for the thecal sac and spinal canal. In the case of lateral recess stenosis, we add the lateral recess object mask, while for foraminal stenosis, we integrate the neural foramina object mask. These masks are then combined with the original axial image, serving as essential inputs for our severity classification models. The reason behind this tailored approach is that these anatomical objects exhibit narrowing when the corresponding type of stenosis is present.

During the inference process, the segmentation model and the binary multilabel stenosis classification model run independently on the input image. Subsequently, for each type of stenosis, the binary classification model's prediction dictates whether to proceed with the corresponding severity classification model. Specifically, if the binary classification model predicts the absence of stenosis, the corresponding severity classification model is not utilized. However, if stenosis is predicted, both the source image and the mask output from the segmentation model are provided as inputs to the relevant severity classification model. The severity classification model then predicts the severity level, which can be categorized as mild, moderate, or severe for the respective type of lumbar stenosis.

Hyper-parameters of the different models are presented in Supplementary Table 2.

Statistical analysis

The outcomes of the CNN and the radiologists were compared to the reference standard for detection and grading of LSS. For binary classification of LSS, sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC) were calculated. Compared to sensitivity and specificity, AUROC is a measure of prediction performance that does not depend on the discrimination threshold. For the multi-class grading of LSS, these metrics were computed in a one vs. all setting, where averages were weighted by the number of observations in each class (weighted average metrics).

Inter-rater agreement between the reference standard and the model and radiologists was assessed by Cohen’s kappa. Level of agreement was defined as follows: 0.0–0.2 (slight), 0.21–0.4 (fair), 0.41–0.6 (moderate), 0.61–0.8 (substantial), 0.81–1.0 (almost perfect).

All confidence intervals for metrics in the Results section were calculated for the corresponding model. All calculations were implemented in Python (Version 3.7.4) using scikit-learn (Version 1.0.2).

Results

Binary classification

We first evaluated the presence of stenosis as a classification problem (presence vs absence) and looked at each stenosis type (central, lateral recess and foraminal) separately. Sensitivity, specificity and AUROC are presented in Fig. 3. ROC curves are shown in Supplementary Fig. 1.

Fig. 3
figure 3

Sensitivity, specificity, and AUROC comparisons of the radiologist panel and the model for presence/absence classification of lumbar spinal stenosis

In the case of central canal stenosis, the sensitivity, specificity, and AUROC for the radiologists were 0.786 (95% confidence interval (CI): [0.619, 0.890]), 0.899 (95% CI: [0.878, 0.918]), 0.842 (95% CI: [0.779, 0.905]), respectively, while the metrics for the model were 0.971 (95% CI: [0.847, 0.999]), 0.864 (95% CI: [0.840, 0.885]), 0.963 (95% CI: [0.946, 0.980]). For binary inter-rater agreement, the average Cohen's kappa was substantial between radiologists and reference standard with a κ value of 0.372. The κ value of the CNN model was 0.431.

For lateral recess stenosis, the sensitivity, specificity, and AUROC for the radiologists were 0.713 (95% confidence interval (CI): [0.576, 0.804]), 0.898 (95% CI: [0.877, 0.916]), 0.805 (95% CI: [0.756, 0.854]), respectively, while the metrics for the model were 0.853 (95% CI: [0.689, 0.950]), 0.787 (95% CI: [0.759, 0.813]), 0.907 (95% CI: [0.868, 0.945]). For binary inter-rater agreement, the average Cohen's kappa was substantial between radiologists and reference standard with a κ value of 0.323. The κ value of the CNN model was 0.315.

For foraminal stenosis, the sensitivity, specificity, and AUROC for the radiologists were 0.879 (95% confidence interval (CI): [0.806, 0.932]), 0.877 (95% CI: [0.851, 0.901]), 0.878 (95% CI: [0.846, 0.910]), respectively, while the metrics for the model were 0.942 (95% CI: [0.890, 0.975]), 0.844 (95% CI: [0.817, 0.868]), 0.950 (95% CI: [0.934, 0.965]). For binary inter-rater agreement, the average Cohen's kappa was substantial between radiologists and reference standard with a κ value of 0.596. The κ value of the CNN model was 0.672.

Multiclass classification

Next, we compared the performance over the full LSS grading scale of absence, mild, moderate, and severe. Global metrics are shown in Fig. 4, severity class-specific metrics are shown in Supplementary Table 3, and severity class-specific confusion matrices are shown in Supplementary Table 4. To account for class imbalances, the average metrics were weighted depending on the number of observations per class. In the case of central canal stenosis, the sensitivity, specificity, and AUROC for the radiologists were 0.885 (95% confidence interval (CI): [0.855, 0.912]), 0.792 (95% CI: [0.631, 0.892]), 0.838 (95% CI: [0.772, 0.902]), respectively, while the metrics for the model were 0.855 (95% CI: [0.821, 0.884]), 0.969 (95% CI: [0.849, 0.998]), 0.912 (95% CI: [0.878, 0.947]). For inter-rater agreement, the average Cohen’s kappa between the radiologists and the reference standard was 0.376 and 0.310 between the model and the reference standard.

Fig. 4
figure 4

Sensitivity, specificity, and AUROC comparisons of the radiologist panel and the model for severity classification of lumbar spinal stenosis

For lateral recess stenosis, the sensitivity, specificity, and AUROC for the radiologists were 0.887 (95% confidence interval (CI): [0.859, 0.910]), 0.772 (95% CI: [0.589, 0.811]), 0.804 (95% CI: [0.755, 0.884]), respectively, while the metrics for the model were 0.783 (95% CI: [0.745, 0.816]), 0.856 (95% CI: [0.698, 0.950]), 0.819 (95% CI: [0.755, 0.884]). For inter-rater agreement, the average Cohen’s kappa between the radiologists and the reference standard was 0.359 and 0.199 between the model and the reference standard.

For foraminal stenosis, the sensitivity, specificity, and AUROC for the radiologists were 0.845 (95% confidence interval (CI): [0.801, 0.884]), 0.890 (95% CI: [0.825, 0.937]), 0.868 (95% CI: [0.831, 0.905]), respectively, while the metrics for the model were 0.836 (95% CI: [0.794, 0.872]), 0.943 (95% CI: [0.895, 0.972]), 0.890 (95% CI: [0.861, 0.918]). For inter-rater agreement, the average Cohen’s kappa between the radiologists and the reference standard was 0.620 and 0.637 between the model and the reference standard.

Once again, the model showed comparable performance compared to the average radiologist subspecialist on the panel, across all types of stenosis.

Discussion

The purpose of this study was to develop an artificial intelligence approach to classify the presence and severity of stenosis and to demonstrate its efficacy as an accurate and consistent diagnostic tool. We designed a three-stage convolutional neural network (CNN) approach to segment anatomical structures, classify the presence of lumbar spinal stenoses (central, lateral recess, foraminal) and assess its severity on spine MRI. Evaluation on an external data set demonstrated comparable performance compared to a panel of 7 radiologist subspecialists for the detection and severity classification of LSS. For binary diagnosis of central, lateral recess and foraminal stenosis, the performance of the CNN model was higher as compared to the radiologist average, achieving an AUROC of (0.963, 0.907, 0.950) compared to (0.842, 0.805, 0.878) of the radiologist average, respectively. Accordingly, for multi-class severity grades, the model yielded an AUROC of (0.912, 0.819, 0.890) compared to the radiologist average of (0.838, 0.804, 0.868), for central, lateral recess and foraminal stenosis, respectively.

Our neural network was trained with a large dataset (21,702 images), which is often considered sufficient for training a model with high accuracy, in particular in comparison to previous AI methods that often trained on ~ 500–10000 images [14, 15]. However, for severe stenoses, the number of images was in the order of several hundred, which led to slightly lower precision.

Previous studies have investigated the performance of various ML models in detecting and grading LSS on lumbar spine MRI. Most of the studies used two experts to assess the models’ performance, while our approach consisted of an expert panel of 7–8 radiologists. Hallinan et al. designed a two-stage deep learning model, using the first model to place region of interests at the central canal, lateral recesses, and neuroforamina with subsequent grading by the second model11. Contrasting our results for central canal classification, their model showed higher inter-rater agreements but better performance of the radiologists for both binary (0.98) and multi-class (0.89) classification compared to the deep learning model. On a larger data set of 7108 lumbar spine MRI on axial and sagittal slices, Lu et al. developed a U-Net architecture that was based on weakly supervised natural language processing labels derived from radiology reports. The accuracy of multi-class LSS (normal-mild-moderate-severe) grading reached an average of 70.6% [16]. Another work trained a two-stage CNN to localize the thecal sac on axial T2-weighted MRI studies at a single motion segment (L4-5) with secondary grading of the localized region [17]. The grading consistency was 77.5% between two expert labels, while the model achieved accuracies of 77.9–83.0% compared to the expert labels. Most recently, Su et al. generated a multi-task classification network on a large data set of axial lumbar spine T2-weighted MRI for grading of disc herniation, foraminal stenosis, and LSS, that was graded by two clinicians as reference standard [18]. The models’ accuracies ranged from 79.7 to 87.0% for the external and internal data set, respectively.

Compared to previous works, our three-stage model provides the most comprehensive assessment of lumbar stenosis as it combines segmentation, stenosis classification and severity grading. The approach of combining input from a source axial image with the stenosis-related anatomical object mask output from the segmentation model provides the classification model with precise spatial information and a contextual understanding of how the stenosis-related anatomical object interacts with the surrounding structures. This results in more accurate predictions when contrasted with the approach of solely passing the source image to the classification model [19]. In summary, our three-stage model effectively addresses the limitations associated with traditional approaches by seamlessly integrating both segmentation and classification techniques.

The current study has limitations, much of which are theoretical limitations with the nature of the current undertaking—the lack of a “gold standard” or “true north” for comparative analysis of the current model. That is, the lack of a firmly established consensus around the MRI grading scale definitions and the ground truth was established by common radiologic grading. This explains the presence of disagreement between the raters. Second, the labor-intensive manual annotation process limits the number of MRI studies that could be used for training of the neural network model. Furthermore, there was no inter-rater consensus established regarding the labeling process. Third, combining MRI and CT imaging can enhance the accuracy and sensitivity of neural networks in detecting stenosis caused by osteophytes, since CT helps with the visualization of solid structures, making it valuable for detecting osteophytes and bony abnormalities. Fourth, patients with severe stenosis can experience little symptoms [20], thus revealing the need for an AI based not only on imaging but also on symptoms.

Conclusions

In conclusion, we have demonstrated that the integration of CNN models is able to diagnose and grade LSS on lumbar spine MRI with high accuracy, reliability, and post-hoc interpretability comparable or superior to expert radiologists and may function as a supporting diagnostic tool. Combined with the patient symptoms and patient’s conservative therapy failure, this tool could be very valuable for surgery decisions. In future works, the CNN model should be expanded to assess a broader spectrum of degenerative findings that occur within the natural course of LSS, including stenosis of the subarticular/lateral recesses and/or neuroforaminal stenosis.