Lumbar spine segmentation in MR images: a dataset and a public benchmark

This paper presents a large publicly available multi-center lumbar spine magnetic resonance imaging (MRI) dataset with reference segmentations of vertebrae, intervertebral discs (IVDs), and spinal canal. The dataset includes 447 sagittal T1 and T2 MRI series from 218 patients with a history of low back pain and was collected from four different hospitals. An iterative data annotation approach was used by training a segmentation algorithm on a small part of the dataset, enabling semi-automatic segmentation of the remaining images. The algorithm provided an initial segmentation, which was subsequently reviewed, manually corrected, and added to the training data. We provide reference performance values for this baseline algorithm and nnU-Net, which performed comparably. Performance values were computed on a sequestered set of 39 studies with 97 series, which were additionally used to set up a continuous segmentation challenge that allows for a fair comparison of different segmentation algorithms. This study may encourage wider collaboration in the field of spine segmentation and improve the diagnostic value of lumbar spine MRI.


Background & Summary
Low back pain (LBP) causes the largest burden of disease worldwide, with most years lived with disability of any disease.[1] As a consequence, lumbar spine magnetic resonance imaging (MRI) for LBP is one of the most used imaging procedures within musculoskeletal imaging.[2] In the United States, 93% of the lumbar MRI referrals were appropriate according to the American College of Radiology guidelines, even though only 13% of the scans contributed in the clinical decision making.[3] Automatic image analysis might be the key to improve the diagnostic value of MRI by enabling more objective and quantitative image interpretation.A first step toward automatic assessment of lumbar spine MRI is segmentation of relevant anatomical structures, such as the vertebrae, intervertebral discs (IVDs) and the spinal canal.
With recent advances in machine learning and artificial intelligence (AI), state-of-the-art spine segmentation algorithms are generally learning-based algorithms that require well-curated training data.The development of vertebra segmentation algorithms for CT images has considerably benefitted from multiple large publicly available datasets with CT images and reference segmentations.[4,5] Currently no comparable large high quality datasets are available for lumbar spine MRI.Existing available datasets are either small, only segment the vertebral body [6,7], or are only annotated in the midsagittal slice (2D) [8,9].Moreover, most datasets are limited to only one of the many anatomical structures that are most relevant for assessing multifactorial disorders such as LBP, i.e., only the vertebrae [10,11,12,13,14] or the IVDs [15,16,17].
To advance the development of segmentation algorithms, and ultimately automatic image analysis, for lumbar spine MRI, this study has three primary goals: 1. To present a large multi-center lumbar spine MR dataset with reference segmentations of vertebrae, IVDs and spinal canal.
2. To introduce a continuous lumbar spine MRI segmentation challenge that allows algorithm developers to submit their models for evaluation.
3. To provide reference performance metrics for two algorithms that segment all three spinal structures auto-matically: a baseline AI algorithm, which was used in the data collection process, and the nnU-Net, a popular algorithm for 3D segmentation tasks for which training and inference code is publicly available.1.
In all included MRI series, all visible vertebrae (excluding the sacrum), intervertebral discs, and the spinal canal were manually segmented.The segmentation was performed by a medical trainee who was trained and supervised by both a medical imaging expert and an experienced musculoskeletal radiologist.Three-dimensional MRI annotation is a complex and laborious task, especially for the vertebral arch of the lumbar vertebrae.Therefore, we worked with an iterative data annotation approach in which our automatic baseline segmentation method (baseline 1: iterative instance segmentation) was trained with a small part of the dataset, enabling semi-automatic segmentation of the remaining images.During semi-automatic segmentation, the automatic method was used to obtain an initial segmentation, which was subsequently reviewed and manually corrected.This process was repeated several times by retraining the automatic segmentation model until the entire dataset was annotated.
Initially, twenty randomly selected high resolution T2 (SPACE) series of the university medical center data were manually annotated using 3D Slicer version 5.0.3 [18].All structures were segmented in their entirety, which for the vertebrae also includes the vertebral arch.This was done since the vertebral arch is essential in the diagnosis of disorders such as foraminal stenosis, facet joint arthrosis, and spondylolysis.The initial manual annotations were per-formed only on high resolution series because the nearisotropic resolution enables detailed viewing in sagittal, axial and coronal directions.Annotations of the corresponding standard sagittal T1 and T2 images were obtained by resampling the T2 SPACE segmentations to the resolution of the T1 and T2 images.The resampled segmentations were reviewed for misalignment due to patient movement between the acquisitions and corrected if needed.All other segmentations were created by first generating initial segmentations using the automatic segmentation method trained with already annotated data, followed by review and manual correction in 3D Slicer.
The vertebrae were not given a correct anatomical label, since accurately determining the anatomical type of vertebra requires information from multiple planes of MRI images, including axial and coronal views, in addition to sagittal views.These additional views are essential to ensure accurate identification of the ribs which is needed to determine the lowest thoracic vertebra and correctly label the lumbar levels.As only sagittal views were available for the majority of studies in this dataset, accurate anatomical labeling of vertebrae was considered infeasible.Therefore, the reference segmentations provided in this dataset are labeled from the bottom up with the most caudal vertebra (usually L5) labeled as 1.
The dataset was divided into a training set (179 out of 218 studies, 82%) and a validation set (39 out of 218 studies, 18%).This data-split was used during training of the iterative instance segmentation algorithm, however it is not mandatory to maintain the same training and validation split when using this dataset.Series belonging to the same patient were always placed in the same set.

Baseline 1: Iterative instance segmentation
By presenting this baseline algorithm, we establish a reference point for evaluating performance and provide users with an understanding of the algorithm employed in generating the dataset.This section summarizes the iterative instance segmentation (IIS) method.An automatic AI-based segmentation algorithm for vertebra segmentation [14] was extended to segment also the IVDs and the spinal canal.This algorithm uses a 3D patch-based iterative scheme to segment one pair of vertebra and the corresponding inferior IVD at a time, together with the segment of spinal canal covered by the image patch.A schematic image of the network architecture is shown in Figure 1.

Instance memory
Because the MR volume is segmented by consecutively analyzing 3D patches, one vertebral level at a time, a method is needed to keep track of its progress.An instance memory volume is used to save the structures that have been segmented, and is used as an extra input channel to remind the network of the structures that can be ignored because they are already segmented.In contrast to the original vertebrafocused method, we introduced separate memory state vol- umes for the vertebrae, IVDs, and the spinal canal.The spinal canal memory state is only used to save the segmentation progress, not as an extra input for the network, as the spinal canal is an elongated structure that cannot be covered by a single patch.Therefore, the network is trained to always segment any visible portion of the spinal canal, which is then stitched together for all patches that are fed through the network.In total, the network has three input channels, two memory states, and the corresponding image patch.

Network architecture
The segmentation approach is based on a single 3D Unet-like fully-convolutional neural network.Unlike the vertebra segmentation algorithm as described in the original paper [14], a patch size of 64 x 192 x 192 voxels with a resolution of 2 x 0.6 x 0.6 mm was used, as the created dataset contains sagittal MR images exclusively.These generally have a higher slice thickness compared to the data used by Lessmann et al. [14] A higher in-plane resolution of the predicted segmentation is achieved while still ensuring the patch is large enough that a vertebra completely fits within one patch.The network has three output channels, one for each anatomical structure.

Iterative segmentation approach
The patch-based scheme is structured in such a way that only relevant parts of the MR volume are being processed.The patch systematically moves through the image until it finds a fragment of the first vertebra, in this case always the lowest vertebra.Subsequently, the patch moves to the center of mass of that fragment after which a new segmentation is made.This process continues until the vertebra's volume stabilizes, which means that the detected vertebra is completely visible within the patch.Binary masks of that vertebra, its underlying IVD, and the spinal canal are then added to their respective memory states.The same patch is segmented again with the updated memory states as in- put, which causes a fragment of the next vertebra to be segmented.This iterative process, illustrated in Figure 2, continues until no more vertebra fragments are detected or when the top of the MR volume is reached.

Completeness and label prediction
The most cranial vertebra is often only partially visible within the field of view of the MR image.The segmentation method includes an additional compression path after the compression path of the U-net, which has a single binary value as output, predicting the completeness of a vertebra.The original vertebra segmentation method also contained a similar compression path for predicting the anatomical label.However, this output was not used in our experiments since no accurate anatomical labels regarding lumbosacral transitional vertebrae were present in our dataset.

Training of the algorithm
Preprocessing of the images consisted of resampling to a standard resolution of 2 x 0.6 x 0.6 mm and orientation in axial slices.Standard data augmentation steps were implemented, such as random elastic deformation, the addition of random Gaussian noise, random Gaussian smoothing, and random cropping along the longitudinal axis.The loss function used during training consisted of three parts: (1) The segmentation error was defined by the weighted sum of false positives and false negatives combined with the binary crossentropy loss.(2) The labeling error was defined by the absolute difference between the predicted label and the ground truth.(3) The completeness classification error was defined as the binary cross-entropy between the true label and the predicted probability.

Baseline 2: nnU-Net
In addition to adapting a segmentation method that was specifically developed for vertebra segmentation, reference results for nnU-Net are provided.nnU-Net is a selfconfiguring, deep learning-based framework for medical image segmentation.[19] It has been widely accepted in the medical image analysis community as a state-of-the-art approach to 3D image segmentation tasks after winning the Medical Segmentation Decathlon [20] and performing well in several other segmentation challenges.A 3D full resolution nnU-Net was trained on the training and validation datasets with 5-fold cross validation, which is its recommended training strategy.[19] Data pre-processing, network architecture and other training details were automatically determined by the nnU-Net framework.The network was trained on both the T1-and T2-weighted MRI series after which the overall performance was compared to the IIS baseline algorithm.

Evaluation
The segmentation performance was evaluated using two metrics: (1) The Dice coefficient to measure the volume overlap, and (2) the average absolute surface distance (ASD) as an indication of the segmentation accuracy along the surface of all structures.Both metrics were calculated separately for all individual structures and were averaged per anatomical structure (vertebrae, IVDs, or spinal canal).Additionally, the average Dice coefficient and average ASD per MRI sequence (T1 vs. T2) were calculated for each anatomical structure.To ensure the Dice score and ASD are not influenced by labeling differences, the individual structures of the reference segmentation are matched to the structured in the predicted segmentation based on the largest found overlap.The completeness classification performance was determined by the percentage of accurate predictions, as well as the average number of false positives and false negatives.Evaluation was performed on a sequestered test set which is a subset of the presented dataset.

Data Records
To generate this dataset, a total of 218 lumbar MRI studies of patients with low back pain were included.Each study consisted of up to three sagittal MRI series which were either T1-weighted or T2-weighted (regular resolution, or high resolution generated using a SPACE sequence) with a total of 447 series.Of all included patients, 63% were female.A total of 3125 vertebrae, 3147 IVDs, and 447 spinal canal segmentations were included over all series combined.An overview of the complete dataset divided by the different hospitals is shown in Table 1.An overview of the training and validation sets and all included structures is shown in Table 2.
All MR images and their corresponding segmentation masks used in this study are stored in MHA format in separate directories.Both files have the same name, which is a combination of the MRI study identifier and the specific sequence type (T1, T2, or T2 SPACE).It is important to note that all MRI series from the same MRI study have the same identifier.

Technical Validation
The performance of the IIS baseline algorithm, which was used to generate initial segmentation masks of unseen images from the dataset, was assessed on a hidden test set.These results are presented to assess the data-annotation strategy, as well as establish a reference performance for users of the dataset.The results for the different structures and the different sequences are shown in Table 3.The overall mean (SD) Dice score was 0.93 (± 0.05), 0.85 (± 0.10) and 0.92 (± 0.04) for the vertebrae, IVDs and spinal canal respectively.The overall mean (SD) ASD was 0.49 mm (± 0.95 mm), 0.53 mm (± 0.46 mm) and 0.39 (mm ± 0.45) mm for the vertebrae, IVDs and spinal canal respectively.The spinal canal was identified in all scans.One of the 656 vertebrae and nine (three in T1 images and six in T2 images) of the 688 IVDs were not found.The completeness prediction was correct in 650 of the 656 vertebrae (99.1%).A nnU-Net was trained on the same training data to enable comparison between the IIS baseline algorithm and nnU-Net baseline.The results of both networks is displayed in Table 4.The IIS model demonstrates strong performance on our dataset, which is comparable to other MR segmentation methods in the literature.[7,21,22] These results are nearly identical to the results of the nnU-Net baseline model, which is considered the gold-standard in medical image segmentation.This indicates that the IIS baseline model is a reasonable benchmark for comparison and was an accurate tool in the iterative data annotation workflow.
The used iterative data annotation approach showed to be an effective strategy.One strength of this approach is its ability to improve the quality of the dataset over time by incorporating corrections from segmentation predictions into the training data.This helps to reduce errors and increase accuracy in subsequent iterations.Additionally, this approach was faster and more efficient compared to fully manual annotations.However, there are several limitations that should be addressed.Firstly, the iterative process of training the network on a small dataset, generating segmentation predictions on unseen images, and manually correcting the predictions before adding them to the dataset, can introduce bias in the final dataset.Moreover, the use of only high-resolution T2 series for the initial manual annotation may not be representative of the entire population, as it is limited to patients from one hospital who underwent this specific imaging modality.
In the era of machine learning and AI algorithms, lumbar spine segmentation can serve as the basis for automated, accurate lumbar spine MR analysis, assisting clinical radiologists and imaging-minded spinal surgeons in their daily practice.It will be able to generate robust, quantitative MR results that can serve as inputs into larger models of lumbar spine disease in clinical practice and research settings.The availability of public datasets and benchmarks plays a crucial role in advancing the field.While datasets  exist for CT vertebra segmentation, such as VerSe which is the largest available vertebra segmentation dataset [23], currently no public datasets for MRI spine segmentation are available.Our dataset is of similar size to VerSe [23] and provides full segmentation of all relevant spinal structures on MR images.This allows for wider participation and collaboration in the field of spine segmentation, as it can be used to train and evaluate algorithms, as well as to compare to other datasets.The presented algorithms are the baseline results to which other algorithms can be compared.

Usage Notes
All training and validation data can be found at https:// doi.org/10.5281/zenodo.8009680and are available under the CC-BY 4.0 license.In order to allow for a fair comparison between different algorithms, including both baseline algorithms, a public segmentation challenge is hosted on the grand-challenge.orgplatform.The training and validation sets are made publicly available for everyone to develop and train their AI algorithms on.The test set will remain hidden on the Grand Challenge platform to avoid overfitting on it and enable a fair comparison.The test set consists of 39 lumbar MRI studies of unique patients, which includes 15 out of the 20 fully manually annotated studies.The remaining studies originate from the same four hospitals in a similar distribution as the presented dataset.
Participants are invited to submit a trained algorithm to the platform, which automatically executes the algorithm and determines its performance on the hidden test set.The challenge can be accessed on https://spider.grand-challenge.org/.

Figure 1 :
Figure 1: Schematic drawing of the network architecture.The input image and both memory states are fed into the Spine U-net,which produces predictions for the segmentation of the vertebrae (red), IVD (yellow), spinal canal (blue), as well as anatomical and completeness predictions.The vertebra and IVD segmentations are added to their respective memory states, which will be used as input to the network in the next iteration (see the section 2.2.3 for more detail)

Figure 2 :
Figure 2: Illustration of the iterative segmentation approach's functionality.The process involves traversing the 3D patch (depicted by the light area with a green border) along the spine with alternating steps for segmentation of the vertebrae (shown in images 1, 3, and 5) and IVD and spinal canal (displayed in images 2, 4, and 6).The right-hand side displays the final automatic segmentation result alongside the reference segmentation.
Fig-ure 3 shows a collection of segmentations obtained by both networks.

Figure 3 :
Figure 3: Examples of cases segmented by both baseline algorithms.Each column represents one case.Column A shows a case without any major pathologies and without significant segmentation errors.Columns B and C show cases where nnU-Net (B) and the IIS baseline algorithm (C) made mistakes, indicated by a white arrow.Column D shows a case with severe degenerative features present and no substantial segmentation errors.

Table 1 :
Overview dataset Abbreviations: UMC, University Medical Center; RH, Regional Hospital; OH, Orthopedic Hospital.* This only applies to the regular T1 and T2 weighted images.The T2 SPACE sequence has a voxel size of 0.90 x 0.47 x 0.47 mm.

Table 2 :
Overview of the distribution of data between the training and validation set.Abbreviation: IVDs, Intervertebral Discs.

Table 3 :
Overview of all results of the IIS baseline algorithm.

Table 4 :
Comparison between the iterative segmentation algorithm and nnU-Net.