Automatic semantic segmentation of the lumbar spine: Clinical applicability in a multi-parametric and multi-center study on magnetic resonance images

Significant difficulties in medical image segmentation include the high variability of images caused by their origin (multi-center), the acquisition protocols (multi-parametric), the variability of human anatomy, illness severity, the effect of age and gender, and notable other factors. This work addresses problems associated with the automatic semantic segmentation of lumbar spine magnetic resonance images using convolutional neural networks. We aimed to assign a class label to each pixel of an image, with classes defined by radiologists corresponding to structural elements such as vertebrae, intervertebral discs, nerves, blood vessels, and other tissues. The proposed network topologies represent variants of the U-Net architecture, and we used several complementary blocks to define the variants: three types of convolutional blocks, spatial attention models, deep supervision, and multilevel feature extractor. Here, we describe the topologies and analyze the results of the neural network designs that obtained the most accurate segmentation. Several proposed designs outperform the standard U-Net used as a baseline, primarily when used in ensembles, where the outputs of multiple neural networks are combined according to different strategies.

In the first experiment, different topologies were designed based on the U-Net architecture with the original U-Net architecture used to obtain baseline results. To do this, a set of distinct interchangeable block types strategically combined to form encoder and decoder branches were defined (see Subsection 4.1). Table 3 describes the combination of configuration parameters used to obtain optimal results for each network topology, as indicated in Subsection 5.2.
The lumbar spine MR imaging dataset used in this work was extracted from the MIDAS corpus by randomly selecting studies corresponding to 181 patients (see Subsection 3.1.).
Input for neural networks is composed by Sagittal T1w and T2w slices aligned at the pixel level. The ground-truth metadata consists of bit masks generated from the manual segmentation carried out by two expert radiologists with high expertise in skeletal muscle pathology.
All variations designed from the U-Net architecture were trained for 300 epochs using the training subset in the three-fold cross-validation iterations. The optimal version of each model at each cross-validation iteration corresponds to the weight values of the epoch in which the model achieved the highest accuracy with the validation subset.
The reported results were computed after labelling every single pixel to one of the 12 classes with both Maximum a Posteriori Probability Estimation (MAP) and Threshold Optimisation (TH) (see Subsection 4.3).   • Topology UMD obtained the best results of all the variants tested in this work, outperforming the baseline architecture U-Net (U1) for all classes using the two labelling criteria.
• Seven of the proposed topologies (UD, UAD, UMD, UAMD, UDD, UMDD, UDD2) outperforms the two baseline architectures: the standard U-Net and the FCN. Four of these topologies use multi-kernels at the input and Deep Supervision at the output generated by the last level of the decoder branch, or DS.v3.
• The TH labeling criterion performed significantly better than MAP for all experiments.

Experiment B. Ensembles -Model Averaging Technique
For this experiment, the output of several networks corresponding to different topologies is combined to form a classifier that is an ensemble of classifiers using the model averaging technique (see Figure 6).
Model averaging is a technique where R models equally contribute to obtaining the ensemble's output. Two ways of computing the output of the ensemble from the output of the components were considered, the arithmetic mean (Arith) and the geometric mean (Geo) (see Subsection 4.2.1.). In addition to training and evaluating individual semantic segmentation models designed as variations from the U-Net architecture (see Experiment A), a set of ensembles were created by grouping from 4 to 13 models. Table 4 reports all the ensembles used; note that we used the FCN network only in ensembles E8 and E13 for comparison purposes.
The experiments for each evaluated network topology or ensemble were carried out following the same three-fold crossvalidation procedure (See Section 5.).
The proposed network topologies use the softmax activation function in the output layer; their outputs are normalized and sum 1. We refer to them as vectors of normalized scores.
The models output masks corresponding to 256×256 patches are combined and generate a single mask per original slide (medical image) to evaluate the quality of the automatic seman-tic segmentation.
The output of the ensemble is also one vector of normalized scores per pixel. The reported results were computed after labelling each single pixel to one of the 12 classes by either the MAP criterion (see Subsection 4.3.1).

Experiment B.1. Results
• Table B.3 and Table B.4 shows the IoU per class computed according to (3) and the averaged IoU calculated according to (4) for all ensembles using the Model Averaging technique, The averaged IoU including the background class is only shown for informational purposes. Two ways of computing the output of the ensemble from the output of the components were used: Arithmetic mean (1) (Arith) and Geometric mean (2) (Geo), the results shown in B.3 Table B.3: Performance of automatic semantic segmentation via several ensembles using the Model Averaging technique, computed by using the Arithmetic mean (Arith) (see Equation (1)). The MAP criteria were used to label every pixel into one of the target classes (see Subsection 4.3.1.). The IoU metric is used to evaluate the performance of the twelve classes using equation (3). The average with/without the background class was computed using equation (4).

Class
Model Averaging -Ensembles according to  (2)). The MAP criteria were used to label every pixel into one of the target classes (see Subsection 4.3.1.). The IoU metric is used to evaluate the performance of the twelve classes using equation (3). The average with/without the background class was computed using equation (4).

Class
Model Averaging -Ensembles according to and Table B.4 respectively. MAP criteria were used to label each single pixel into one of thetarget classes. The best results for each one of the classes have been highlighted in bold.
• Ensemble E13 obtained the best results of all the ensembles.
• No significant differences between the Arith and the Geo are observed. The high similarity between both ways of computing the mean confirms that all the topologies combined in the ensembles perform very similarly.
• Compared to table A.1 in experiment Experiment A , all ensemble results calculated with the model averaging tech-nique outperforming the baseline architecture U-Net (U1) and the best performing architecture (UMD) in all classes.

Experiment C. Ensembles -Stacking Model Technique
For this experiment, stacking models learn to obtain a better combination of the predictions of R single models to achieve the best prediction.
An ensemble following the stacking model is implemented in three stages: (a) layer merging, (b) meta-learner, and (c) prediction, As indicated in Subsection 4.2.2.
The stacking model technique was used with two different techniquees to prepare the input to the layer-merging stage: (a) the output of the softmax activation layer from each model r in the ensemble, i.e., the vector y r , and (b) the 64-channel tensor used as input to the classification block (i.e., the outputs generated by the last level of the decoder branch or DS.v3, where applicable). The combination of the inputs in the layer-merging stage can be done by concatenation, averaging, or adding.
The set of ensembles created by grouping 4 to 13 models is shown in table 4.
Ensembles based on the stacking model were trained during 50 epochs using the same data-augmentation transformations used to train each single network (see Subsection 5.4), and following the three-fold cross-validation procedure with the same partitions of the dataset. Table 5 depicts the best-performing ensemble input formats and layer configurations based on the stacking model assembling technique. A three-letter acronym identifies ensemble configurations. The first letter identifies the input type, N and T which stand for normalized scores (softmax output) and 64channels tensors, respectively. The second letter indicates layer merging operator: averaging (A) and concatenation (C). The third corresponds to the type of meta-learner used; in this case, we only used dense layers with the third letter fixed to D.
The ensemble configurations are: -NAD configuration, the inputs to the stacking model are normalized scores, the layer-merging is Average, and the metalearner is a dense layer.
-TCD configuration, the inputs to the stacking model are 64-channel tensors at the input of the classification block from each model. The merging layer is a concatenated layer, and the meta-learner is a dense layer.
The stacking model output masks corresponding to 256×256 patches are used to be combined and generate a single mask per original slide (medical image) to evaluate the quality of the automatic semantic segmentation. We use the vector corresponding to each pixel of the reconstructed mask to assign each pixel to one of the twelve classes using MAP or TH (see Subsection 4.3).

Experiment C.1. Results
• Tables C.5, C.6, C.7 and C.8 shows the Intersection over Union (IoU) per class computed according to (3) and the averaged IoU calculated according to (4) for all ensembles using the stacking model technique, The averaged IoU including the background class is only shown for informational purposes.
The results obtained with the two stacking model configurations and the respective method used to label each pixel in one of the target classes are shown in the tables as follows: -NAD + TH in Table C.5, -NAD + MAP in Table C.6 -TCD + TH in Table C.7, -TCD + MAP in Table C.8 The best results for each one of the classes have been highlighted in bold.
• The ensemble E12+NAD+TH obtained the best overall results. Let us remark that the TH labelling criterion per-formed better than the MAP criterion in all the performed experiments.
• The ensembles E10+TCD+TH and E12+NAD+TH performed significantly better than best performing topology tested in this work (UMD+TH) for all target classes.
• The ensembles including the FCN topology, E8 and E13, have a significant performance drop. Comparing E12 and E13 results for the configuration NAD+TH, it can be observed that the addition of the FCN topology significantly deteriorates the performance.