Introduction

Adult spinal deformity (ASD) is known to severely reduce the health-related quality of life and shows an increasing prevalence in patients > 65 years (30–68%) [1, 2]. Long construct instrumentation is the surgical treatment for high degree ASD [3, 4].

The radiological evaluation of sagittal balance is fundamental for characterization, classification, and consecutive treatment planning of ASD [5,6,7]. Among the most important radiographic parameters are sacral slope (SS), pelvic tilt (PT), pelvic incidence (PI), lumbar lordosis (LL) and sagittal vertical axis (SVA) [8]. These parameters can be identified and measured on total spine radiographs including the pelvis. A precise and reproducible measurement of these radiographic parameters is therefore essential. For the most part, this evaluation is performed manually with software assistance, which is time consuming and examiner dependent [9,10,11].

Artificial intelligence (AI) technologies are employed in different fields in medicine. Machine learning with deep learning (DL) algorithms is currently developed for precise image analysis. Only few publications investigated AI based, automated analysis of the clinically relevant sagittal balance parameters by a single algorithm [12, 13]. None of these algorithms showed a sufficiently high accuracy of the automated measurements of basic sagittal balance parameters until now.

Limiting factors in establishing a fully automated DL algorithm have been reduced image quality and stitching artefacts by now. Beyond that, high-grade spinal deformity and postoperative long construct instrumentations with implant artefacts lead to high inaccuracy of the analysis and prevented the implementation in routine clinical use [12,13,14].

We were recently able to show high accuracy of spinopelvic parameters as measured by AI in different lumbar pathologies with short instrumentations and without detection of SVA [15].

The aim of this study was to assess the accuracy of a new, complete automated DL algorithm for analysis of essential parameters of sagittal balance in a large and challenging cohort of patients with ASD and after their correction with long construct instrumentation.

Material and methods

The evaluation of the DL performance was conducted retrospectively on a cohort of 141 patients with ASD. Study approval of the local Ethics Committee was obtained prior to the initiation of this study (EA1/342/21). Patients with ASD that underwent corrective surgery with long construct instrumentation with more than three segments were included in the study. Exclusion criteria were prior spinal surgery with instrumentation or kyphoplasty.

Radiographic data

From the 141 identified patients, 118 had preoperative and 125 had postoperative lateral total spine radiographs obtained by three different X-ray machines (Kodak Elite CR and Kodak DRX-Evolution X-ray scanners; Carestream Health, Rochester, NY, USA and EOS imaging; ATEC, Paris, France) at our institution. All postoperative radiographs included long construct spinal instrumentation (pedicle screws, rods, interbody cages). Screws were cement augmented in 21 patients.

Ground truth manual measurements

For comparison with the DL measurements all preoperative and postoperative radiographs were manually measured independently by two of the authors (F.A. and J.L.) using the SurgiMap Spine software as previously reported (Nemaris Inc., New York, NY, USA) [9, 10, 15, 16]. SS, PT, PI, LL (as measured by L1-S1 lordosis) and SVA were measured and recorded. For intraobserver reliability, the measurements were repeated.

Deep learning-based measurements

The DL-based algorithm for automatic computation of sagittal balance parameters included three main steps: (1) Automatic adjustment of image brightness, contrast and identification of stitching artefacts for segmentation of all relevant anatomical structures—cervical, thoracic, and lumbar vertebral bodies, sacral endplate, femoral heads and instrumentations, (2) landmark detection on sacrum and L1, and (3) line fitting and computation of all parameters.

The segmentation model was trained using Mask-RCNN architecture on 946 training images obtained from 22 different clinical sites in their clinical routine [17]. The training images were independent from the 118 preoperative and 125 postoperative measured radiographs. As an input to the segmentation model, the DICOM images were preprocessed to enhance the brightness and contrast. Furthermore, a histogram equalization was applied to highlight the bony structures in the images. The segmentation model was trained on the masks around the visible anatomical structures and their corresponding categories. The training labels were generated by the medical staff with background knowledge on human anatomy. The model was trained for 100 epochs on NVIDIA GeForce 1080 GPU with a 90–10 validation split.

The development of the landmark detection algorithm relied on the location of detected structures in the first step of segmentation, allowing the generation of crops of the sacrum and the vertebral bodies. Two separate models were trained to place (1) five landmarks on sacral endplate and (2) six landmarks on L1 with three landmarks each on each upper and lower endplate. The CNN network was based on UNet architecture and was fed with 256 × 256 squared crops along with the landmarks as heatmaps as input [18]. The output heatmaps from the model were converted to coordinates as the final prediction. Euclidean distance error and AdamW optimizer were used for training with a learning rate of 0.001 for 60 epochs [19].

The final step compiles all the necessary predictions from segmentation and landmark placement models to compute the relevant parameters. The vertebral bodies are labelled from sacral/caudal to cranial/cervical counting five lumbar, twelve thoracic and seven cervical vertebras. The spinopelvic parameters were computed using the line regression on the detected landmarks on sacrum/L1 and midpoint of the detected femoral heads. The SVA was computed based on the midpoint of C7 and the most posterior landmark of sacral endplate (Fig. 1).

Fig. 1
figure 1

AI-based landmark detection and segmentation

Statistical analysis

The mean values, root mean square error (RMSE) and standard deviation (STD) were calculated for the parameters. The correct detection rate of the DL algorithm was described (in percentage), where all parameters could be computed fully automatically. The intra-class correlation coefficient (ICC), Pearson correlation coefficient and the correspondent p values were calculated for intra- and interobserver as well as intermodal reliability. Statistical significance was defined as p < 0.05. All statistical analyses were conducted with SPSS 27 (IBM Corp., Armonk, New York, NY, USA) and Python 3 programming language [20].

Results

The preoperative detection rate of the DL algorithm was 91.5%. The postoperative detection rate was 84.8%. The intraobserver ICC (Pearson correlation coefficient) for the SurgiMap-assisted manual preoperative and postoperative measurement was 0.85–0.99 and 0.93–0.99, respectively. The interobserver ICC (Pearson correlation coefficient) for the SurgiMap-assisted manual preoperative measurement was 0.96 for SS, 0.99 for PT, 0.96 for PI, 0.97 for LL and 0.99 for SVA. The interobserver ICC (Pearson correlation coefficient) for the SurgiMap-assisted manual postoperative measurement was 0.99 for SS, 0.99 for PT, 0.99 for PI, 0.99 for LL and 0.99 for SVA (Table 1). The ground truth values are given in Table 1. The ICC between the manual measurements and the DL measurements was 0.71–0.99 for the preoperative and 0.72–0.96 for the postoperative analysis (Table 2). The measurement accuracy was not affected by implants or cement augmentation of screws, as no statistically significant differences of the evaluated parameters could be revealed between these groups in a subgroup analysis (p > 0.05) (Fig. 1).

Table 1 Ground truth values for manual, preoperative and postoperative measurements with interobserver comparison
Table 2 Inter-modal reliability between manual, ground truth measurements and deep learning algorithm measurements

Discussion

This study is the first to show high accuracy for measurement of fundamental sagittal balance parameters by one single, complete automated DL algorithm.

The main finding of this study is that the new DL algorithm is a reliable tool due to the high precision. DL evaluation of high degree degeneration and spinal deformity is the most challenging. All patients of this cohort had ASD and were evaluated preoperatively and postoperatively with long construct instrumentations.

The highest measurement accuracy in this cohort was observed for SVA. This is of particular importance, as it is a fundamental radiological parameter to evaluate and classify ASD and global balance. The assessment of sagittal balance in combination with the also investigated PT allows for further consideration of compensatory mechanisms. The highest inaccuracy was observed for the detection of SS, which is consistent with so far published results and due to sacral endplate irregularities and summation of implants in this area [21]. However, the clinical importance of SS is inferior to PI, for which our results compare favourable to other studies [13, 22].

Previous studies did not show sufficiently high accuracy for relevant spinopelvic parameters with one DL algorithm [12,13,14, 22, 23]. The only study investigating the four most relevant spinopelvic parameters, as they were investigated in our study, showed a detection accuracy of PI of 0.69 and the authors concluded that the DL algorithm is not suitable for implementation in clinical routine [13]. The other investigated parameters showed comparable high ICCs to our study. Further studies with high accuracy did not investigate postoperative radiographs with implants or showed high accuracy for spinopelvic parameters but not SVA [21, 22, 24].

Previously, we investigated automated DL measurements of sagittal balance in short-segment spinal deformities and mono- and bisegmental instrumentations. The accuracy of the present study compares equally to these findings [15].

Until now, only three studies investigated automated DL-based SVA measurements. On this occasion, the main cause for the difficulty in computing SVA is the visibility of C7 as observed in our study. The clinical routine is based on conventional total spine radiographs in many hospitals until now. DL measurements need to cope with varying radiograph quality to be suitable for clinical use. Among other aspects, the most important challenges in this study were radiographs issued from three different X-ray machines with lower image quality (including stitching artefacts), long construct instrumentations and cement augmentation of screws. Prior studies of DL analysis of sagittal balance with high accuracy excluded up to 28% of radiographs due to poor radiograph quality [13, 22]. This may improve the measurement performance but prevents a statement on how the DL algorithm would perform on real clinical data. All postoperatively examined radiographs in this cohort included long construct instrumentation. A high performance of the algorithm for preoperative and postoperative analysis is important for clinical implementation. The implant density of postoperative measurements and cement augmentation of screws did not affect the measurement accuracy in our study.

The DL algorithm in this cohort is multimodal and involves vertebrae segmentation and separate landmark placement for each segmented vertebra. Six landmarks are placed on vertebral bodies and five landmarks on the sacrum, which is significantly more than previously published DL approaches. This contributes to a high accuracy and very robust workflow in view of the challenging presented cohort.

The manual measurements were done with software assistance, which has shown to be a reliable tool [9, 10, 12]. The intra- and interobserver correlations in this cohort compare equal and favourable with prior published results [22]. As the evaluation of the DL accuracy is based on these ground truth measurements, this is a key point of all validation studies.

A limitation of this study can be seen in the fact that we only investigated four spinopelvic parameters. Previous studies were able to evaluate more parameters [22]. However, the four presented parameters are among the most relevant and challenging to detect and are sufficient for clinical use and decision-making.

Conclusion

The new DL algorithm provided high accuracy for complete automated detection of sagittal balance in ASD. For the first time, the precision and the robustness of a DL algorithm allow for implementation in clinical routine. In the spotlight of the recent discussion, this study demonstrates a performing synergism of DL and human effort for improved analysis of medical imaging.