Reproducibility of thoracic kyphosis measurements in patients with adolescent idiopathic scoliosis

Background Current surgical treatment for adolescent idiopathic scoliosis (AIS) involves correction in both the coronal and sagittal plane, and thorough assessment of these parameters is essential for evaluation of surgical results. However, various definitions of thoracic kyphosis (TK) have been proposed, and the intra- and inter-rater reproducibility of these measures has not been determined. As such, the purpose of the current study was to determine the intra- and inter-rater reproducibility of several TK measurements used in the assessment of AIS. Methods Twenty patients (90% females) surgically treated for AIS with alternate-level pedicle screw fixation were included in the study. Three raters independently evaluated pre- and postoperative standing lateral plain radiographs. For each radiograph, several definitions of TK were measured as well as L1–S1 and nonfixed lumbar lordosis. All variables were measured twice 14 days apart, and a mixed effects model was used to determine the repeatability coefficient (RC), which is a measure of the agreement between repeated measurements. Also, the intra- and inter-rater intra-class correlation coefficient (ICC) was determined as a measure of reliability. Results Preoperative median Cobb angle was 58° (range 41°–86°), and median surgical curve correction was 68% (range 49–87%). Overall intra-rater RC was highest for T2–T12 and nonfixed TK (11°) and lowest for T4–T12 and T5–T12 (8°). Inter-rater RC was highest for T1–T12, T1-nonfixed, and nonfixed TK (13°) and lowest for T5–T12 (9°). Agreement varied substantially between pre- and postoperative radiographs. Inter-rater ICC was highest for T4–T12 (0.92; 95% CI 0.88–0.95) and T5–T12 (0.92; 95% CI 0.88–0.95) and lowest for T1-nonfixed (0.80; 95% CI 0.72–0.88). Conclusions Considerable variation for all TK measurements was noted. Intra- and inter-rater reproducibility was best for T4–T12 and T5–T12. Future studies should consider adopting a relevant minimum difference as a limit for true change in TK. Electronic supplementary material The online version of this article (doi:10.1186/s13013-017-0112-4) contains supplementary material, which is available to authorized users.


Background
Adolescent idiopathic scoliosis (AIS) is characterized by a lateral deviation of the spine in the coronal plane, vertebral rotation in the transverse plane, and often hypokyphosis in the sagittal plane [1,2]. Current surgical treatment for AIS involves multisegmental pedicle screw instrumentation, which results in considerable correction in the coronal plane with limited loss of correction over time [3,4]. However, several studies have reported failure to restore the thoracic kyphosis (TK) to a normal range seen in non-scoliotic subjects, and in recent years, the importance of surgical correction of sagittal malalignment has gained increased focus [5,6].
Although measuring TK in AIS patients on plain radiographs has become commonplace throughout the decades, considerable variation across studies in terms of defining TK exists and no consensus has been established on what should be regarded as an actual change in TK as opposed to expected measurement variation. For one, a recent meta-analysis evaluated the surgical correction of TK in AIS patients; however, the analysis included various studies with different definitions of TK, which made direct comparisons challenging [7]. Moreover, several studies have attempted to define the TK range in normal subjects but have used different definitions without addressing differences in measurement variation [8][9][10]. Furthermore, the Lenke classification [11] is widely used in preoperative planning; however, classification of the sagittal thoracic modifier has shown poor reliability and the measurement agreement for T2-T12 and T5-T12 kyphosis have been found to be reduced compared to the frontal Cobb angle [12,13]. For T2-T5 regional kyphosis, reliability has been shown to be poor [13,14] and other studies have shown that the upper part of the thoracic spine is inherently challenging to visualize due to structural overlap of the shoulder girdle [8,15,16]. As the clinical importance of the spinal sagittal profile becomes increasingly evident, there is a need to ensure that the measuring methods used to evaluate TK are both accurate and reproducible, especially since the rotational component of the curve may alter reproducibility depending on definitions of TK. Traditionally, TK is determined by a fixed limit Cobb technique (fixed TK, e.g., T4-T12); conversely, the definitions of fixed TK vary among studies [17][18][19][20]. A few authors have suggested applying a nonfixed approach where limits of TK are based on the individual sagittal shape of the spine as it has been shown that the cranial and caudal end vertebrae of the nonfixed TK vary among normal adolescents [8,[21][22][23].
Overall, the intra-and inter-rater reproducibility of these various TK measurements has not been established, and there is no consensus as to which measurements offer the least amount of variability. While a few studies have addressed the intra-and inter-rater correlation for certain TK measurements [24], it is of limited application on individual patients and it will be of great clinical and academic value to know the actual expected variation for repeated TK measurements on the same subject. As such, the objective of the following study was to determine intra-and inter-rater reproducibility of commonly used TK measurements.

Methods
Plain radiographs of 20 patients who were at one point diagnosed with AIS and surgically treated at our institution with alternate-level pedicle screw fixation [25,26] were examined. Institutional review board approval was obtained. Gender and patient age was recorded, and curve type was determined based on the Lenke classification [27]. On the coronal radiograph, pre-and postoperative main Cobb angle was measured and correction rate was calculated.
One spine research fellow (rater 1) and two spine surgeons (raters 2 and 3) independently evaluated 20 sets of pre-and postoperative standing lateral radiographs. For each radiograph, the following were determined ( Fig. 1 Each rater independently performed all measurements twice 14 days apart. Before the second round of measurements, the sequence of the radiographs was randomly reassigned and the raters were blinded from the results of the first round. All raters were blinded to patient details. The total analysis produced 1920 data points for further analysis. All radiographs were measured on a high-resolution monitor using the Picture Archiving Communication system, and identification and labeling of individual vertebrae was based on the Radiographic Measurement Manual by the Spine Deformity Study Group [28]. Application of this manual was discussed among the raters, and consensus was established prior to the study. The protocol for the study was based on the Guidelines for Reporting Reliability and Agreement studies [30].

Imaging details
For the scoliosis radiographs, all patients were positioned in erect position with the feet together and in the straightest posture possible. For lateral images, patients were in the clavicle position with flexed shoulders and elbows past 90°with hands pointing at the sternal notch to allow better spine visualization while preventing changes to the sagittal balance [31]. A computed detector was utilized to determine the position of the patient's skull and hip joints and also the length of the image required. The detector was 40 cm in length, and thus, image splitting was required. Up to 2-3 exposures were required depending on the patient's height. The postero-anterior radiographs were taken with 78-peak kilovoltage and 20 mAs of X-ray energy. The lateral radiographs were taken with 88-peak kilovoltage and 32 mAs of X-ray energy. For both images, the focus film distance was 180 cm.

Statistical analysis
All statistical analyses were performed using R version 3.2.3 (R core team, 2014, Vienna, Austria). Data was reported as proportions (%), mean with standard deviation (SD), or median with range, and data distribution was assessed by histograms.
Reproducibility is a term that entails both measurement agreement and reliability. Intra-and inter-rater agreement is defined as the degree to which repeated measurements are identical whereas reliability is defined as the ability of a measurement to differentiate between subjects [30]. Intra-and inter-rater agreement per subject was estimated for each type of TK measurement using the repeatability coefficient (RC), which is the difference in measurements exceeded by only 5% of pairs of measurements on the same subject. Ninety-five percent limits of agreement were defined as ±RC, meaning that a high RC indicated a high variation (poor agreement) in repeated measurements.
Intra-rater agreement for each rater was calculated according to Bland and Altman [32]: Single rater RC = 1.96 * SD of the difference between repeated measurements for each rater.
Overall, intra-and inter-rater RC was calculated using a linear mixed effects model with subjects and rater-withinsubject variation as random effects and timing of radiograph (e.g., pre-or postoperative) as a fixed effect: [24] Overall intra-rater RC = 2.77 * √(residual mean square) Overall inter-rater RC = 2.77 * √(rater:subject mean square + residual mean square) Inter-rater RC was further analyzed for pre-and postoperative radiographs separately.

Discussion
Our study noted a substantial measurement variation for all definitions of TK with the best reproducibility for T4-T12 and T5-T12 both in terms of intra-and inter-rater agreement as well as reliability. Only a few previous studies have addressed the variation of TK measurements in a systematic manner. For example, in a study by Ilharreborde Postoperative 57 ± 9 55 ± 9 55 ± 9 57 ± 9 52 ± 9 54 ± 9 SD standard deviation, TK thoracic kyphosis, LL lumbar lordosis et al. [35], the authors found an intra-rater agreement of 6°a nd 4°for T1-T12 and T4-T12, respectively, and an interrater agreement of 7°and 6°, respectively. This study, however, utilized EOS-imaging, which is a slot-scanning device that may improve the agreement. Moreover, EOS is currently only available in selective centers. Similarly, Kuklo et al. [36] found that the intra-rater agreement for T2-T12 and T5-T12 was 5°and 6°, respectively. However, none of these studies addressed the issue of random effects, so these results are not directly comparable to the present study and likely to underestimate the overall variation seen between randomly chosen raters. Carman et al. [37] measuring nonfixed TK found 95% of the differences between raters to be within 7°and found a trend towards less variation with increased clarity on radiographs. The study also found that an 11°difference in TK was required to rule out measurement error with 95% confidence. Our results are in line with these findings showing that TK measurements have considerable intra-and inter-rater variation and a difference of 8°to 13°(depending on TK definition) may solely be produced by observer error alone.
In order to ensure clinical applicability of our results, our study included both pre-and postoperative radiographs.
Our analysis showed substantial differences in both intraand inter-rater agreement between pre-and postoperative radiographs, showing markedly better agreement in postoperative radiographs for T4-T12 and T5-T12 (Tables 2  and 3). For the remaining TK measurements, analyses of pre-and postoperative subgroups were not conclusive but, generally, we found poorer or unchanged agreement. The reason for these changes may be that the variation seen in T4-12 and T5-T12 is mainly due to the lateral and rotational deformity of the curve which is surgically corrected whereas the variation seen in measurements including T1 and T2 is more likely due to structural overlap (e.g., of the humeral head) and therefore not affected by surgery. Interestingly, our analysis also showed considerable variation for the fixed and nonfixed LL, indicating that the sagittal radiograph, as a whole, is inherently difficult to analyze in a reproducible manner in AIS patients.
Our study focused on the overall TK because we found that a wide range of definitions exist in the literature. Establishing the respective reproducibility of these measurements was our main objective, but we would encourage future studies to include additional clinically important parameters, such as proximal TK (T2-T5) and thoracolumbar alignment (T10-L2) as well as several other clinically relevant measurements.
The ICC analysis showed good to excellent reliability for all measurements. However, while the ICC analysis is frequently reported in studies of this type, it holds limited practical value when assessing potential variation of individual measurements per subject, as it is a measure of the reliability for the measurement to differentiate between subjects. By applying a mixed effects model to our data, the observed variance is split into both the variability between the raters within subjects (inter-rater variation) and a residual error term (representing intra-rater variation). Ultimately, an RC is generated which represents the upper and lower 95% limits of agreement for an individual measurement. By using the rater as random effects, our results represent conservative estimates  T1-nonfixed 12 14 13 Nonfixed TK 14 13 13 L1-S1 9 12 11 Nonfixed LL 11 12 11 RC reliability coefficient, TK thoracic kyphosis, LL lumbar lordosis and we hypothesize that measurement variation found in our study would also apply for other raters. Our results are limited by a substantial variation in single rater RC among raters, which was lowest for T4-T12 and highest for T2-T12 (Table 2). Several steps were taken before the initiation of the study to minimize bias in terms of discrepancies in labeling vertebra, handling of odd number of ribs, or definitions using the nonfixed approach. Rater 1 had more than 3 years of experience in evaluating radiographs from AIS patients, and raters 2 and 3 had 8 and 10 years of experience, respectively. It should be noted that all raters routinely use mainly T5-T12 or T2-12 when evaluating patients with AIS although rater 1 also uses the nonfixed approach on a regular basis. As such, we believe that our results reflect the expected variation between clinicians. In addition, our patient sample did not include lumbar curves, so we cannot infer that our results may be readily applied to this group. Also, the sample size in our study did not allow for analyzing individual curve types, but variation may be greater for thoracic curves since TK has been found to depend on curve type [38]. Nonetheless, we hope that our study can form the foundation whereby future studies can further elaborate upon different curve types.
Our results may guide clinicians and researchers in the evaluation of the sagittal profile following surgery in defining the limits of actual improvement of worsening of TK as opposed to expected measurement variation. Applying such variation in clinical definitions of progression has previously been described in guidelines for evaluation of radiographic results of brace treatment [37,39,40]. Our results indicate that T4-T12 and T5-T12 offer the least amount of observer variation, and while measuring nonfixed TK may offer a more individualized assessment of the spine, we found considerable measurement variation using this approach that may limit the clinical applicability. It is outside the scope of this paper to determine how these various measurements correlate with clinical outcomes; however, we recommend that future studies specifically state the applied definition Fig. 2 Intra-rater intra-class correlation coefficients for all measurements (both pre-and postoperative) with 95% confidence interval (CI). TK thoracic kyphosis Fig. 3 Inter-rater intra-class correlation coefficients of all measurements (both pre-and postoperative) with 95% confidence intervals (CI). TK thoracic kyphosis of TK and also adequately address measurement variation when evaluating treatment results. This will further help with standardization of measurements between studies for comparative purposes.

Conclusions
Our study addresses the intra-and inter-rater reproducibility of TK measurements in AIS patients, and we noted a considerable variation for all TK measurements. Both intraand inter-rater reproducibility were best for T4-T12 and T5-12. Future studies should consider adopting a relevant minimum difference (depending on TK definition) as a limit for indication of true change in TK within a patient. As such, our findings have implications in the decisionmaking of the spine specialist.  Availability of data and materials All original data used for the analyses presented are made available as Additional file 1.
Authors' contributions SO, DS, DH, and JC participated in the conception and design of the study. SO, JC, and KK made radiographic measurements. SO and DH analyzed the data. SO, DS, DH, JC, and KK wrote the manuscript, and BD, MG, and KC made critical revisions. DS supervised the study and provided administrative support. All authors have read and approved the final manuscript.