Introduction

Recently, there has been a shift toward more organ-preserving treatments for rectal cancer. Patients with advanced tumors who show clinical evidence of a complete response (CR) after neoadjuvant chemoradiotherapy (CRT) may be entered into a watch-and-wait (W&W) program while patients with small tumor remnants may be cured with local treatment options such as transanal excision instead of major resection [1,2,3]. In addition, there are ongoing trials (such as the STAR-TREC trial) investigating the benefit of giving chemoradiotherapy to early-stage tumors with the aim of achieving organ preservation [4]. According to current guidelines, these early tumors are typically managed with direct surgery. These developments have urged the need to accurately monitor response after CRT, but have also given rise to an increased interest to predict treatment response before the start of CRT. If we could differentiate at baseline which patients are likely to respond well and which patients will be non-responders, this could aid in the selection of patients who would be good candidates for CRT while avoiding unnecessary side effects in anticipated non-responders. Pre-treatment knowledge of the anticipated treatment response could also help to optimize the neoadjuvant treatment strategies further.

Several studies have investigated the potential role of imaging and image biomarkers as pre-treatment predictors of response [5,6,7,8,9]. So far, these studies have mainly focused on functional imaging techniques such as diffusion-weighted imaging (DWI) and dynamic contrast-enhanced (DCE) MRI, and on multiparametric imaging models developed using artificial intelligence (AI) methods such as radiomics [10,11,12,13,14]. Interestingly, several of these reports have shown that basic tumor descriptors such as the T- and N-stage, morphology, and volume were among the variables showing the best potential to predict response [15, 16].This indicates that visual morphologic interpretation by radiologists is not only crucial for staging but could also be helpful to render predictors of treatment response. Van Griethuysen et al. were one of the first to develop a method to estimate the likelihood of response based solely on radiologists' visual interpretation and staging of baseline MRI scans [17]. They showed that a confidence scoring system taking into account the size, signal, and shape of the tumor, T- and N-stage, mesorectal fascia (MRF) involvement, and extramural vascular invasion (EMVI) could predict the chance of achieving a good or complete response to CRT on baseline MRI with areas under the curve (AUCs) of 0.67-0.83, when assessed by two expert radiologists. To the best of our knowledge, visual morphologic response prediction methods such as the one proposed by van Griethuysen have not yet been evaluated by larger groups of readers and/or using multicenter MRI data.

This study therefore aims to evaluate the visual response prediction method of van Griethuysen et al. in a multicenter study setting and to compare it to two simplified adaptations of the same scoring system in terms of diagnostic performance and reproducibility among a large inter-national group of radiologists with varying levels of expertise.

Materials and methods

Patient selection

This retrospective diagnostic study was conducted as part of an ongoing institutional review board approved multicenter project focused on MRI for rectal tumor risk and response assessment, including the imaging and clinical outcome data of 1037 rectal cancer patients from ten centers in the Netherlands acquired between 2010 and 2018. For the current study, we identified from this cohort patients fulfilling the following inclusion criteria: (a) biopsy-proven non-mucinous rectal adenocarcinoma, (b) neoadjuvant treatment consisting of “routine” long course CRT (50.0-50.4 Gy with concurrent capecitabine-based chemotherapy), (c) availability of diagnostic quality primary staging MRI including at least T2-weighted sequences in three planes (sagittal, coronal, transversal), and (d) availability of a final response outcome (histology after surgery or ≥2 years clinical follow-up in case of W&W treatment). From this group, we semi-randomly selected a sample of n=90 patients to be included in the current study cohort, taking into consideration that data of all 10 study centers had to be represented and ensuring a clinically representative sample in terms of response outcomes to allow meaningful statistical analyses. This semi-random (selective) approach was chosen, because two of the ten centers are referral centers for W&W, which could have otherwise resulted in relative overrepresentation of complete responders in the cohort. Because of the retrospective nature of the study, informed consent was waived.

MR imaging

All MRIs were performed according to the local protocols of the participating centers at the time of inclusion. From the full protocols, we selected for this study the 2D T2-weighted spin echo sequences in sagittal, oblique-axial (perpendicular to the tumor axis), and oblique-coronal (parallel to the tumor axis) planes, in line with the minimal requirements for primary rectal cancer staging as outlined in recent guidelines [18]. Slice thickness ranged between 3 and 5 mm and in plane resolution between 0.35x0.35 and 0.94x0.94 mm.

Image evaluation

MRIs were assessed by twenty-two radiologists from 14 different countries, including five rectal MRI-experts (each with ≥10 years’ dedicated experience in rectal MRI and rectal cancer research) and 17 abdominal radiologists or general radiologists with a specific interest in abdominal imaging. The 17 abdominal/general radiologists had a median of 6 years’ experience in reading rectal MRI (range 1.5–21 years) with an estimated median of 100 (range 50–250) rectal MRI cases read on a yearly basis. Study readers were recruited via an open call to members of the European Society of Gastrointestinal and Abdominal Radiology (ESGAR), in specific those with an interest in rectal imaging. Image evaluation was performed using an in-house developed web-based platform (iScore) that was designed by one of the authors (N.E.K.) and incorporates the Open Health Imaging Foundation (OHIF) DICOM viewing platform [19].

Study readers were asked to review the baseline MRIs of the 90 study cases using electronic case report forms (eCRFS) that were embedded into iScore. These eCRFs included three different scoring methods designed to estimate the likelihood that patients would achieve a complete or near-complete response to chemoradiotherapy based on the overall tumor risk profile. The first scoring method was the 5-point confidence score published by van Griethuysen et al. that is based on a combination of tumor size, signal heterogeneity, shape (regular/irregular), T-stage, N-stage, EMVI and MRF invasion [17]. The second scoring method was a simplified 4-point adaptation, taking into account only MRF invasion, high-risk T-stage, EMVI, and N-stage. The final scoring method was a further simplified, dichotomized (2-point score) adaptation. Full details of the three scoring methods are provided in Table 1 and supporting images are provided in Figs. 1 and 2. A visual representation of the scoring setup in iScore including the full eCRFS is provided in Supplement 1. Readers were asked to indicate for each individual case whether they found the respective scoring methods easy, moderately easy/difficult, or difficult to apply. Finally, after completion of all cases, they were asked to give an overall indication of which scoring method they would prefer to use in their daily clinical practice. Readers were blinded to each other’s scorings and to the final response outcomes.

Table 1 Scoring methods used to predict response to chemoradiotherapy on baseline MRI
Fig. 1
figure 1

Instructions provided to the study readers to assign a 4-point risk score based on the presence/absence of 4 key high-risk features: obvious MRF invasion, high-risk T-stage (bulky/irregular, T3c-4), obvious node-positive disease, and obvious EMVI. Readers were instructed to only select yes if they were confident that a respective worrisome feature was present. When in doubt, readers were instructed to select ‘no’

Fig. 2
figure 2

Instructions provided to study readers to assign a dichotomized (2-point) risk score. Green = low risk, i.e., tumor likely to achieve a (near-)complete response. Red = high risk, i.e., tumor unlikely to achieve a (near-)complete response

Standard of reference

The main outcome of this study was to predict a (near-)complete response, which was defined as the absence of viable cancer cells, or the presence of only rare or small clusters of residual cancer cells at histopathology after surgery. The primary standard of reference in the patients that had undergone surgery was the histopathological Mandard tumor regression grade (TRG), where a (near-)complete response was defined as TRG 1-2 [20]. In patients undergoing W&W, a sustained clinical complete response with a local regrowth-free follow-up period of ≥2 years was considered a surrogate endpoint of a complete response (TRG1).

Statistical analysis

Statistical analyses were performed using R statistics version 4.1.0 (2021) and IBM SPSS version 27 (2020). The scores from the 22 radiologists were averaged for each patient in order to produce a probability of response that was then used to compute Receiver Operator Characteristics (ROC) curves and calculate mean areas under the curve (AUC) for each scoring method. Optimal cut-off values for the 5-point and 4-point scores were derived from the ROC curves to calculate sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and overall accuracy to predict a (near-)complete response (being the positive study outcome). Results were separately analyzed for the five MRI-experts versus the 17 less experienced readers, and mixed model linear regression was used to assess the impact of reader experience on the diagnostic accuracy figures of each scoring method. To account for the repeated measurements of each patient, a patient-level random intercept was used. A logistic regression was performed to analyze the possibility of an association between the diagnosis accuracy and the interval between completion of CRT and final surgery/entry into a W&W program. To do so, the proportion of correct diagnoses for each patient and method was computed across all readers. This proportion was then used as response and interval between completion of CRT and final surgery/entry into a W&W program was used as a covariate. p-values <0.05 were considered statistically significant. Group interobserver agreement (IOA) was calculated using Krippendorff’s alpha (α).

Results

Patient characteristics

Table 2 shows the baseline characteristics of the 90 study patients. Fifty-two patients (58%) were male; median age was 65 years (range 41–82). Forty-four patients (49%) were (near-)complete responders, including 27 (30%) complete responders (21 after surgery; 6 clinical complete responders undergoing W&W). Mean time interval between completion of CRT and surgery (or inclusion into a W&W program) was 11±2.5 weeks.

Table 2 Patient characteristics

Diagnostic performance to predict a (near-)complete response

Average performance for all study readers to predict a (near-)complete response to CRT was similar for all three methods with an AUC of 0.71 (95% CI 0.60–0.82) for the 5-point confidence score, AUC 0.74 (95% CI 0.64–0.84) for the 4-point risk score, and AUC 0.72 (95% CI 0.62–0.83) for the dichotomized 2-point risk score; differences in AUC between the three methods were not statistically significant (p=0.10–0.64). Further accuracy figures are provided in Table 3. The 5-point confidence score resulted in slightly lower sensitivity than the other two methods (49% versus 57-59%); the other metrics were similar for the three different scoring methods. There was a tendency toward higher performance for the MRI-experts versus less expert readers, though these differences did not reach statistical significance (p=0.15–0.99; except for the PPV of the 5-point confidence score where the MR-experts scored significantly higher than the non-experts, p=0.03). The time interval between CRT and surgery/W&W had a significant confounding effect (with a tendency toward higher performance with longer intervals).

Table 3 Diagnostic performance and effect of reader experience level

Interobserver agreement and reader preference

Table 4 shows the interobserver agreement for the three scoring methods, including specified results for the expert and non-expert readers. Table 5 shows the reader feedback (i.e., perceived difficulty per case and overall preferred scoring methods). Group IOA (Krippendorff’s alpha) for all readers combined was similar for the 5-point confidence level score (α=0.55) and the 4-point risk score (α=0.57), and lower for the 2-point score (α= 0.46). Agreement was higher for the MRI-experts compared to the less experienced readers, especially for the 5-point confidence score (α=0.64 versus 0.53) and for the 4-point risk score (α=0.65 versus 0.55). When looking at the individual variables included in the 4-point risk score, IOA for the assessment of EMVI and MRF involvement was higher than for the assessment of high-risk T-stage and nodal involvement. Most readers found the simplified 4-point and 2-point risk scores easier to apply, compared to the 5-point confidence level score; most readers (55%) selected the 4-point risk score as their preferred method of response prediction.

Table 4 Interobserver agreement (Krippendorf’s alpha)
Table 5 Reader preference

Discussion

With this study, we investigated the value MRI to estimate the chance that patients will undergo a (near-)complete response to neoadjuvant chemoradiotherapy based on visual morphologic risk assessment and staging performed on baseline MRI. A previously published 5-point confidence score and two simplified (4-point and 2-point) adaptations were tested and compared in terms of diagnostic performance, interobserver reproducibility, and reader preference. Diagnostic performance to predict a (near-)complete response was similar for the three methods with AUCs ranging between 0.71 and 0.74. When also considering interobserver agreement and reader preference, a 4-point risk score based on a combination of high-risk T-stage, MRF invasion, EMVI, and nodal involvement showed the most favorable results.

Interobserver agreement in our study was at best moderate (α=0.46–0.57), with somewhat better results for the MRI-experts, especially for the 4- and 5-point scores (α=0.64–0.65). The more expert radiologists also showed a tendency toward better diagnostic performance, albeit that the difference in performance did not reach statistical significance in most cases. IOA for the most simplified 2-point risk score was similarly low for the experts and non-experts (α=0.44–0.47). The previously published confidence score proposed by van Griethuysen is a relatively complex composite score that incorporates T-stage, size, signal, shape, N-stage, EMVI, and MRF involvement. We hypothesized that by simplifying this score, we might be able to improve the interreader reproducibility. However, such an effect was not observed. Nevertheless, most readers did show a clear preference for the simplified scoring systems—in particular the 4-point risk score—and indicated that they found this method more straightforward to apply. This scoring system may therefore be more easy to adapt in daily practice, especially by more general readers.

Specificity to predict patients unlikely to achieve a (near-)complete response was relatively high (ranging between 68% and 73%) and considerably higher than the sensitivity of only 49%–59% to predict which patients would become (near-)complete responders. These results indicate that the study readers were better at estimating patients likely to end up with residual tumor. We hypothesize that recognizing the really “ugly” tumor cases (unlikely to ever reach organ preservation) may be relatively straightforward, while there is a more broad spectrum of “intermediate risk” cases where it will be more challenging to predict which patients will proceed to show a good response versus a (near-) complete response to treatment. Interestingly, our results are also more or less in line with previous reports on assessing response in the restaging setting after completion of CRT where radiologists are generally also better at identifying poor responders than in identifying complete (or near-complete) responders [21,22,23]. Ultimately, the selection of patients for organ preservation should not be based on imaging only, but informed by a combination of MRI, clinical (digital rectal) examination, and endoscopy [3, 24].

Of note, our current results are based solely on “simple” visual morphologic assessment and baseline staging of anatomical MR images by radiologists, without the need for additional quantitative measurements, advanced (functional) imaging sequences, or computational algorithms. The benefit of such an approach is that it can easily be implemented in daily practice and is relatively comprehensive for clinicians. An important drawback, however, is that it is also observer dependent and influenced by the experience level of radiologists, as also reflected by our results that show a tendency toward higher IOA and diagnostic performance for the more experienced study readers. Though we aimed to provide readers with clear scoring instructions (see Figs. 1 and 2), criteria such as ‘obvious nodal involvement’ and ‘bulky tumor’ remain subjective criteria, which probably contributed to the relatively low IOA. These effects are less of an issue when using more quantitative or AI-based methods, which have formed a major topic of research in recent literature. Functional imaging parameters such as the Apparent Diffusion Coefficient (ADC) derived from diffusion-weighted MRI, and perfusion metrics (e.g., K-trans) derived from dynamic contrast-enhanced MRI, have all shown potential as pre-treatment predictors of response [5, 25]. In addition, “texture” features such as entropy and uniformity that reflect tissue heterogeneity have been associated with the chance of successful tumor response [15, 16, 26]. When combining such quantitative features in multivariable (radiomics) models, published reports have shown varying AUCs ranging between 0.68 and 0.97 to predict rectal tumor response at baseline [27]. Van Griethuysen et al. showed that the predictive performance of a quantitative AI model was similar to that of a visual morphologic response prediction performed by experienced radiologists [17]. Other studies have shown a complementary value for AI (radiomics) and visual morphologic evaluations and have demonstrated that combining these two approaches can increase diagnostic performance to predict response [17, 27,28,29,30]. Nevertheless, reported results for image-based prediction methods (regardless of whether visual morphologic and/or quantitative) are highly variable with AUCs in many reports not exceeding 0.80; performance levels that will likely not be considered sufficient to impact treatment planning. Response to anti-cancer treatment is a multifactorial process that is not only dependent on tumor size and morphology, but also on other patient-factors and aspects of tumor biology that we cannot hope to capture by imaging. Important next steps in research will therefore be to combine image-based prediction methods with other clinical, histopathological, immunohistochemical and genetic biomarkers that have shown promise as predictors of response and that were unfortunately not available for analysis in this current retrospective study cohort [27, 29,30,31,32,33]. Only this way can we hope to achieve a strong enough predictive performance to serve as a basis for clinical decision-making, aiming to further boost personalized therapy in rectal cancer.

There are some limitations to our study design, in addition to its retrospective nature. To ensure that it would be feasible for a multitude of readers to complete the full set of study cases within an acceptable timeframe, the cohort size was deliberately kept relatively small. We fully acknowledge that the semi-random selection of patients from a larger cohort (ensuring a balanced sample in terms of representation of data from the different participating centers and response outcomes), may be prone to bias though we are confident that our cohort including data from ten different centers offers a representative sample reflective of everyday clinical routine. The study dataset dates back to 2010, which entails that some MRIs were acquired with ‘outdated’ study protocols. Though we acknowledge these variations may have had an impact on overall scan quality, we believe that these effects will likely be limited considering that evaluations were mainly based on routine T2-weighted imaging which will probably show less variation in quality over time than for example DWI. While the 17 less expert readers in our cohort were intended to offer a representative sample of radiologists reading rectal MRI in everyday clinical practice, we cannot rule out a certain selection bias considering that readers were recruited via an open call to ESGAR members (with a specific interest in rectal cancer). Finally, our results should be interpreted with some caution as we have shown that response, and corresponding performance to predict response, was influenced by variations in the interval between CRT and surgery/W&W. Prolonging the interval between CRT and surgery is a known factor that generally results in higher response rates [34,35,36,37,38]. Though variations were small (mean interval between CRT and surgery/W&W was 11 weeks with a standard deviation of 2.5 weeks), a confounding effect could nevertheless not be avoided in this retrospective study setting.

Conclusions

In conclusion, this multicenter and multireader study has shown that visual morphologic methods to predict response to chemoradiotherapy on baseline staging MRI have a moderate–good diagnostic performance to estimate the likelihood that patients will achieve a (near-)complete response to CRT. Specificity is relatively high, indicating that imaging is mainly good in identifying the more high-risk patients that are unlikely to achieve organ preservation. Overall interobserver agreement is moderate, with better results for more experienced radiologists. Compared to a previously published confidence-based scoring system, study readers preferred a more simplified 4-point risk score based on high-risk T-stage, MRF involvement, nodal involvement, and EMVI. While results are obviously too premature to base clinical decision-making on, they are encouraging and warrant further multidisciplinary research focused on combining imaging with other clinical predictors of response.