Introduction

Cardiovascular magnetic resonance (CMR) imaging is the reference standard for the assessment of cardiac morphology and function1,2, amongst which left ventricular ejection fraction (LVEF) is most commonly used for cardiovascular risk assessment and clinical decision making3,4. Compared to echocardiography, CMR demonstrates superior inter-study reproducibility resulting in considerably lower sample sizes required to show clinically relevant changes in left ventricular (LV) and right ventricular (RV) dimensions and function5,6. However, in clinical routine CMR requires extensive post-processing, which is time-consuming, tedious and prone to observer variability7,8,9,10. Despite efforts directed towards automation of volume and mass assessments, most approaches require manual preparation and preselection of CMR images11,12. More recently, novel artificial intelligence (AI)-based deep learning algorithms were introduced which allow for fully automated post-processing of LV mass and biventricular volumes showing promising initial results including risk stratification following acute myocardial infarction13,14. Data on interstudy reproducibility is of high clinical importance when it comes to follow-up surveys. Observer experience and variability may significantly impact the identification of subtle clinical changes between exams10. Hence, the current study aimed to assess the impact of fully automated assessments on inter-study variability and reliability in comparison to an experienced and inexperienced observer to define the current potential and limitations of fully automated post-processing.

Methods

Study population

The study population consisted of 18 participants which were scanned twice at a median interval of 63 days (range 49–87) using a standardized imaging protocol for anatomy and function15,16. All participants were in stable sinus rhythm during image acquisition. A minimum of 6 weeks between the first and second scan was required to avoid recollection bias of the involved CMR staff. Care was taken that acquisitions were performed at the same levels of the heart. Care was taken that no change in symptoms and medication occurred in patients with heart failure. Furthermore, new onset of cardiac disease was excluded in healthy subjects. The study was approved by the Ethics Committee of the Charité-University Medicine Berlin and was conducted according to the principles of the Helsinki Declaration. All participants gave written informed consent before randomization. The study was supported by the German Centre for Cardiovascular Research (DZHK).

Cardiovascular magnetic resonance imaging

Electrocardiogram (ECG)-gated balanced steady-state free precession (bSSFP) cine images were acquired in 10–16 equidistant short axis (SA) planes covering both entire ventricles on a clinical MR scanner (1.5 T, Achieva, Philips Healthcare, Best, The Netherlands). Imaging parameters were as follows: 25 frames/cardiac cycle, pixel spacing 0.8 mm × 0.8 mm, 8 mm slice thickness as well as inter-slice gap, TE 1.5 ms, TR 3 ms.

Manual volumetric assessments were performed in SA orientations according to standardized recommendations17 by an experienced CMR operator (observer A, cardiologist, 3 years of CMR experience) and an inexperienced operator (observer B, trainee in cardiology, no experience in reporting or CMR segmentation), who was trained 45 min by the experienced observer with five cases from the SCMR consensus data18. Long-axis views (4-chamber and 2-chamber) were crosslinked to define RV and LV basal segments. Dedicated commercially available post-processing software was employed for manual assessments (QMass, Version 3.1.16.0, Medis Medical Imaging Systems, Leiden, The Netherlands). Fully automated analyses were performed in SA stacks with suiteHEART (Version 4.0.6, Neosoft, Pewaukee, WI, USA), Fig. 1. Papillary muscles were included within the myocardium. Fully automated analyses were not manually post-processed or validated, manual segmentations were not supported by any semi-automated processing e.g. threshold or edge detection. All operators were blinded to their previous as well as each other’s results. Volumetric analyses comprised LV mass, LV and RV end-diastolic/systolic (EDV/ESV) volumes as well as stroke volumes (SV) and EF. Interstudy agreements were evaluated for manual assessment of observer A, manual assessment of observer B as well as fully automated analyses.

Figure 1
figure 1

Fully automated biventricular segmentation. The figure depicts automated biventricular volume assessments for a representative volunteer at the basal, midventricular and apical level at baseline MRI (Exam A) and follow-up MRI (Exam B). Higher inter-study variability may potentially be induced by the basal segmentation in the example.

Statistical analyses

Statistics were calculated using IBM SPSS Version 24 for Windows (IBM, Armonk, NY, USA) and Microsoft Excel. Continuous parameters are reported as mean and corresponding standard deviation (SD), changes from Exam 1 to 2 were evaluated using the Wilcoxon signed-rank test for dependent continuous parameters. An alpha level of 0.05 and below was considered statistically significant. Inter-study and inter-observer variability was assessed using intra-class correlation coefficients (ICC) based on absolute agreement (excellent ICC > 0.74, good between 0.60 and 0.74, fair between 0.4 and 0.59 and poor below 0.4)19, the coefficient of variation (CoV, SD of mean difference divided by the mean (SD (MD))/mean) as well as Bland–Altman plots [mean difference between measurements with 95% confidence interval (CI)]20. Intra-observer reproducibility of the automated algorithm has been addressed previously yielding ICC = 1 and CoV 0%13. Sample sizes were calculated for the detection of absolute changes of 10 g LV mass, 10 ml LV and RV EDV/ESV/SV as well as 5% change in LV/RV-EF for a power of 80% and an α-error of 0.05 using the formula \(n=f \left(\alpha , P\right)*{\sigma }^{2}*\frac{2 }{{\delta }^{2}}\) where n = sample size, f = factor taking α (level of significance) and P (study power) into account (f = 7.85 for α = 0.05 and P = 0.8), σ = interstudy standard deviation of the mean difference between Exam 1 and 2 and δ the magnitude of differences to be detected5,6.

Results

Study population

The study population consisted of 18 participants, 11 with normal biventricular function and 7 with heart failure, the latter including 3 patients with heart failure and preserved (HFpEF) and 4 patients with reduced (HFrEF) ejection fraction. The mean age was 46 years with a SD of 23. Ten participants were male and 8 female. All SA stacks were assessed by observers A and B as well as by the fully automated software algorithm. Results for LV and RV volumes are reported in Table 1. LV volumes and function were not significantly different between exams 1 and 2 for observer one and two as well as automated analyses. Statistically significant differences in RV volumetry were observed for observer A and the automated software algorithm reported in Table 1. Manual post-processing took on average 8.5 ± 1.7 min and 13.2 ± 2.8 min for the experienced and inexperienced observer, as opposed to automated analyses with < 1 min/SA stack.

Table 1 Cardiac volumes.

Reproducibility

For interstudy reproducibility, mean differences as well as corresponding SD, ICC and CoV of LV and RV volumes are reported in Table 2, corresponding Bland–Altman plots are displayed in Figs. 2, 3 and 4. LV reproducibility was overall excellent (ICC 0.86–1.00), best for observer A (ICC > 0.98), followed by fully automated analyses (ICC > 0.93) and observer B (ICC > 0.86). Interstudy reproducibility of RV volume was excellent for observer A (ICC > 0.88), good to excellent for automated analyses (ICC 0.69–0.92) and fair to excellent for observer B (ICC 0.46–0.95). Similarly, lowest interstudy variability was found in LV volumes for observer A (CoV < 9.6%) followed by fully automated analyses (CoV < 12.4%) and observer B (CoV < 18.8%). Regarding RV analyses, lowest interstudy variability was found for observer A (CoV < 10.7%) whilst fully automated analyses (CoV < 22.8) as well as observer B (CoV < 28.7%) demonstrated considerable inter-study variability.

Table 2 Interstudy reproducibility.
Figure 2
figure 2

Agreement of short axis volume assessments based on fully automated analyses. Bland Altman plots are shown for interstudy reproducibility of left (LV) and right (RV) ventricular end-diastolic (EDV) and -systolic (ESV) as well as corresponding stroke volume (SV) and ejection fraction (EF). LV assessments also included LV mass. (Δ = difference for interstudy measurements. Red: bias; green: limits of agreement.

Figure 3
figure 3

Agreement of short axis volume assessments based on the experienced observer. Bland Altman plots are shown for interstudy reproducibility of left (LV) and right (RV) ventricular end-diastolic (EDV) and -systolic (ESV) as well as corresponding stroke volume (SV) and ejection fraction (EF). LV assessments also included LV mass. (Δ = difference for interstudy measurements. Red: bias; green: limits of agreement.

Figure 4
figure 4

Agreement of short axis volume assessments based on the inexperienced observer. Bland Altman plots are shown for interstudy reproducibility of left (LV) and right (RV) ventricular end-diastolic (EDV) and -systolic (ESV) as well as corresponding stroke volume (SV) and ejection fraction (EF). LV assessments also included LV mass. (Δ = difference for interstudy measurements. Red: bias; green: limits of agreement.

For interobserver reproducibility, mean differences as well as corresponding SD, ICC and CoV of LV and RV volumes are reported in Table S1 (supplementary material) comparing automated with experienced and inexperienced manual analyses as well as comparing experienced and inexperienced manual analyses, showing overall excellent interobserver reproducibility of LV analyses (ICC 0.92–0.99) and fair to excellent reproducibility of RV metrics (ICC 0.43–0.97). Fully automated LV analyses shower better agreement with experienced than with inexperienced analyses. The automated algorithm overestimated RV EDV (mean difference 12.9 ± 13.8 ml/m2) and RV ESV (mean difference 10.5 ± 12.7 ml/m2) as compared to the experienced observer, while underestimated RV ESV (mean difference − 14.8 ± 9.0 ml/m2) as compared to the inexperienced observer.

Sample size calculations

Sample sizes required for the detection of absolute changes in volumetric indices (10 g mass, 10 ml in volume or 5% EF) are reported in Table 3. Sample sizes were smallest for observer A, followed by fully automated analyses and largest for observer B. Whilst samples sizes of automated analyses for LV volumes were similar to those of observer A, sample sizes of automated analyses for RV volumes were similar to those of observer B. LV volume sample sizes ranged between n = 5 for LV mass and n = 11 for ESV for observer A, between n = 6 for EF and n = 32 for EDV for automated analyses and between n = 19 for EF to n = 89 for EDV for observer B. RV volume samples sizes ranged between n = 6 for ESV and n = 9 for SV for observer A, between n = 27 for ESV and n = 77 for EDV for automated analyses and between n = 42 for ESV and n = 73 for SV for observer B.

Table 3 Sample size calculation.

Discussion

The present study evaluates the interstudy variability of LV mass as well as LV and RV volumes quantified using a fully automated post-processing algorithm. Concerning LV analyses, the results demonstrate similarly high interstudy reproducibility of fully automated analyses as compared to an experienced CMR observer and show superior performance of fully automated analyses as compared to an inexperienced observer. In contrast, reliability of automated RV analyses is notably lower as compared to an experienced CMR observer.

CMR imaging represents the reference standard for the assessment of cardiac morphology and function due to a precise evaluation of bSSFP SA stacks covering the entire LV and RV1. However, in many departments CMR examinations are still not easily available since MR scanners are not always dedicated to CMR and consequently examinations and post-processing of the images are relatively time-consuming compared to other examinations. As a result cost-effectiveness is lower compared with competing methodology such as echocardiographic approaches even though CMR diagnostic information can often be considered of higher value7,9. Notwithstanding, mounting evidence emphasizes the need of CMR surveys in an increasing number of cardiac diseases21. To achieve high quality diagnostic examinations experience and training are important with a distinct effect on volumetric analyses and are particularly required in challenging anatomic conditions, e.g. patients with congenital heart disease10,22. User-independent fully automated assessments have been introduced for the evaluation of biventricular volumes showing promising results11. Machine learning and AI-based algorithms23 may indeed complement varying levels of user experience. Furthermore, process efficiency may be strengthened considering SA stacks volumetric analyses may be already performed parallel to scanning e.g. during LGE imaging, and thus might reduce analysis time and ultimately costs. Our results support a reliable use of fully automated LV analyses, showing objective and reproducible results.

Recently, automated analyses demonstrated feasibility and equally predictive prognostic value in 1017 patients following acute myocardial infarction compared to conventional analyses by trained and experienced medical personal14. Several previous studies applying the proposed automated algorithm showed consistently high interobserver reproducibility with experienced CMR observers13,14. The feasibility and reliability of automated LV analyses in clinical routine imaging is further underlined by the present data demonstrating high interstudy reproducibility. Future applications may expand to automated tissue characterisation e.g. scar quantification14 as well as deformation imaging24. Deformation imaging has gained recognition for enhanced risk prediction beyond conventional volumetric derived functional analyses, e.g. following acute myocardial infarction25 as well as ischemic and non-ischemic cardiomyopathy26. However, ongoing discussions about the reproducibility of deformation based approaches27 and limited data from large clinical train still hamper its unrestricted clinical use. At the current time, cardiac volumetric analyses still remain the gold-standard for quantitative functional assessments, despite its inability to assess regional function.

Guidelines for clinical decision making are inevitably based upon thresholds28. In certain clinical scenarios, decision making heavily relies on changes between serial examinations e.g. recovery of LVEF following acute myocardial infarction to evaluate implantable cardioverter defibrillator (ICD) therapy3. Serial examinations rely on the assumption that changes in cardiac mass and volumes are reliably detectable. However, most CMR imaging laboratories employ several CMR operators, often with different training experience, resulting in potential inter-observer variability if serial CMR examinations are analysed by different observers. This study confirms an overall excellent interstudy reproducibility for LV mass and volumes, best for manual assessments by an experienced observer and user-independent automated analyses and slightly lower for an inexperienced observer. Reproducibility of RV volumes was overall lower compared to LV metrics, which is in line with the available literature6. Whilst the experienced observer still achieved good to excellent reproducibility, variability between exams was high for the inexperienced observer. Automated assessments of RV volumes resulted in a slight improvement of reproducibility as compared to the inexperienced observer. We observed numerical differences for RV volumetry both for manual and automated analysis between the repeated exams. Even though they were statistically significant, their respective clinical relevance with a change of 2% in RV-EF should be interpreted with caution. On the other hand, defined cut-offs (e.g. for arrhythmogenic right ventricular cardiomyopathy (ARVC) end-diastolic volumes beyond 110 ml/m2 for male and 100 ml/m2 for female patients or an EF below 40%29) require precise volume assessments. Thus, inaccuracies in RV volume assessments bear potential clinical consequences. The present data support current evidence that precise and correct quantifications of RV metrics remain challenging and still require dedicated training which is probably due to the more complex anatomy of the RV as compared to the LV10,30. Because a strong link between RV functional but not structural changes with prognosis following acute myocardial infarction has been demonstrated31, the field of automated RV assessment and required analysis refinement and improvement warrants further investigation.

Limitations

Sample size calculations and derived conclusions are based on n = 18 participants. Although reports indicate low sample sizes in CMR volume assessments5, statistical evaluations and generalisation may be limited. Detailed specifications of the automated algorithm that incorporates AI and deep learning models developed by the manufacturer are not disclosed; therefore, they cannot be described more precisely. The results of the study therefore apply to this specific cohort. Without knowing the exact types of scans used in the software’s training, it might be difficult to extrapolate the results to other cohorts, which should definitely be addressed in larger future studies. Furthermore, it will be interesting to address whether or not the results can be extrapolated to patients with a more demanding anatomy (e.g. patients with congenital heart disease).

Conclusion

In this cohort, fully automated user-independent analyses allowed reliable serial investigations of LV volumes and function with comparably high interstudy reproducibility in relation to manual analyses performed by an experienced CMR observer. In contrast, fully automated RV assessments did not yet provide satisfying interstudy reproducibility and still require manual post-processing corrections by an experienced reader.