Fully Automated, Quality-Controlled Cardiac Analysis From CMR: Validation and Large-Scale Application to Characterize Cardiac Function

Objectives This study sought to develop a fully automated framework for cardiac function analysis from cardiac magnetic resonance (CMR), including comprehensive quality control (QC) algorithms to detect erroneous output. Background Analysis of cine CMR imaging using deep learning (DL) algorithms could automate ventricular function assessment. However, variable image quality, variability in phenotypes of disease, and unavoidable weaknesses in training of DL algorithms currently prevent their use in clinical practice. Methods The framework consists of a pre-analysis DL image QC, followed by a DL algorithm for biventricular segmentation in long-axis and short-axis views, myocardial feature-tracking (FT), and a post-analysis QC to detect erroneous results. The study validated the framework in healthy subjects and cardiac patients by comparison against manual analysis (n = 100) and evaluation of the QC steps’ ability to detect erroneous results (n = 700). Next, this method was used to obtain reference values for cardiac function metrics from the UK Biobank. Results Automated analysis correlated highly with manual analysis for left and right ventricular volumes (all r > 0.95), strain (circumferential r = 0.89, longitudinal r > 0.89), and filling and ejection rates (all r ≥ 0.93). There was no significant bias for cardiac volumes and filling and ejection rates, except for right ventricular end-systolic volume (bias +1.80 ml; p = 0.01). The bias for FT strain was <1.3%. The sensitivity of detection of erroneous output was 95% for volume-derived parameters and 93% for FT strain. Finally, reference values were automatically derived from 2,029 CMR exams in healthy subjects. Conclusions The study demonstrates a DL-based framework for automated, quality-controlled characterization of cardiac function from cine CMR, without the need for direct clinician oversight.

0.01). The bias for FT strain was <1.3%. The sensitivity of detection of erroneous output was 95% for volume-derived parameters and 93% for FT strain. Finally, reference values were automatically derived from 2,029 CMR exams in healthy subjects.

Conclusions-
The study demonstrates a DL-based framework for automated, quality-controlled characterization of cardiac function from cine CMR, without the need for direct clinician oversight.
Keywords cardiac aging; cardiac function; cardiac magnetic resonance; CMR feature tracking; machine learning; quality control Cardiac magnetic resonance (CMR) enables full coverage of the heart using high spatial and temporal resolution, without the constraints of limited acquisition windows or use of ionizing radiation, as with echocardiography or computedtomography (1). Cine CMR has become the gold standard for non-invasive quantification of cardiac volumes and ejection fraction (EF) (1). However, cine CMR images hold significantly more detailed information that allow for quantification of advanced markers of cardiac function such as ventricular shape (2), ejection and filling rates (3), myocardial wall motion, and myocardial strain (ε) (4,5). These parameters have shown to be valuable biomarkers for earlier detection and monitoring of disease (2)(3)(4)(5). However, obtaining them is time and labor intensive. Moreover, although largescale studies have provided meaningful reference values and standards for analysis of cardiac volumes and EF (6,7), such studies are absent for the remaining biomarkers. As a result, the use of these advanced markers in clinical practice has so far been limited.
Recent advances in deep learning (DL) algorithms show great promise for the automation of CMR analysis. Convolutional neural networks (CNNs), have achieved previously unmatched accuracy in many image analysis challenges (8). Using CNNs, a wide set of cardiac functional parameters could potentially be obtained automatically from CMR. Several groups have shown that CNNs can provide accurate enddiastolic and end-systolic cardiac segmentations from CMR in preselected images (9)(10)(11). Although these results have gained significant attention, the practical implementation of DL algorithms in clinical practice and research is hindered by a lack of appropriate quality control (QC). Variable image quality, image artefacts, and unusual anatomic variations (not seen during training) are unavoidable in clinical imaging, and can result in significant errors if such images are analyzed automatically. Therefore, robust QC measures to detect (potential) erroneous output are a prerequisite to the translation of DL algorithms into clinical practice (12).
We aim to address this issue by developing a pipeline for comprehensive analysis of cardiac function (cardiac volumes, filling and ejection dynamics and myocardial strain) that includes robust QC mechanisms, which allows for automated cine CMR analysis without clinician oversight. Using our pipeline, we provide reference values for a range of automatically derived cardiac metrics that have not previously been reported in large subject cohorts.

Image Analysis Pipeline
The developed image analysis pipeline consists of a DL algorithm for segmentation of shortaxis (SAX) and 2-and 4-chamber long-axis (LAX) cine CMR stacks, automated calculation of cardiac functional parameters and 2 QC steps: 1 before the segmentation and analysis steps (QC1) and 1 after (QC2). For an illustration of the pipeline see the Central Illustration and Video 1. Our pipeline is available for further training and use via the corresponding author.
Step 1: Pre-analysis Image QC (QC1) All CMR images were screened for the presence of motion artefacts (artefacts due to inconsistent breath-holding, mistriggering or arrhythmias) and erroneous planning of the 4chamber view using 2 CNNs: a 2-dimensional CNN with a recurrent long short-term memory layer trained to detect motion artefacts and a 2-dimensional CNN trained to detect erroneous planning of the 4-chamber view (CNN 4Ch ). We have previously published a detailed description of the architecture, training, and validation of both algorithms (13,14).

Step 2: Image Segmentation
After QC1, a 17-layer CNN (CNN segment ) was used to segment the left ventricle (LV) and right ventricle (RV), including the LV myocardium, in all frames of the cine CMR. This network has been trained using manual segmentations of cine CMR images in 3,975 subjects, consisting of both healthy volunteers as well as patients with a wide variety of cardiac diseases (10).

Step 3: Parameter Calculation
After segmentation, the SAX and LAX imaging stacks were aligned using an iterative alignment process to correct for different breath-hold positions and motion between the different cine-acquisitions (15). Next, LV and RV volume curves and LV mass (LVM) were calculated. From the volume curves, end-diastolic volume (EDV), end-systolic volume (ESV), stroke volume (SV), EF, peak ejection rate, peak early filling rate, atrial contribution (AC), and peak atrial filling rate were obtained.
Subsequently, CMR feature tracking (FT) was automatically performed on 3 SAX slices, and the 2-and 4-chamber LAX images. We previously published the details of this method (16). Briefly, CMR FT was performed using the Medical Image Registration ToolKit. The end-diastolic LV wall segmentations were used as the region of interest for the FT algorithm. Global circumferential strain (ε circ ), radial strain (ε rad ), and longitudinal strain (ε long ) were computed from the FT results.

Step 4: Post-Analysis QC (QC2)
In QC2, we first evaluated the orientation of the images, the presence of missing slices, and the coverage of the segmentations over the heart. We automatically compared the aligned LAX and SAX images and segmentations to determine the image plane intersections (e.g., did the LAX images intersect the mitral valve and apex in SAX?), presence of missing slices (e.g., did the SAX stack cover the full length of the LAX segmentation?), and the coverage of segmentations (did LAX segmentation reach a similar level as the SAX segmentation and vice versa?). Next, the output parameters were inspected. If there was a >10% difference between LV and RV SV or a >10% difference between ventricular volumes on the first and last cardiac phase, the exams were flagged. Lastly, we implemented 2 support vector machine (SVM) classification algorithms to detect abnormalities in the obtained volume (SVM vol ) and strain curves (SVM strain ). These SVMs were trained using output of the CNN segment and FT algorithm from 500 UK Biobank subjects (300 healthy subjects and 200 subjects with cardiomyopathy). These datasets were classified by an expert CMR cardiologist as right or wrong/unusual on the basis of the shape of the volume and strain curves, as well as the corresponding functional parameters.
All cases detected during the QC steps were flagged for clinician review.

Pipeline Validation
We validated our method in 2 ways. First, we compared the results obtained to manual analysis by an experienced CMR cardiologist (Validation1) in 50 healthy volunteers and 50 patients with cardiomyopathy. These cases were not previously used during training of the algorithms and were randomly selected after having successfully passed the algorithm's QC steps. During the manual analysis, ventricular volumes were segmented at each cardiac phase using commercially available CMR analysis software, CVi42 (Version 5.10.1, Circle, Calgary, Alberta, Canada). With the same software, CMR FT was performed to obtain strain values.
Secondly, we evaluated the ability of the full pipeline to detect errors in the analysis (Validation2) in a further 700 cases (500 healthy subjects and 200 patients with cardiomyopathy) randomly selected from the UK Biobank cohort, again excluding cases used during training. An experienced CMR cardiologist, blinded for the pipeline's verdict, critically reviewed the segmentations, volume and strain curves and parameters obtained in step 3 and classified them as correct or erroneous. This process was facilitated by visually representing the images with segmentations and outcome-parameters for each case in a single panel to ensure apt identification of errors (Supplemental Figure 1, Video 2).

Obtaining Reference Values
After validation, we utilized the developed pipeline to obtain reference values. Healthy subjects were selected from a total of 9,619 cases in the UK Biobank that underwent CMR (17), excluding all subjects with a history of cardiovascular disease, cardiovascular risk factors, other systemic diseases, those taking medication for any systemic disease, and subjects with a body mass-index >30 kg/m 2 (see all exclusion criteria in Supplemental Table  1).

Statistics
Validation1-Dice coefficients were calculated to compare the manual and automated segmentations. Bland-Altman analysis and Pearson's correlations were used to compare the obtained cardiac volumes, filling and ejection rates, and peak global strains to the manual Ruijsink  applied. Finally, we compared the mean absolute errors of all parameters between healthy subjects and patients with disease using paired t-tests.
Validation2-Sensitivity (% of manually labelled erroneous output that was correctly detected by the pipeline during QC), specificity (% of output manually labelled as error-free that was not flagged by the pipeline during QC), and balanced accuracy were calculated for the total pipeline's performance for volume and strain analysis, as well as for each individual parameter.
Reference values-Data were stratified by sex, and age by decade (45 to 54, 55 to 64, and 65 to 74 years), and the means and reference ranges (95% prediction intervals) were defined (18). Outliers, defined a priori as values 3 interquartile ranges below the first or above the third quartile, were removed from the analysis. Cardiac volumes were indexed to body surface area using the Dubois and Dubois formula (19). We used linear regression analysis to assess the impact of age on ventricular volumes, filling and ejection dynamics and strains. For all analyses, p values were corrected using Bonferroni correction for multiple comparisons. A p value of <0.05 after correction was considered statistically significant.

Validation1
Overall, the Dice score between manual and automated segmentations was 0.93 ± 0.03% for the LV blood pool, 0.84 ± 0.02% for the LV myocardium, and 0.91 ± 0.03% for the RV blood pool segmentations. There was a good correlation between automatically and manually obtained cardiac volumes (LVEDV r = 0.99; LVESV r = 0.98; LVM r = 0.94; RVEDV r = 0.98; and RVESV r = 0.91), filling and ejection parameters (peak ejection rate r = 0.98; peak early filling rate r = 0.98; peak atrial filling rate r = 0.97 and AC r = 0.93) and strain (ε circ r = 0.91; ε rad r = 0.85; ε long 2-chamber r = 0.91; and ε long 4-chamber r = 0.89). The Bland-Altman plots for agreement between the pipeline and manual analysis are shown in Figures 1 and 2. There was no significant bias for cardiac volumes and filling and ejection parameters, except for RVESV (bias +1.80 ml; 2.3% of the mean RVESV; p = 0.01) and LVM (bias +2.95 ml; 2.7% of the mean LVM; p = 0.001). For strain, there was a significant bias for ε circ (+0.75%; p < 0.001) and 2-and 4-chamber ε long (+1.29%; p < 0.001 and +1.03%; p < 0.001, respectively). Lastly, there was no significant difference in mean absolute error between cardiac patients and healthy volunteers for the output parameters, except for LVESV (4.04 ± 4.04 ml vs. 6.65 ± 5.90 ml; p < 0.01) and AC (2.19 ± 2.17 ml vs. 3.30 ± 2.31 ml; p < 0.01) (Supplemental Table 2). Table 1 shows the results of Validation2. For the total pipeline, sensitivity for volume parameters (volume curves, cardiac volume, and filling and ejection dynamics) was 94.99%, whereas the specificity was 82.93%. Stratified by group, the sensitivity was 94.83% in healthy subjects and 95.39% in cardiac patients. For strain assessment, sensitivity and specificity were 93.21% and 77.14%, respectively, and sensitivity for each subgroup was 92.69% in healthy subjects and 94.41% in cardiac patients. Supplemental Table 3 shows data for all the individual parameters. The total rate of CMRs flagged by the QCs was 26% in healthy volunteers and 32% in cardiac patients. The final rejection rate of the pipeline after clinician review was 15.2% for healthy subjects and 11% for the cardiac patients.

Obtained Reference Values
A total of 2,029 subjects of the UK Biobank matched our criteria for healthy subjects and were processed using our pipe line (Supplemental Figure 2). During QC1, 222 cases (11%) were rejected for image quality. During QC2, 75 exams (4%) were automatically flagged for errors in cardiac volume output, whereas 119 (7%) were flagged for errors in strain analysis. Baseline characteristics of the remaining subjects are shown in Table 2. Reference values for cardiac volumes, cardiac function and filling and ejection parameters as well as ε circ , ε long and ε rad stratified by sex are shown in Tables 3 and 4. Supplemental Table 4 shows the regression analysis of changes in cardiac function in men and women with age.

Discussion
In this study, we presented and validated a pipeline for automated analysis of ventricular function from cine CMR. Our pipeline is not solely a DL image analysis algorithm, but a framework that includes extensive QC steps to allow fully automatic processing of large numbers of CMR datasets without direct clinician oversight. We show that, using our proposed technique, we were able to obtain a detailed description of cardiac function in >2,000 healthy individuals. To the authors' best knowledge, this is the first comprehensive framework for automated cine CMR analysis that approaches clinical standards of QC.

Automated QC
QC is essential in developing DL algorithms for automated processing of clinical data, but has so far been mostly overlooked (12). In our framework, we implemented QC in 2 separate steps, a pre-analysis control of image quality, QC1, and a postanalysis control of the quality of the output parameters, QC2.
QC1 focused on detection of motion artefacts and off-axis planning of the obtained images. Motion artefacts do not result in static distortion of the image, which is easily recognized in post-analysis QC. Instead, the dynamic motion of the heart is affected due to incorporation of information from unrepresentative motion states (arrhythmias or mistriggering) or through-and in-plane motion (breathing artefacts). Similar to off-axis planning, these artefacts can have a significant impact on the computed parameters.
In QC2, we used a wide range of relevant criteria to evaluate the output of our pipeline, including clinical knowledge (similarity between LV and RV SV), anatomical relations (coverage of segmentations and images in LAX and SAX) and DL algorithms. This design ensured that erroneous and/or anomalous outputs were detected independent of their nature, even in cases not anticipated during development of the algorithms. This generalization facilitates implementation of the pipeline in clinical scenarios, such as large research databases or clinical practice, where the image quality and disease are not known a priori.
Techniques for automated QC have been previously proposed, such as motion artefact detection in brain magnetic resonance imaging (20), image quality evaluation in fetal (21) and cardiac (22) ultrasound, and detection of missing slices (23), off-axis planning (24), or segmentation errors (25) in CMR. So far, these techniques have been aimed at a single source of error and lack a generalized QC of the output based on clinical criteria. Robinson et al. (25) proposed a method to obtain segmentation quality scores for SAX segmentations from previous ratings in a large cohort of CMR segmentations. Obtaining quality scores from segmentations using this method, or other techniques that include uncertainty into segmentation networks, can complement our framework to further improve the quality of automated CMR analysis.

Pipeline Validation
We validated the performance of the pipeline in 2 separate steps (Validation1 and Validation2). The direct comparison between automated and manual analysis in Validation1 demonstrated that the data obtained using our method was in high agreement for both segmentations (see Dice scores in Results subsection 'Validation1') as well as output (Figures 1 and 2). Only for LVM (+2.95 g), RVESV (+1.80 ml), and ε circ ε long strain (+0.75% and +1.03% to 1.29%, respectively) was there a small bias. However, these biases are within the range of inter-and intraobserver variabilities previously reported (6,26) and are unlikely to have significant clinical impact. The validation results for cardiac volumes (EDV, ESV, and SV) correspond well to the ones obtained in the original publication of the CNN segment (10), showing its reproducibility. The Dice scores we obtained were slightly lower compared with the original publication of the segmentation network. The original network was trained and tested on segmentations made by the UK Biobank's core analysis lab (6). In our paper, validation was performed against a new set of ground truth segmentations, performed by our own CMR cardiologists. The lower performance is therefore likely a reflection of the slight differences in training paradigms and segmentation strategies between cardiac CMR centers.
To investigate the detection of erroneous data by the QC steps of our image-processing pipeline, we evaluated its performance in a second, larger population. Manual analysis of all 700 cases in Validation2 is practically unrealistic. Therefore, we focused on critical review of the segmentations and output parameters to score their validity and evaluated the pipeline's ability to detect the erroneous cases.
The results of Validation2 show that our 2-step QC robustly detects potential erroneous cases. Overall, the sensitivity of the pipeline to detect errors was high for both volume curves (94.99%) and strain (93.21%).
The specificity of the pipeline to correctly detect good cases was lower (82.93% for volume curves and 77.14% for strain). This is likely a consequence of the stringent QC criteria, resulting in flagging of cases with severely distorted anatomy (for example, after cardiac surgery) or abnormal volume curves (restricted ventricles with small volumes, low EF, and shallow early diastolic upslope of the curve). Although the lower specificity leads to unnecessary clinician review, we viewed it necessary to flag such cases to create a safe

Europe PMC Funders Author Manuscripts
Europe PMC Funders Author Manuscripts clinical workflow. However, the additional time for manual review is minimal because incorrectly flagged cases can be directly accepted upon review without adjustments.
It is noteworthy that, except for the lower specificity, our method performed similarly well in patients with cardiomyopathy as in healthy subjects, see the comparisons of absolute mean errors in validation1 (Supplemental Table 2) and sensitivity of error detection in validation2 (Table 1, Supplemental Table 3). Only for LVSV and AC were there small differences in mean absolute errors, but these are unlikely to have significant clinical impact. As can be appreciated from the Bland-Altman plots, the errors did not significantly increase at very high or very low values of the parameters. This further shows that the network has been robustly trained and is also accurate in outliers, such as patients with severe ventricular dilatation.

Reference Values
After validation, we used our pipeline to obtain sex-specific reference values for the ventricular function parameters in a group of 2,029 healthy volunteers (Tables 3 and 4).
The values for cardiac volumes (EDV, ESV, SV, and LVM) obtained using our automated method are in correspondence with those manually obtained in previous sizable studies (6,7). In addition to these values, we also present reference values for filling and ejection dynamics and strain. The latter parameters have not previously been reported in large cohort studies. However, our results do correspond with the largest available study for filling and ejection parameters (27), and a meta-analysis of normal values for CMR-derived strain (28).
The total analysis time of the network was ~ 8 min/subject. This is significantly shorter than the time needed for manual or semiautomated segmentation and FT of the full cardiac cycle in SAX and LAX using the current state-of-the-art commercial software that requires frequent manual adjustments of semiautomated analysis in basal and apical slices of the acquisition.

Study Limitations
At present, this method is designed using data from our Department of Cardiovascular Imaging and UK Biobank. Variability in type of CMR scanners and protocols results in variable image-characteristics between CMR labs. To obtain similar performance in other laboratories, additional training of the neural networks in the framework is needed using data from the new site. However, the principles, including the hardcoded QC measures, remain valid as vital components for automation of CMR analysis in general. If adapted using extra training input, this method can therefore potentially provide robust analysis in other large datasets, research studies, or even clinical CMR services. As part of the Open Science initiative, our method is available for further training and use via the corresponding author.

Conclusions
We presented and validated a pipeline for automated analysis of cardiac function from cine CMR using DL. Our proposed framework includes comprehensive QC designed to detect Europe PMC Funders Author Manuscripts potential erroneous results for clinician review, allowing fully autonomous processing of CMR exams. We showed that using this tool, we were able to obtain reference values in a large cohort (>2,000) of subjects to characterize cardiac function.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.  The p values represent the difference in mean bias from zero using a paired t-test. (F) The mean error in LV is a normalized volume curve for all cases, and both subgroups is shown. Sensitivity, specificity and balanced accuracy (BACC) of the pipeline in detecting inaccurate or unusual output versus correct output with respect to manual assessment are shown.