Concordance of left ventricular volumes and function measurements between two human readers, a fully automated AI algorithm, and the 3D heart model

Background Echocardiography is essential in cardiovascular medicine for screening, diagnosis, and monitoring. Artificial intelligence (AI) has the potential to improve echocardiography by reducing variability and analysis time. While 3D echocardiography is becoming more accurate, 2D imaging still dominates clinical care. We aimed to evaluate agreement in measures of left ventricular (LV) volumes and function between human readers, a fully automated AI 2D algorithm, and the 3D Heart Model. Methods A retrospective analysis was conducted on 109 patients who underwent 2D and 3D transthoracic echocardiography. LV end-diastolic and end-systolic volumes (LVEDV, LVESV) and ejection fraction (LVEF) were measured by two operators, a commercially available AI algorithm (US2ai), and the 3D Heart Model. Global longitudinal strain (GLS) was measured by the integrated semi-automated software and the AI algorithm. Outcomes included measures of agreement [bias, limit of agreement and Pearson's correlation (R)] Results For LV volume measurements, the AI algorithm was strongly correlated with the average of the human operators (r = 0.89 for LVEDV and r = 0.92 for LVESV), which was higher than between the operators (r = 0.74 and r = 0.84, respectively, p < 0.01). The same trend was seen for measures of reliability with respect to LVEDV, but not LVESV. AI demonstrated comparable performance to human operators in measuring LVEF, while the 3D Heart Model had a weaker correlation and reliability compared with human operators and AI measurements. The correlation between human operators and AI for GLS was only moderate. Conclusion This study demonstrates AI-based echocardiography as a promising tool for accurately assessing LV volumes and LVEF in clinical practice. AI-based measures demonstrated a significantly lower inter-operator variability, thereby improving the consistency and reliability of these assessments. Moreover, AI may prove particularly effective for conducting retrospective bulk analyses, offering a valuable tool for comprehensive evaluations of past data.


Introduction
Echocardiography holds a pivotal role in multiple aspects of cardiovascular medicine, encompassing screening, prevention (e.g., in patients undergoing cardiotoxic cancer treatments), diagnosis, risk stratification or monitoring for structural and functional abnormalities (1)(2)(3).The integration of artificial intelligence (AI) has already proven its value in various cardiac imaging modalities and has the potential to significantly enhance or simplify echocardiography as well.By eliminating intra-operator and inter-operator variability, AI may minimize the need for extensive training programs for operators, or AI can expedite the analysis time required to interpret collected images, leading to more efficient diagnosis and decision-making processes (4)(5)(6).
Although 3D echocardiography is becoming increasingly easy and accurate, 2D imaging is still the work horse of everyday echocardiography primarily due to technical limitations and availability of 3D echocardiography.However, automatic measurements of 3D datasets using near real-time machine learning techniques have revolutionized the clinical applicability of 3D echocardiography, especially in quantifying chamber volumes and ejection fraction.Nonetheless, these methods are often vendor-specific and primarily available in top academic centers.
Automated AI algorithms that are capable of accurately analyzing standard 2D echocardiography are highly desirable, both for routine clinical practice and for retrospective automated analysis of the large amounts of echocardiograms stored in electronic archives worldwide.By enabling automated analysis, valuable and unexpected longitudinal variations in key parameters and their trajectories could be revealed, reducing the necessity for time-consuming assessments by expert human readers, opening new roads for retrospective analyses of data.However, ensuring that AI-based automatic measurements perform at least as good as manual readings remains a critical requirement.
Our aim was to assess the agreement, correlation, and reliability of measurements performed by (a) a fully automated commercially available AI-algorithm (Us2ai) on 2D images, (b) the Heart Model 3D (HM3D) system and (3) human expert readers.

Methods
This was a retrospective analysis of 109 consecutive subjects who underwent transthoracic echocardiography at the cardiology echo lab of the University Hospital of Parma, a tertiary care center, between November 1 and December 1, 2022.The study protocol was approved by the institutional review board.

Transthoracic image analyses
All patients underwent a resting transthoracic echocardiogram according to international guidelines (7).2D and 3D ultrasound imaging was performed using an EPIQ machine and ×5 transducer by Philips Healthcare.The HM3D images were obtained by employing wide-angle acquisition in "full-volume" mode, optimizing the frame rate by minimizing sector depth and width.Images were later analyzed off-line by two experienced operators who were blinded to each other and clinical data.In particular, experienced operator had EACVI transthoracic echocardiography certification or an echocardiography experience of more than 10 years.Left ventricular end-diastolic volumes (LVEDV), left ventricular endsystolic volumes (LVESV) and left ventricular ejection fraction (LVEF) were calculated using the modified Simpson's rule according to the 2015 American Society of Echocardiography (ASE)/European Association of Cardiovascular Imaging (EACVI) guidelines for cardiac chamber quantification (7).Peak R wave and end of T wave on ECG were used to identify enddiastole and end-systole, respectively, for manual measurements (each reader used these same criteria, also when repeating measurements for intra-inter-observer variability), while 2D AI and 3D heart model systems identify end-diastole and endsystole with proprietary methods.We selected only cineloops not comprising arrhythmias from analyses, to avoid potential confounders.Global longitudinal strain (GLS) was calculated as the average Legrangian strain from the apical 4-chamber (A4C), apical 3-chamber (A3C) and apical 2-chamber (A2C) views using the conventional software Autostrain (Philips Healthcare), which is semi-automated (i.e., operators acquiring the images adjust the endocardial border tracings if needed) (8).
The semi-automated 3DHM algorithm was used to determine 3D measures of LVEDV, LVESV with the aim to calculate only LVEF, since 3D volumes were deemed not comparable to 2D volumes.
Briefly, 3D datasets were acquired in a single beat during a breath hold lasting a few seconds, ensuring optimal temporal and spatial resolution.The volumetric datasets were immediately evaluated on-board using the DHM software (Heart Model, Philips Healthcare), which automatically identifies LV endo-and epicardial borders at end-diastole and LA borders at end-systole, allowing prompt quantification of the volumes of these chambers In our study, 3DE images were analyzed using the default settings of the boundary detection sliders (end-diastolic default position = 60/60; end-systolic default position = 30/30).
The fully-automated 2D AI-based analyses were performed by the commercially available algorithm from Us2ai (Us2ai, Singapore, Singapore), which automatically calculated LV volumes, LVEF and GLS without any manual correction.The algorithm is based on a deep learning workflow, as previously described for 2D videos and GLS (9,10).In brief, the AI algorithm classifies the 2D video clips into either A4C, A3C or A2C view and automatically excludes low-quality images.Then, automated contouring of the endocardial border for every frame from the A4C, A3C and A2C views are performed by a convoluted neural network (CNN) model.Automated identification of the end-diastolic and the end-systolic frames based upon video-level volume curves with confirmation by an accompanying electrocardiogram, if available.The strain module uses the annotated and endocardium-traced video clips of LV produced in the conventional 2D echo module to measure the circumferential lengths of a traced endocardium for each frame and are projected as drift corrected strain curves based on the cardiac cycle identified by video level volume curves.

Statistical methods
Unless otherwise specified, data are presented as mean +/− SD or n (%).Group comparisons were performed using the Student's t-test or a Mann-Whitney U test for continuous data and categorical data were compared with chi-squared (χ 2 ) test.Bland-Altman plots were utilized to assess methodological agreement, including bias (difference in mean measurement) and 95 percent limits of agreement (LoA, mean of the two measurements ± 1.96 × SD) between  Correlation plots and bland-altman plots of left ventricular end diastolic volume (LVEDV) measures between two human operators (left) and between AI-based measures and the average between the human operators (right).the methods.Paired t-tests were conducted to determine the significance of the biases.Measurement variability was expressed as the mean absolute difference (MAD) between corresponding pairs of repeated measurements within each patient throughout the study group.Correlations were assessed using the Pearson coefficient (r).Reliability was evaluated using the interclass correlation coefficient, which considers the average of K to determine the degree of reliability among the different methods.P-value < 0.05 was considered statistically significant.

Results
The human operators and the AI algorithm successfully analyzed all 109 (100%) 2D echocardiographic studies included in our study, while the 3DHM algorithm was able to analyze 99 of the studies (89%).The clinical characteristics of the study population are presented in Table 1.
Absolute mean values for each measurement performed with different methods (LVEDV, LVESV, LVEF, GLS) are presented in Table 2.
For measurements of LVESV, the correlation between the two operators was r = 0.84 (0.77-0.89, p < 0.001), with a reliability of k = 0.91 (Figure 2; Table 3).The average bias was 5.7 ml (LoA ± 20.8 ml).Comparing the average of the operators with AI, the correlation was r = 0.92 (0.89-0.95, p < 0.001) with a reliability of k = 0.60.The AI algorithm measured higher LVESV, with an average bias of 11.9 ml (LoA 37.6 ml).
GLS was successfully analyzed by human operators and the AI algorithm in 103 subjects (Figure 4).The two methods exhibited a correlation of r = 0.55 (0.85-0.92, p < 0.0001) with a reliability of k = 0.71 and with and average bias of 4% (LoA ± 6.3%).
Table 3 reports also reports full data for intra-operator variability for LVEDV, LVESV and LVEF.Correlation plots and Bland-Altman plots of left ventricular ejection fraction (LVEF) measures between two human operators (left) and between AI-based measures and the average between the human operators (middle) and between 3D heart model and the average between the human operators (left).

Discussion
In this real-world study of consecutive subjects who underwent transthoracic echocardiography for various clinical indication, we found good correlations and reliability, and a low bias, for measures of LV volumes and LVEF between human operators and a fully automated AI algorithm.The feasibility of the AI algorithm was high, as all images were successfully analyzed.The 3DHM was able to analyze LVEF in 89% of images, which is in agreement with the feasibility reported in the literature (11), and the accuracy, with human operators as the reference, was inferior to that of the AI model.
AI-based measurements of LVEDV showed superior correlation, agreement, and reliability compared to human operators analyzing identical images.This finding may suggest that AI can mitigate the inherent inter-operator variability that affects the accuracy of conventional echocardiography by standardizing the measurements.The same findings were confirmed for LVESV with respect to correlation, but with a higher bias and a lower reliability for the AI-based measurements.This discrepancy may be attributed to different approaches including myocardial trabeculae in, which become particularly elevated from the pars compacta in end-systole.However, as there were no such differences between measures of LVEDV and LVESV in three larger datasets using the same algorithm, the discrepancies may also be by chance (9).
LVEF is perhaps the most important variable for clinical decision-making.Our data suggest that the agreement, correlation, and reliability of the AI-based algorithm compared to the mean of two operators are nearly identical to those observed between the two operators themselves.This implies that the AI-based algorithm can be considered as reliable and consistent as an experienced operator in measuring LVEF.We also compared the performance of AI-based algorithm with another tool for assessing LVEF, the 3DHM.The 3DHM system exhibited slightly inferior agreement, correlation, and reliability in LVEF measurements compared to the AI-based algorithm when compared to the mean of the two experienced operators.This observation does not justify the added complexity and reduced feasibility associated with automatic 3DHM imaging compared to 2D imaging by the AI-based algorithm.Importantly, 3D measures of LV volumes differ substantially from 2D measures, with 3D and3DHM volumes being closer to LV volumes as measured by cardiac magnetic resonance (11,12).This may be biased against the 3DHM model, as the reference LVEF was based on 2D-images by human operators.
The correlation between semi-automated and AI-based measures of GLS was modest (r = 0.55) and lower than what has previously been reported in larger datasets by the same algorithm (r = 0.84 in a real-world dataset and r = 0.76 in an echo core lab study of patients with HFpEF) (10).However, as the bias and reliability of the measurements were good, the modest correlation may relate to the narrow range of GLS in this study (majority between −15% and −20%).
Our study has some limitations.This was a retrospective analysis which may have introduced selection bias.We did not validate the findings in an independent cohort, however, the algorithms used have previously been tested in other populations (9,10).The population studied is rather small with a tight range of LV volumes and EF, mostly within the normal range.The results are representative only for this specific population and can not be generalized to the entire LV volumes and EF range which can be encountered in clinical practice.The images were acquired by equipment from one vendor (Philips Healthcare) and although the AI software is labeled as vendor-independent the findings can not necessarily be extrapolated to other vendors.

Conclusions
Chamber quantification in echocardiography is crucial for making informed decisions in everyday cardiology practice.Our analysis strongly suggests that an AI-based method for quantifying left ventricle volumes and LVEF can be effectively

FIGURE 4
FIGURE 4Correlation plots and Bland-Altman plots of left ventricular global longitudinal strain (GLS) measures between AI-based measures and semi-automated measures (autostrain).

TABLE 1 clinical
characteristics of the studied population.
CVD, cardiovascular disease; LVEF, left ventricle ejection fraction.a The average of two human operators.

TABLE 3
Bias, correlation and reliability for measures of left ventricular volume, ejection fraction and global longitudinal strain by human operators, 2D AI algorithm and 3D heart model.
Myhre et al. 10.3389/fcvm.2024.1400333Frontiers in Cardiovascular Medicine 06 frontiersin.orgemployed in clinical practice, as it demonstrates good agreement and correlation when compared to assessments made by two experienced human operators.