Automated Echocardiographic Detection of Heart Failure With Preserved Ejection Fraction Using Artificial Intelligence

Background Detection of heart failure with preserved ejection fraction (HFpEF) involves integration of multiple imaging and clinical features which are often discordant or indeterminate. Objectives The authors applied artificial intelligence (AI) to analyze a single apical 4-chamber transthoracic echocardiogram video clip to detect HFpEF. Methods A 3-dimensional convolutional neural network was developed and trained on apical 4-chamber video clips to classify patients with HFpEF (diagnosis of heart failure, ejection fraction ≥50%, and echocardiographic evidence of increased filling pressure; cases) vs without HFpEF (ejection fraction ≥50%, no diagnosis of heart failure, normal filling pressure; controls). Model outputs were classified as HFpEF, no HFpEF, or nondiagnostic (high uncertainty). Performance was assessed in an independent multisite data set and compared to previously validated clinical scores. Results Training and validation included 2,971 cases and 3,785 controls (validation holdout, 16.8% patients), and demonstrated excellent discrimination (area under receiver-operating characteristic curve: 0.97 [95% CI: 0.96-0.97] and 0.95 [95% CI: 0.93-0.96] in training and validation, respectively). In independent testing (646 cases, 638 controls), 94 (7.3%) were nondiagnostic; sensitivity (87.8%; 95% CI: 84.5%-90.9%) and specificity (81.9%; 95% CI: 78.2%-85.6%) were maintained in clinically relevant subgroups, with high repeatability and reproducibility. Of 701 and 776 indeterminate outputs from the Heart Failure Association-Pretest Assessment, Echocardiographic and Natriuretic Peptide Score, Functional Testing (HFA-PEFF), and Final Etiology and Heavy, Hypertensive, Atrial Fibrillation, Pulmonary Hypertension, Elder, and Filling Pressure (H2FPEF) scores, the AI HFpEF model correctly reclassified 73.5% and 73.6%, respectively. During follow-up (median: 2.3 [IQR: 0.5-5.6] years), 444 (34.6%) patients died; mortality was higher in patients classified as HFpEF by AI (HR: 1.9 [95% CI: 1.5-2.4]). Conclusions An AI HFpEF model based on a single, routinely acquired echocardiographic video demonstrated excellent discrimination of patients with vs without HFpEF, more often than clinical scores, and identified patients with higher mortality.

H eart Failure (HF) is a clinical syn- drome affecting over 64 million people worldwide and has an increasing prevalence. 1,2Measurement of ejection fraction (EF) is used to categorize HF; while HF with reduced EF is relatively simple to identify, heart failure with preserved ejection fraction (HFpEF) is more complex, leading to differences in diagnostic criteria, 3 and likely contributing to "failed" clinical trials. 4However, with mounting evidence indicating a beneficial impact of sodium-glucose cotransporter-2 inhibitors across the spectrum of HF, 5 a key focus must now be improving diagnostic capacity 6 in a patient population with poor 5-year survival rates, high hospital readmission rates, and substantial morbidity. 7,8pEF is a heterogenous syndrome associated with various comorbidities, wherein cardiac and noncardiac factors contribute to elevated intracardiac filling pressure, resulting in signs and symptoms of HF. 3,9 Although transthoracic echocardiography (TTE) is routinely used to estimate intracardiac filling pressure, 9,10 there is considerable variability in its performance and interpretation, and a high burden on skills, time, and expertise for acquiring diagnostic quality information which may not be feasible beyond expert clinical sites.Clinical algorithms, utilizing multiple sources of patient data, 11,12 may be limited by discordant or incomplete data. 13,14These factors collectively contribute to variable diagnostic capacity, increasing the requirement for invasive confirmatory tests (eg, right heart catheterization 9,12 ), adding further burden to the patient and health care system, and potentially missing individuals who might benefit from treatment.
Recent work in artificial intelligence (AI) computer vision techniques offer great promise that computational methods can better interpret the vast amount of information that exists within medical data including images.Whereas recent AI studies have combined clinical parameters and manual echocardiographic measurements to classify diastolic dysfunction and HFpEF, [15][16][17] fewer have used echocardiographic images. 18,19Development of an approach using this simple input might obviate the need for complex Doppler assessment, provide supporting information when traditional measures are nondiagnostic, or limit data requirements when such data collection is not feasible.
The objective of this study was to develop an AI model to automatically detect HFpEF by only using the apical 4-chamber (A4C) TTE video clip.This view was selected because it includes much information (chamber sizes, wall thicknesses, annulus motion, etc) and is routinely acquired in imaging protocols.In an independent data set, we tested the hypothesis that the developed AI HFpEF model would demonstrate acceptable classification accuracy, and feasibility superior to current clinical scores for detection of HFpEF.EF of at least 50% 20 (Supplemental Appendix), obtained using standard echocardiographic procedures at the relevant site, and interpreted by qualified clinicians.
Evidence of elevated intracardiac filling pressure.
Documented evidence of increased intra-cardiac filling pressure (cases) or lack thereof (controls), was obtained from comprehensive clinical TTE reports, measured in accordance with relevant guidelines 9,10 (Supplemental Appendix, Figure 1).A convolutional neural network (CNN) 21 model was applied to the A4C video clips.The model was comprised of 3 series of 3-dimensional (3D) convolutional layers.Each of these 3 series was a sequence of 2 convolutions with a 3 Â 3 Â 3 kernel, followed by batch normalization and rectified linear unit activation, and then 1 max-pooling operation with kernel size and stride of 3 in every direction.This architecture was chosen since it is well suited to operate on 3dimensional data (2 in plane spatial dimensions for each frame plus time).The input of the model was comprised of all overlapping sequences of 30 frames, with a stride of 10 frames, from the entire A4C video clip which was usually comprised of 3 cardiac cycles.
The fully connected layer used a dropout with a 0.5 probability (Central Illustration).Grad-CAM identify "important" regions in the image to differentiate between cases and controls (Supplemental Appendix).In the correct example, the highlighted regions correspond to clearly defined cardiac structures with clinical importance, which suggesting that the model is "looking" at appropriate features.In the incorrect example, the strongest (red) signal appears in a less clearly defined structure/ regions.
INDEPENDENT AI HFpEF MODEL TESTING.In the independent testing data set, from an available 1,292 patients (650 cases and 642 controls), and 1,426 video clips (722 cases, 704 controls), 3 video clips could not be read, and 29 contained fewer than 30 frames required for the analysis.The final sample size for the independent testing data set was therefore 1,284 patients (646 cases, 638 controls) (Table 1).C l a s s i fi c a t i o n a c c u r a c y .The AI HFpEF model classified 94 out of 1,284 studies (7.3%) as non-diagnostic due to high model uncertainty.In the remaining data, sensitivity (87.8%; 95% CI: 84.5%-90.9%)and specificity (81.9%; 95% CI: 78.2%-85.6%)both exceeded the a priori benchmarks consistent with average clinical practice (both P < 0.001 for 1-sided Binomial Exact test), with corresponding positive and negative predictive values of 83.6% (95% CI: 80.2%-87.0%)and 86.5% (95% CI: 83.0%-90.0%),respectively.
Compared to their correctly classified counterparts, misclassified controls were older with more evidence of structural heart disease and diastolic dysfunction, whereas the opposite was true for misclassified cases (Table 2).Although there have been other AI models developed to tackle the burden of HFpEF, they have relied on accurate chamber segmentation to derive a series of features used for classification 19 or used the complete echocardiographic study to automate computation of left ventricular diastolic function parameters. 18To our knowledge, this is the first model developed on a single routinely acquired video clip, demonstrating feasibility and high classification accuracy consistent with comprehensive clinical and echocardiographic assessment in expert centers, 7,23,24 albeit using substantially less clinical information.In comparison, recent AI developments in diastolic function scoring 17 were developed using complete data sets; a scenario rarely representing the clinical norm.Comparative effectiveness of different models is beyond the scope of this study, but considering the observed proportion of missing data (Table 4), such a model could support existing diagnostic efforts without requiring additional calculation of current or new (eg, left atrial strain) metrics.
5][26] The performance of such methods varies considerably, 7,11,23,24 but can be excellent in expert centers, or when missing or discordant data are not an issue.In complex clinical cases, whilst there is guidance for estimating filling pressure when echocardiographic signals are difficult to interpret (atrial fibrillation 10 ), the assessment is often avoided entirely.Compared to current clinical algorithms, or guideline-derived cut-offs for various diagnostic markers, the AI HFpEF model retuned fewer nondiagnostic outputs (Table 4), successfully reclassifying almost 75% of those who would be non-diagnostic according to the HFA-PEFF or H2FPEF scores     Data presented are the total sample of patients with data available for use in the classification ("n"), number of patients with data available who receive a nondiagnostic output from the AI HFpEF model ("no class).a Indexing was performed to body surface area.Average filling refers to the calculated mean of the septal and lateral mitral annular early diastolic tissue velocity when both metrics are available, or the available metric when only 1 is available.Pulmonary artery systolic pressure calculated as 4 (tricuspid regurgitation velocity) 2 þ estimated right atrial pressure (5 mm Hg).HFA-PEFF probability categories calculated according to Pieske et al 12 Patients with a score of 0 or 1 were considered unlikely likelihood of HFpEF (negative output; predicted control), and 5 or more considered probable likelihood of HFpEF (positive output; predicted case).H2FPEF categorical scores were calculated according to Reddy et al 11 Patients with a score of 0 or 1 were considered low probability of HFpEF (negative output; predicted control), and 6 or more considered high probability of HFpEF (positive output; predicted case).STUDY LIMITATIONS.The diagnostic details of each case were not adjudicated.Therefore, it is possible that some controls had subclinical disease, albeit representative of patients in major clinical trials (Supplemental Appendix).Nonetheless, an important progression for the current model is to increase capacity and validate detection of HFpEF earlier in the clinical pathway, particularly when patients might have dyspnea on exertion, but not at rest (eg, patients referred for diastolic stress testing, or invasive filling pressure measurements at rest and with exertion 9,12 ), or when limited echocardiographic imaging occurs earlier in the pathway (eg, point-of-care ultrasound).
Another limitation is that complete matching for age was not possible; patients with HFpEF were older.
METHODSDATA SOURCES AND STUDY POPULATION.This retrospective, multisite, and multinational cohort study was approved by Institutional Review Boards of Mayo Clinic, United States and St. George's University Hospitals, National Health Service Foundation Trust, United Kingdom.Patients provided written informed consent for inclusion in research; consent for use of TTE analysis and relevant clinical patient information was exempted by the participating Institutional Review Boards due to the use of deidentified data.Data from the United States and United Kingdom were used in the training and validation of the AI model, whereas independent multisite data from the United States were used for testing.M o d e l t r a i n i n g a n d v a l i d a t i o n .The Mayo Clinic echocardiography database, which comprises all clinical images and TTE reports since 2002, and matched electronic medical records were screened for patients meeting the ground truth determination for cases and controls.Data were included for patients A B B R E V I A T I O N S A N D A C R O N Y M S 3D = 3-dimensional A4C = apical 4 chamber AI = artificial intelligence AUROC = area under receiver-Ultromics Ltd, Oxford, United Kingdom; b Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota, USA; c Division of Hospital Internal Medicine, Mayo Clinic, Rochester, Minnesota, USA; d Department of Cardiovascular Medicine, Mayo Clinic, Rochester, Minnesota, USA; e Cardiovascular Clinical Research Facility, University of Oxford, Oxford, United Kingdom; f Experimental Therapeutics, Medical Sciences Division, Radcliffe Department of Medicine, University of Oxford, Oxford, United Kingdom; and the g Department of Cardiology, St George's University Hospitals NHS Foundation Trust, London, United Kingdom.*Drs Akerman and Porumb are joint first authors.The authors attest they are in compliance with human studies committees and animal welfare regulations of the authors' institutions and Food and Drug Administration guidelines, including patient consent where appropriate.For more information, visit the Author Center.Manuscript received April 7, 2023; revised manuscript received May 18, 2023, accepted May 29, 2023.who had undergone a comprehensive TTE at Mayo Clinic in Rochester, Minnesota between January 2009 and December 2020.Echocardiograms at Mayo Clinic are performed by certified cardiac sonographers and interpreted by experienced level 3 trained physicians prior to the patient's dismissal from the laboratory.A continuous random sampling of the data pool was taken and cross-referenced for preserved EF, and evidence of increased intracardiac filling pressure, until the desired number of cases was compiled (Figure 1).Controls were then randomly sampled to achieve a distribution of age, sex, and year of echocardiogram amongst patients.St. George's Hospital cardiac database was screened in an identical manner to the Mayo Clinic echocardiography database to enrich the data set and facilitate generalizability via multinational data.Independent testing of the AI HFpEF model.Multicenter independent retrospective data were collected within Mayo Clinic Health System to test the AI HFpEF model.Patients were selected from geographically distinct areas from the data used in model development to ensure generalizability.Data were selected from clinical sites spanning 4 states, and outreach services across 5 states (Supplemental echocardiogram and attempts were made to match for age.To better assess generalizability, up-sampling of non-White and Hispanic populations was used.IDENTIFICATION OF STUDY GROUPS.The ground truth determination used in model training, validation and independent testing was based on data collected from patient medical records and comprehensive TTE reports.The definition of cases was consistent with the current national guidelines for detection and diagnosis of HF, 9 based on the clinical diagnosis provided by the treating physician, and matching the clinical patient pathway for this patient cohort.Patients with HFpEF (cases) and patients without HFpEF (controls) were therefore identified via the mechanisms described below and illustrated in Figure 1.C l i n i c a l d i a g n o s i s o f h e a r t f a i l u r e .Documented clinical diagnosis of HF, based on an International Classification of Diseases 9 or 10 code, within 1 year of the associated echocardiogram (case) or lack of this diagnosis (control) was collected from the patient medical records (Supplemental Table 2).P r e s e r v e d s y s t o l i c f u n c t i o n .Documented evidence of preserved systolic function according to TTE (cases and controls) was obtained from the patient TTE reports.This was evidenced by a left ventricular

FIGURE 1
FIGURE 1 Flow Diagram Illustrating Identification and Selection of Patients in AI HFpEF Model Development and Testing

J
A C C : A D V A N C E S , V O L . 2 , N O .6 , 2 0 2 3 OVERVIEW OF THE AI HFpEF MODEL.Model training and validation were completed using Python (version 3.7.7)with TensorFlow (version 2.2) on a rackmounted server with a set of 3 Nvidia Tesla V100 graphic processing units, each with 32 GB of video RAM.Model inputs consisted of only A4C TTE video clips.For training and validation of the AI HFpEF model, all A4C video clips for a given patient were used.

All
Figure1).COMPARISON OF AI MODEL WITH CURRENT CLINICALPRACTICE.To test the hypothesis that the classification accuracy of the developed AI HFpEF model, based on analysis of a single A4C video clip, was acceptable, we compared observed sensitivity and specificity in the independent testing data set to average reported data in the literature (sensitivity, 74%; specificity, 65%) (Supplemental Appendix).To demonstrate a 5% increase from these benchmarks, and allowing for 21.9% of nondiagnostic outcomes, w1,048 patients were required in the independent testing data set (Supplemental Appendix).Classification performance was assessed in a priori determined subgroups of interest related to patient demographics, clinical, and echocardiographic criteria (Supplemental Appendix).The previously validated clinical Heart Failure Association-Pretest Assessment, Echocardiographic and Natriuretic Peptide Score, Functional Testing, and Final Etiology (HFA-PEFF)12 and Heavy, Hypertensive, Atrial Fibrillation, Pulmonary Hypertension, Elder, and Filling Pressure (H2FPEF) scores11 were calculated retrospectively (ie, they were not required for the original clinical diagnosis) and categorized as unlikely (0 or 1), indeterminate (2-4), or probable (5-6) likelihood of HFpEF for the HFA-PEFF score, and low probability (0 or 1), indeterminate(2-5), or high probability (6-9) of HFpEF for the H2FPEF score.The impact of

Figure 2
Figure 2 demonstrates representative Grad-CAM images for a correctly classified case, and an incorrectly classified control.The highlighted areas in the Sensitivity analyses were performed to identify whether bias in age, sex, or year of echocardiogram meaningfully influenced the classification accuracy.In all instances, sensitivity and specificity were higher than the a priori benchmarks (range: 83.7%-87.6%and range: 78.4%-82.4%,respectively) (Supplemental Appendix).Likewise, no a priori identified patient or technical factors meaningfully impacted the classification accuracy, with sensitivity and specificity maintained across subgroups (Supplemental Appendix).R e p e a t a b i l i t y a n d r e p r o d u c i b i l i t y o f A I H F p E F m o d e l .The model demonstrated perfect agreement for repeatability of all model outputs (

(Figure 3 )
. Furthermore, the model identified those with increased risk of mortality (Central Illustration), and its use in clinical practice-particularly in those who would otherwise be indeterminate-might facilitate a higher proportion of patients being managed correctly (Figure4, Supplemental Appendix).Further research is required to understand whether the added feasibility and high classification performance translate to meaningful clinical endpoints, including reductions in follow-up procedures, hospitalization, or death.Technological advances provide increased capacity to capture information not readily observed by the human eye, albeit often at the expense of interpretability.Grad-CAM is one approach to facilitate interpretability in AI, identifying important regions in the image to discriminate between cases and controls.In an example of correct classification (Figure2), the Grad-CAM highlights regions which correspond to clearly defined cardiac structures which might have clinical importance.

H2FPEF¼
Heavy, Hypertensive, Atrial Fibrillation, Pulmonary Hypertension, Elder, and Filling Pressure; HFA-PEFF ¼ Heart Failure Association-Pretest Assessment, Echocardiographic and Natriuretic Peptide Score, Functional Testing, and Final Etiology; HFpEF ¼ heart failure with preserved ejection fraction; LA ¼ left atrial; LV ¼ left ventricle.increases in filling pressure, or signs and symptoms of HF not captured by the clinical coding employed herein.Important validation work in the future will involve assessment of model performance in adjudicated HF outcomes and/or invasively measured filling pressure.
However, survival analysis was age-adjusted and sensitivity analysis demonstrated no meaningful change in interpretation in only age-matched patients.Future work will be required for recalibration or updating of the model in other patient groups (eg, increased filling pressure but no HF diagnosis, or indeterminate filling pressure assessment by TTE), validating its application in other echocardiography laboratories and in different demographic groups, and prospective evaluation of comparative effectiveness with clinical scores.CONCLUSIONS We present a novel AI HFpEF model which, based on only a single routinely acquired TTE video clip,

FIGURE 3
FIGURE 3 Alluvial Plot Demonstrating Reclassification of Patients Using Clinical Scores Compared to the AI HFpEF Model

Table 3 )
. From the main testing data set, 2 separate video clips per patient were available for 34 controls and 48 cases to

TABLE 2
Summary of Characteristics for Patients With and Without HFpEF (Cases and Controls, Respectively) Who Were Correctly and Incorrectly Classified Using the AI HFpEF Model or Received No Classification due to High Model Uncertainty

TABLE 2
11lues are mean AE SD[N]or n (%)[N].P value refers to statistical test between correct, incorrect, and unclassified groups within controls, and the same comparison within cases.aIndexing was performed to body surface area.Average filling refers to the calculated mean of the septal and lateral mitral annular early diastolic tissue velocity when both metrics are available, or the available metric when only 1 is available.Categories within the "Comorbidities and risk factors" section only refer to individuals with the given condition present.Obesity refers to a BMI >25.0 kg/m 2 .Structural heart disease refers to the presence of an enlarged LA volume index ($34.0mL/m 2 ) or LV mass index ($116/96 g/m 2 for males and females, respectively), a relative wall thickening >0.42, or a posterior wall thickness $12 mm.Pulmonary disease refers to the presence of lung disease or chronic obstructive pulmonary disorder.Previous cardio-or cerebrovascular event refers to the presence of a previous stroke, transient ischemic attack, coronary artery revascularization, or myocardial infarction.Pulmonary artery systolic pressure calculated as: 4 (tricuspid regurgitation velocity) 2 þ estimated right atrial pressure.HFA-PEFF probability categories calculated according to Pieske et al.12Patients with a score of 0 or 1, between 2 and 4, and 5 or more, were denoted as unlikely, indeterminate, and probable likelihood of HFpEF, respectively.H2FPEF continuous and categorical scores were calculated according to Reddy et al.11For the categorical score, patients with a score of 0 or 1, 2 to 5, or 6 to 9, were denoted as low, indeterminate, and high probability of HFpEF, respectively.AI ¼ artificial intelligence; BMI ¼ body mass index; BNP ¼ brain natriuretic peptide; H2FPEF ¼ Heavy, Hypertensive, Atrial Fibrillation, Pulmonary Hypertension, Elder, and Filling Pressure; HFA-PEFF ¼ Heart Failure Association-Pretest Assessment, Echocardiographic and Natriuretic Peptide Score, Functional Testing, and Final Etiology; HFpEF ¼ heart failure with preserved ejection fraction; LA ¼ left atrial; LAVi ¼ left atrial volume index; LV ¼ left ventricle; LVMi ¼ left ventricular mass index; NT-proBNP ¼ N-terminal pro brain natriuretic peptide; SBP ¼ systolic blood pressure.

TABLE 3
Repeatability (Same Video Clip Used Twice), and Reproducibility AI ¼ artificial intelligence; HFpEF ¼ heart failure with preserved ejection fraction.

TABLE 4
Traditional Methods to Classify Patients as High or Low Likelihood of Having HFpEF Using Guideline Echocardiogram Cut Points