Artificial intelligence in fracture detection with different image modalities and data types: A systematic review and meta-analysis

Jongyun Jung; Jingyuan Dai; Bowen Liu; Qing Wu

doi:10.1371/journal.pdig.0000438

Abstract

Artificial Intelligence (AI), encompassing Machine Learning and Deep Learning, has increasingly been applied to fracture detection using diverse imaging modalities and data types. This systematic review and meta-analysis aimed to assess the efficacy of AI in detecting fractures through various imaging modalities and data types (image, tabular, or both) and to synthesize the existing evidence related to AI-based fracture detection. Peer-reviewed studies developing and validating AI for fracture detection were identified through searches in multiple electronic databases without time limitations. A hierarchical meta-analysis model was used to calculate pooled sensitivity and specificity. A diagnostic accuracy quality assessment was performed to evaluate bias and applicability. Of the 66 eligible studies, 54 identified fractures using imaging-related data, nine using tabular data, and three using both. Vertebral fractures were the most common outcome (n = 20), followed by hip fractures (n = 18). Hip fractures exhibited the highest pooled sensitivity (92%; 95% CI: 87–96, p< 0.01) and specificity (90%; 95% CI: 85–93, p< 0.01). Pooled sensitivity and specificity using image data (92%; 95% CI: 90–94, p< 0.01; and 91%; 95% CI: 88–93, p < 0.01) were higher than those using tabular data (81%; 95% CI: 77–85, p< 0.01; and 83%; 95% CI: 76–88, p < 0.01), respectively. Radiographs demonstrated the highest pooled sensitivity (94%; 95% CI: 90–96, p < 0.01) and specificity (92%; 95% CI: 89–94, p< 0.01). Patient selection and reference standards were major concerns in assessing diagnostic accuracy for bias and applicability. AI displays high diagnostic accuracy for various fracture outcomes, indicating potential utility in healthcare systems for fracture diagnosis. However, enhanced transparency in reporting and adherence to standardized guidelines are necessary to improve the clinical applicability of AI.

Review Registration: PROSPERO (CRD42021240359).

Author summary

Artificial Intelligence (AI) is increasingly employed to detect fractures by using various imaging modalities and data types. Our search of Medline (via PubMed), Web of Science, and IEEE revealed numerous primary studies demonstrating AI’s superior performance in fracture detection. This systematic review and meta-analysis is the first to assess and compare the diagnostic accuracy of AI models across different imaging modalities and data types for various fracture outcomes. We found that AI models achieve high accuracy in fracture detection, particularly with radiograph images. However, we identified significant flaws in study design and reporting, limiting real-world applicability. Few studies provided patient characteristics, and only half reported the hyperparameter selection process. Our findings underscore the benefits of using AI models with radiographs for fracture detection, as they outperform other imaging modalities. Despite similar results across modalities, inadequate methodology and reporting in AI model evaluations call for improvement. Considering AI’s high diagnostic performance, integrating it into existing fracture risk assessment tools could enhance patient identification and enable early intervention.

Citation: Jung J, Dai J, Liu B, Wu Q (2024) Artificial intelligence in fracture detection with different image modalities and data types: A systematic review and meta-analysis. PLOS Digit Health 3(1): e0000438. https://doi.org/10.1371/journal.pdig.0000438

Editor: Martin G. Frasch, University of Washington, UNITED STATES

Received: May 3, 2023; Accepted: December 25, 2023; Published: January 30, 2024

Copyright: © 2024 Jung et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data generated or analyzed during the study are included in the published paper.

Funding: The research and analysis described in the current publication were supported by a grant (R21MD013681 to QW) from the National Institute on Minority Health and Health Disparities and a grant (R01AG080017 to QW) from the National Institute of Aging. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Bone fractures represent a significant public health concern globally [1], particularly for individuals with osteoporosis [2]. Fractures contribute to work absences, disability, reduced quality of life, health complications, and increased healthcare costs, affecting individuals, families, and societies [3,4]. A meta-analysis of 113 studies reported the pooled cost of hospital treatment for a hip fracture after 12 months as $10,075, with total health and social care costs amounting to $43,669 per hip fracture [5].

Artificial Intelligence (AI), encompassing Machine Learning (ML) and Deep Learning (DL), has been extensively employed for fracture outcome prediction due to technological advancements and accessibility. Various imaging modalities, including X-rays [6,7], computed tomography (CT) [8,9], and magnetic resonance imaging (MRI) [10,11], have been used in fracture diagnosis and detection. AI can also predict fractures using tabular data, such as electronic medical records (structured patient-level data). However, few studies [12–14] have applied AI with tabular data in fracture prediction despite its growing importance over the past decade. Recent systematic reviews and meta-analyses have reported high accuracy for AI in fracture detection and classification. Kuo et al. [15] summarized 42 studies with 115 contingency tables, finding pooled sensitivity of 92% (95% CI: 88, 94) and specificity of 91% (95% CI: 88, 93). Yang et al. [16] reviewed 14 studies on orthopedic fractures, reporting pooled sensitivity and specificity of DL models as 87% (95% CI: 78, 93) and 91% (95% CI: 85, 95), respectively.

However, existing systematic review and meta-analysis studies focused solely on image-based analyses, neglecting comprehensive examination of various imaging modalities and data types (image, tabular, or both). Despite the superior performance of AI for medical image analysis and using tabular data, a critical gap exists in the current literature concerning the optimal choice of image modalities and the choice between image, tabular, or combined data types. There is a lack of comprehensive guidance on the most effective selection of image modalities and data types for fracture diagnosis. This gap in knowledge underscores the need for systematic investigation to determine which image modality, and by extension, which data type, yields the highest diagnostic accuracy and clinical relevance in AL algorithms. Addressing this gap will not only optimize the design of AI-based diagnostic tools but also enable healthcare practitioners to make informed decisions when selecting appropriate imaging modalities and data types for improved patient care.

Thus, this study primarily aims to evaluate the diagnostic accuracy of AI in fracture detection using diverse imaging modalities and data types, reflecting AI’s growing role in healthcare. Additionally, we seek to synthesize current evidence on AI-based fracture detection, offering a concise overview and discerning the strengths and limitations of various data types, whether image, tabular, or combined.

Materials and methods

Identification and selection of studies

This systematic review, registered with PROSPERO (CRD42021240359), follows PRISMA guidelines (S1 PRISMA Checklist) [17]. We searched Medline (via PubMed), Web of Science, and IEEE. The last search was conducted on December 15, 2022, and we manually searched bibliographies, citations, and related articles of included studies. S1 Text lists each search term. Two independent reviewers (JJ and JD) assessed study eligibility, resolving disagreements through discussion or involving a third author (BL) if necessary.

Eligible studies predicted fracture outcomes using structured patient-level health data (electronic health records and cohort studies data) and image-related data (MRI, DXA, and X-ray). We excluded reviews, gray literature, non-human subject studies, studies without machine learning or deep learning models, fracture outcomes, AUC, accuracy, sensitivity, specificity, validation, and insufficient algorithm development details. We only considered studies published in English without time restrictions.

Data extraction

All three categories of data were considered: image-related, tabular, and both. Image-type studies used MRI, DXA, CT, or X-ray; tabular-type studies used structured electronic health records data; image and tabular studies used both data types. Two investigators (JJ and JD) independently evaluated study eligibility, extracting relevant data for articles meeting inclusion criteria. A structured data collection form was used to capture general study characteristics, population, data preprocessing, clinical outcomes, analytical methods, and results. A third author (BL) resolved discrepancies if necessary. We constructed the contingency table (true positive, true negative, false positive, and false negative) based on the provided information of sensitivity, specificity, positive predictive value, and negative predictive value for each study (S4 Table). If the study reported multiple sensitivity and specificity, we used the highest sensitivity and specificity.

Statistical analysis

Meta-analyses were performed using a random-effects model to calculate the pooled sensitivity and specificity based on logit transformation [18,19], using the Clopper-Pearson interval to calculate 95% confidence intervals for each study [20]. We used a unified hierarchical summary receiver operating characteristic curve (HSROC) to investigate the relationship between logit-transformed sensitivity and specificity. We calculated the diagnostic odds ratio and used inverse variance weighting for pooling with random effect models [21].

Sensitivity analysis

The logit transformation does not consider the correlation between sensitivity, specificity, and threshold effects; another model is desired to capture this missing part. Barendregt et al. [22] recommend using the Freeman-Tukey double arcsine transformation instead of the logit transformation. Hence, we used the Freeman-Tukey double arcsine transformation as a sensitivity analysis [22] for a random-effects model.

Subgroup analysis

Two subgroup analyses were conducted: 1) three data types (images, tabular, or images and tabular) and 2) different image modalities among image data used in AI. Statistical analysis was performed using R [23], with ‘meta’ [24] and ‘mada’ [25] packages. A p-value of < 0.05 was considered statistically significant.

Publication bias

We utilized the contour-enhanced funnel plot [26] to illustrate the assessment of publication bias for each fracture outcome and data type used. Each data point in the contour-enhanced funnel plot represents an individual study, and the plot incorporates contour lines that delineate expected areas of symmetry in the absence of bias. The plot provides insights into potential publication bias, with asymmetry suggesting a deviation from expected publication patterns. We employed the trim-and-fill method to address publication bias [22] further. This statistical approach helps adjust for the potential missing studies due to publication bias by imputing hypothetical “filled” studies and recalculating the effect size accordingly.

Risk of bias and applicability

Two reviewers (JJ and JD) independently evaluated the risk of bias in each study using Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) [27], assessing four domains: patient selection, index test, reference standard, and flow and timing. The risk of applicability was evaluated with the first three domains.

Results

Study selection and characteristics

Our search identified 1,128 studies, yielding 717 unique ones after removing duplicates (Fig 1). We screened titles and abstracts and selected 496 studies for full-text review based on our inclusion criteria. We then excluded 254 studies for lacking sensitivity and specificity information (149 studies), not having fracture-related outcomes (75 studies), not using ML models (28 studies), or being survey or review articles (2 studies). We further removed 176 studies because no contingency table could be calculated from the provided information. Ultimately, 66 studies were included in our systematic review and meta-analysis.

Download:

Fig 1. Flow chart of the literature selection in PubMed, Web of Science, and Institute of Electrical and Electronics Engineers (search conducted on December 15, 2022).

*IEEE: Institute of Electrical and Electronics Engineers.

https://doi.org/10.1371/journal.pdig.0000438.g001

The selected studies were published between 2007 and 2022, with 73% (48 studies) published in the last three years (Table 1). The studies were conducted in various countries, including Asian countries (26 studies) [6,9,11,28–50], North American countries (19 studies) [14,34,36,51–66], European countries (14 studies) [13,59,67–78], Australia (1 study) [79] and Brazil (2 studies) [10,80] (Table 1). Four studies did not provide the country information [81–84].

Download:

Table 1. Fracture detection of 66 selected studies using machine learning and deep learning models and general characteristics of the study.

https://doi.org/10.1371/journal.pdig.0000438.t001

Fracture identification was performed using imaging-related data in 54 studies, tabular data in nine studies, and imaging and tabular data in three. Of the 57 studies using imaging-related and combined data, 33 analyzed radiograph images [6,7,28–31,35–38,40–42,45,47–49,52–57,59,61,62,66–68,72–74,78], 12 analyzed computed tomography (CT) images [8,9,39,43,50,63,65,69,75,81–83], and the remaining studies analyzed other imaging modalities (S1 Table, and S2 Table). The most common fracture outcome was vertebral fracture (20 studies) [8,10,11,28,31,34,35,38,44,46,50,51,58,59,65,72,77,80,83,84], followed by hip [6,13,29,32,33,37,39–43,48,53,62,64,66,68,79], and other fracture types (Table 1).

AI algorithms summary

Among the 54 studies that utilized imaging-related data, convolutional neural networks (CNN), a deep learning approach, emerged as the predominant choice, followed by instances where transfer learning was adopted. In some cases, the limited availability of labeled image data prompted the utilization of transfer learning [53,69], and certain studies incorporated pre-trained CNNs with non-fracture-related radiological images [6,28,85]. The prevailing preference was for fully connected artificial neural networks within the subset of nine studies involving tabular data. Logistic regression and ensemble learning models were commonly employed, including Random Forest, Gradient Boosting, and XGBoost. Among the three studies that harnessed both image and tabular data, a notable trend was the adoption of the support vector machine with various kernel models [57,68].

Handling imbalanced data and data augmentation

Imbalanced fracture outcomes were reported in 48 studies (S3 Table). Only 12 studies addressed the handling of imbalance outcomes during model development, using Synthetic Minority Over-sampling Technique (SMOTE) [86] or undersampling [35]. Data augmentation was frequently utilized in image studies, including horizontal and vertical rotation [45,50,58,67,69,72], adding Gaussian noise [67], random rescaling and flipping [30,53], mirroring, and lighting and contrast adjustments [56].

Hyperparameter optimization

Thirty-six studies reported the detailed process for optimizing hyperparameters in the final selected models (S3 Table). Beyaz et al. utilized genetic algorithms to identify the optimal hyperparameters for their CNN architecture [67]. Liu et al. explored the impact of varying the number of hidden neurons in the output layer [32]. Nissinen et al. [72] employed two approaches for hyperparameter searches: random search [87] and hyperband [88].

Data split and validation in an external data set

Fifty-one studies reported the split sample for model development (training) and validation (testing) (S3 Table). No universal rule of data separation was found. A different set of split samples was utilized, e.g., 80% training and 20% testing [10,28,47,57,71], 90% training and 10% testing [32,33,56,81], and 80% training, 10% validation, and 10% testing [40,41,65,69]. Twenty studies reported the cross-validation with 20-folds [66], 10-folds [8,14,33,34,39,45,50,53,57,64,72,76,80,81], 5-folds [13,28,32,38,44,46,48,67,74,78,79], and 7-folds [83]. Thirteen studies performed an out-of-sample external validation [6,7,29–31,35,47,49,56,59,62,72,74]. Choi et al. [47] performed external tests using two types of distinct datasets: temporal data, which was obtained at a different period from the model development, and other geographically separated data, which was collected from a different center. Li et al. [35] utilized a dataset from another medical center that used a different plain radiographic technique.

Meta-analysis

We extracted 66 contingency tables for each selected study (S4 Table). The overall pooled sensitivity and specificity, calculated using logit transformation, were 91% (95% CI: 88, 93) and 90% (95% CI: 88, 92), respectively (Table 2). The pooled sensitivities for hip and vertebral fractures were found to be 92% (95% CI: 87–96) and 86% (95% CI: 82–89), respectively, while the pooled specificities for these fractures were 90% (95% CI: 85–93) and 86% (95% CI: 81–90), respectively (Table 2). The unified hierarchical summary receiver operating characteristic curve for different fracture types is shown in Fig 2. The area under the curve (AUC) was highest for femoral neck fractures at 0.98, followed by other fractures (0.97), multiple fractures (0.93), hip fractures (0.91), wrist (0.86), and vertebral (0.84).

Download:

Fig 2. The hierarchical summary receiver operating characteristic curve for different fracture types in the meta-analysis.

A: Hip (18 studies), B: Vertebral (20 studies), C: Wrist (3 studies), D: Femoral Neck (4 studies), E: Multiple (11 studies), and F: Others (10 studies).

https://doi.org/10.1371/journal.pdig.0000438.g002

Download:

Table 2. Pooled Sensitivities, Specificities, and Diagnostic Odds Ratio for 60 studies in different fractures outcome.

Studies with only one selected fracture outcome (cervical spine, hand, lumber spine, proximal humerus, supracondylar, and trabecular bone) were omitted.

https://doi.org/10.1371/journal.pdig.0000438.t002