Accuracy of Artificial Intelligence for Cervical Vertebral Maturation Assessment—A Systematic Review

Background/Objectives: To systematically review and summarize the existing scientific evidence on the diagnostic performance of artificial intelligence (AI) in assessing cervical vertebral maturation (CVM). This review aimed to evaluate the accuracy and reliability of AI algorithms in comparison to those of experienced clinicians. Methods: Comprehensive searches were conducted across multiple databases, including PubMed, Scopus, Web of Science, and Embase, using a combination of Boolean operators and MeSH terms. The inclusion criteria were cross-sectional studies with neural network research, reporting diagnostic accuracy, and involving human subjects. Data extraction and quality assessment were performed independently by two reviewers, with a third reviewer resolving any disagreements. The Quality Assessment of Diagnostic Accuracy Studies (QUADAS)-2 tool was used for bias assessment. Results: Eighteen studies met the inclusion criteria, predominantly employing supervised learning techniques, especially convolutional neural networks (CNNs). The diagnostic accuracy of AI models for CVM assessment varied widely, ranging from 57% to 95%. The factors influencing accuracy included the type of AI model, training data, and study methods. Geographic concentration and variability in the experience of radiograph readers also impacted the results. Conclusions: AI has considerable potential for enhancing the accuracy and reliability of CVM assessments in orthodontics. However, the variability in AI performance and the limited number of high-quality studies suggest the need for further research.


Introduction
In the last few years, there has been an increase in the amount of scientific evidence supporting the diagnostic accuracy and effectiveness of AI in various clinical scenarios [1].Due to the nature of diagnostic imaging and its repetitive analysis of specific image features, radiology is an area of medicine in which AI is developing most rapidly [2].Owing to its significant use of imaging and emphasis on cephalometric analysis, orthodontics is particularly well suited for the implementation of AI [3].Recently, the effectiveness of AI has been evaluated in a number of utilizations associated with orthodontic treatment, including automated landmark detection and cephalometric analysis, dental and temporomandibular joint (TMJ) diagnostics, treatment planning, treatment outcome evaluation, patient monitoring, and skeletal age assessment [4].The results of scientific research indicate that AI can significantly enhance the efficiency of clinical orthodontic practice and diminish the workload of practitioners [5,6].However, the impact of AI algorithms on patient care remains a matter of rising concern.
Growth and maturation are critical factors in the field of orthodontics because they are closely linked to the effectiveness of orthodontic treatment.Patients treated with orthodontic appliances tend to achieve optimal growth and develop a harmonious relationship in the masticatory system before attaining skeletal maturity [7].The growth rate and facial development stage are vital for lasting orthodontic results.Precise assessment of these factors is necessary to minimize undesired post-treatment changes due to ongoing facial growth [8].Previous studies have shown that properly aligning orthodontic treatment with a patient's growth phases can increase its effectiveness [9,10].
Adolescent growth rates vary significantly; therefore, chronological age alone does not sufficiently predict the extent of remaining growth [11,12].The use of skeletal age is a widely accepted and reliable method for evaluating individual growth, and it can be determined through two main approaches: cervical vertebrae maturation (CVM) and wrist X-rays [9,[13][14][15][16].Both growth intensity and growth potential are important factors in terms of proper treatment timing or optimal choice of the treatment strategy.Since the standard diagnostic orthodontic routine does not involve the use of wrist X-rays due to additional radiation exposure, currently, the method of choice in skeletal maturity assessment in these patients remains CVM [17].CVM utilizes lateral cephalograms frequently acquired during treatment planning and has already shown accuracy and reliability in skeletal age assessment [16,18].
It was further modified by Hassel and Farman in 1995 [19] and Bacetti in 2005 [20].The method involves evaluating the development and fusion of the cervical vertebrae, particularly the morphology of the second, third, and fourth vertebrae.Since its introduction, the method has been widely utilized in orthodontics to help determine the optimal timing for orthodontic treatment and for monitoring skeletal growth [16].However, this method requires additional training and experience, and some studies have shown its poor reproducibility, particularly in classifying the shapes of C3 and C4 vertebral bodies [21,22].Since AI has already shown its ability to detect features that may be hidden to human readers [23,24], its incorporation in CVM assessment may aid clinicians in proper diagnosis.Due to the continuously increasing number of research papers, it was pertinent to conduct a systematic review of the current body of literature.
The present systematic review aimed to identify and summarize the existing scientific evidence concerning the diagnostic performance of AI in CVM assessment

Search Strategy and Eligibility Criteria
This systematic review was conducted according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement [25], Supplementary Material Tables S1 and S2 and the guidelines from the Cochrane Handbook for Systematic Reviews of Interventions [26].On 16 January 2024, a series of preliminary searches of the following databases were performed: PubMed, PMC, Scopus, Web of Science, Embase, and the Dental & Oral Health Source EBSCO.The final search proceeded on 31 January 2024 using all of the abovementioned search engines.The combination of different Boolean operators AND/OR and MeSH/non-MeSH terms was used to select appropriate studies:

[artificial intelligence] OR [deep learning] OR [automated] OR [machine learning] AND [cervical vertebral maturation] OR [skeletal maturity]
. Additional studies were selected by searching the reference lists of all included articles, and all related papers were also screened through the PubMed database.The final search string included the following terms: ("cervical vertebrae" OR "cervical vertebra") AND ("maturation" OR "CVM" OR "CVMS" OR "skeletal age" OR "skeletal maturation" OR "skeletal development") AND ("deep learning" OR "machine learning" OR "CNN" OR "SVM" OR "decision tree" OR "random forest" OR "convolutional neural network" OR "neural network" OR "Bayesian" OR "artificial intelligence").EndNote 21 software was used to collect references and remove duplicates.Study selection was independently carried out by two reviewers (WK and MJ) and evaluated through Cohen's kappa coefficient; any disagreements were resolved by a third expert reviewer (JJO).The same two reviewers extracted study characteristics, such as authors, year of publication, algorithm architecture, dataset partition (training and test), and algorithm accuracy metrics.Based on PICO(S) [27], the framework of this systematic review was developed as follows: population: orthodontic patients; comparison: evaluation of the maturation stage of cervical vertebrae according to the assessment of artificial intelligence software and experienced clinicians; outcomes: accuracy of cervical vertebrae assessment according to CVM or CVMS; and studies: cross-sectional studies with neural network research.The included articles discussed the clinical efficiency of neural networks for evaluating cervical vertebral maturation.
Studies were included if they met the following criteria: (1) cross-sectional studies with neural network research for cervical vertebral maturation assessment, (2) studies reporting diagnostic accuracy, (3) human studies, (4) studies with a sample size of at least 30, and (5) studies published in peer-reviewed journals.
The exclusion criteria were as follows: (a) conference papers, (b) case reports, (c) descriptions of technique, (d) research without quantitative evaluation, (e) book chapters, and (f) records unrelated to the topic of the review.No language restrictions were applied.
After the results were retrieved from the search engines to create a database, duplicates were removed.Then, the titles and abstracts were independently analyzed by two authors (WK and NK) following the inclusion criteria.Full-text articles of potentially eligible studies were then retrieved and reviewed for final inclusion.Disagreements were resolved by discussion with the third author (JJO) by creating a working spreadsheet to verify the results by the Cochrane Collaboration guidelines [26].Cohen's K coefficient for the agreement between the authors indicates perfect agreement between the authors and was equal to 0.98.

Data Extraction and Quality Assessment
Data on study characteristics, such as study design, sample size, AI algorithm used, CVM method used, and accuracy measures, were extracted using a standardized data extraction form.The quality of the included studies was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS)-2 tool.The tool includes four domains: patient selection, index test, reference standard, and flow and timing.Each domain is evaluated for bias risk, and the first three domains are also evaluated for applicability concerns.The use of signaling questions aids in assessing bias.QUADAS-2 is used in four steps: summarizing the review question, tailoring the tool to provide review-specific guidelines, creating a primary study flow diagram, and evaluating bias and applicability.It enhances the transparency of bias and applicability ratings in primary diagnostic accuracy studies [28].

Search Results
An initial search using tailored queries to each database resulted in a total of 314 articles.After 111 duplicate articles were removed, the remaining 203 studies were initially screened.Subsequently, 165 studies were removed because they were out of the scope of the review.Figure 1 presents Prisma flow diagram thoroughly describing the search process.Both reviewers had a high level of agreement in this phase, achieving a Cohen's kappa of 0.98.Few disagreements were resolved by a third reviewer (JJO).Subsequently, 38 articles underwent full-text screening, of which twenty were excluded because seven were reviews of the literature, six did not evaluate AI systems, five did not evaluate cervical maturation, and two did not present a structured methodology with clear results (Supplementary Material Table S3).Ultimately, 18 articles were found to be eligible for inclusion in the review.The data obtained from the studies are presented in Table 1.kappa of 0.98.Few disagreements were resolved by a third reviewer (JJO).Subs 38 articles underwent full-text screening, of which twenty were excluded becau were reviews of the literature, six did not evaluate AI systems, five did not cervical maturation, and two did not present a structured methodology with cle (Supplementary Material Table S3).Ultimately, 18 articles were found to be el inclusion in the review.The data obtained from the studies are presented in Tab    The final classification accuracy ranking was ResNet152 > DenseNet161 > GoogLeNet > VGG16, as evaluated on the test set.ResNet152 proved to be the best model among the four models for CVM classification with a weighted κ of 0.826, an average AUC of 0.933 and total accuracy of 67.06%.The F1 score rank for each subgroup was: CS6 > CS1 > CS4 > CS5 > CS3 > CS2.The areas of the third (C3) and fourth (C4) cervical vertebrae were activated when CNNs were assessing the images.The system has achieved good performance for CVM assessment with an average AUC (the area under the curve) of 0.94 and total accuracy of 70.42%, as evaluated on the test set.The Cohen's kappa between the system and the expert panel is 0.645.The weighted kappa between the system and the expert panel is 0.844.The overall ICC between the psc-CVM assessment system and the expert panel was 0.946.The F1 score rank for the psc-CVM assessment system was: CVS (cervical vertebral maturation stage) 6 > CVS1 > CVS4 > CVS5 > CVS3 > CVS2.The studies were predominantly conducted in Turkey (n = 7), followed by Korea and China (n = 3 each), with additional studies from the USA (n = 2), Iran (n = 2), and France (n = 1).Notably, the eighteen included studies were from only twelve research groups.This indicates the niche nature of the study and the fact that it is being developed by a small group of researchers throughout the world.The overall sample size included in the review was 30,275 cephalograms.The number of samples varied from 419 to 10,200 among the studies.

Risk of Bias
The overall risk of bias in the studies included in the review was rather low or unclear.However, there are some studies that provided proper descriptions of the methods applied.Two studies were at high risk of bias.The main shortcomings in patient selection are the lack of a detailed description of subject enrollment and the randomization of the subjects before manual analysis of radiographs, which could have resulted in bias.If the study provided accurate patient demographics and vertebral maturity assessments for the patients included in the study, the risk of bias was considered low.If the study did not provide complete demographic data or an assessment of vertebral maturity was not indicated, the risk of bias was considered unclear.If the study only stated that a certain number of radiographs were included in the study, without providing their characteristics, the risk was considered high.For the same reasons, it remains unclear whether the results presented in studies can be applied to a wider spectrum of populations studied.One study should have indicated that the risk of bias was high, as the authors did not indicate any characteristics of the included radiographs beyond their number.The risk of index test bias was considered low if both intra-rater and inter-rater compliance were examined.If one piece of information about one of these examinations was missing, the risk was described as unclear.Thus, if an error study was not performed in any of the trials, the points were not determined manually by more than one orthodontist.Due to the prevalence and validity of the vertebral evaluation method, the risk of bias due to the reference standard was low, except for the study by Seo et al. [45], who did not describe the method of reference.All but one study clearly described the intervals and timing.The applicability concerns regarding patient selection remains the same due to the nature of the study material.In the case of Makaremi et al. [42], applicability of index test is unclear, as both detailed description and timing of index test are lacking, while by Seo et al. [46] the description of index test and reference standard left too many uncertainties, therefore the risk should be considered high.The summary of risk of bias assessment is presented in Table 2.The reference standards were set according to three methods.Most of the studies have used the method by Bacetti et al. [30,31,33,34,39,[43][44][45][46], followed by the methods by Hassel and Farman [29,[36][37][38] and by McNamara and Franchi [32,35,39,41,42].The number of observers evaluating radiographs, their experience and their professions varied widely among the studies.One of the Seo et al. studies did not mention the number of readers [45].In eight of the studies, there was only one reader [32,34,[36][37][38]41,43,44].However, in studies by Kök et al., the reader assessed the images twice with a fixed time interval [36][37][38].Among the remaining studies, the number of readers was greater-up to four-in the case of the study by Amasya et al. [31].

Diagnostic Accuracy
Subgroup analyses based on geographic location, sample size, and AI model type highlighted variations in diagnostic accuracy.The pooled accuracy varied from 0.57 (Akay et al. [29]) to 0.956 (Seo et al. [45]).Sensitivity analyses confirmed the robustness of the findings, with predominantly consistent results across different study designs and populations.However, studies with greater methodological rigor and larger sample sizes tended to report more reliable diagnostic performance.A summary of the diagnostic accuracy metrics presented in the included studies can be found in Table 3.When available, the detailed accuracy metrics of each maturation stage are included in the Table 3. Graphical presentations of the available accuracy metrics are presented in Figures 2 and 3.   Seo, Atici and Kim have presented results of calculations as a confusion matrix [33,35,44,45].The findings of the four studies were not included in the table [30,31,34,43].Amasya et al. presented only the results of calculations of concordance between human expert readers and selected AI systems [30,31].Radwan et al. assessed three sets of stages:   Seo, Atici and Kim have presented results of calculations as a confusion matrix [33,35,44,45].The findings of the four studies were not included in the table [30,31,34,43].Amasya et al. presented only the results of calculations of concordance between human expert readers and selected AI systems [30,31].Radwan et al. assessed three sets of stages: Seo, Atici and Kim have presented results of calculations as a confusion matrix [33,35,44,45].
The findings of the four studies were not included in the table [30,31,34,43].Amasya et al. presented only the results of calculations of concordance between human expert readers and selected AI systems [30,31].Radwan et al. assessed three sets of stages: prepubertal (stages 1 and 2), pubertal (3 and 4), and postpubertal (5 and 6) [43].Khazaei et al. assessed the accuracy of the model in three-and two-class scenarios [34].

Discussion
Human maturation is a continuous process, making the estimation of CVM challenging, with approximately one in three cases misclassified [47].This is typically converted into a classification problem by discretizing continuous CVM levels into six classes, posing challenges in achieving satisfactory performance even for experienced radiologists.However, AI has shown promising results in various dental fields by enhancing human performance and accelerating decision-making processes [48].While AI has demonstrated superior diagnostic accuracy in skeletal age assessment using wrist and index finger Xrays [49,50], its accuracy in the estimation of CVM remains variable.Although there are already two systematic reviews regarding the use of AI in CVM assessment [51,52], a constantly expanding body of literature provides additional original articles that need to be systematically reviewed.Rana et al. included 13 papers [51], whereas Mathew included eight papers [52].Thus, it was necessary to systematically review the current body of literature.This systematic review aims to map the existing scientific evidence on the diagnostic performance of AI in CVM assessment, with a focus on the diagnostic accuracy and operational characteristics of various AI approaches.
The 18 studies included in this review demonstrated a wide range of pooled diagnostic accuracies, from 57% to over 95%, highlighting both the potential and limitations of AI technologies in CVM assessment.These variations can be attributed to several factors, including the choice of AI models, the nature of the training data, and the methods employed in each study.Most studies have utilized convolutional neural networks (CNNs) [29,[33][34][35][39][40][41][43][44][45][46], reflecting the prevailing trend toward employing deep learning techniques for complex image recognition tasks in medical diagnostics.The performance of these CNN models often surpasses that of traditional methods, particularly when pretrained models are adapted for specific tasks [53].This adaptation likely benefits from transfer learning, where a model developed for one task is repurposed for another related task, bringing in preexisting knowledge that can be fine-tuned with a smaller set of targeted data.However, the integration of AI into clinical practice raises significant concerns about the generalizability of these models.Most studies were geographically concentrated in countries such as Turkey [29][30][31][36][37][38]43], Korea [35,44,45], and China [39,40,46], which may influence the diversity of training datasets.Such datasets may not adequately represent the global population, potentially limiting the applicability of these AI models in different demographic settings.Moreover, the reliance on data from specific research groups further narrows the diversity of data, potentially leading to models that perform well on specific types of data but fail to generalize across broader populations.
The methodological approaches used to assess the performance of AI models varied across the studies.Some studies employed cross-validation [29,32,35] techniques to mitigate overfitting and enhance the ability of models to generalize to new data.However, the lack of uniformity in validation methods, such as the variation in the number of folds used in cross-validation [37,46], introduces inconsistencies in assessing model performance.Additionally, the review revealed a high degree of variability in the experience and number of readers evaluating the radiographs, ranging from single reader [32,36,41,43,44] assessments to multiple readers with assessments at different intervals.This variability could introduce additional biases into the training data, as the interpretation of CVM stages is subject to inter-rater and intra-rater variability.The results of some of the studies were also affected by the lower number of stages assessed (prepubertal, pubertal, and postpubertal) [34,43].Furthermore, the ethical considerations of deploying AI in clinical settings were not adequately addressed in all studies, ensuring the transparency of AI processes, ethical data collection, and maintaining patient confidentiality, which was reflected by the majority (12 out of 18 studies) of the studies scoring unclear to high for patient selection in the risk of bias assessment using the QUADAS-2 tool.
The significant problem associated with CVM evaluation is high inter-and intra-rater variability.A recent paper by Shoretsaniti et al. [54] evaluated the reproducibility and efficiency of CVM assessment.The study included evaluations by six experts in radiology and orthodontics.The intra-rater reliability ranged from 77.0% to 87.3%, meaning that up to 1/4 of the diagnoses of CVM stage were changed.The results of the inter-rater agreement were even worse, with an absolute agreement calculated at 42.8%.The study also showed the lowest reproducibility for stage 3, a crucial stage that marks the beginning of pubertal growth.These results align with other studies that show significant discrepancies in CVM assessment [22,55,56].Such low scores of both inter-and intra-rater reproducibility indicate that the assessment of CVM stage is biased due to high variability among raters.Therefore, the results of studies showing more than 90% AI accuracy in CVM assessment should be considered very optimistic.It should be emphasized that individual errors and inconsistencies by raters assessing the CVM stage in the training sample significantly impact the learning process of the applied AI model.However, as stated in a Nature paper by Topol [57], AI will likely boost human performance and accelerate decision-making in currently problematic tasks.Figure 4  scoring unclear to high for patient selection in the risk of bias assessment using the QUADAS-2 tool.The significant problem associated with CVM evaluation is high inter-and intra-rater variability.A recent paper by Shoretsaniti et al. [54] evaluated the reproducibility and efficiency of CVM assessment.The study included evaluations by six experts in radiology and orthodontics.The intra-rater reliability ranged from 77.0% to 87.3%, meaning that up to 1/4 of the diagnoses of CVM stage were changed.The results of the inter-rater agreement were even worse, with an absolute agreement calculated at 42.8%.The study also showed the lowest reproducibility for stage 3, a crucial stage that marks the beginning of pubertal growth.These results align with other studies that show significant discrepancies in CVM assessment [22,55,56].Such low scores of both inter-and intra-rater reproducibility indicate that the assessment of CVM stage is biased due to high variability among raters.Therefore, the results of studies showing more than 90% AI accuracy in CVM assessment should be considered very optimistic.It should be emphasized that individual errors and inconsistencies by raters assessing the CVM stage in the training sample significantly impact the learning process of the applied AI model.However, as stated in a Nature paper by Topol [57], AI will likely boost human performance and accelerate decision-making in currently problematic tasks.Figure 4 presents samples of all six stages verified according to the method by Bacetti et al.With the increased use and popularity of cone beam computed tomography (CBCT) in orthodontic treatment planning, future studies could test the efficacy of AI in assessing CVM using CBCT data.Given the availability and widespread use of CBCT, incorporating this technology could also help reduce multiple radiation exposures.However, to date, there are no studies published on this topic.Additionally, an interesting direction could With the increased use and popularity of cone beam computed tomography (CBCT) in orthodontic treatment planning, future studies could test the efficacy of AI in assessing CVM using CBCT data.Given the availability and widespread use of CBCT, incorporating this technology could also help reduce multiple radiation exposures.However, to date, there are no studies published on this topic.Additionally, an interesting direction could be the use of MRI in CVM assessment, potentially leading to a radiation-free method of skeletal age assessment.Furthermore, future research should focus on testing AI models on more diverse sample sizes to decrease bias.Since most of the studies evaluated in the present systematic review were conducted in Asia, it is uncertain whether the findings can be generalized to other and more diverse populations.Collaboration among researchers is essential to achieve these goals and enhance the robustness of AI models in clinical applications.
A recent paper by Obuchowski et al. [58] critically evaluated and proposed an appropriate research protocol for multireader-multicase (MRMC) studies.Due to the rapid development of AI and the necessity of assessing the diagnostic accuracy of tested AI models, MRMC study design continues to play a key role in the translation of novel imaging tools to clinical practice.Unlike most medical studies, MRMC requires a reference standard and sampling from both reader and patient populations, making these studies costly and time-consuming.The authors indicated that investigators often attempt numerous analyses and report only the most promising results.Moreover, evaluations based on a single reader's opinion are highly subjective and can significantly affect model performance metrics, resulting in overly enthusiastic reports.Therefore, the required number of readers, preferably from different institutions and with varying levels of expertise, should be at least five [58,59].None of the studies provided such a high number of expert readers, with a predominance of one-or two-reader studies.In addition to this significant variability in CVM stage assessment [47], we believe that despite initial optimistic results, the technology of AI-CVM assessment still requires extensive research before it can be routinely applied in clinical practice.However, given these highly encouraging results, we anticipate that future advancements in AI technology will improve the diagnostic accuracy of CVM tools, potentially making them as reliable as wrist X-ray assessments for determining skeletal maturity.
This study has several limitations, including significant heterogeneity among the included studies in terms of study design, sample size, and the AI algorithms used.These variations could impact the generalizability and comparability of the findings.

Conclusions
Despite the promising results, the studies exhibited heterogeneity in the AI algorithms used, sample sizes, and study designs, which could influence the generalizability of the findings.The risk of bias was generally low, although some studies showed unclear risk, mainly due to the lack of detailed methodological descriptions.
In conclusion, AI has considerable potential for enhancing the accuracy and reliability of CVM assessments in orthodontics.The pooled accuracy for CVM stage assessment varied from 0.57 to 0.956.However, the variability in AI performance and the limited number of high-quality studies suggest the need for further research.

Table 1 .
Characteristics of the included studies.

Table 1 .
Characteristics of the included studies.
The proposed CNN model preceded with a layer of tunable directional filters achieved a validation accuracy of 84.63% in CVM stage classification into five classes, exceeding the accuracy achieved with the other DL models investigated.

Table 2 .
Risk of bias assessment according to the QUADAS-2 tool.

Table 3 .
Comparison of the diagnostic accuracy parameters of the best-performing AI models.
presents samples of all six stages verified according to the method by Bacetti et al. [20].