3.1 Literature screening
We downloaded the retrieved literature catalog as a whole, incorporated it into Endnote, summarized and merged it, and then divided it into two stages of literature screening, namely coarse screening and fine screening. By reading the articles’ abstracts, we could roughly screen them, and by downloading the full text for reading, we could refine the screening process. Based on the set search conditions, we retrieved 126 studies from the PubMed database, 140 studies from the Web of Science database, and 295 studies from other databases. After reading the titles and eliminating repetitive references, 39 articles remained (Fig. 2). Overall, 39 studies (Supplementary Table S1) met the inclusion requirements [13–33]. these were 39 randomized controlled studies with 2,959 participants: eight were from the United States; seven from China; six from Canada; five from Germany; three from the Netherlands; two from France; and one each from Russia, Belgium, India, Switzerland, New Zealand, Tunisia, Turkey, Thailand, and Japan.
3.2 Literature quality analysis
As shown in the quality analysis, the risk of bias is relatively low in most studies (Fig. 3 & Supplementary Figure S2). A few studies lack information on performance and detection bias [16, 18, 21, 34] as, due to the nature of the intervention, it was impractical to conduct blind checks on students and residents during the research process (selection bias). Most studies were determined to have a low risk of selection and low risk of attrition bias due to the complete data of the research results and the use of random selection for grouping [14–18, 20–25, 29–32, 35–48]. The judgment of whether the research has selective reporting is based on whether the results are fully mentioned in the manuscript or discussion section. Three studies were judged to be at high risk of other bias as the experimental or control group had fewer than 10 participants [15, 19, 49]. Finally, one study was judged to have a high risk of performance bias due to the participants’ biased understanding of the assigned interventions during the study period [19].
3.3 Literature information analysis
As shown in the Fig. 1, the United States, China, Canada, and Germany conducted more randomized controlled trials on this topic, followed by the Netherlands.
As shown in Supplementary Figure S1, a greater number of randomized controlled trial on this topic were conducted in the area of neuroanatomy, followed by head and neck, liver, and cardiac anatomy.
3.4 Data merging of test scores
Based on StataMP 17 (64-bit), we made a score forest map, and all the studies reported the influence of intervention on test scores (of the 39 articles we cited, 35 included the influence on test scores, and the data of 25 articles were included).
With regard to overall data consolidation, in the random-effects model, compared with traditional learning, 3D technology significantly improved learners’ test scores (SMD = 0.69, 95% CI = 0.24–1.14, p < 0.05, I2 = 93.8%, Fig. 4).
As subgroup analysis displayed, merging the literature data of China has statistical significance (SMD = 1.72, 95% CI = 1.04–2.40, p < 0.05, I2 = 88.8%, Fig. 4); however, merging data from regions outside of China shows no statistical significance (SMD = 0.29, 95% CI = -0.19–0.77, p < 0.05, I2 = 92.9%, Fig. 4).
The medical students subgroup displayed statistical significance (SMD = 0.68, 95% CI = -0.17–1.19, p < 0.05, I2 = 94%), while the residents subgroup showed no statistical significance (SMD = 0.74, 95% CI = -0.31–1.80, p < 0.05, I2 = 94.5%, Supplementary Figure S3).
3.5 Data merging of satisfaction degree
Ten studies [19, 27, 31, 38, 40, 42, 45, 47, 49, 50] evaluated satisfaction as a secondary outcome (Fig. 5A). The summary results based on the random-effects model show that most students are more interested in learning through 3D methods than traditional or 2D teaching methods (SMD = 0.70, 95% CI = 0.32–1.07, p < 0.05, I2 = 69.0%), which may be related to the more intuitive experience given by 3D technology. If the literature from China is excluded, the 3D group has statistical significance as well (SMD = 0.79, 95% CI = 0.30–1.29, p < 0.05, I2 = 69.0%, Fig. 5B).
3.6 Data merging of time and enjoyment degree
We included eight documents on time consumption [15, 19, 20, 24, 43, 48] and four documents in the forest map of enjoyment value [27, 35, 37, 45]. The results showed no statistical difference between the 3D group and the traditional group (SMD = -0.55, 95% CI = -1.23–0.14, p > 0.05, I2 = 86.5%, Supplementary Figure S4). If the study from China is removed, the statistical significance of the results remains unchanged (Supplementary Figure S5). However, the results of the happiness value forest map show that 3D technology makes participants feel happier (SMD = 3.04, 95% CI = 1.05–5.04, p < 0.05, I2 = 95.8%, Fig. 6A). If the study from China is deleted, the statistical significance of the results remains unchanged as well (SMD = 2.73, 95% CI = 0.32–5.15, p < 0.05, I2 = 96.5% Fig. 6B).
3.7 Publication bias
According to the results, the funnel chart is basically symmetrical, with a vertical line in the middle representing the combined OR value, and all studies are generally evenly distributed on both sides of the vertical line, showing an inverted funnel shape (Figures S6–S8). This shows no obvious bias in grades, test time, or satisfaction. At the same time, the results of the Egger’s test for test time and test performance showed non-significant asymmetry (p > 0.05); therefore, no apparent application bias was observed in the present study. However, the Egger’s test of satisfaction showed application bias (p < 0.05), and when the Chinese study was removed, no application bias was found (p > 0.05).
3.8 Sensitivity analysis
Due to the significant heterogeneity (I2 > 75%), we created a sensitivity analysis chart to verify the reliability of the results. We found that when any research was removed from the model, the significant influence of virtual reality on test scores, satisfaction, and test time remained unchanged (Figs. 7A–7B, Figure S9). Therefore, this result shows that the survey’s inspection results were reasonable.
3.9 Regressive analysis
To confirm the influence of various factors, we made a meta-regression analysis on the influence of four potential factors: learners, countries, courses, and time (Table 1). We grouped medical students versus other learners, China versus other countries, and neuroanatomy versus other anatomy. The results show that the P-value of the country is less than 0.05, which indicates that the national factors have a significant influence on the results, while the other factors have none.
Table 1
Meta-regression analysis for different subgroups
Factors | Coefficient | Standard error | P | 95% CI |
Participant | 0.46 | 0.61 | 0.46 | -0.81 | 1.74 |
Year | -0.7 | 0.45 | 0.14 | -1.64 | 0.25 |
Course | -0.63 | 0.5 | 0.22 | -1.68 | 0.42 |
Country | -1.47 | 0.5 | 0.01 | -2.51 | -0.44 |