The diagnostic accuracy of magnetic resonance imaging for anterior cruciate ligament injury in comparison to arthroscopy: a meta-analysis

We performed this meta-analysis to examine the diagnostic accuracy of MRI for the diagnosis of anterior cruciate ligament (ACL) injury in comparison to arthroscopy. We also compared the diagnostic accuracy of MRI with magnetic field intensities (MFI) greater than or equal to 1.5T with those below 1.5T, in addition to different MRI sequences. Studies relevant to the diagnosis of ACL injury by MRI and arthroscopy were analyzed. Computer and manual retrieval were carried out on studies published between January 1, 2006 and May 31, 2016. Twenty-one papers were included. Neither threshold nor non-threshold effects were present (p = 0.40, p = 0.06). The pooled sensitivity (SE), specificity (SP), positive likelihood ratio (LR+), negative likelihood ratio (LR−) and diagnostic odds ratio (DOR) with 95% confidence interval (CI) were 87% (84–90%), 90% (88–92%), 6.78 (4.87–9.44), 0.16 (0.13–0.20) and 44.70 (32.34–61.79), respectively. The area under the curve (AUC) was 0.93. The risk of publication bias was negligible (p = 0.75). In conclusion, examination by MRI is able to provide appreciable diagnostic performance. However, the principle, which states that the higher the MFI, the better the diagnostic accuracy, could not be verified. Additionally, conventional sequences (CSs) associated with proton density-weighted imaging (PDWI) are only slightly better than CSs alone, but not statistically different.


Results
Study selection. A total of 1922 articles were initially retrieved for this meta-analysis: 481 from PubMed, 783 from EMBASE, 470 from Ovid, 129 from BIOSIS Previews, 53 from the Cochrane library and 6 articles obtained from manual retrieval of relevant references by sending e-mails to authors. 759 reports were then eliminated out of 1232 duplicated reports as they originated from the same team or the same set of data. According to the inclusion and exclusion criteria for the initial screening, a total of 110 articles were thus selected after reading the title and abstract (71 from PubMed, 24 from EMBASE, 11 from Ovid and 4 from the Cochrane library) and the articles were marked with 1 star in EndNote software. By evaluating the full text, two researchers (K.L. and J.D.) then selected 31 papers that strictly complied with the inclusion and exclusion criteria and marked them with 2 stars in EndNote. Ten studies were excluded after re-assessing the full text during the third screening. Finally, 21 articles 9, 14-33 were chosen and marked with 3 stars, articles for which true positive (TP), false positive (FP), true negative (TN) and false negative (FN) results could be extracted or accurately calculated through 2 × 2 contingency tables (16 from PubMed and 5 from EMBASE). These articles consisted of 16 prospective studies and 5 retrospective studies, for a total of 1722 cases. The literature search, the screening process and the results are shown in Fig. 1. The basic characteristics of the studies which were included are displayed in Table 1. Assessment of risk of bias within studies. The methodological quality assessment of risk of bias within eligible studies is shown in Fig. 2, according to the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool. Overall, the number of low and unclear risk of bias was 39 and 45, respectively, for the four domains (patient selection, index test, reference standard, and flow and timing). The number of high, unclear and low concerns regarding applicability was 42, 8 and 13, respectively for the three domains (patient selection, index test and reference standard).

Heterogeneity test of individual studies.
Results of the heterogeneity test for the threshold effect were as follows: the variation tendency of sensitivity (SE) and specificity (SP) or positive likelihood ratio (LR+) and negative likelihood ratio (LR−) were not negatively correlated in forest plots ( Fig. 3A-D). The distribution of accuracy estimates of each independent study did not show the "shoulder arm" shape in the summary receiver operator characteristic (sROC) plane (Fig. 4). Results of the Spearman correlation (p = 0.40 and r = 0.194) between the logit of sensitivity and the logit of 1-specificity indicates that the threshold effect was absent. Regarding the heterogeneity test for the non-threshold effect, results of the Cochran-Q test (p = 0.06) indicates that the non-threshold effect was also absent (Fig. 5).
A random effects model was used for pooled SE [p < 0.001, inconsistency index (I 2 ) = 57.9%], pooled SP (p < 0.001, I 2 = 72.7%) and pooled LR+ (p < 0.001, I 2 = 64.3%) ( Fig. 3A-C). A fixed effects model was used for pooled LR− (p = 0.12, I 2 = 27.2%) (Fig. 3D) and pooled diagnostic odds ratio (DOR) (p = 0.06, I 2 = 34.5%) (Fig. 5) respectively. The I 2 statistics based on Chi square (where Q is the chi-square statistic) was used to quantify the degree of heterogeneity in eligible studies and expresses the percentage of total variation observed across studies caused by heterogeneity rather than by chance. There is no observed heterogeneity when I 2 = 0, implying that all the variability observed in the effect estimates is due to sampling errors rather than because of heterogeneity amongst trials. Heterogeneity that is low, moderate, or high relates to I 2 < 25%, 50% < I 2 < 75%, I 2 > 75% respectively. Values of I 2 = 25%, 50% or 75% are defined as 1/4, 1/2 or 3/4 of the variability observed in the effect estimates being attributable to inconsistency among trials.
Subgroup analysis. The differences between subgroups were calculated according to the MFI, the year of publication and the type of MRI sequence [conventional sequences (CSs) and CSs with proton density weighted imaging (PDWI)]. The results are listed in Table 2 and include the pooled SE, SP, LR+, LR−, DOR and AUC values.
Publication bias in the literature evaluation. The Deeks' funnel plot asymmetry test for DOR presented basic symmetry (Fig. 6). Nevertheless, results showed no significant risk of publication bias (p = 0.75).

Discussion
Overall, ACL injury is a common clinical form of knee damage. Timely and accurate diagnosis and treatment could prevent the emergence of cartilage degeneration, the progression of bone contusion, the aggravation of traumatic arthritis or the occurrence of knee joint dysfunction 34 .
Magnetic resonance imaging is a noninvasive technique that remains a physician's first choice for the clinical diagnosis of ACL injury. It has the advantages of good soft tissue contrast, high spatial resolution and allows multi-parameter evaluation of morphological changes in an injured ACL. However, it is likely that overuse of the MRI technique in the diagnosis of ACL injury leads to misdiagnosis (estimated at 47%), especially in a chronic incomplete tear which might be due to the special sensitivity to the hydrogen atom and could be associated with volume effects and synovial hyperplasia 18 . Additionally, different studies have attributed different values for sensitivity and specificity, ranging from 63.6% 14 to 100% 9, 19, 29 and from 68.4% 26 to 100% 16,19,28 respectively, owing to the slightly oblique angle of the ACL crossing the knee joint and to the difficulty of displaying the full ACL in the true sagittal plane via a single MRI scan 22 . Meanwhile, the accuracy of MRI diagnosis depends on the scanning technique and the experience of the musculoskeletal radiologist 30 . Thus, the precise diagnostic accuracy of MRI for ACL injury is unknown. It is necessary, therefore, to carry out high level evidence-based medical research on the accuracy of MRI diagnosis for ACL injury.
Our meta-analysis focused on the diagnostic accuracy of MRI for ACL injury compared with arthroscopy. The pooled SE and SP are 87% (95% CI, 84-90%) and 90% (95% CI, 88-92%) respectively, indicating that the rate of missed diagnosis and misdiagnosis reach 13% and 10%, respectively. Furthermore, a good diagnostic test may have a LR+ superior to 10 and a LR− inferior to 0.1. Our study revealed that the pooled LR+ reaches 6.78 (95% CI, 4.87-9.44), which means that it is possible that ACL injury occurred in suspected cases when the MRI result was positive. Moreover, the pooled LR− had a value of 0.16 (95% CI, 0.13-0.20). In other words, there is a real possibility of excluding an ACL injury in suspected injured patients when the MRI result was negative. In addition, DOR represents a summary measure of the power of the test and the higher this measure, the better the performance of the inspection method 35 . The pooled DOR was 44.70 (95% CI, 32.34-61.79) in the present study, which predicts that the odds of obtaining a positive result using MRI are 44.7 times higher for an ACL injury than for an intact knee. In addition, the area under the curve (AUC) was 0.93, which indicates that MRI examination has a high diagnostic accuracy. Low, medium and appreciable accuracies of diagnosis are considered for AUC values ranging from 0.5 to 0.7, 0.7 to 0.9 and ≥0.9, respectively. The maximum AUC value of 1 predicts that the diagnostic test is perfect for differentiation in diagnostic test evaluation. In contrast, an AUC value < 50% indicates a poor performance of the diagnostic test 13 .
The MFI of MRI is one of the most important factors affecting accuracy of diagnosis. Smith et al. (2016) proved that there is no evidence that 3T scanners had superior diagnostic efficacy for ACL injury when compared with 1.5T machines 11 . Similarly, Phelan et al. (2015) and Smith et al. (2012) also reported that magnetic field strength had no significant effect on accuracy 12,36 . Our results indicate that there are no significant differences in SE, SP, LR+, LR− and DOR between MFI greater than or equal to 1.5T and MFI below 1.5T (p = 0.85, p = 0.76, p = 0.84, p = 0.75, p = 0.84, respectively), which is not only consistent with the results of previous studies, but also corroborate previous studies.  Another important factor that affects the diagnostic accuracy is the MRI sequence. Oei et al. (2003) reported that improving the MRI sequence could improve diagnostic accuracy 37 . However, no study has yet compared the accuracy of diagnosis between different MRI sequences. Our meta-analysis provides evidence that there are no  In previous reviews, the impact of the study's year of publication was found to be variable. Oei et al. (2003) reported that recent studies had better diagnostic accuracy than older studies 37 , which is likely due to improvements made in imaging technology such as the use of specific knee coils, improved sequences and radiologist familiarity with MRI over time. In contrast, Crawford et al. (2007) found that there is a negative trend in diagnostic accuracy with more recent studies 38 , which may be due to differences in the prevalence of ACL tears in the selected studies. They also reported that older studies had better methodological quality than recent studies. Therefore, they included all studies regardless of the year of publication. Our meta-analysis found that SP was significantly different between studies published during the periods 2006-2009 and 2012-2016 (SP = 0.93 vs. SP = 0.89, respectively; p = 0.04). Through a detailed reading of the literature included in the meta-analysis, we found that this was due to SP values of three articles that reached 100% 16,19,28 , which may be related to  Our meta-analysis has not only updated, verified, supplemented and improved previous studies, but it has also provided an objective and systematic evaluation of the value of MRI diagnosis for ACL injury, including its diagnostic accuracy and methodology. Additionally, our research suggested new direction for future diagnosis experiments. Firstly, future studies should attempt when possible to use the standards for reporting of diagnostic accuracy (STARD) in their diagnostic tests, and try to evaluate in detail the authenticity, reliability and clinical importance of their diagnostic tests, in order to make their results more accurate, complete and conclusive 39 . Secondly, the diagnostic and control tests should be performed as soon as possible during the study process, and acquisition conditions clearly defined. Ultimately, the assessment of the test results should be double-blinded. Finally, by comparing different MFIs and the different sequences used for ACL injury, we provided reference and guidance for clinicians who choose MRI for patients with ACL damage.
Even though this meta-analysis showed optimistic results for the diagnostic accuracy of ACL injury, the outcomes should be viewed cautiously due to several limitations related to this meta-analysis. Firstly, the selected studies varied greatly in sample size, continuity of enrolled patients and patient race in addition to scanning conditions. Besides, the MFI parameter, the method used to blind participants and assessors or the familiarity of the radiologist were not mentioned in several studies that were included. Secondly, our method cannot identify an accurate cut-off point on the sROC curve, which is in agreement with other meta-analysis of diagnostic accuracy. The reason is that there is no precisely measured value for the MR image and a threshold is not used in clinical examination 13 .
In conclusion, current evidence of our meta-analysis indicates that MRI examination is able to provide appreciable diagnostic performance for DOR and AUC in the detection of ACL injury with high SE and SP (greater than 85%). Yet, there is not enough evidence to show that a higher MFI results in better diagnostic accuracy when MFI greater than or equal to 1.5T was compared with MFI below 1.5T. In addition, CSs + PDWI sequences are only slightly better than CSs, but without any statistical difference.

Materials and Methods
Inclusion and exclusion criteria. The inclusion and exclusion criteria were formulated based on the PICOS principles (participants, intervention, comparison, outcome and study design) of preferred reporting items for systematic reviews and meta-analyses (PRISMA) 40 . Studies relevant to the diagnosis of ACL injury by MRI and Arthroscopy were included. Inclusion criteria contained the following five conditions.
Participants and intervention measures. Patients suspected of having ACL injury/tear, examined by MRI and arthroscopy. Patients' age, gender or race did not limit inclusion.

Comparison. MRI versus arthroscopy.
Outcomes. We obtained the pooled SE, SP, LR+, LR−, DOR and the sROC curve by extracting (directly or indirectly) the raw data (TP, FP, TN and FN results). Exclusion criteria. Studies were excluded if they met one of the following conditions: (1) the type of article was a review, an abstract or a conference paper; (2) the study was performed on animals or cadavers; (3) the sample size of the study was less than 25 cases; (4) the raw data was not complete, thus preventing the calculation of TP, FP, FN or TN; (5) the patients were not examined using MRI and arthroscopy simultaneously; (6) clinical data were insufficient; (7) repeated reports came from the same team or the same set of data.
Search strategy. Computer retrieval of English studies from PubMed, EMBASE, and Ovid databases, in addition to BIOSIS Previews and the Cochrane library was performed from January 1, 2006 to May 31, 2016. In addition, a manual retrieval was achieved based on references, magazines, ResearchGate, the national library reference service platform or by sending emails to authors. We used the following MeSH heading and keywords: magnetic resonance imaging AND anterior cruciate ligament AND arthroscopy.
Screening and literature selection. The screening of the original literature should be strictly followed by the inclusion and exclusion criteria. There were four steps in the selection process. Firstly, the two researchers eliminated duplicated reports coming from the same team or the same set of data. Secondly, the two researchers selected the papers by reading titles and abstracts according to the inclusion and exclusion criteria. Thirdly, by evaluating the full text, the two researchers screened the potentially available studies conforming to the inclusion and exclusion criteria. Fourthly, to re-assess the full text, the two researchers chose the studies for which TP, FP, TN and FN could be extracted and calculated. The two researchers completed the screening process independently. When their opinions differed, they discussed the results until they reached the same conclusions.
Data extraction. The two researchers designed a standardized abstract form, extracted data respectively and mutually checked their data. Disagreements relating to values or assessment were resolved by discussion. Extracted variables included: the author, the year of publication, the country where the study had been performed, the study designation, MFI, the number of samples, the demographic characteristics, the blinding process and TP, FP, TN, FN, SE and SP values.
Quality evaluation. The methodological assessments of the quality of eligible studies were graded by two researchers independently, according to the QUADAS-2 tool (Agency for Healthcare Research and Quality, Cochrane Collaboration, and the U.K. National Institute for Health and Care Excellence) 41 , which is recommended for use in systematic reviews of diagnostic accuracy based on sources of bias and variation. The following four aspects are required to use the QUADAS-2 tool: (1) summarize the evaluation question; (2) develop the tool and produce evaluation with guidance; (3) construct a flow diagram for the original study; and (4) judge bias and applicability. The QUADAS-2 tool can provide obvious grades of bias and applicability of primary diagnostic accuracy studies. It comprises four significant domains including: (1) patient selection; (2) index test; (3) reference standard; and (4) the flow and timing. Each domain contains several signal questions used to help judge the risk of bias (low, high or unclear) 41 . The two researchers completed the screening process independently. Disagreement in the process of answering questions was discussed until consensus was reached. A final decision of "yes (satisfactorily elaborated)", "no (unsatisfactorily elaborated)" or "unclear (data are insufficient making a judgment difficult)" was made by the researchers after systematic discussion. If the answers to all the signal problems were "yes", a low risk of bias was attributed to the study; if the answers to all the signal problems had one or more "no" or "unclear" values, an unclear risk of bias was used; if the answers to all the signal problems contained at least one "no" but no "yes" answers, a high risk of bias was attributed. QUADAS-2 tabular and graphical display can be retrieved from the Web page, http://www.bris.ac.uk/quadas/quadas-2.
Statistical analysis. Meta-Disc 1.4 for Windows (XI Cochrane Colloquium, Barcelona, Spain) statistical software was used for the heterogeneity test, outcomes combination and subgroup analysis 42 . Stata 14.0 (Stata Corp., College Station, TX, USA) was used for publication bias. A two-sided statistical test was considered suitable and statistical significance was set at p < 0.05.
Heterogeneity is usually caused by threshold and non-threshold effects. If the threshold effect exists, the pairs of accuracy estimates (SE and SP, or LR+ and LR−) are negatively correlated (or SE is positively correlated with 1 -SP), or vice versa; the accuracy estimates distribution of each independent study shows a typical "shoulder arm" shape in the sROC curve; or the Spearman correlation coefficient reflects a significant relationship between the logit of sensitivity and the logit of 1-specificity according to p and r values. Besides the threshold effect, non-threshold effects also cause heterogeneity, including population (such as disease severity and complications), test conditions (such as different technologies, laboratory tests and operators), standard tests and so on. This can be detected through Chi-square and Cochran-Q statistical tests. If non-threshold effects exist, then p < 0.05 43 .
A fixed effects model was used with no heterogeneity among individual studies when p > 0.05 and I 2 < 50%. This calculation model of the combined effect indicated that all the variation in the eligible studies was caused by chance. In other words, the model assumed that the measurements over all effects were from the same population. Otherwise, a meta-regression analysis can be used to explore the potential factors of heterogeneity (such as the participants, the test, the standard test, the methodological characteristics, etc.). When persistent heterogeneity among eligible studies exists, a random effects model can be used to analyze the sampling error (variance) and the variance of the research with p < 0.05 and I 2 ≥ 50% 44 , and estimate the uncertainty of the results by 95% CI, because of the clinical importance of some indices. This calculation model could give a wider CI than the fixed effects model when the heterogeneity is caused by other potential factors.
A fixed effects model with the Mantel-Haenszel method or a random effects model with the DerSimonian-Laird method was applied to calculate the pooled SE, SP, LR+, LR− and DOR with 95% CI based on the level of heterogeneity of the eligible study presenting in forest plots. The sROC curve with 95% CI was established by combining data, which could evaluate the potential association between SE and SP in a metamorphic approach. A value of ½ was added to all cells of studies when data with a zero value appeared.
A subgroup analysis was subsequently assessed in a more homogeneous group according to MFI (≥1.5T versus <1.5T), year of publication (2006 to 2009 versus 2012 to 2016) and MRI sequences (CSs versus CSs + PDWI), which was comprised more than 3 studies. Differences between subgroups were calculated through t test or rank sum test 45 .
A Deeks' funnel plot asymmetry test was used with a significance level set at p < 0.05 to predict the existence of publication bias 46 , which is of great concern for meta-analysis of diagnostic studies. Data Availability. The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.