Prevalence of Inconsistency and Its Association With Network Structural Characteristics in 201 Published Networks of Interventions

Background: Network meta-analysis (NMA) has attracted growing interest in evidence-based medicine. Consistency between different sources of evidence is fundamental to the reliability of the NMA results. The purpose of the present study was to estimate the prevalence of inconsistency and describe its association with different NMA characteristics. Methods: We updated our collection of NMAs with articles published up to July 2018. We included networks with randomised clinical trials, at least four treatment nodes, at least one closed loop, a dichotomous primary outcome, and available arm-level data. We assessed consistency using the design-by-treatment interaction (DBT) model. We estimated the prevalence of inconsistency and its association with different network characteristics (e.g., number of studies, treatments, treatment comparisons, loops), and evaluated heterogeneity in NMA and DBT models. Results: We included 201 published NMAs. The p-value of the design-by-treatment interaction (DBT) model was lower than 0.05 in 14% of the networks and lower than 0.10 in 20% of the networks. Networks comparing few interventions in many studies were more likely to have small DBT p-values (less than 0.10), which is probably because they yielded more precise estimates and power to detect differences between designs was higher. In the presence of inconsistency (DBT p-value lower than 0.10), the consistency model displayed higher heterogeneity than the DBT model. Conclusions: Our ndings show that inconsistency was more frequent than what would be expected by chance, suggesting that researchers should devote more resources to exploring how to mitigate inconsistency. The results of this study highlight the need to develop strategies to detect inconsistency (because of the relatively high prevalence of inconsistency in published networks), and particularly in cases where the existing tests have low power.


Background
Network meta-analysis (NMA) is a useful approach for exploring effects of multiple interventions by simultaneously synthesizing direct and indirect evidence, and in recent years the number of published NMAs has grown continually [1,2]. The reliability of inferences from NMA depends on the comparability of studies evaluating the multiple interventions of interest [3][4][5]. NMA results are valid only when the transitivity assumption holds, i.e., the distribution of effect modi ers is similar across intervention comparisons. Lack of transitivity can create statistical disagreement between the information coming from direct and indirect sources of evidence, that is inconsistency. The notion of inconsistency also refers to the disagreement between evidence coming from different designs (comparisons of different sets of interventions across studies). Several statistical methods exist to assess inconsistency in the network as a whole or locally on speci c comparisons or loops of evidence (i.e., paths in the network of interventions that start and end at the same node) [6][7][8][9][10][11][12]. To date, the design-by-treatment interaction (DBT) model is the only method that both provides a global assessment of inconsistency for a network and is insensitive to the parameterization of studies with multiple arms [6,8].
The majority of NMAs published in the medical literature in the recent years examine whether the prerequisite assumptions in NMA are met [1,2,13,14]. However, a number of reviews still combine direct and indirect evidence in a network of interventions without evaluating the condition of consistency or despite evidence of inconsistency [15,16]. Empirical ndings for dichotomous outcomes suggest that inconsistency is present in one in ten loops of evidence when a network is evaluated in multiple parts separately and in one in eight networks when a network is evaluated as a whole [10]. It is encouraging though that NMAs increasingly discuss transitivity and/or inconsistency (0% in 2005 vs 86% in 2015), and use appropriate methods to test for inconsistency (14% in 2006 vs 74% in 2015) [2]. Another important consideration when conducting an NMA is that there is an inverse association between heterogeneity and statistical power to detect inconsistency [3]. The larger the heterogeneity, the less precise the direct and indirect estimates are, and hence statistical inconsistency may not be evident even when it is present. Empirical evidence using 40 networks of interventions suggested that increased heterogeneity was associated with low detection rates of inconsistency, and that the consistency model displayed higher heterogeneity than the inconsistency model [10]. Also, the same study showed that the choice of the heterogeneity estimator can in uence inferences about inconsistency particularly when few studies are available. In general inconsistency tests have low power to detect inconsistency, and power may vary depending on the network's characteristics (e.g., number of patients and studies) [17].
The purpose of the present study was to estimate the percentage of NMAs for which strong evidence against the hypothesis of consistency is evident. We updated our previous empirical evaluation using a larger sample of published NMAs with dichotomous outcome data [10]. We also aimed to describe the association between evidence against consistency and the NMA structural characteristics, such as number of studies, interventions, and loops. We nally aimed to evaluate heterogeneity in consistency and inconsistency models.

Eligibility criteria for network database
The collection of published NMAs used in this paper has been described elsewhere [1,2]. We included networks published in two distinct periods: 1) up to December 2015 (including NMAs identi ed from our previous search up to April 2015[18], and from our updated search up to December 2015), and 2) between 2017 and 2018 (from our updated search up to July 2018 [19]). NMAs were eligible if they included only randomised clinical trials, had at least four intervention nodes (including placebo) in the network, had conducted any form of valid indirect comparison or NMA, included at least one loop, and had a dichotomous primary outcome with available arm-level data.

Synthesis
We performed a descriptive analysis of the eligible networks regarding the following characteristics: number of included studies, interventions, comparisons with direct evidence, presence of at least one intervention comparison informed by a single study, multi-arm studies, loops, number of unique designs, presence of complex interventions (as de ned by Welton et al. [20]), type of outcome, and type of intervention comparisons [2,21].
We assessed consistency in each network using the DBT model that evaluates the entire network as a whole and encompasses the potential con ict between studies including different sets of interventions, named 'designs' [6,8]. In this model we synthesised evidence in a way that re ects the extra variability due to inconsistency (i.e., beyond what is expected by heterogeneity or random error) [22], and encompassed the potential con ict between studies with different sets of interventions [6,8]. We assessed evidence against the hypothesis of consistency based on the p-value of the DBT test (see Appendix 1). Since the tests of inconsistency are known to have low power [17,23], and considering that empirical evidence showed that 10% of loops are inconsistent [10], we decided to use along with the commonly used cut-off p-value of 0.05, the cut-off p-value of 0.10.
We estimated the prevalence of NMAs for which evidence or strong evidence against the hypothesis of consistency was evident (at both 0.05 and 0.10 thresholds) and explored its association with network structural characteristics. We present scatterplots and box plots for the aforementioned descriptive characteristics against the p-value of the DBT test. We visually assessed if inconsistency rate changed per year of study publication in a stacked bar plot. To explore the association between the DBT p-value and prevalence of inconsistency with estimation of heterogeneity in consistency and inconsistency models, we plotted the estimated between-study standard deviation values under the consistency and inconsistency models. We used a different colour scheme for each network to indicate strong evidence against the consistency hypothesis at 0.05 and 0.10 thresholds.
Among the several approaches that have been suggested to estimate the between-study variance we selected the popular DerSimonian and Laird (DL) method and the restricted maximum likelihood (REML) method which has been shown to be a better alternative [24,25]. For completeness, we investigated the impact of both ways to estimate the between-study variance on the consistency evaluation. We distinguish heterogeneity in the consistency and inconsistency models as: a) representing within-and between-design heterogeneity in the consistency model, and b) representing within-design heterogeneity only in the inconsistency model. We conducted both consistency and inconsistency models in Stata and R using the network[26] suite of commands and netmeta[27] package, respectively. We used both Stata and R software, since at the time of conducting the analyses the REML estimator for heterogeneity was available in the network[26] command in Stata only, and the DL estimator in the netmeta[27] R package. Currently, both REML and DL estimators for heterogeneity are available in the netmeta[27] package. We also calculated the I-squared statistic for each network using the netmeta package in R and assessed its association with the DBT p-values in a scatterplot.

Description of the network database
From the 456 total NMAs identi ed from our previous search [2], we located 105 NMAs satisfying the eligibility criteria. Using the same process for the years 2015 (April 2015 to December 2015), 2017 and 2018 we also included another 96 NMAs. Overall, we included 201 NMAs that ful lled the eligibility criteria (Appendix Fig. 1).
The median number of studies per network was 20 (IQR 13, 35), and the median number of interventions per network was seven (IQR 5,9). Multi-arm trials were included in 142 networks (70%), with a median number of one multi-arm study (IQR 0, 3). The median number of direct intervention comparisons in the included networks was 25 (IQR 16, 42), whereas the median number of unique designs was 13 (IQR 8, 23). Most networks included at least one comparison (186 networks, 92.5%) informed by a single study. The median number of loops across networks was three (IQR 2, 7), and the median number of inconsistency parameters per network was four (IQR 2, 7) (see also Appendix 1). The median I-squared statistic was 30% (IQR 0%, 59%).

Prevalence of inconsistency
At the 0.05 threshold, strong evidence against the consistency hypothesis was detected in 28 (14%) networks. At the 0.10 level, strong evidence against the consistency hypothesis was detected in 39 (20%) networks of the 201 total networks. Changing from REML to DL estimators for heterogeneity had only a minor impact on the prevalence of inconsistency (see Appendix Table 1) [24]. Most DBT p-values were considerably higher than 0.10, irrespective of heterogeneity estimator. No change in the prevalence of inconsistency was detected across years (Appendix Fig. 2).
In the following, results are presented at the 0.05 threshold and according to the REML estimation method for the between-study variance. Results according to the DL estimator are presented in the supplementary les.

Evidence of inconsistency across different network structural characteristics
Lower p-values in the DBT test were more likely in networks with many studies, many direct intervention comparisons, and many designs (Appendix Fig. 3). However, these associations were rather weak  4 for results using the DL heterogeneity estimator). This was expected as power in detecting inconsistency is higher in networks with many studies per intervention comparison. It should also be noted that networks with few studies and many interventions were associated with larger heterogeneity, which can mask detection of inconsistency (Appendix Fig. 5).
The type of outcome (p-value = 0.86), type of intervention comparisons in the network (p-value = 0.75), and the presence of complex interventions (p-value = 0.08) did not suggest important differences in the distribution of the p-values calculated in DBT. Similarly, the inclusion of at least one intervention comparison with a single study in the network did not affect the assessment of the global inconsistency in the network (p-value = 0.57) (Fig. 2 and Appendix Fig. 6).

Heterogeneity in consistency and inconsistency models
An increase in the I-squared statistic was associated with a decrease in the DBT p-value (Appendix Fig. 7). In Fig. 3, we plot the estimated between-study standard deviation values under the consistency and inconsistency models by levels of evidence against the consistency hypothesis (see also Appendix Fig. 8 for results using the DL heterogeneity estimator). Evidence against the consistency assumption was associated with heterogeneity being larger in the consistency model compared to the inconsistency model.

Discussion
Our ndings suggest that evidence of inconsistency was at least twice as frequent as what would be expected by chance if all networks were truly consistent (when we would expect one in 20 networks for 0.05 level, and one in 10 networks for 0.10 level under the null hypothesis of no inconsistent networks in our sample). Overall, evidence against the hypothesis of consistency (as de ned by the p-value of the DBT test) in NMAs with dichotomous data was evident in one in seven networks using the threshold of 0.05. Taking into consideration the low power of the inconsistency test, and in particular the DBT model that has more degrees of freedom in contrast to other inconsistency tests [16,17], we decided to use also the threshold of 0.10, where inconsistency was prevalent in one in ve networks. Considering that the observed inconsistent NMAs are at 14% of the networks, we expect that the truly inconsistent networks range between 12% and 20%, assuming the test has a perfect type I error at 0.05 and power ranges between 50% and 80%.
Our study showed that structural network characteristics only weakly impact the detection of inconsistency. In particular, we found a mild association between networks including both a high number of studies and a small number of interventions or loops, and lower p-values of the DBT test. This is probably due to the increased power to detect inconsistency in these types of networks. Another key nding of our study was that an important drop in heterogeneity when moving from the consistency to the inconsistency model is associated with evidence of inconsistency. This suggests that heterogeneity estimated in the consistency model may account for discrepancies between direct and indirect evidence in the network. Results were overall consistent among DL and REML heterogeneity estimators.
To the best of our knowledge, this is the largest empirical study used to evaluate the prevalence of inconsistency in networks of trial evidence. Overall, our ndings are aligned with our previous study [10], where we evaluated 40 networks of interventions. The present research study includes ve times the number of networks included in our previous review, and the exploration of multiple structural network characteristics. In this study, we found slightly higher empirical rates of inconsistency compared with our previous study (i.e., 14% of 201 networks vs. 13% of the 40 networks), suggesting that researchers should devote more resources to exploring how to mitigate inconsistency.
Our study has a few limitations worth noting. First, for the empirical assessment of consistency, we evaluated articles with dichotomous outcome data restricting to the odds ratio effect measure. We expect our ndings to be generalisable to other effect measures. Although our previous empirical study showed that in some cases inconsistency was reduced when moving from one effect measure to another, overall, the detected inconsistency rates were similar for different effect measures [10]. For completeness it would be interesting to carry out an empirical study for continuous outcomes to examine possible differences in inconsistency between mean differences, standardized mean differences and ratios of means. Second, in the present study we considered a common within-network heterogeneity. This is often clinically reasonable and statistically convenient. Since most direct intervention comparisons in networks comprise only few studies, sharing the same amount of heterogeneity allows such comparisons to borrow strength from the entire network. However, assuming common within-network heterogeneity, intervention comparisons with a smaller heterogeneity than that of the remaining network will be associated with a larger reported uncertainty around their summary effect, compared to what would be accurate. In such a case, the chances of detecting inconsistency decrease. Although assuming a common within-network heterogeneity can underestimate inconsistency, it better re ects how summary effects are combined in an NMA in practice. Alternatively, when heterogeneity is believed to vary across comparisons, different heterogeneity parameters can be built into the model, but need to be restricted to conform to special relationships according to the consistency assumption [28]. Third, we assessed detection of inconsistency based on a threshold of the DBT p-value, which re ects common practice, and ignored the actual differences between the different designs and the direct and indirect estimates. However, to avoid "votecounting" of strong evidence against the consistency hypothesis we also explored the distribution of the DBT p-values according to several network structural characteristics. Fourth, we did not exclude potential outlier networks, since this was outside of the scope of the study.
In a systematic review and NMA, investigators should interpret strong evidence against the consistency hypothesis very carefully and be aware that inconsistency in a network can be absorbed into estimates of heterogeneity. Given that the descriptive prevalence of inconsistency is frequent in published NMAs, authors should be more careful in the interpretation of their results. Con dence in the ndings from NMA should always be evaluated, using for example CINeMA [29] (con dence in network meta-analysis) or GRADE (Grading of Recommendations Assessment, Development, and Evaluation) for NMA approaches [30]. Since inconsistency tests may lack power to identify true inconsistency, we recommend to avoid interpreting 'no evidence for inconsistency' as 'no inconsistency'. We also recommend using both a global (e.g., the DBT model) and a local approach (e.g., loop-speci c approach [10] or node-splitting [7] method) for the assessment of inconsistency in a network, before concluding about the absence or presence of inconsistency. However, detection of inconsistency often prompts authors to choose only direct evidence, which is often perceived as less prone to bias, disregarding the indirect information [23]. It is advisable though, instead of selecting between the two sources of evidence, to try to understand and explore possible sources of inconsistency and refrain from publishing results based on inconsistent evidence [5,31].
NMA is increasingly conducted and although assessment of the required assumptions has improved in recent years, there is room for further improvement [1,2]. Systematic reviews and NMA protocols should present methods for the evaluation of inconsistency and de ne strategies to be followed when inconsistency is present. The studies involved in an NMA should also be compared with respect to the distribution of effect modi ers across intervention comparisons. Authors should follow the PRISMA (Preferred Reporting Items for Systematic Review and Meta-analysis)-NMA guidelines [32] and report their inconsistency assessment results, as well as the potential impact of inconsistency in their NMA ndings.

Conclusion
This research provides empirical evidence on the prevalence of inconsistency. Our ndings show that evidence of inconsistency is more frequent than would be expected by chance if all networks were consistent. This suggests that inconsistency should be appropriately explored. Detection of inconsistency was mildly sensitive to various network characteristics and their combination. In particular, networks with a high number of studies, and a small number of interventions had larger power to detect inconsistency. Also, inconsistency was likely to manifest as extra heterogeneity when the consistency model was tted. Lower estimates of heterogeneity in the inconsistency model compared with the consistency model were associated with higher rates of detection of inconsistency. Overall, there was a good empirical agreement of inconsistency when different heterogeneity estimation methods were used.
Given that inconsistency is frequent in nature (with up to 20% of networks expected to be inconsistent), investigators should be more careful in the interpretation of their results. Our results highlight the need for a widespread use of tools that assess con dence in the NMA ndings, such as CINeMA [29], that factor-in inconsistency concerns in the interpretation. It is also essential to develop strategies to detect inconsistency, particularly in cases where the existing tests have low power. Additional investigation is needed to evaluate the performance of other inconsistency methods (e.g., node-splitting [7], Lu and Ades method [12]) and to understand their power under different meta-analytical scenarios. Further investment is required in developing methods to potentially deal with inconsistent networks (e.g., network metaregression approaches, NMA methods for classes of interventions) and to educate researchers about their use.

List Of Abbreviations
DBT: Design-By-Treatment  Plot of the between-study standard deviation in consistency against the inconsistency model The black diagonal line represents equality in between-study standard deviation between consistency and inconsistency models. Red points represent the networks consistent at the threshold of 10%, blue points represent the inconsistent networks at α=5%, and green points represent networks inconsistent between the thresholds 5% and 10%. All analyses have used the REML estimator for heterogeneity. Abbreviations: DBT, design-by-treatment interaction model; REML, restricted maximum likelihood

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.