Measuring the statistical validity of summary meta‐analysis and meta‐regression results for use in clinical practice

An important question for clinicians appraising a meta‐analysis is: are the findings likely to be valid in their own practice—does the reported effect accurately represent the effect that would occur in their own clinical population? To this end we advance the concept of statistical validity—where the parameter being estimated equals the corresponding parameter for a new independent study. Using a simple (‘leave‐one‐out’) cross‐validation technique, we demonstrate how we may test meta‐analysis estimates for statistical validity using a new validation statistic, Vn, and derive its distribution. We compare this with the usual approach of investigating heterogeneity in meta‐analyses and demonstrate the link between statistical validity and homogeneity. Using a simulation study, the properties of Vn and the Q statistic are compared for univariate random effects meta‐analysis and a tailored meta‐regression model, where information from the setting (included as model covariates) is used to calibrate the summary estimate to the setting of application. Their properties are found to be similar when there are 50 studies or more, but for fewer studies Vn has greater power but a higher type 1 error rate than Q. The power and type 1 error rate of Vn are also shown to depend on the within‐study variance, between‐study variance, study sample size, and the number of studies in the meta‐analysis. Finally, we apply Vn to two published meta‐analyses and conclude that it usefully augments standard methods when deciding upon the likely validity of summary meta‐analysis estimates in clinical practice. © 2017 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.


Introduction
The capacity to aggregate multiple studies and provide a summary estimate for translation into practice was one of the motivations that drove the development of meta-analysis. In this regard, it has achieved undoubted success; however, the blight of heterogeneity, which so often affects meta-analyses, can potentially affect the applicability of results in individual clinical settings, such as individual practices, hospitals, regions, or even countries.
Although methods have been developed to quantify and ascertain the effects of heterogeneity, more recently, particularly in the field of predictive modeling, the focus has been on developing statistical approaches that increase the validity of meta-analysis results when applied in new populations. When evaluating diagnostic and prognostic tests, Riley and colleagues [1] examine approaches to translate test accuracy meta-analysis results to a new population, and propose cross-validation and prediction intervals to evaluate calibration performance of each approach. Debray and colleagues provide a framework for the use of individual patient data (IPD) from multiple studies in prediction modeling using logistic regression, and demonstrate that different model intercepts may be needed in new settings to ensure good predictive

Meta-analysis and mixed-models
Supposing there are multiple studies that each evaluate a particular effect of interest (e.g. an intervention effect, the sensitivity of a test, or the performance of a prognostic model). Of interest is the summary (mean) effect across studies and, for the purposes of this paper, how to apply or tailor such summary meta-analysis estimates to clinical practice. Given the potential variation in the true effects between primary studies, we develop methods from the view point of a random effects model.

(i) Meta-analysis model
For the observed mean effect, y i , in a primary study, we use the following univariate random effects model to aggregate the primary studies in the meta-analysis where μ is the mean (summary) effect across the studies and the key parameter to be estimated, δ i~N (0, τ 2 ) is the study-specific deviation from the overall mean effect with unexplained between-study variance τ 2 , and ε i~N (0, v i ) is the sampling error with variance v i assumed known for each study i. This will be used as the base model in the meta-analysis, where the key result for translation to clinical practice is the summary effect estimate,ŷ.
(ii) Tailored meta-regression model Heterogeneity across settings may be explained by study-level covariates, and such covariates may be important when applying summary meta-analysis results in clinical practice. We consider the effects of covariates on the meta-analysis by incorporating these within a meta-regression model. When there are p À 1 covariates, we have where X i is the row vector with p elements (the first element is 1) associated with study i, β is the p-vector of coefficients to be estimated (the first element being the intercept term), and δ i and ε i are defined as previously. This model allows us to obtain summary results 'tailored' for particular populations of interest, defined by their set of covariate values and estimated by X iβ Ài ð Þ whereβ Ài ð Þ denotes the estimate of β derived from (2) with the ith study omitted. Therefore, following estimation of model (2), the key result for translation to clinical practice is X iβ Ài ð Þ . The merits of using information from the setting of interest have been recently described by Willis and Hyde [5,6] particularly in the case of diagnosis. Essentially, it helps tailor the summary meta-analysis estimate to the setting for clinical application, to potentially improve its validity.
Models (1) and (2) can be estimated using standard techniques, such as methods of moments and restricted maximum likelihood (REML). All meta-analysis and meta-regression models in this article are conducted in R using the package Metafor [7].

Cross-validation approach
To evaluate whether meta-analysis results may be translated into practice requires the development of methods that allow the cross-validation of the derived summary estimates in new independent settings. In prognostic modeling, Royston and colleagues proposed what they called an 'internal-external cross-validation' procedure to establish the generality of a prognostic model developed across different studies [4]. More recently, this method has been elaborated upon by Riley [1] and Debray [2]. Essentially, the procedure bears similarity with the 'Jack-knife method' [8] by omitting each primary study, in turn, from the meta-analysis and deriving a summary estimate from the remaining studies, which is then compared with the observed estimate in the corresponding omitted study. When there are k independent studies, the procedure generates k different meta-analysis estimates, and thus k different validations. As such, the primary studies selected for the meta-analysis are themselves being used as the basis for independent validation.

Validation statistic, Vn
In the remainder of this article, we focus on developing a statistic to test the validity of summary meta-analysis estimates for clinical practice, within the context of the aforementioned cross-validation approach. Let y i be the observed mean effect estimate of interest in the ith study, andŷ Ài ð Þ be the summary meta-analysis estimate (from either model (1) or model (2)) generated from using k À 1 studies with the ith study omitted. Therefore, following the cross-validation exercise, we have a dataset containing k values of y i andŷ Ài ð Þ . In this context, we propose the validation statistic, Vn where var(y i ) is the variance of y i and varŷ Ài ð Þ is the variance ofŷ Ài ð Þ . Assuming y i to be normally distributed, the Vn statistic may be used as an overall test of the null hypothesis that μ (Ài) = μ i for all i, where μ (Ài) is the parameter (true predicted effect) that underlies thê y Ài ð Þ predicted by the meta-analysis model, and μ i is the parameter (true effect) in the omitted study i. By our definition, when the null hypothesis is true, the meta-analysis/regression estimate is a statistically valid estimate for a new setting. Thus, if we define a p value <0.05 for Vn as significant, then a p < 0.05 implies that there is sufficient evidence to conclude the meta-analysis/regression estimate is not statistically valid. In section 3.3, the distribution for Vn is derived for meta-analysis/regression models.

Distribution of Vn for meta-analysis and meta-regression summary estimates
Here, we give an outline of the derivation of the asymptotic distribution of Vn recognising that Vn is a quadratic form and using an approach described in previous studies [9,10]. A more detailed description of the derivation is given in the appendix 1.
Assuming a continuous outcome, let the ith study have observed mean effect y i and variance ¼ σ 2 i =n i where σ 2 i is the variance of the patient-level observations in each study with sample size n i . By writing the weights, , Vn may be written as If we define the k × k matrix A appropriately (see appendix 1), the term inside the squared brackets may be written as Ay, where y is the k-vector with elements (y 1 , y 2 , y 3 , … , y k ), and Vn may be written in the following matrix form The diagonal matrix w * has diagonal elements w Ã , and in general, the diagonal elements of A are all 1. The advantage of writing Vn in this form is that the null hypothesis, μ (Ài) = μ i for all i, is equivalent to Aμ = 0.
By transforming y into a vector z of standard normal variables and applying the spectral decomposition theorem, it may be shown that for eigenvalues Thus, Vn has a distribution which is a linear combination of χ 2 variables of degree 1. This is an exact distribution if the σ 2 i are all known. In practice, the σ 2 i are estimated from the sample data, so it is an asymptotic distribution. Note that Vn has the same form of distribution for both the univariate random effects meta-analysis and tailored meta-regression, but both A and B differ between the two cases (see appendix 1).

Farebrother's algorithm to implement Vn
To apply Vn, the distribution specified in (6) needs to be known. The distribution for a linear combination of chi-square variables has received considerable attention over the years. In general, there is no closed form to the distribution so that it has to be obtained numerically and a number of approaches have been described. Sattherwaite in 1946 described an approximate method based on the observation that when the non-zero eigenvalues all equal one the distribution simplifies to single chi-squared distribution with k degrees of freedom [11]. This suggests that a single distribution with an 'effective' number of degrees of freedom may provide a suitable approximation. Other approaches include inverting the characteristic function (Davies [12,13]) and applying numerical integration to a weighted sum of chi-squared variables (Fleiss [14]).
Ruben made a notable development when he demonstrated that a linear combination of chi-square variables could be written as an infinite series [15]. Importantly, he also showed that the truncation error after n terms had an upper bound which was dependent on a chi-squared distribution, the coefficients of the terms in the expansion, and n-all of which could be estimated accurately [15]. Thus, an estimate of the exact distribution may be obtained for the truncated series with n terms, such that n is set to make the truncation error arbitrarily small [9]. Ruben's method is incorporated within Farebrother's algorithm [16] and it is this algorithm we use when estimating the distribution of Vn. The version of Farebrother's algorithm applied below is from the package CompQuadForm in R [10]. The R source code used to estimate the distribution for Vn for a case example may be found in appendix 2.

Heterogeneity and statistical validity
The Vn statistic may be used as a test of the null hypothesis H 0 : μ i = μ (Ài) for all i. For the base meta- It can be seen from this that the null hypothesis is only satisfied when μ i = μ j for all i ≠ j, that is, when there is no heterogeneity. In short, we have the intuitive result that the base meta-analysis model will provide a summary estimate which is statistically valid to all clinical settings only when the studies comprising the meta-analysis are homogeneous.
Writing Vn in matrix form is useful when considering the null hypothesis. As stated earlier, the null hypothesis equates to Aμ = 0 where μ is the k-vector of parameters; in other words, the above equations are equivalent to the kernel or null space of A on μ. Using Gauss-Jordan elimination, A may be reduced to echelon form, and the above result follows readily.
Heterogeneity is likely to exist in most meta-analyses, and thus, in general, individual clinical settings will have a true effect that differs from the mean (summary) effect; thus, one would expect Vn to lead to the null hypothesis being rejected in most applications of meta-analysis.
In contrast, Vn is likely to be more useful when considering the case of tailored meta-analysis results, as derived from the tailored meta-regression model in (2) where covariates are included. Reducing A to echelon form (see Appendix 1 for definition of A), it follows that if μ (Ài) = μ i for all i then for p À 1 covariates The dimension of the kernel of A, otherwise known as the nullity, is p-this follows from the ranknullity theorem, because the rank(A) = k À p. Therefore, if μ k À p + 1 ,.., μ k are all known, then this constrains μ i to values in p-space (equation (8)). The values of the coefficients α i1 , … α ip depend on the within-study variance, the between-study variance, and the values of the covariates for each study; when the covariates are continuous, there are an infinite number of possible solutions. In summary, within the k-dimensional space of all possible values of (μ 1 , μ 2 ,…, μ k ), there is a p-dimensional sub-space where the (μ 1 , μ 2 ,…, μ k ) are statistically valid. Because in most cases, we expect to find μ i ≠ μ j , that is, the primary studies are heterogeneous, there is still the potential for the meta-regression model to provide predicted summary estimates that are statistically valid.
When the covariates are discrete, the possible values of (μ 1 , μ 2 ,…, μ k ) that are statistically valid share the same p-dimensional sub-space as continuous covariates. However, the number of possible values of (μ 1 , μ 2 ,…, μ k ) is constrained by the number of levels, contrasting a continuous covariate, which may be thought of as having an infinite number of levels. Thus, for a meta-regression model that includes only a single dichotomous covariate each of the μ i ∈ (μ 1 , μ 2 , … , μ k ) can have only one of two values for them to be statistically valid. For example, if μ 1 = 2.5 and μ 2 = 3.8, then the other μ i will be either 2.5 or 3.8. Although there is strict heterogeneity (not all the μ i are equal), the data divide into two homogenous sub-groups. Thus, similar to meta-analysis, statistically valid estimates arise when the sub-groups of studies are homogeneous. In essence, the process of adding covariates to a meta-regression model in order to 'explain' heterogeneity is a one of identifying homogenous sub-groups and this makes statistical valid estimates more likely. In the limit, when the covariates are continuous, this is equivalent to there being an infinite number of homogenous sub-groups.

Comparison of the Q statistic with Vn
In meta-analysis, Cochran's Q statistic [17] is classically used to identify heterogeneity [18]. Specifically, Q is used to test the null hypothesis of homogeneity, namely, H 0 : μ i = μ j for all i ≠ j [18] or equivalently, H 0 : τ 2 = 0 where τ 2 is the between-study variance.
For a meta-regression model, the use of the Q statistic may be extended to detecting residual heterogeneity. In this instance, Q is used to test H 0 : τ 2 r ¼ 0 where τ 2 r is the residual between-study variance and this corresponds to identifying homogenous sub-groups of studies. Thus, if a covariate has m levels, then for each level the sub-group of studies corresponding to that level satisfy μ i = μ j for all i ≠ j.
Whether the Vn statistic is applied to a meta-analysis or meta-regression model, it is a test of the null hypothesis H 0 : μ i = μ (Ài) for all i, which we defined as statistical validity. However, as we have deduced already, this is equivalent to testing H 0 : τ 2 = 0 in the case of meta-analysis and H 0 : τ 2 r ¼ 0 in the case of meta-regression. In essence, Vn and Q test the same hypotheses but from different standpoints, one from a direction of statistical validity of predictions, the other from homogeneity.
The key to our definition of statistical validity is the comparison of the parameter for the model estimate with that of an independent study. This derives from our goal of determining whether the model produces an estimate which accurately represents that seen in a new clinical population. Specifically, we would like to know how close the meta-analysis/regression estimate of the effect is to the effect seen in an independent study. By incorporating a leave-one-out cross validation approach, the Vn statistic directly captures the out of sample prediction error.
In contrast, Cochran's Q statistic measures the deviation of the studies from the overall summary estimate or regression line. As all the studies are used to derive these summary measures, the Q statistic does not directly quantify how well these summary estimates generalise to independent settings.
In the next section, the statistical properties of Vn are studied and compared with Cochran's Q statistic.

A simulation study
In order to study the properties of Vn and compare them with the Q statistic, meta-analyses and meta-regression models were simulated for different values of the following parameters: τ 2 (between-study variance), σ 2 i (variance of patient-level observations), n (sample size for each individual primary study), and k (number of primary studies in each meta-analysis/meta-regression). The following values were used: i and n were varied between meta-analyses/regressions but not between the primary studies within the meta-analysis/regression to allow us to study the contributions of these separately.
In practice, Vn and Q are calculated using the asymptotic estimatesσ 2 i andτ 2 for the parameters σ 2 i and τ 2 , respectively. To simulate these for each primary study, y i was calculated by taking the mean of the n simulated observations from y m;i e N μ i ; σ 2 i À Á for m = 1,..,n. Theσ 2 i was estimated as the variance of the observations y m , i around y i , andτ 2 was estimated using y i andσ 2 i to fit the meta-analysis/regression models via the Metafor package [7]. The models were fitted using REML.
When evaluating the type 1 errors of Vn and Q using a meta-analysis model, we simulated a single μ k~N (0, 1) and set μ 1 = μ 2 = … = μ k . The meta-regression models were simulated using a single continuous covariate for each study x i~N (0, 1). Hence, when investigating the type 1 error of the meta-regression models, two values of μ i~N (0, τ 2 ) for i = k À 1, k, were simulated, and the remaining k À 2 values of μ i were deduced as linear combinations of μ k À 1 and μ k as according to (8).
To evaluate the power of Vn and Q the true effect (parameter) for each primary study, μ i was simulated from a normal distribution according to μ i~N (0, τ 2 ) for i = 1,..,k. These were checked to ensure that Aμ ≠ 0 and rejected if otherwise.
For different combinations of (τ 2 , σ 2 i , n, k), the type I error rate and the power of Vn and Q were determined for a critical value of p = 0.05. The rate of type I error and the power were estimated based on 40 000 meta-analysis/meta-regression replications for each (τ 2 , σ 2 i , n, k) combination.

Type 1 error rates for Vn and Q
In Table I the rates of type 1 error for Vn and Q when applied to the meta-analysis model in (1) are given.
Here, there is no heterogeneity (τ 2 = 0), that is, μ i = μ j for all i ≠ j. In general, the type 1 error rate is dependent on the average sample size, n, in the individual studies and the number of studies in the meta-analysis, k. This reflects the estimatesσ 2 i andτ 2 being less prone to sampling error as n and k increase, respectively.
When there are fewer studies k < 50, and smaller sample sizes, n, the type 1 error rate of Q is closer to the threshold of 5% than Vn. The null distribution of Q applied to meta-analysis is a χ 2 distribution of degree k À 1 when σ 2 i is known exactly [19], and for n = 1000, where sampling error ofσ 2 i is minimised, the type 1 error rate for Q is consistently around the 5% mark across different numbers of studies. Table II provides the type 1 error rates for Vn and Q when applied to the meta-regression model in (2) where one continuous covariate was included. As in the meta-analysis model, when compared with Vn, Q has type 1 error rates which are closer to the 5% threshold. When n = 50 and k = 5, Vn has a type 1 error rate as high as 9.3% compared 5.3% for Q. In general, for both statistics when the sample size of the individual studies is small (n = 50), the type 1 error rate increases with k.

Power of Vn and Q
In Figure 1 the power of Vn is plotted against τ/se (where se ¼ σ i = ffiffi ffi n p ) for models (1) and (2). In both the meta-analysis and meta-regression models, the power of Vn increases with increasing number of studies for a given τ/se. Thus, heterogeneity and large meta-analyses increase the power of Vn.
Comparing the left and right panels, respectively, the power curves for Vn are shifted slightly to the right in the meta-regression model when there are 10 studies or fewer. In short, the probability of rejecting statistical validity when it is known to be false for a given τ/se is more likely when Vn is applied to a meta-analysis model than in a meta-regression model with one covariate. Probabilities are derived from simulations based on 40 000 meta-analysis replications with σ 2 i fixed at 0.1. n = individual study sample size; k = number of studies. In Figure 2, the power of Vn and Q are compared for each of the models when there are 5 and 25 studies, respectively. When there are 5 studies, Vn has greater power than Q for a given τ/se. This difference is more pronounced in the meta-regression model. When there are 25 studies in the analyses, the power curves for Vn and Q are closer, but Vn maintains greater power over Q for a wide range of τ/ se. In short, Vn is more useful if we may assume heterogeneity and that estimates are more likely to be invalid.
Also of note is how the difference in type 1 error rates between the two statistics compares with the difference in power. When a meta-analysis has 25 studies, the type 1 error rate of Vn is 0.001-0.006 higher than Q (this difference is 0.004-0.008 for meta-regression). Although power varies with τ/se, for a similar-sized meta-analysis and τ/se< 1.5 the power of Vn is 0.002-0.014 higher than Q (difference in meta-regression is 0.003-0.016). Essentially, the higher type 1 error rate of Vn compared with Q is compensated by a corresponding increase in power.

Interpretation
Central to proposing Vn is its use in testing the null hypothesis of statistical validity, but its interpretation has to involve weighing up the results in the context of other evidence. Before we interpret Vn, an important consideration is to decide whether the null hypothesis (statistical validity) is likely to be true. We know from 3.6 this is equivalent to deciding whether the studies are homogeneous and the cumulative evidence provided by such measures as the Q statistic [18], I 2 [18] and the width of prediction intervals [19] contribute to this decision. However, we should also be aware that none of these is without shortcomings when making such decisions [20,21].
Depending on whether statistical validity seems likely, we then interpret the results of Vn in terms of the type 1 error rate or power (specifically the type 2 error rate). Here, the simulation study which evaluated these for both Vn and Q may be used to inform decision-making when interpreting results.
If Vn is significant (p < 0.05), then either the meta-analysis/regression estimates are invalid or there is a type 1 error. For τ/se > 3 and a 5% level of significance, the power of Vn is above 85% when there are 5 studies and above 99% when there are 10 studies. Thus, if model estimates are statistically invalid and τ/se is large enough, Vn will nearly always be significant. A type 1 error arises when the estimates are valid but Vn is significant. When there are few studies (k = 5) and the sample size is small (n = 50), Vn can have a type 1 error rate as high as 9% at a level of significance of 5%. Because validity is dependent on homogeneity (as we showed in 3.5), we may use the Q statistic in such instances as it maintains a type 1 error rate of around 5% even when k and n are small.
If Vn is not significant but statistical invalidity seems likely, then we should consider the type 2 error rate (1-power). The power of Vn is dependent on k and τ/se, and these are required for interpretation. We may estimateτ directly from the meta-analysis model and estimate the average standard error se of studies in the meta-analysis based on that proposed by Higgins and Thompson [18], namely where w i ¼ n i =σ 2 i . Thus, any non-significant results should be interpreted by weighing up whether they are consistent with the null hypothesis being true or there being a type 2 error given our estimate,τ=se.
In order to gain some insight into the values ofτ=se that may be expected, we screened the Cochrane database of Systematic Reviews for reviews published in September 2016. Including the two cases that follow, we found 16 reviews which provided summary 2 × 2 table data and had 5 or more primary studies (see Table III). The median number of primary studies was 7 [lowest = 5, highest = 19] per review, and the medianτ=se was 1.00 [lowest = 0, highest = 3.44]. Cochrane reviews tend to be of a higher quality than other reviews so they may not be representative-this may also account for the median primary study count being only 7.
For four studies,τ = 0 and henceτ=se = 0 where both Vn and Q have low power. But in such instances, the power becomes less relevant becauseτ = 0 suggests homogeneity. This reinforces the need to decide on whether statistical validity is likely to be true before interpreting the results. Whenτ=se = 1, the power for Vn is 37.5% for k = 5 and is 52% for k = 10. This rises to 62% and 83%, respectively, whenτ=se =1.5. There were four reviews in whichτ=se, and two of these had 6 studies. The lists of reviews are available in an online appendix.
We will now apply these principles to the following case examples.

Case examples
We now use two data sets to illustrate the use of Vn in testing the statistical validity of summary metaanalysis results. In each case, we first fit the meta-analysis model then the tailored meta-regression model. Note all parameters were estimated from fitting the models using REML.

Berkey
The first dataset is from Berkey et al. [22] who reviewed the primary studies which evaluated the efficacy of the Bacillus Calmette-Guérin (BCG) vaccination in preventing tuberculosis (TB) [22]. For both the meta-analysis model and the tailored meta-regression model, Vn is calculated as follows where d log RR ð Þ Ài ð Þ is the summary estimate from the meta-analysis/regression model fitted with the ith In Table IV the individual study estimates for log(RR) i are given alongside the corresponding meta-analysis and tailored meta-regression estimates for d log RR ð Þ Ài ð Þ . This shows directly how well the meta-analysis and tailored meta-regression estimate predicts that observed in the excluded study.
From Table V, Vn (59.96; p < 0.0001) is significant, and asτ=se = 3.44, the power of Vn is likely to be close to 100% at a 5% level of significance. This strongly suggests the meta-analysis estimate is unlikely to be valid in a new setting. The other statistics in Table V also suggest that there is heterogeneity which would be consistent with the estimate being invalid.
When undertaking a meta-regression, the reported efficacy in terms of the relative risk was associated with the latitude of the setting in which the study was conducted [22]. Each fitted tailored meta-regression equation with the ith study omitted took the form of: The study estimate is the log(relative risk) for the individual study. The meta-analysis (MA) estimate for a study is that derived from aggregating the remaining studies. The tailored meta-regression (TMR) estimate for a study is derived from regressing the remaining studies with the covariate Lat but inserting the Lat value for the excluded study. For example, the MA estimate for study 1 is derived from aggregating studies 2-13. The TMR estimate for study 1 is derived from regressing studies 2-13 but inserting Lat = 19. All estimates are for the log(RR) with 95% confidence intervals in brackets. Lat = latitude.  [22]) and Case 2 (Leeflang et al. [23]). The results are given for the meta-analysis (MA) and meta-regression (MR) (with 1 covariate). k = the number of studies; *Includes 1 covariate (the latitude) # Includes 1 covariate (the logit(prevalence)); PPV-positive predictive value; RR-relative risk; 95% PI-95% prediction interval. † Prediction interval estimated for a latitude of 45°. ‡ Prediction interval estimated for a prevalence of 10%.
whereα Ài ð Þ andβ Ài ð Þ are estimated from fitting the model. It is of interest to know if the summary estimates from this tailored meta-regression are valid in particular countries. Therefore, the cross-validation approach is useful, with Vn used to test the calibration of the meta-regression predicted effects and the actual study effects.
Although including the latitude as a covariate in the model has helped explain some of the heterogeneity (both Q and I 2 have decreased), Vn (25.77; p = 0.0037) remains significant. As noted aboveτ=se = 3.44, and for a tailored meta-regression model, the power of Vn is still around 100%-this points strongly to the tailored meta-regression estimate being invalid.
For the estimate to be statistically valid, we would need to interpret the significant result for Vn in the context of the probability of a type 1 error. Notwithstanding that Table II shows for a level of significance of 0.05 the true type 1 error rate of Vn can be higher than this, a p value of 0.0037 suggests a type 1 error is still very unlikely. This supports our judgment that the tailored meta-regression estimates are likely to be statistically invalid.

Leeflang
The second dataset is derived from a Cochrane systematic review which appraised studies that had evaluated the Galactamannan assay for diagnosing invasive aspergillosis in immunocompromised patients [23]. Here, we use the dataset with the threshold for a positive test result set at an optical density index (ODI) of 0.5. As in general, it is the probability of disease/non-disease given the test result which is most useful to clinicians, the outcome of interest chosen in this example was the positive predictive value (PPV). Thus, the Vn statistic for both the meta-analysis and meta-regression model is calculated as follows  Table VI.
From Table V, for the meta-analysis model, Q (15.39; df = 6; p = 0.0175), I 2 = 59.75% and the 95% prediction interval for the summary logit (PPV) is (À1.84, +0.33) (this is equivalent to a PPV of 14-58%)-these all suggest the studies are heterogeneous. As might be expected, Vn (16.76; p = 0.0083) is also significant which leads to the conclusion that any summary estimates are unlikely to be valid. The study estimate is the logit(PPV) for the individual study. The meta-analysis (MA) estimate for a study is that derived from aggregating the remaining studies. The tailored meta-regression (TMR) estimate for a study is derived from regressing the remaining studies with the covariate lgtprev but inserting the lgtprev value for the excluded study. For example, the MA estimate for study 1 is derived from aggregating studies 2-7. The TMR estimate for study 1 is derived from regressing studies 2-7 but inserting lgtprev = À4.82. All estimates are for the logit(PPV) with 95% confidence intervals in brackets. PPV = positive predictive value; lgtprev = logit(prevalence) # Estimate includes continuity correction of 0.5.
From Bayes' theorem, the PPV is known to depend on the disease prevalence. We implemented a tailored meta-regression approach in which the prevalence of disease was assumed to be known for each primary study [5,6], to study the effects of such information on the potential validity of any estimates. Thus, for the Vn statistic, each fitted tailored meta-regression model took the form: whereα Ài ð Þ andβ Ài ð Þ are estimated from fitting the model with the ith study omitted. In contrast to the first example, Vn (6.04; p = 0.484) is non-significant. Should this be considered as consistent with the null hypothesis of statistical validity or is it a type 2 error? When there are few studies (k = 7), Vn has greater power and therefore a lower type 2 error rate than Q. From Table V,τ=se = 1.21 and inspecting the power curves in Figure 2 we see the power is around 58% (a type 2 error rate of 42%) for a level of significance of 0.05. However, in this instance, the power for Vn will be much higher because p = 0.484. (A simulation study based onτ=se = 1.21 and an average sample size per study of 128 for the 7 studies demonstrates the power to be 90.0% giving a type 2 error rate of 10%). This would suggest that estimates are likely to be valid. This is also supported by the other measures, as I 2 = 0%, and the prediction interval has narrowed to (À1.19, À0.58) equivalent to a PPV of 23-36% suggesting that the heterogeneity has been explained by the addition of the covariate to the model. Based on this evidence, it would be reasonable to implement a PPV estimate from this model in an independent setting and hence practice.
Both of these examples demonstrate why clinicians should not automatically assume summary metaanalysis results are applicable to their population and that even when a summary estimate appears valid this should be judged in the context of the properties of the statistic and other evidence used to make that decision.

Discussion
One of the issues facing medical research in general is determining how well the research results translate into practice. To truly address this, we need a separate evaluation of the research in practice settings which are independent from the research settings. The terms validity or external validation are often reserved for when the research findings may be generalised or translated into clinical practice. With regard to meta-analysis results, however, it is often impractical to conduct further independent studies to assess whether the aggregate estimates are valid. Judgments on validity are therefore generally based on an assessment of quality, in which a combination of qualitative and quantitative characteristics of the comprising studies is weighed up.
Yet the need for further studies is partly circumvented in meta-analysis by making use of the existing studies in a similar way to the Jack-knife method [7]. Such 'cross-validation' is not new and was implemented by Lachenbruch [24] and Stone [25] but has only recently gained some traction in metaanalysis. However, to date, its main use in meta-analysis has been in prediction modeling to evaluate predictive performance and for model recalibration [2][3][4].
To address this, we considered the concept of statistical validity in relation to estimates generated by meta-analysis and meta-regression models. In particular, we defined it as when the model parameter for the effect measure of interest equates to that in an independent setting. From this, and using the previously described cross-validation procedure, we derived a statistic, Vn, and its asymptotic distribution to test the viability of statistical validity.
Homogeneity plays a central role and is integral to statistical validity when evaluating meta-analysis or meta-regression models with discrete categorical covariates. As part of the derivation of Vn, it was demonstrated that in meta-analysis statistical validity follows only when the studies are homogenous. Furthermore, when meta-analysis is extended to include discrete covariates in a tailored meta-regression model, statistical valid estimates arise when the individual sub-groups of studies are homogeneous, equivalent to the covariates being used to 'explain' the heterogeneity.
However, the more general case is when the p covariates are continuous in which case the set of statistical valid estimates spans a p + 1 dimensional sub-space of the k-space of parameters (μ 1 , μ 2 , …, μ k ). Here the individual parameters, μ i , may all have different values (by definition heterogeneity), but the model still provide statistically valid estimates.
Owing to the link between homogeneity and statistical validity, it reinforces the need to explore meta-analyses/regressions for heterogeneity using standard methods [17,18]. As such, Cochran's Q statistic, a measure often used to identify heterogeneity, was compared with Vn. There are clear similarities in that they are both a 'weighted sum of squares' statistic. Because Vn is estimated using cross validation it directly measures the out of sample prediction error which is important to statistical validity and contrasts Q.
In terms of the type 1 error rate and power, they are similar when then are 50 or more studies. However, when there are fewer studies, Vn has greater power and Q has a lower type 1 error rate. Clearly defining the power and type 1 error rates is important as both these statistics are being used to test hypotheses. Furthermore, they provide a basis for recommendations on the use of Vn (and Q) when making decisions on the validity of meta-analysis/regression estimates.
The power of Vn not only depends on the number of studies but also on τ/se, so there is a trade-off between the level of heterogeneity and the average precision of the studies. Thus, when the metaanalysis consists predominantly of high precision studies, Vn may detect differences between the model estimate and those observed which may be too small to be clinical relevant but are, nonetheless, statistically significant. However, our overview of Cochrane reviews showed this not to be a large problem.
As in previous studies [20,26], similar shortcomings were demonstrated here with the Q statistic. This motivated the proposing of statistics, such as the I 2 statistic, that are aimed at measuring the extent of heterogeneity rather than its presence [27]. However, this too is similarly affected by study precision and the number of studies in the meta-analysis [20,27,28]. Such drawbacks should inform the interpretation of these statistics and also suggest that they should not be used in isolation when evaluating heterogeneity or statistical validity.
Like many statistics, Vn provides little information when used in isolation; its usefulness depends upon the context in which it is applied. As statistical validity is intrinsically linked to homogeneity if homogeneity seems likely then a significant Vn result should be interpreted in terms of its potential type 1 error rate. In such an instance, the Q statistic, with its lower type 1 error rates, is likely to be the more informative of the two.
However, if heterogeneity is considered to be likely at the outset (as with many meta-analyses), then Vn's greater power means that it could be more useful than the Q statistic particularly for τ/se < 1.5. In this context, reviewers may be able to use Vn as support for a recommendation which discourages the wider application of the meta-analysis results.
But how should Vn be interpreted when the inclusion of covariates in meta-regression analyses seems to 'explain' the between-study heterogeneity? One of the risks in this instance is that legitimate exploratory analyses could lead to recommendations on validity. When the objective is to determine potential sources of variation and not make assertions on the 'applicability' of results, then the Q statistic and I 2 can be used to inform such analyses. Although in any event, a non-significant Vn should be interpreted against the likelihood of a type 2 error, it should not be used to assert statistical validity in meta-regression analyses when the covariates were identified as part of an exploratory phase. The risks of such data fishing or dredging are well documented [29] but are particularly apposite here where the results could be applied in clinical practice as a result. The best way to mitigate this risk is to pre-specify the covariates that are likely to be causal before embarking upon any meta-regression analyses. In this context, the assertion of statistical validity is more likely to be justified.
As an alternative to the hypothesis testing approach of the Vn statistic, Riley et al. propose using 95% prediction intervals to quantify the potential error in the predictions of true study values from applying meta-analysis/regression models to new settings [1]. This allows the magnitude of error to be examined, which may inform clinical relevance. Indeed, the issue of clinical relevance is pertinent to any methods used to assess validity. Altman and Royston allude to this when they express the importance of differentiating between statistical and clinical validity [30]. In the latter case, biased estimates (which are statistically invalid) may be acceptable to clinicians under certain clinical conditions. However, there are potential limitations to the Riley et al. [1] approach, too. Prediction intervals are not universally accepted in the meta-analysis field, and recent work suggests frequentist equations for the prediction interval have poor coverage [31]. Furthermore, unlike the Vn statistic, the 'error' in the proposed prediction interval, as estimated using Riley's method [1], does not account for the variance of the meta-analysis model's predicted summary estimate.
In this study, we confined our investigations to using study-level data either as part of meta-analyses or as covariates included in meta-regression analyses. As IPD or patient-level data become increasingly available, there is the potential to use IPD meta-analyses to improve the summary estimates translated into practice. An important part of this process would be to evaluate their statistical validity and is worthy of future research.
In conclusion, for summary estimates from meta-analysis to be useful in practice, they need to be statistically valid. As a direct measure of statistical validity, we have proposed the Vn statistic, and it is applicable whenever the meta-analysis model (1) or the tailored meta-regression model (2) is applied to combine effect estimates from multiple studies. It does have limitations as a hypothesis test, and these should be noted. However, as statistical validity relates to identifying homogenous sub-groups of studies, these limitations may be partly circumvented by using it alongside other statistics such as the Q statistic. As such, Vn provides a useful summary of the likely statistical validity of results from meta-analysis/regression models when applied to clinical practice. the expectation, E v T i z À Á ¼ 0 because E(z i ) = 0 and the variance, var v T i z and therefore Vn has a distribution which is a linear combination of χ 2 variables of degree 1 where the coefficients are the eigenvalues of B. This is an asymptotic distribution because σ 2 i and τ 2 are estimated from the sample data. We note the property that the entries of each row of A sum to zero persists in A T w * A. Furthermore, because this matrix is symmetrical, the entries of each column also sum to zero. B = w À1/2 A T w * Aw À1/2 is symmetrical, and if we multiply each row i, by a factor (w 1 w 2 . . w i À 1 w i + 1 . . w k ) À1/2 where w j is the jth element on the diagonal of w, the column entries also sum to zero; thus, B has rank = k À 1. Because Vn is non-negative, B is positive semi-definite and has k À 1 positive eigenvalues, where the kth eigenvalue, λ k = 0. Furthermore, when the non-zero eigenvalues all equal one, Vn has a χ 2 distribution of k À 1 degrees of freedom.
(ii) Tailored meta-regression model (2) To validate the summary estimate X iβ Ài ð Þ from the meta-regression model in (2), it is again of interest to formulate Vn in matrix form. Note X i is the row vector for the ith study with 1 as the first element and each of the p À 1 covariates as the other elements. Let θ Ài ð Þ ¼ X T Ài ð Þ w Ã Ài ð Þ X Ài ð Þ À1 X T Ài ð Þ w Ã Ài ð Þ , where X (Ài) is the matrix of row vectors X j with the ith study excluded. Thus, θ (Ài) is a p × (k À 1) matrix for p À 1 covariates, andβ Ài ð Þ ¼ θ Ài ð Þ y Ài ð Þ . Suppose we partition θ (Ài) into the first i À 1 columns, append an ith column of zeros, then re-append the remaining k À i columns to produceM Ài ð Þ a p × k matrix.
We define the matrix A as having elements whereM j Ài ð Þ is the jth column ofM Ài ð Þ of length p. As a result, Vn may be written in the quadratic form where w * is defined as previously and A T w * A is symmetrical. By similar arguments to those made in Section 3.3(i), we may again define z = w 1/2 y and write B = w À1/2 A T w * Aw À1/2 in order to deduce that The matrix θ (Ài) has rank p because X (Ài) has rank p. The k-vector Ay is analogous to (I À H)y the vector of residuals (y i Àŷ) where H is the hat matrix. Ay represents the vector of predictive residuals y i Àŷ Ài ð Þ across all X i and y i . Thus, A spans the same sub-space as I À H and has rank k-p.
The rank of w * Aw À1/2 is the same as A because both w * and w À1/2 are diagonal and the transpose, A T , also has the same rank as A. Thus, B = w À1/2 A T w * Aw À1/2 has rank k-p. Again, B is positive semi-definite and has k-p positive eigenvalues, where λ k À p + 1 = … = λ k = 0 for p < k.