Mendelian randomization with fine-mapped genetic data: choosing from large numbers of correlated instrumental variables

Mendelian randomization uses genetic variants to make causal inferences about the effect of a risk factor on an outcome. With fine-mapped genetic data, there may be hundreds of genetic variants in a single gene region any of which could be used to assess this causal relationship. However, using too many genetic variants in the analysis can lead to spurious estimates and inflated Type 1 error rates. But if only a few genetic variants are used, then the majority of the data is ignored and estimates are highly sensitive to the particular choice of variants. We propose an approach based on summarized data only (genetic association and correlation estimates) that uses principal components analysis to form instruments. This approach has desirable theoretical properties: it takes the totality of data into account and does not suffer from numerical instabilities. It also has good properties in simulation studies: it is not particularly sensitive to varying the genetic variants included in the analysis or the genetic correlation matrix, and it does not have greatly inflated Type 1 error rates. Overall, the method gives estimates that are not so precise as those from variable selection approaches (such as using a conditional analysis or pruning approach to select variants), but are more robust to seemingly arbitrary choices in the variable selection step. Methods are illustrated by an example using genetic associations with testosterone for 320 genetic variants to assess the effect of sex hormone-related pathways on coronary artery disease risk, in which variable selection approaches give inconsistent inferences.


Background
In a Mendelian randomization investigation, genetic variants that are instrumental variables for a given risk factor are used to assess the causal effect of the risk factor on an outcome [Davey Smith and Ebrahim, 2003;Burgess and Thompson, 2015]. An association between such a genetic variant and the outcome is indicative of a causal effect of the risk factor on the outcome [Didelez and Sheehan, 2007;Lawlor et al., 2008]. When there are multiple uncorrelated genetic variants that are instrumental variables for the same risk factor, power to detect a causal effect is maximized by including all such genetic variants in a single analysis [Pierce et al., 2011]. However, when genetic variants are correlated, it is not clear how to choose which variants to include in the analysis to obtain the most efficient estimate possible without the analysis suffering from numerical instabilities when there are large numbers of highlycorrelated candidate variants (such as with fine-mapped genetic data).

Theoretical viewpoint
If individual-level data are available on the genetic variants (potentially correlated), risk factor, and outcome for the same participants, then the two-stage least squares (2SLS) method provides the most efficient estimate of the causal effect (amongst all instrumental variable estimators using linear combinations of the instruments and under conditional homoscedasticity -the error term in the model relating the outcome to the risk factor has constant variance conditional on the instruments) [Wooldridge, 2009].
The first stage of the 2SLS method regresses the risk factor on all the genetic variants. As the sample size increases, the coefficient of any variant that does not explain independent variation in the risk factor will tend to zero, and so its contribution to the analysis decreases to zero. This implies that an optimally-efficient Mendelian randomization analysis should include all genetic variants associated with the risk factor in a conditional analysis. The inclusion of additional variants not independently associated with the risk factor will not have a negative impact on the analysis asymptotically (as their coefficient for contribution to the analysis will tend to zero), but will not add to the precision of the causal estimate either. As an aside, fitted values from the first-stage of the 2SLS method are equivalent (up to an additive constant) to values of an allele score (also called a genetic risk score). This implies that the optimal weights in an allele score with correlated variants are the conditional (multivariable) associations of the variants with the risk factor.
Estimating a causal effect using summarized data The 2SLS estimate can also be obtained using summarized data on genetic associations with the risk factor and with the outcome from univariable regression analyses of the risk factor or outcome on each genetic variant in turn. This is important as such summarized data from large consortia are often publicly available, enabling Mendelian randomization investigations to be performed on large sample sizes without the need for costly and time-consuming data-sharing arrangements [Burgess et al., 2015b]. This estimate can also be calculated in a two-sample setting, in which genetic associations with the risk factor and with the outcome are estimated in different samples [Inoue and Solon, 2010].
If the genetic association with the risk factor for genetic variant j isβ Xj with standard error se(β Xj ), and with the outcome isβ Y j with standard error se(β Y j ), and assuming that genetic variants are uncorrelated, then the causal estimate is [Johnson, 2013]: This is referred to as the inverse-variance weighted (IVW) estimate [Burgess et al., 2013]. It is the weighted mean of the 2SLS estimates using each genetic variant The variant-specific estimates are combined using the standard formula for a fixed-effect meta-analysis [Borenstein et al., 2009]. This same estimate can be obtained by weighted regression of the genetic associations with the outcomeβ Y j on the genetic associations with the risk factorβ Xj using weights se(β Y j ) −2 and with the intercept term set to zero. The IVW estimate is equivalent to the 2SLS estimate when the genetic variants are uncorrelated [Burgess et al., 2015a]. This formula does not take into account uncertainty in the genetic associations with the risk factor; however, these associations are typically more precisely estimated than those with the outcome, and ignoring this uncertainty does not lead to inflated Type 1 error rates for the IVW estimate in realistic scenarios [Burgess et al., 2013].
When genetic variants are correlated, the IVW method can be extended to account for the correlations between genetic variants [Burgess et al., 2016]. This can be motivated by considering generalized weighted linear regression of the genetic associations with the outcome on the genetic associations with the risk factor using the weighting matrix Ω, where Ω j 1 j 2 = se(β Y j1 ) se(β Y j2 )ρ j 1 j 2 , and ρ j 1 j 2 is the correlation between genetic variants j 1 and j 2 . The causal estimate is: whereβ X ,β Y are vectors of the genetic associations, and T is a vector transpose.
Again, this estimate is equivalent to the 2SLS estimate that is obtained using individuallevel data (see Appendix for proof). It therefore inherits the efficiency property of the 2SLS estimate as the optimally-efficient causal estimate based on all the genetic variants.

Scope of paper
In this paper, we illustrate and provide guidance on choosing variants to include in a Mendelian randomization with fine-mapped genetic data. We first provide a motivating example analysis based on summarized genetic associations for hundreds of correlated genetic variants in a single gene region. We demonstrate and explain why including too many genetic variants in such an analysis can lead to numerical instabilities and inflated Type 1 error rates. We also show that estimates based on a few variants can be highly sensitive to the choice of these variants. A novel approach is presented using principal components analysis to ensure that all variants contribute to the analysis, but without introducing numerical instabilities. We discuss practical implications of these findings for applied Mendelian randomization investigations.
Software code in the R programming language for implementing the analyses discussed in the paper is provided in the Appendix.
Motivating example: serum testosterone and coronary heart disease risk We consider an example Mendelian randomization analysis with serum testosterone as the risk factor and coronary artery disease (CAD) risk as the outcome using genetic variants in the SHBG gene region. The associations of 325 individual SNPs with testosterone are reported by Jin et al. [2012]; associations of 322 of these variants with coronary artery disease risk are reported by the CARDIoGRAMplusC4D Consortium [2015]. Previously, in an independent dataset, Coviello et al. [2012] demonstrated at least six separate signals in the SHBG gene region at a genome-wide level of significance, plus three more variants associated with sex hormone-binding globulin (SHBG) on adjustment for these six variants. In all analyses, correlations between variants are estimated using 1000 Genomes Phase 3 data on 503 individuals of European descent as reference data. A further 2 variants were monomorphic in the reference data; analyses are conducted using the remaining 320 variants. As variants in the SHBG gene region are associated with circulating levels of both testosterone and SHBG, a positive Mendelian randomization finding would not distinguish which of these is a causal risk factor, but would suggest that sex hormone-related mechanisms have a causal role in cardiovascular disease.
Three approaches are taken here to choose which variants to include in a Mendelian randomization analysis. First, we take eight variants from the conditional analysis in the independent dataset reported by Coviello et al. (the association with testosterone in the data under analysis was not available for one variant). Second, we perform a stepwise conditional approach using the summarized associations reported by Jin et al., selecting at each step of the analysis the variant having the lowest p-value for association with the risk factor in the conditional analysis. We proceed until no additional variants are associated with the risk factor at p < 0.0001 or p < 0.001.
This approach is implemented using the GCTA software. Third, we perform a stepwise pruning approach [Yang et al., 2012], selecting at each step of the analysis the variant having the lowest p-value for association with the risk factor in a marginal (univariable) analysis. Once a variant is selected, all other variants whose correlation with the selected variant is greater in magnitude than a given correlation threshold (taken as 0.2, 0.4, 0.6, 0.8, 0.9, and 0.95; equivalent to an r 2 threshold of 0.04, 0.16, 0.36, 0.64, 0.81 and 0.9025) are removed from the analysis. We continue until each variant is either selected or removed. This ensures that a set of variants is chosen for each threshold correlation such that the variants are each marginally associated with the risk factor, and the pairwise correlations are all below the threshold correlation. Although a data-driven approach to selecting variants to include in a Mendelian randomization investigation is often unwise [Burgess et al., 2011], in this case the associations with the risk factor and with the outcome are estimated in non-overlapping samples, and so "winner's curse" bias in the genetic associations with the outcome should not arise.
The Mendelian randomization estimates are presented in Table I. Fixed-effect analysis models that account for correlations between variants are used throughout. Despite the two approaches using a conditional analysis and the pruning approach at a threshold correlation of 0.2 including similar numbers of variants in the analysis, the causal estimates in these three analyses differed substantially -by over two standard errors, and gave opposing substantive conclusions. In the pruning approach, as the threshold correlation increased, more variants were included in the Mendelian ran- substantially lower than those calculated using the conditional approach. This may be due to the extra variants explaining additional variability in the risk factor; the reduction in standard error corresponds to a 97% relative increase in variance explained by the variants at a threshold of 0.4 compared with at 0.2, and a 240% increase at a threshold of 0.6. It is unclear which of the estimates in Table I are reliable, and therefore whether evidence supports testosterone as a causal risk factor for coronary heart disease risk or not. Estimates (standard errors, SE) of causal effect of testosterone on coronary artery disease risk (estimates are log odds ratios per unit increase in log-transformed testosterone) from inverse-variance weighted method (accounting for correlation) with variants selected using three different approaches and (for the pruning method) six different threshold correlations (measured by ρ and by r 2 ).
a The variance estimate was negative, indicating that the weighting matrix was not positive definite, meaning that either the standard errors in the weighting matrix were imprecisely estimated, or else were not compatible with the correlation matrix.

Choosing the right number of variants
To resolve the question of how to choose which variants to include in a Mendelian randomization analysis, we explore reasons why analyses that include too many or too few genetic variants may go wrong, and propose a solution that incorporates associations on large numbers of genetic variants into the analysis, but does not suffer from numerical instabilities.

Too many variants: near-singular genetic correlation matrix
A matrix is singular if it cannot be inverted -formally, if the determinant of the matrix is zero. This occurs when the rows or columns of a matrix are linearly dependent; that Genetic association with testosterone Genetic association with coronary artery disease risk is, at least one column (or row) can be calculated as a linear sum of multiples of the other columns (known as multicollinearity). This will occur for the genetic correlation matrix when two genetic variants are in perfect linkage disequilibrium, or alternatively if a small number of haplotypes are present in the data (perfect multicollinearity can occur even if no pair of variants is highly correlated). In contrast, a near-singular matrix can be inverted, but its determinant is close to zero. This occurs in a regression model when there is substantial, but not perfect, multicollinearity. As sample sizes for estimating genetic correlations increase, singular matrices will become less common, but near-singular genetic correlation matrices are likely to become more common. This is because a discrepant allele count in a single individual (which could represent a genotyping error) can lead to a singular matrix becoming non-singular. A nearsingular matrix is problematic as elements of its inverse can be very large. In the motivating example with correlation thresholds of 0.9 and 0.95, the maximal element of the inverse of the correlation matrix is over 10 million.
If a matrix is exactly singular, then it cannot be inverted, and the analysis will report an error. If a matrix is near-singular, then the analysis may report an estimate without giving any indication that the estimate may be misleading (as observed in Figure 1). In conjunction with discrepancies in the genetic association estimates, nearsingular behaviour can lead to overly-precise as well as highly misleading estimates.
Discrepancies may occur due to include rounding of association estimates (particularly for summarized genetic associations taken from the literature), inaccuracy and uncertainty in correlation estimates, and genetic association estimates and/or correlation estimates being estimated in different samples. When multiplied by the large numbers in the inverse of a near-singular genetic correlation matrix, small discrepancies in association estimates are magnified. Overprecision in the causal estimate will occur when genetic association estimates that should be similar based on the correlation matrix are more dissimilar than expected.

Too few variants: unstable estimates
While theoretical considerations suggest that a Mendelian randomization analysis should be based on only variants associated with the risk factor in a conditional analysis, in practice this results in a Mendelian randomization estimate that only uses data on a small number of variants. In the motivating example, the conditional analyses suggest that less than 10 variants should be included in the analysis; associations with the remaining over 300 variants are ignored. In some cases and in particular in the motivating example, the causal estimate is highly sensitive to the choice of which variants are included in the analysis. This leads to unstable Mendelian randomization estimates -if one of the selected variants in the conditional analysis happened not to be measured, or failed quality control (QC) criteria, then a different set of variants would have been obtained from the conditional analysis, resulting in a different Mendelian randomization estimate.

Just right?: principal components analysis
One potential solution for resolving the problem of multiple correlated variants is principal components analysis (PCA). The use of PCA has been previously suggested for reducing the dimensionality of the instrumental variable space to resolve issues of weak instrument bias [Winkelried and Smith, 2011], and as a tool for grouping variants in a fine-mapped gene region [Cai et al., 2013]. We perform unscaled PCA on a weighted version of the genetic correlation matrix Ψ j The diagonal elements of this matrix are the inverse-variance weights, and so each is equal to the precision of the causal estimate based on that variant alone.
Assuming that associations for all variants are estimated in the same sample size, these diagonal elements are proportional to the amount of variance in the risk factor explained by the genetic variant. This can be seen as the standard errors of the associations with the outcome will be directly proportional to the standard errors of the associations with the risk factor, which in turn relate to the minor allele frequencies MAF j : if the variant is a diallelic SNP, then se(β Xj ) −2 ∝ MAF j (1 − MAF j ) [Burgess et al., 2016]. (The proportion of variance in the risk factor explained by ge- If two variants are perfectly correlated, but the estimates for one are measured in a larger sample size, then the precision of the association with the outcome (se(β Y j ) −1 ) will be greater for this variant, and so it will (correctly) be preferentially selected. The number of principal components to be included in the analysis can be chosen based on a threshold of variance in the weighted genetic correlation matrix. Once the principal components have been selected, we multiply the vector of genetic associations with the risk factor by the matrix of principal components, we multiply the vector of genetic associations with the outcome by the matrix of principal components, and pre-and post-multiply the genetic correlation matrix by the matrix of principal components.
The IVW method is then performed on the transformed vectors of genetic associations and the transformed correlation matrix. If the matrix Ψ = W ΛW T , where W is the matrix of eigenvectors (or loadings), and Λ is the diagonal matrix with the eigenvalues λ 1 > . . . > λ J on the diagonal, then let W k be the matrix constructed of the first k columns of W . Then we define: β X = W T kβX as the transformed genetic associations with the risk factor β Y = W T kβY as the transformed genetic associations with the outcomẽ Ω = W T k ΩW k as the transformed weighting matrix.
Then the PCA-IVW estimate is given by: For the example of testosterone and CAD risk, 99% of the variance in this matrix was explained by the first 8 principal components, and 99.9% by the first 17 principal components. The corresponding estimates using these principal components as instruments were -0.065 (standard error 0.099) and -0.045 (0.083) respectively. These estimates are similar in precision to that using the previous conditional analysis for variable selection, but less precise than those calculated using the GCTA method on the data under analysis or a liberal correlation threshold in the pruning method.

Simulation study
We illustrate statistical issues arising from using too many and too few variants in a series of simulation studies based on the motivating example. Again, fixed-effect analysis models are used throughout.

Sensitivity to choice of genetic variants
First, we repeated the analyses of the motivating example except using only 180 of the 360 genetic variants at a time. This represents a scenario in which only a subset of the genetic variants in the analysis were measured. Sets of 180 variants were chosen at random 10 000 times.

Sensitivity to correlation matrix
Second, we repeated the analyses of the motivating example except varying the correlation matrix. We took a bootstrap sample of the reference data (same size sample as the original data, sampled with replacement), and calculated a correlation matrix based on this sample. This procedure was performed 10 000 times.
For each of these simulation analyses, we performed the pruning method for selecting genetic variants at a threshold correlation of 0.2, 0.4, 0.6 and 0.8, and the PCA method using components that explained 99% and 99.9% of the variance in the summarized association matrix. Results are presented in Table II. In both simulation studies, as the threshold in the pruning approaches increased, the mean standard error of the causal estimates decreased, and the mean causal estimate also changed substantially. For a threshold correlation of ρ = 0.8, causal estimates were unstable, and were particularly sensitive to changes in the correlation matrix. In contrast, estimates using the PCA approach were not so precise, but they were far less variable between iterations. Means of estimates, standard deviations (SD) of estimates, and mean standard errors (SE) for 10 000 iterations based on motivating example: i) varying the choice of variants and ii) varying the correlation matrix. Six approaches for selecting genetic variants are performed: four based on pruning at different correlation thresholds (ρ) and two based on principal components analysis (PCA).
a Excluding 536 iterations in which the standard error was not defined. b Estimates were highly variable and the standard error was not defined for a large proportion of iterations.

Rounding of association estimates
Finally, we simulated genetic associations with the risk factor and with the outcome directly. Genetic associations with the risk factor were drawn for 320 variants from a multivariable normal distribution with mean vector the measured genetic associations with testosterone from the motivating example and variance-covariance matrix Ω X , where Ω Xj 1 j 2 = se(β Xj1 ) se(β Xj2 )ρ j 1 j 2 . The associations with the outcome are drawn from a multivariate normal distribution with mean zero and variance-covariance matrix Ω, where Ω j 1 j 2 = se(β Y j1 ) se(β Y j2 )ρ j 1 j 2 as defined above. This represents a null causal effect. We also set the mean of the distribution of the associations with the outcome as 0.1 times the associations with the risk factor, representing a causal effect of 0.1. We simulated 10 000 datasets for each value of the causal effect, and calculated the Mendelian randomization estimate using the same six approaches for variant selection as above. Additionally, we repeated the analyses but first rounding the genetic associations (and their standard errors) to three and two decimal places. Table III for the standard deviation of estimates, the mean standard error, and the empirical power of the 95% confidence interval (the proportion of datasets in which the confidence interval excluded the null; this is the Type 1 error rate for a null causal effect). The mean estimates (not presented) were close to the true causal effect throughout for all approaches. As in the previous simulations, estimates from the pruning approach became more precise as the threshold correlation increased, although Type 1 error rates were above nominal levels for ρ = 0.8 even when the association estimates were not rounded. Rounding exacerbated false positive findings, and inflated Type 1 error rates were present in all methods when associations were rounded to 2 decimal places. Coverage rates were least affected when pruning at a threshold correlation of ρ = 0.2 or 0.4 and for the PCA approaches. With a positive causal effect, power increased as the threshold increased, although judging estimators by power estimates is misleading when Type 1 error rates are inflated. Power of the PCA approaches was similar to that using a pruning threshold of ρ = 0.2, and was greater using principal components that explained a greater proportion of the weighted correlation matrix. Standard deviations (SD) of estimates, mean standard errors (SE), and empirical power based on the 95% confidence interval for 10 000 simulated datasets using six approaches for selecting genetic variants. Results are also given on rounding the association estimates to a fixed number of decimal places.

Discussion
As the cost of high-density genome sequencing continues to fall, additional signals are likely to be identified within known loci. There will be growing demand for methods to exploit correlated instruments in Mendelian randomization, as the addition of correlated variants can improve power to detect a causal effect. In this paper, we first connected previously known results together to show from theoretical arguments that genetic variants included in a Mendelian randomization analysis should be those that are associated with the risk factor in a conditional analysis. If the variants are combined in an allele score, then the conditional (multivariable) associations with the risk factor should be used as weights in the allele score to obtain the most efficient analysis. If only summarized data are available, then the same analysis can be replicated with the marginal (univariable) associations using an extension to the inverse-variance weighted method to account for correlations between variants.
However, difficulties arise when there are many correlated genetic variants in a single gene region that are associated with the risk factor (fine-mapping genetic data).
Including too few genetic variants in an analysis means that estimates are less precise, but also highly variable, in that different approaches to choosing variants can lead to markedly different estimates. However, including too many variants can lead to numerical instabilities and overly precise estimates with inflated Type 1 error rates.
These numerical instabilities are not computational issues, but arise due to inconsis- occasionally occurred at a threshold correlation of 0.6 (r 2 = 0.36). We note as well that r 2 is not always a good measure of correlation between genetic variants; nearsingular matrices can occur when the pairwise correlations measured by r 2 are low but there are haplotypes represented in the data, or when the minor allele frequencies of variants differ but a common variant 'tags' a rare variant (high D-prime, but low r 2 ).
As an alternative approach, we have proposed a method for selecting instruments based on principal components analysis of a weighted version of the genetic correlation matrix. This approach constructs instruments as linear combinations of genetic variants. As the linear combinations are orthogonal, the approach does not suffer as much with respect to numerical instabilities. Additionally, the method incorporates data on all the genetic variants into the analysis, and consequently causal estimates from the approach are less variable. Estimates from the principal components analysis approach are less precise than those from the variable selection approaches considered here (GCTA and pruning); however, they are less variable with respect to choices of how to implement the analysis (in particular the choice of variants).

Comparison with previous work
The inverse-variance weighted method presented here is a simple application of generalized weighted linear regression, and is not unique to Mendelian randomization.
The same method has been used in a variety of contexts including discovery genetics [Zhu et al., 2016], and prediction and model selection [Chen et al., 2015;Benner et al., 2016;Newcombe et al., 2016]. A number of different solutions have been proposed to the problem of highly-correlated variants, including pruning and clumping at a threshold correlation, and adding a small positive number to the diagonal of the correlation matrix [Gusev et al., 2016]. In the applied example of the paper at a correlation threshold of ρ = 0.8, adding 0.1 to the diagonal of the correlation matrix changed the causal estimate from -0.137 (standard error 0.031) to -0.065 (0.057). Although the substantial change in the causal estimate is indicative of near-singular behaviour, it would seem preferable for estimation to simply use a stricter correlation threshold rather than misspecifying the correlation matrix (and better still to use the principal component approach presented in this manuscript).
We believe that Mendelian randomization differs somewhat from other analysis contexts, as an instrumental variable analysis relies on inferences from a single gene region (for example, for a protein risk factor where the gene region is the coding region for the risk factor) or a small number of gene regions. Another feature of Mendelian randomization is the prevalence of the summarized data and two-sample settings, in which discrepancies in genetic associations are likely to arise.
Principal components approaches have been suggested before for fine-mapping data, with Wallace demonstrating that 70% of the variance in the genetic correlation matrix could be explained by an average of 7 components for 49 test gene regions [Wallace, 2013]. A key innovation here is weighting the genetic correlation matrix, meaning that principal components with the greatest eigenvalues will be those that explain the most variance in the risk factor. This means that it is more likely that an analysis based on a small number of principal components will have reasonable power to detect a causal effect. For example, if there is only one causal variant in the gene region, then 100% of the variance would be explained by one principal component, even if there were other uncorrelated variants in the gene region.
We advocate the principal components analysis method proposed in this paper as a worthwhile approach to analyse fine-mapped genetic data for Mendelian randomization.

A.1 Software code
We provide R code to implement the methods discussed in this paper. The genetic variants are represented by g (a matrix of allele counts for the genetic variants), the risk factor by x, and the outcome by y. Weights for the allele score are represented by wts. If the genetic variants are perfectly uncorrelated, and the weights are the coefficients from univariable regression analyses of the risk factor on each of the genetic variants in turn, then these two analyses are equivalent.
Regression of Y on Z gives beta-coefficientsβ Y = (Z T Z) −1 Z T Y with standard errors the square roots of the diagonal elements of the matrix (Z T Z) −1 σ 2 where σ is the residual standard error. If the instrumental variables are perfectly uncorrelated, then the off-diagonal elements of (Z T Z) −1 σ 2 are all equal to zero. Regression of X on Z gives beta-coefficientsβ X = (Z T Z) −1 Z T X. Weighted linear regression of the beta-coefficientsβ Y on the beta-coefficientsβ X using the inverse-variance weights (Z T Z)σ −2 gives an estimate: The assumption of uncorrelated instrumental variables ensures that the regression coefficients from univariate regressions (as in the regression-based methods) equal those from multivariable regression (as in the two-stage least squares method).
Variants correlated: If the variants are correlated, then the same argument holds, except that the weights in the weighted linear regression of the beta-coefficientŝ β Y on the beta-coefficientsβ X are (Z T Z)P σ −2 , where P is the (symmetric) correlation matrix.