MendelianRandomization v0.5.0: updates to an R package for performing Mendelian randomization analyses using summarized data

The MendelianRandomization package is a software package written for the R software environment that implements methods for Mendelian randomization based on summarized data. In this manuscript, we describe functions that have been added to the package or updated in recent years. These features can be divided into four categories: robust methods for Mendelian randomization, methods for multivariable Mendelian randomization, functions for data visualization, and the ability to load data into the package seamlessly from the PhenoScanner web-resource. We provide examples of the graphical output produced by the data visualization commands, as well as syntax for obtaining suitable data and performing a Mendelian randomization analysis in a single line of code.


Introduction
Mendelian randomization is an epidemiological technique that uses genetic variants to link risk factors to outcomes 1,2 . The MendelianRandomization package is a software package written for the R software environment 3 that implements methods for Mendelian randomization based on summarized data 4 . Summarized data are genetic associations with risk factors and outcomes taken from regression analyses that have been performed for each genetic variant in turn 5 . Such data (beta-coefficients and standard errors) are generated in a genome-wide association study, and have been publicly reported for hundreds of thousands of variants by many large studies and consortia 6 . While the basic functionality and initial features of the package have been discussed previously 7 , several functions have been added to the package in recent years. These features can be divided into four categories: robust methods for Mendelian randomization, methods for multivariable Mendelian randomization, functions for data visualization, and the ability to load data into the package seamlessly from the PhenoScanner web-resource. We discuss each of these categories in turn, describing the various options available to investigators. A list of functions in the package is provided as Table 1. We do not discuss in detail the properties of the various methods Table 1. Functions available in the MendelianRandomization package. Functions are divided into five categories: data entry functions, univariable estimation methods, multivariable estimation methods, data visualization functions, and functions that load data from PhenoScanner.

Amendments from Version 1
We have edited the manuscript text to address the reviewers' comments. Specifically, we have added more information on how to access information on the various estimation commands: the options for inputs and the outputs reported by the various functions. We have added more detail on correlated variants and on multivariable Mendelian randomization. We have provided more information on how the inverse-variance weighted method is implemented in the univariable and multivariable settings. We have also added a new table that lists the various estimation methods, the corresponding commands, some brief strengths and weaknesses, and provided references.
Any further responses from the reviewers can be found at the end of the article REVISED or the reasons for choosing between the various options presented; we would encourage users to read the relevant references for the methods or the recently-published guidelines paper on performing Mendelian randomization investigations 8 . We also encourage users to consult the package documentation, which describes all the options available for each method in greater detail. The aim of this paper is to provide a broad overview of the package.

Implementation
The initial release of the MendelianRandomization package included four functions for the estimation of causal effects based on summarized genetic data in a univariable (that is, one risk factor) Mendelian randomization framework. These were mr_ivw (inverse-variance weighted method, IVW) 9 , mr_egger (MR-Egger method) 10 , mr_median (simple and weighted median methods) 11 , and mr_maxlik (maximum likelihood method) 9 . Each of these estimation functions takes an MRInput object as input, created using the mr_input command. The syntax is: mr_ivw(mr_input(ldlc, ldlcse, chdlodds, chdloddsse)) where ldlc and ldlcse are genetic associations with low-density lipoprotein (LDL) cholesterol and their standard errors for 28 genetic variants as previously reported by Waterworth et al. 12 , and chdlodds and chdloddsse are genetic associations with coronary heart disease risk for the same variants. These data variables are provided with the package. Syntax for the default operation of the mr_egger and mr_median commands (and all the other univariable estimation commands discussed in this paper) is identical, although useroptions and the output from each method is different.
Some methods rely on all variants being uncorrelated; others allow correlated variants using the correl option. Using correlated variants requires the specification of the correlation matrix between genetic variants, on the assumption that the correlations between the genetic variants are the same as the correlations between the genetic association estimates 4 . Correlations are typically estimated from reference data, such as those from European-descent participants of the 1000 Genomes Project that can be obtained using the ld_matrix command in the TwoSampleMR package 13 . Care should be taken that entries in the correlation matrix are harmonized to the same effect and reference alleles as the genetic associations 14 ; if the correlation matrix was calculated with the effect and reference alleles reversed, then the positive and negative signs should be flipped for the relevant column and row of the matrix (the diagonal terms should remain as +1). Exemplar data on genetic associations with calcium and fasting glucose for correlated variants are provided in the package. The IVW method can be applied to these data using the syntax: mr_ivw(mr_input(calcium, calciumse, fastgluc, fastglucse, corr=calc.rho)) where calc.rho is the correlation matrix.
All methods allow confidence intervals to be calculated using a t-distribution rather than a normal distribution (distribution = "t-dist") or based on a different significance level (alpha = 0.05 corresponds to a 95% confidence interval). Other options are specific to particular methods; a list of input options for for each method can be found in the package documentation under the subheading "Arguments"; for the mr_ivw method, this is accessed by the command ?mr_ivw.
Each method provides output in a slightly different format. Generally, the estimate from the method is in the Estimate slot, its standard error is in the StdError slot, and the lower and upper limits of the confidence interval for the estimate are in the CILower and CIUpper slots. For the mr_ivw command, these can be accessed via: mr_ivw(mr_input(ldlc, ldlcse, chdlodds, chdloddsse))$Estimate mr_ivw(mr_input(ldlc, ldlcse, chdlodds, chdloddsse))$StdError mr_ivw(mr_input(ldlc, ldlcse, chdlodds, chdloddsse))$CILower mr_ivw(mr_input(ldlc, ldlcse, chdlodds, chdloddsse))$CIUpper A list of output slots for each method can be found in the package documentation under the subheading "Value"; for the mr_ivw method, this is accessed by the command ?mr_ivw.

Operation
The R software environment runs on a wide variety of UNIX platforms, Windows, and MacOS, and requires minimal computer resources (256 kilobytes of RAM is recommended). The package requires R version 3.0.1 or higher.

Use cases
Robust methods for Mendelian randomization A brief description of each method is given in Table 2. These methods were discussed in greater detail and compared in a review of robust methods for Mendelian randomization 15 .
The IVW method is implemented by weighted linear regression of the genetic associations with the outcome on the genetic associations with the risk factor 4 . There are two options in the mr_ivw method that represent different robust methods. The robust option performs the IVW method method using robust regression (referred to as MR-Robust) 16 . The penalized option performs the IVW method with penalized weights 16 . The syntax is: mr_ivw(mr_input(ldlc, ldlcse, chdlodds, chdloddsse), robust=TRUE) mr_ivw(mr_input(ldlc, ldlcse, chdlodds, chdloddsse), penalized=TRUE) Other methods implemented in the package are the mode-based method (mr_mbe) 17 , the heterogeneity penalized method (mr_hetpen) 18 , the lasso method (mr_lasso) 16 , and the contamination mixture method (mr_conmix) 19 . As for the mr_ivw command, the syntax is: mr_mbe(mr_input(ldlc, ldlcse, chdlodds, chdloddsse)) and similarly for the other methods.
The mr_mbe method has options weighting = "weighted" or weighting = "simple", corresponding to weighted and unweighted versions of the method. It also has options stderror = "simple" or stderror = "delta" corresponding to first-and second-order standard errors. The mr_hetpen method has options prior to set the prior probability of a genetic variant being a valid instrument (default is 0.5), and CIMin, CIMax, and CIStep to allow feasible and efficient calculation of confidence intervals.
The mr_conmix method has options psi to set the value of the standard deviation of the distribution of invalid estimands (that is, how variable are the quantities targeted by genetic variants that are invalid instrumental variables), and CIMin, CIMax, and CIStep as above.
The mr_lasso method has the option lambda to set the tuning parameter in the penalized (lasso) regression model.

Methods for multivariable Mendelian randomization
Multivariable Mendelian randomization is an extension of the standard Mendelian randomization paradigm to include multiple risk factors in a single analysis model 20,21 . Typically, it is employed when several risk factors share genetic predictors, and so it is not possible to find genetic variants that are specific predictors of a particular risk factor. In multivariable Mendelian randomization, it is assumed that the genetic variants are specifically associated with any of a set of risk factors, such that any causal pathway from the genetic variants to the outcome passes via one or other of the risk factors. To perform multivariable Mendelian randomization with summarized data, genetic associations are required for each variant with all of the risk factors.
Methods for multivariable Mendelian randomization take an MRMVInput object as an input, created using the mr_mvinput command. Four functions are included for the estimation of causal effects based on summarized genetic data in a multivariable Mendelian randomization framework. These are mr_mvivw (multivariable IVW method) 22 , mr_mvegger (multivariable MR-Egger method) 23 , mr_mvmedian (multivariable median-based method) 24 , and mr_mvlasso (multivariable lasso method). The syntax is: mr_mvivw(mr_mvinput(bx = cbind(ldlc, hdlc, trig), bxse = cbind(ldlcse, hdlcse, trigse), by = chdlodds, byse = chdloddsse)) where hdlc and hdlcse are genetic associations with high-density lipoprotein (HDL) cholesterol and their standard errors, and trig and trigse are genetic associations with triglycerides and their standard errors for the same 28 variants. Again, these data variables are provided with the package. Syntax for the mr_mvegger, mr_mvmedian, and mr_mvlasso commands is identical.
The multivariable IVW method is implemented similarly to the univariable IVW method, except using multivariable regression of the genetic associations with the outcome on the genetic associations with the risk factors. As in the univariable case, the mr_mvivw command can be implemented using robust regression with the robust = TRUE option 24 . The mr_mvivw and mr_mvegger methods have a correl option to allow for correlated variants. The mr_mvlasso method has the lambda option to set the penalization parameter as in the univariable case. All methods have distribution and alpha options as discussed above.

Functions for data visualization
The initial release of the MendelianRandomization package included two options for data visualization, both implemented using the mr_plot function. Application of the mr_plot function to an MRInput object gave an interactive scatter plot of the genetic associations together with a line representing the IVW estimate. Genetic associations with the risk factor are plotted on the horizontal axis, and genetic associations with the outcome on the vertical axis. Application of the mr_plot function to an MRAll object plotted a similar (although non-interactive) scatter plot of the genetic associations together with lines representing the estimates from various methods. An MRAll object is created by the mr_allmethods function, which returns estimates from various estimation methods.
We have added functionality so that the mr_plot function can now be applied to an MRMVInput object. In this case, we still plot the estimated genetic associations with the outcome on the vertical axis. On the horizontal axis, we plot predicted genetic associations with the outcome. These are fitted values from the multivariable IVW method, which regresses the genetic associations with the outcome on the genetic associations with the risk factors. Horizontal error bars represent confidence intervals for these fitted values. These refiect uncertainty in the multivariable IVW estimates, but not in the genetic associations with the risk factors, which are assumed to be known without error. A diagonal line is plotted with gradient 1 to help the detection of outliers, which may be pleiotropic variants. The syntax is: mr_plot(mr_mvinput(bx = cbind(ldlc, hdlc, trig), # Figure 1 bxse = cbind(ldlcse, hdlcse, trigse), by = chdlodds, byse = chdloddsse)) In the example of Figure 1, we additionally set the option interactive = FALSE to produce a non-interactive version of this plot.
In updating the package, we have added several additional functions for data visualization. The default implementation of the mr_forest function plots the variant-specific estimates in a forest plot, with the pooled estimate from the IVW method at the bottom (Figure 2a). The variant-specific estimates are the ratio estimates from each genetic variant in turn. This plot allows the user to investigate heterogeneity in the variant-specific estimates, which indicates potential pleiotropy in the analysis 25 . Heterogeneity can also be expressed numerically by Cochran's Q statistic (for the IVW method) or Rücker's Q statistic (for the MR-Egger method), which are reported as the "heterogeneity test statistic" by the relevant estimation functions. The mr_forest function can also be used to plot estimates from different methods, either in addition to the variant-specific estimates or without them ( Figure 2b): mr_forest(mr_input(ldlc, ldlcse, chdlodds, chdloddsse)) # Figure 2A mr_forest(mr_input(ldlc, ldlcse, chdlodds, chdloddsse), # Figure 2B snp_estimates=FALSE, methods = c("ivw", "median", "wmedian", "egger", "maxlik", "mbe", "conmix")) (For presentation purposes, in this and subsequent figures we provide plots for the first 9 variants in the package only.) The mr_funnel function is similar, except that the variant-specific estimates are plotted against their precision (that is, the reciprocal of their standard error). This plot also enables the user to investigate heterogeneity in the variant-specific estimates (    The mr_loo function allows the user to investigate sensitivity of the IVW estimate to individual data points. This is implemented by calculating the IVW estimate omitting each variant from the analysis in turn (loo stands for 'leave one out'). The IVW estimate based on all the variants is also plotted for reference ( Figure 4): mr_loo(mr_input(ldlc, ldlcse, chdlodds, chdloddsse)) # Figure 4 Output from each of these commands is a ggplot object, and so basic graphical parameters can be changed using functions from the ggplot2 package 26 . For example, the horizontal axis can be set to run from −5 to +5 using the following code: library(ggplot2) forest = mr_forest(mr_input(ldlc, ldlcse, chdlodds, chdloddsse)) forest2 = forest + coord_cartesian(xlim=c(-5,5)) forest2 Loading data from PhenoScanner The initial release of the MendelianRandomization package included a function called extract.pheno.csv. This function took a .csv file previously downloaded by the user from the PhenoScanner webtool (http://www. phenoscanner.medschl.cam.ac.uk/) and converted the file into an MRInput object, extracting the relevant genetic associations with the risk factor and outcome. PhenoScanner 27,28 is a database of genetic associations that contains over 65 billion associations for over 150 million unique genetic variants, including genetic associations reported by major consortia, as well as those for the UK Biobank study reported by Ben Neale's team (http://www.nealelab.is/uk-biobank).
The extract.pheno.csv function is no longer maintained; however, it has been superseded by the pheno_input command, which calls PhenoScanner directly from R and creates an MRInput object. Using this command, the entire workflow of a Mendelian randomization analysis can be performed in a single line of code. For example: mr_ivw(pheno_input(snps=c("rs12916", "rs2479409", "rs217434", "rs1367117", "rs4299376", "rs629301", "rs4420638", "rs6511720"), exposure = "Low density lipoprotein", pmidE = "24097068", ancestryE = "European", outcome = "Coronary artery disease", pmidO = "26343387", ancestryO = "Mixed")) This code first extracts data on eight genetic variants (their 'rsid' identifiers are listed above), and creates an MRInput object using the genetic associations with "low density lipoprotein" taken from the study with PubMed ID 24097068 29 in individuals of European descent as the summarized associations with the risk factor, and genetic associations with "coronary artery disease" taken from the study with PubMed ID 26343387 30 in a mixed ancestry sample as the summarized associations with the outcome. The triplet of trait name, PubMed ID, and ancestry is necessary to uniquely identify the correct dataset for genetic associations, as some publications report associations with multiple traits, or associations with the same trait in different ancestry groups. While the above code then implements the IVW method on this MRInput object, any other estimation or data visualization command that takes an MRInput object as input could be applied to the output of the pheno_input function.

Summary
In summary, the MendelianRandomization package has added a number of features since its initial release: to implement various robust estimation methods, to implement methods for multivariable Mendelian randomization, to enable a greater range of data visualization options, and to facilitate data entry. We conclude with the same warning that we stated at the end of the manuscript accompanying the initial package release 7 : while this software simplifies the operational aspects of a Mendelian randomization, the truly difficult parts of an analysis are choosing sensible risk factors and outcomes, selecting genetic variants that are plausible instrumental variables, performing a reasonable range of analyses, and interpreting the results with care and caution 31 . Software code for these aspects of an analysis cannot be written 32 .

Data availability
Underlying data All data used in this article are distributed in the software package described, or can be freely downloaded using commands in the software package that are detailed in the text of the article.

The MendelianRandomization package is available via the Comprehensive R Archive Network (CRAN)
The software package is available here: https://cran.r-project.org/web/packages/MendelianRandomization/ index.html.

9.
Open Peer Review package is widely used and thus a detailed description of updates is justified and warranted. I have only very minor suggestions.
The R package manual (updated September 30, 2020) details all functions described in this paper. The introduction or summary should briefly mention what the current paper adds above and beyond the R package manual. 1.
The paper can use some minor re-organizing or heading revisions. Currently, "Introduction", "Methods", "Use cases", "Summary", "Data availability" and "Software availability" are the major sections. The "Use cases" is well written and describes features in four categories which the authors introduce in the beginning and constitutes the key/main section of the paper. The "Methods" section, however, seems out of place or mislabeled for this descriptive paper. Perhaps use "Software overview"? 2.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes

Competing Interests: No competing interests were disclosed.
Reviewer Expertise: genetic epidemiology, nutrition, cardiometabolic traits, aging traits, coffee, caffeine, metabolomics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Author Response 20 Nov 2020
Stephen Burgess, University of Cambridge, Cambridge, UK

C0. Broadbent et al. summarize updates made to their MendelianRandomization R package. The package is widely used and thus a detailed description of updates is justified and warranted. I have only very minor suggestions.
-We thank the reviewer for their positive view of this submission.
C1. The R package manual (updated September 30, 2020) details all functions described in this paper. The introduction or summary should briefly mention what the current paper adds above and beyond the R package manual.
-The reviewer is correct that all of the material in the current paper is also available through the R package manual. However, there are reasons why we believe that a publication is worthwhile. First, this paper presents the contents of the package in a clearer and more holistic way than the manual. Secondly, it provides examples of usage and interpretation of the output. Thirdly, it creates a citation in the scientific literature so that we can track usage of the package.
-We have added text to the introduction to more clearly define the scope of this paper. We have also referenced the manual more clearly within the text, for example to obtain lists of the input options and output slots for each method.
C2. The paper can use some minor re-organizing or heading revisions. Currently, "Introduction", "Methods", "Use cases", "Summary", "Data availability" and "Software availability" are the major sections. The "Use cases" is well written and describes features in four categories which the authors introduce in the beginning and constitutes the key/main section of the paper. The "Methods" section, however, seems out of place or mislabeled for this descriptive paper. Perhaps use "Software overview"? -Thank you for the suggestion. However, the format of the article is set by the journal to ensure homogeneity of papers across this article class. Hence we do not have complete freedom to determine the structure of the article. We have reviewed the guidelines and rearranged where possible according to the journal guidelines, but the original structure largely remains.
C3. Table 1. Please annotate functions that are new and not discussed in reference 5.
-We have added annotation of those functions that are new and those that are updated since the publication of Yavorska and Burgess 2017 (now reference 7).

Department of Population Health Sciences, University of Bristol, Bristol, UK
This article demonstrates how a range of valuable Mendelian randomization approaches can be easily implemented using the MendelianRandomization R Package. It describes substantial additions to the existing software, which I believe would be of great value to researchers. This is especially true of integrating data from the PhenoScanner web-resource, which can be invaluable in identifying avenues for pleiotropic bias.
The material presented in the article is technically sound, though I believe it could be improved with minor revisions.
In the methods section, some of the wording is at times confusing, for example: "simple and weighted method methods)".

1.
There is an understandably difficult balance between summarising methods previously described, and providing sufficient detail so as for researchers to understand how options relate to methods implemented. For example, while the correl option is referenced several times, there is no indication of appropriate usage (such as providing a suitable correlation matrix). Especially in the Use cases section, the details provided at times read like the help file for various functions when using R. This shifts from providing insufficient detail overall for using such methods (specifically novel options provided in the package) with confidence, to providing at times a puzzling focus on elements such as the standard error options for the modal estimator. If these options are important enough to be included in the manuscript, then this would warrant more description of their usage.

2.
A brief overview of multivariable MR would potentially be helpful, as there is often confusion over whether only SNPs correlated with all exposures should be used, or a set of SNPs where each SNP is correlated with at least one exposure included in the model. I appreciate this has been explained in previous publications, but the clarification would be welcome.

3.
Having performed an example analysis using this software as part of the review process, I commend the authors on providing helpful vignettes and documentation guiding the analysis process in the R package itself.
It is also very encouraging to see issues surrounding the appropriate selection of exposures and outcomes, as well as interpretation of results, afforded attention at times lacking in software showcasing novel statistical approaches.
With the above comments in mind, I believe the MendelianRandomization R package represents a valuable addition to a growing body of MR software ensuring emerging methods are easily implemented in applied research.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound? Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? No Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes -We appreciate the sentiment of the reviewer's comment. As per the response to A3, our preference is for this paper to be brief and to illustrate the options available rather than to wade into deep discussions about the properties of the various methods or the reasoning for choosing between the different options available. Hence, we are happy to provide the various options that can be chosen, but we would rather not discuss the merits and demerits of the options at length -these are already addressed elsewhere. This is now explicitly stated in the Introduction.
-If there are other specific examples where the reviewer feels that additional detail is required, then we would be happy to consider adding these. As for the specific examples that the reviewer calls out: 1) we are not aware of any novel options that are provided in the package -all the methods and options have been introduced elsewhere (see citations in the new Table 2); 2) we have added explanation and an example of the "correl" option; 3) the discussion of standard errors for the mode-based estimation method is one sentence. We do not think this is excessive. The original version of the mode-based estimation method provided by the authors gave 12 different estimates based on different options for the method (3 options for the phi parameter, weighted and unweighted, simple and secondorder standard errors): we have condensed these down into the three options discussed in the submission. -We appreciate the sentiment of the reviewer. However, as the reviewer states, detailed descriptions of these methods are already available in the literature, and a paper reviewing the methods has already been written (Slob and Burgess, 2020). Our strong preference is that this paper remains short and focuses on the software package, rather than deviating too strongly into technical detail. To this end, we have included a brief description of the methods in a separate table (Table 2), so that this information is available to the interested reader, but it does not add to the length of the text. The aim of this paper is now more clearly stated in the introduction.

A4. Similarly, it would be useful to add some extra description to the MVMR introduction to
explicitly specify what is going on; for example that the basic MVMR is just a regression with first order weights (y ~ x1 +x2) to make it less intimidating for users.
-We have added a brief description of multivariable Mendelian randomization (MVMR), and describe how the multivariable IVW method is implemented using multivariable regression.
A5. Is there a facility to report Cochran's or ruckers Q within the software? The authors discuss the value of visually inspecting plots for heterogeneity but it would wonderful if it were possible to return the statistics for this.
-The mr_ivw and mr_egger functions provide the "Heterogeneity test statistic" as part of the outcome. This is the same as Cochrane's Q statistic for the IVW method and Rücker's Q statistics for the MR-Egger method. This is clarified in the help files for the mr_ivw and mr_egger functions, and now in the manuscript.
A6. I also note that it is enjoyable that the plots can be directly edited as ggplot objects.
-That is wonderful! A7. Finally, it would be excellent to have a brief paragraph explaining the format of the data frame of results being reported. It can be quite frustrating when using a package to have to manually explore the data object to try and sleuth out what exactly is present and what it means. I expect this is in the vignette in R but I recommend a summary of this here, perhaps, even only for the components reported in common for all MR methods.
-We understand the reviewer's request, and sympathize with this problem. As suggested, we have provided details of some key outputs that are common to all (or at least, most) methods. A full description of the output of each method is readily available in the package documentation.
-For example, typing ?mr_ivw opens the description of the mr_ivw function. Under the subheading "Value", there is a clear list of the output for the IVW method. In this case, there are 15 slots: Model, Exposure, Outcome, Correlation, Robust, Penalized, Estimate, StdError, CILower, CIUpper, Alpha, Pvalue, SNPs, RSE, and Heter.Stat. We do not think it would be helpful to the reader to provide a detailed list of every slot for every method in this manuscript, but have added an explanation of how to access this information.
A8. In general, this is a welcome update that is described in a clear manner. With the minor updates I suggest above it will be a very useful tool for genetic epidemiology.
-We thank the reviewer for his comments, which have strengthened and clarified this manuscript.

Competing Interests:
No competing interests were disclosed.