The use of negative control outcomes in Mendelian randomization to detect potential population stratification

Abstract A key assumption of Mendelian randomization (MR) analysis is that there is no association between the genetic variants used as instruments and the outcome other than through the exposure of interest. One way in which this assumption can be violated is through population stratification, which can introduce confounding of the relationship between the genetic variants and the outcome and so induce an association between them. Negative control outcomes are increasingly used to detect unobserved confounding in observational epidemiological studies. Here we consider the use of negative control outcomes in MR studies to detect confounding of the genetic variants and the exposure or outcome. As a negative control outcome in an MR study, we propose the use of phenotypes which are determined before the exposure and outcome but which are likely to be subject to the same confounding as the exposure or outcome of interest. We illustrate our method with a two-sample MR analysis of a preselected set of exposures on self-reported tanning ability and hair colour. Our results show that, of the 33 exposures considered, genome-wide association studies (GWAS) of adiposity and education-related traits are likely to be subject to population stratification that is not controlled for through adjustment, and so any MR study including these traits may be subject to bias that cannot be identified through standard pleiotropy robust methods. Negative control outcomes should therefore be used regularly in MR studies to detect potential population stratification in the data used.


Introduction
When the observed association between an exposure, X, and an outcome, Y, is confounded by an unobserved variable, conventional regression analysis will produce misleading estimates of the effect of the exposure on the outcome. If genetic variants-usually single nucleotide polymorphisms (SNPs)-are available which reliably predict the exposure variable but do not have an effect on the outcome through any other pathway, then they are valid instrumental variables (IVs) and can be used in a Mendelian randomization (MR) analysis to obtain unconfounded evidence of the effect of the exposure on the outcome. 1,2 A key assumption for MR to give consistent estimates of the causal effect of an exposure on the outcome is that the SNPs used as instruments are not associated with the outcome other than through the exposure. 3 One way in which this assumption may be violated is through population stratification, where structure in the population studied causes a correlation at the population level between the distribution of the genetic variants and the distribution of the exposure and/or outcome. 4,5 Population stratification and its implications for genome-wide association studies (GWAS) are described in more detail in Box 1. MR analyses are often conducted by comparing summary data estimates of SNP-exposure and SNP-outcome associations gleaned from two independent but homogeneous study populations. This is referred to as two-sample summary data MR. 6 For the MR estimate of the causal effect of the exposure on the outcome to be a consistent estimate of the effect of the exposure on the outcome, the genetic variants must satisfy the following assumptions.
IV1: the variants must be associated with the exposure X (the 'relevance' assumption). IV2: the variants must be independent of all (observed or unobserved) confounders of X and Y, as represented by U (the 'exchangeability' assumption). IV3: the variants must be independent of the outcome Y given the exposure X, (the 'exclusion restriction').
These assumptions are illustrated in Figure 1 and are explained in detail elsewhere. 3,6,7 Population stratification causes confounding of the instrument and outcome in MR, violating IV2, illustrated in Figure 1. Any MR analyses based on the results from a GWAS will potentially be biased if that GWAS does not fully account for any ancestral population structure that could lead to population stratification. 5,8 This bias is likely to be largest when the outcome phenotype in an MR study is subject to population stratification that has not been fully accounted for. However, it will also bias effect estimates in a two-sample MR analysis when the exposure phenotype is subject to population stratification, by causing the estimated association between the SNP and the exposure to be mis-specified. As well as increasing or decreasing the size of the observed association, this bias could generate evidence of an apparent causal effect of the exposure on the outcome when no such effect exists, or alternatively could mask a true effect.
A method that is often used in observational studies to detect confounding, and help assessment of whether a causal relationship exists between an exposure and an outcome, is negative control outcome analysis. 9-11 'Negative controls' essentially renames the 'specificity of associations' which Hill considered a factor that should be weighed up in evaluating plausibility of causation in epidemiological studies. 12 Given the dismissal of this criterion in some influential epidemiological texts, their rechristening as 'negative controls' allowed their relegitimation. 13 Negative control outcome studies compare the association observed between an exposure and the outcome under investigation with the association observed between that exposure and a negative control outcome. The negative control outcome variable is chosen to be a variable that is not expected to be associated with the exposure of interest, but is expected to be subject to the same unobserved confounding as the exposure and outcome of interest. It follows that if the assumptions hold, any association observed between the exposure and the negative control outcome will be due to confounding in the model. Negative control outcome studies have previously also been proposed to detect selection bias in observational studies. 11 We advance the use of negative control outcomes to identify when exposure and outcome phenotypes in an MR analysis may be subject to population stratification or selection bias that has not been fully accounted for in the GWAS, and so may induce instrument-outcome confounding or mis-estimation of the SNP exposure relationship, and consequently may bias the results obtained.
Negative control outcome methods applied to an MR analysis have been used in a limited number of studies previously to detect potential pleiotropy and provide additional evidence on the validity of the MR study. [14][15][16] These negative controls were, however, not used in the detection of potential population stratification. We illustrate this method through estimation of the potential for population stratification in 33 preselected exposure-indexing GWAS included in MR Base. 17 We detect potential population stratification by estimating the effect of each phenotype on self-reported tanning ability and self-reported natural hair colour, variables that are likely to be highly affected by population stratification but that are largely determined at birth and are not expected to be truly affected by any of the phenotypes considered. Our results from this study show that the GWAS of adiposity-related phenotypes and education are likely to be affected by population stratification. Any MR involving these phenotypes is therefore potentially subject to bias. These results show that negative control outcomes should be routinely used in MR studies to detect population stratification.

Methods
The GWAS of natural hair colour was conducted using data from UK Biobank [https://www.ukbiobank.ac.uk/] under application number 15825. UK Biobank has received ethical approval from the UK National Health Service's National Research Ethics Service (ref 11/NW/ 0382). All other analyses were conducted using publicly Box 1. Population Stratification and the effects on results from GWAS Population stratification occurs where different sub-populations within a population being studied have different distributions of both allele frequencies and the phenotype being considered. Phenotypes, particular social phenotypes, have different distributions geographically, even over relatively small geographical areas. Primarily due to genetic drift, allele frequencies will also differ geographically. Even if this distribution of allele frequencies occurs due to chance, it can lead to an apparent association between those alleles and the phenotype, even when none exists. This will confound the results from the GWAS and lead to potentially spurious or inflated association between SNPs and the phenotype, which are due to the structure of the population and not due to a direct effect of the SNP on the phenotype. 2,18 Additionally it means that even for SNPs that do have a direct effect on the phenotype, the estimated size of that association will be biased. 19 Within GWAS studies, population stratification is often controlled for by adjusting for the top principal components from a principal components analysis of the genetic variants, 20 or by using linear mixed models which allows them to account for genetic confounding of common variants more accurately, and improve power, by jointly modelling the contribution of all measured variants. [21][22][23] However, a number of recent papers have examined the effect of population stratification in datasets such as UK Biobank and have shown that population stratification is likely to present challenges for causal inference and that such adjustment may not be sufficient. 5 Haworth and colleagues show in UK Biobank that genetic variants are associated with a number of variables including location of birth, and this association cannot be fully accounted for by standard principal components analysis. 4 Abdellaoui and colleagues show that many traits in UK Biobank are subject to genetically driven clustering after controlling for ancestry. 24 They propose that this clustering is likely to reflect socioeconomic differences in migration patterns, and that these results suggest that social stratification affects the geographical pattern of allele frequencies. The implication of this is that even very recent patterns of movement within the UK will lead to population stratification for more 'social' phenotypes.
Many GWAS are conducted using multiple independent cohorts, each assumed to have independent ancestral patterns that are likely to be cancelled out when meta-analysed. A recent study has brought this assumption into question, illustrating that within-cohort correction for population stratification tends to be under-powered to fully account for deep ancestral history that is common across all cohorts. 25 Two recent papers show that a polygenic signal for height, observed in European GWAS such as GIANT, is weak or absent in UK Biobank and that the signal observed in the European GWAS is due to population stratification. 26,27 Barton and colleagues highlight why this matters. When multiple genetic variants are used to predict a phenotype, as is the case in an MR study, the association between each variant and the phenotype needs to be unbiased for reliable inference. Population stratification will bias these associations and therefore potentially any inference that is based on them. 19 available summary data generated using the relevant ethics approval for that study.
Two-sample summary data MR compares the association of a set of SNPs with the exposure and outcome to determine the effect of the exposure on the outcome. It is explained in detail elsewhere. 6,7 We propose running two additional MR sensitivity analyses for any MR study where population stratification is thought to potentially affect the result: i. an MR analysis to estimate the effect of the exposure of interest on the negative control outcome; ii. an MR analysis to estimate the effect of the outcome of interest on the negative control outcome (i.e. the outcome of interest becomes the exposure in this analysis).
Any effects detected in these analyses would indicate the potential presence of population stratification in the GWAS of the phenotype of interest and therefore possible bias in an MR analysis including that phenotype. The negative control outcomes should be selected based on the same criteria that have been traditionally used in epidemiological studies; i.e. they should not be expected to be dependent on the phenotypes of interest in the analysis but should be affected by the same confounding. In order to satisfy the assumption that the negative control outcome is not actually caused by the exposure, we propose using phenotypes that are determined before the exposure and the outcome in the negative control MR study. The phenotype for the negative control outcome should also be selected to be thought to be affected by the population stratification. For bias caused by population stratification, such variables could include hair colour, eye colour or skin tone. If there is no instrument-outcome confounding, this analysis will give a null result. As the negative control outcomes are largely predetermined relative to the exposure and outcome and so cannot depend on either, any association of the SNPs with the negative control outcome must be driven by some other mechanism. This could take the form of pleiotropy due to the SNPs having an effect either directly on the negative control outcome or on another phenotype that then affected the negative control outcome, illustrated in Figure 1. However, conventional pleiotropy robust estimation methods will give results that are robust to this pleiotropy if it only affects some of the SNPs included in the estimation. [28][29][30] Alternatively, the (c) Mechanisms through which pleiotropy can cause bias in Mendelian randomization estimates. X is the exposure of interest, Y is the outcome of interest, G x are the genetic variants associated with X used as instruments, G Y are genetic variants associated with Y, C is a confounder of the exposure outcome relationship. In (a): assumption IV1 is illustrated by the bold line from G X to X. Violations of assumptions IV2 and IV3 are given by the dashed lines from C to G X and from G X to Y, respectively. In (b): the presence of population stratification creates an association between G X and Y that does not go through X, violating one of the IV assumptions. In (c): pleiotropy will cause bias in MR estimates if either both edges marked a, or the edge marked b, are present. Pleiotropy in MR studies is explained in detail elsewhere 3 observed effect of the phenotype on the negative control outcome could be due to instrument -outcome confounding. In this case, conventional pleiotropy robust methods would not give results that are robust to this bias as the confounding would affect all of the SNPs included in the estimation. Evidence of an effect of the exposure and outcome on the negative control outcome indicates that an MR study of the exposure on the outcome is also likely to be biased. An illustration of how negative control outcomes could be applied to an MR estimation of body mass index on coronary heart disease is given in Supplementary Figure S1, available as Supplementary data at IJE online.

Applied example
To illustrate the use of negative control outcomes in MR studies, we investigated the effect of a range of exposures on self-reported tanning ability and natural hair colour from UK Biobank, as a negative control outcome in twosample summary data MR to detect population stratification in these exposures. Between 2006 to 2010, the UK Biobank study enrolled 500 000 individuals aged between 40 and 69 at baseline across 22 assessments centres in the UK. 31 Data were collected based on clinical examinations, assays of biological samples, detailed information regarding self-reported health characteristics and genome-wide genotyping. 32 In total, 12 370 749 genetic variants in up to 463 005 individuals were available for analysis, as described previously. 33 UK Biobank received ethical approval from the Research Ethics Committee (REC reference for UK Biobank is 11/NW/0382).
For their tanning response to sun exposures, individuals were asked 'What would happen to your skin if it was repeatedly exposed to bright sunlight without any protection?,' with four potential responses which ranged from get very tanned (given a score of 1) to never tan and always burn (given a score of 4). A higher score is therefore associated with fairer skin that is less prone to tanning. A GWAS of this question was conducted by the MRC IEU 33 and included in MR Base. 17 For natural hair colour, individuals were asked 'What best describes your natural hair colour? (If your hair colour is grey, the colour before you went grey)?', with five valid potential responses; blonde, red, light brown, dark brown or black. We categorized these responses as 1: blonde, 2: red, 3: light brown, 4: dark brown and 5: black, in accordance with a previous GWAS of hair colour which included UK Biobank. 34 The association between genetic variants and outcomes in the UK Biobank study were assessed using the software BOLT-LMM. 21,33 This approach applies a Bayesian linear mixed model to evaluate the association between each genetic variant across the human genome in turn, with the analysed outcome accounting for both relatedness and population stratification. 22 Age at baseline, sex and type of genotyping array were added as covariates in the model. As tanning ability and hair colour are largely determined at birth and are highly dependent on variations in an individual's ancestral background, they should not depend on exposures experienced during an individual's lifetime.
We preselected 50 characteristics or risk factors with GWAS data available in MR base as our example phenotypes. These phenotypes were all selected to have male and female participants from a mixed or European population that did not include UK Biobank. Where multiple GWAS for the same phenotype were available, we chose only the most recent relevant one available in MR base at the time of analysis; however, we retained in the analysis similar (but not exactly equivalent) phenotypes such as body mass index (BMI) and waist-to-hip ratio. We excluded GWAS that included UK Biobank to avoid the potential for winner's curse from selecting the exposure and the outcome from the same sample. A full list of the phenotypes included in the analysis is given in Supplementary Table S1, available as Supplementary data at IJE online. From these preselected phenotypes, we excluded one GWAS due to the information available in MR base not matching that given in the paper, and 16 with fewer than five genome-wide significant SNPs available as instruments, leaving us with 33 phenotypes for analysis.
For each of our 33 exposures, we calculated the inverse variance weighted (IVW) effect for that exposure on tanning ability and hair colour. For those exposures which showed evidence of an effect on each negative control outcome, we also report the MR Egger, 30 weighted mode 29 and weighted median 28 effects as sensitivity analyses. Weighted mode and median estimates give robust estimation results if the association observed is driven by outlying SNPs. However, if population stratification is driving the results seen, this would not be expected to be due to an effect of a small number of outlying SNPs but due to an effect across all of the SNPs used as instruments. We would therefore still expect to estimate an effect of the trait on the negative control outcome in each case. MR Egger accounts for violation of IV assumptions 2 and 3 that satisfy the InSIDE assumption. This assumption states that the bias on the outcome is independent of the strength of the SNP on the exposure. Bias due to population stratification may satisfy this assumption if it applies equally across the SNPs. However, MR Egger has low power to detect effect estimates, and so it is often not possible to determine whether the lack of an association in an MR Egger estimation that was observed in an IVW analysis is due to bias in the IVW estimation or low power in the MR Egger estimates. In each case we included all SNPs that are genome-wide significant for the exposure as our genetic instruments, as these are the SNPs that are usually used for MR analyses with multiple SNPs. 35 All analyses were conducted using the package 'TwoSampleMR' in R. 17 Estimated effects of the genetic liability towards each exposure on tanning ability from our IVW analysis are given in Figure 2, with full details given in Supplementary Table S1. These results show that a number of the exposures considered appear to have a causal effect of genetic liability towards that exposure on tanning ability. Table 1 gives the estimated effect sizes for all results with a P-value of less than 0.05 in the IVW analyses. Although we have conducted multiple tests in this analysis as many of the phenotypes we consider are related, these tests are not independent. We therefore suggest here that this gives a potential indication of whether results warrant further investigation for potential bias rather than a hard cut-off for whether these results are of interest. The traits with an effect on tanning fall into three categories; adiposity-related traits, bowel disease and years of schooling. The majority of the GWAS studies included adjusted for population stratification (using principal components or alternative methods), suggesting that this adjustment alone is not sufficient to remove all structural bias in the data. The MR Egger results suffer from high levels of uncertainty due to low power but estimated the same direction of effect in all but three of these exposures. The weighted mode and weighted median estimates supported the overall results with all of the results showing the same direction of effect as the IVW results and only three of the 12 results not replicating in at least one of the weighted mode or weighted median estimates.
Estimated effects of the genetic liability towards each exposure on hair colour from our IVW analysis are given in Figure 3, with full details given in Supplementary Table  S1. Results from the IVW analyses and sensitivity analyses for exposures with a P-value <0.05 (as a heuristic for Figure 2 Inverse variance weight (IVW) estimates from Mendelian randomization (MR) analyses on self-reported tanning ability. IVW results from MR analyses of 33 preselected traits on tanning ability. A higher score indicates less being less likely to tan and more likely to burn when exposed to strong sunlight. Full results from these analyses are given in Supplementary Table S1 presentation) in the IVW analysis are given in Table 2. These results show a very similar pattern to the results for tanning ability with adiposity-and education-related traits, showing a potential effect on hair colour. Additionally, coeliac disease showed an association with hair colour.
Three traits (triglycerides, years of schooling and obesity class 2) showed evidence of an effect on both tanning ability and hair colour.
These results illustrate the use of negative control outcomes to detect potential population stratification, and Results from inverse variance weight (IVW), MR Egger, weighted mode and weighted median analyses for those phenotypes which indicated a potential effect on tanning ability from an MR analysis of 33 preselected phenotypes on tanning ability.
show that the GWAS results for many of the phenotypes we consider, particularly those related to education and adiposity, are likely to be affected by population stratification.

Discussion
In this paper we describe the use of variables that are predetermined relative to the phenotypes of interest, but are likely to be subject to population stratification as negative control outcomes within an MR analysis to detect population stratification. The method we describe is easy to implement with currently available software and data. Our results suggest that negative control outcomes could be routinely used as part of any MR study, to detect population stratification in GWAS data that could bias the results from the MR estimation.
We propose using this method to examine the potential for population stratification in both the exposure and outcome in any MR study. Population stratification in the outcome can create confounding between the genetic variants and the outcome, which can lead to an apparent association between an exposure and outcome in any MR estimation when no causal effect exists. Population stratification in the exposure will create confounding of the genetic variants and exposure which can bias the causal estimate obtained from the MR estimation, including making a true association appear to be null. Therefore, for reliable interpretation of the results from the MR estimation, including a reliable assessment of whether or not a causal effect exists as well as estimation of the size of that effect, it is important that population stratification does not affect either the exposure or outcome. For this reason, negative control  Supplementary Table S1 outcomes should be applied as described here to both the exposure and the outcome in the MR estimation of interest.
Our applied analysis conducts a negative control outcome MR analysis of 33 preselected phenotypes on tanning ability and natural hair colour, to detect the potential for population stratification in the GWAS of these phenotypes. We find a range of phenotypes are potentially affected by population stratification, particularly a number of phenotypes related to BMI, height and educational attainment. Any association between these variables shown by an MR analysis could be due to population stratification introducing the apparent association. This result is supported by a recent study using within-family MR analyses, which showed the observed associations from MR analyses between height and education and BMI and education attenuated once family effects were controlled for. 8 One key advantage of within-family MR over our method is that is can provide MR causal estimates adjusted for the bias due to population stratification. However, within-family MR requires a large sample of related individuals and cannot be conducted with standard GWAS results. The method we propose can detect potential population stratification in samples that do not contain related individuals and using existing summary data.
Genetic liability for coeliac disease was associated with hair colour; however, this GWAS did not account for population stratification, suggesting that the adjustments for population stratification included in GWAS studies do mitigate the effects of population stratification to some degree. However, a number of the other exposures which were found to be associated with our negative control outcomes did include adjustment for population stratification in the GWAS, suggesting that this adjustment does not fully mitigate the effects. Examination of the extent to which adjustment for population stratification, through inclusion of Results from inverse variance weight (IVW), MR Egger, weighted mode and weighted median analyses for those phenotypes which indicated a potential effect on tanning ability from an MR analysis of 33 preselected phenotypes on self-reported natural hair colour. P-values in parentheses.
principal components or alternative adjustments such as using a BOLT-LMM model 21 in GWAS, can mitigate the problems of bias due to instrument-outcome confounding in MR studies, is an area for future research. Linkage disequilibrium (LD) score regression is a method that attempts to separate out biological and confounded genetic signals and so can also be applied in an MR setting to determine whether there is a causal (biological) effect of an exposure on an outcome or if an observed association is due to confounding. 36 LD score regression however does not give the appropriate results if the GWAS being considered have been performed using a linear mixed model. Our method therefore provides a complementary approach to LD score regression whcih does not depend on the method used to estimate GWAS associations. Additionally, LD score regression incorporates data from the entire genome, whereas the use of negative controls outcomes proposed here only uses SNPs strongly associated with the phenotypes of interest. This is potentially more relevant to bias in MR analyses which use SNPs associated with the exposure to estimate the causal effect of the exposure on the outcome.
The methods described here could equally be applied to detect selection bias. Selection bias, where individuals select to participate in a study or not, based on their particular phenotypes, can also induce bias into any analyses of that study. 37 Particularly, selection bias can induce bias in the associations observed between phenotypes selected on and the genetic variants associated with those phenotypes. 38,39 Selection bias is a form of collider bias which occurs when the variables of interest independently affect a third variable and so conditioning on this third collider variable will induce an association between the variables of interest. 37,40 In this case, the third variable is participation in the study and conditioning on it is unavoidable, as data are only available for the participants. Although selection bias is distinct from population stratification in its source, the subsequent biases in MR studies are similar. 39,41 Negative control outcomes could also be used to detect selection bias in GWAS results by using outcome phenotypes that are expected to affect participation in a study, but that are predetermined relative to the exposure and outcome considered in the MR analysis. Such negative control outcomes could include early life variables such as place of birth or education. Alternatively, participation can be examined directly in birth cohort studies which are followed up over time, and GWAS results from these studies could be used as a negative control outcome. 42 There are a number of weaknesses with our method that should be considered. This method is only able to detect bias as far as it affects the chosen negative control outcome, and therefore no detected effect of the phenotype on the negative control outcome does not mean that the phenotype, and any associated MR analysis, is necessarily free from bias. This limitation can be mitigated by choosing negative control outcomes that are likely to be highly population stratified, as far as they are available.
Negative control outcome calibration has been proposed for observational negative control studies, to adjust the effect of the exposure of interest on the outcome for the bias detected by the negative control outcome. 43 We believe that mechanical application of such an approach should be avoided due to the strong assumptions required for such calibration to give reliable estimates. 44 A key assumption for such an approach to work is that the model fully identifies the effect of the exposure on the outcome and negative control outcome, such that the size of the effect of the bias on the outcome can be determined once and differences in scale of the outcome and negative control have been taken into account. In the context of MR this is not a reasonable assumption, as this assumption would require the instrumentoutcome confounding to have exactly the same effect in the exposure, outcome and negative control outcome, and so if bias is detected, this method does provide a method to correct the estimated effect. However, the size and direction of the estimated effect on the negative control outcome could be used as an indicator for a sensitivity analysis which considered whether bias of up to, for example, five times that estimated by the negative control outcome would change the conclusions from the main MR analyses.
An extension to this method is to consider the use of similar phenotypes, considered as negative control outcomes here, as negative control exposures. Such an approach provides an obvious complement to the approach considered here; however, the assumptions required and implications of such an analysis are notably different from those for a negative control outcome study, and therefore we leave this as an area for future research.

Supplementary data
Supplementary data are available at IJE online.

Funding
This work was supported by the Integrative Epidemiology Unit which is funded by the University of Bristol and the Medical Research Council (MC_UU_00011/1). T.G.R. is a UKRI Innovation Research Fellow (MR/S003886/1).

Data availability
Data for all outcomes considered and 'tanning response to sun exposure' exposure are available as part of the R package 'TwoSampleMR'. Code for the negative control analyses conducted is available at [https://github.com/elea norsanderson/MR-negativecontrols].