Collider Scope: How selection bias can induce spurious associations

1. MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, United Kingdom. 2. UK Centre for Tobacco and Alcohol Studies, School of Experimental Psychology, University of Bristol, Bristol, United Kingdom. 3. School of Social and Community Medicine, University of Bristol, Bristol, United Kingdom. 4. University of Queensland Diamantina Institute, Translational Research Institute, Brisbane, Queensland 4102, Australia.


Introduction
Understanding the impact of genetic and environmental factors on physical and mental health outcomes is critical if we are to develop effective preventive and treatment interventions. Large-scale cross-sectional and cohort studies provide an invaluable resource to support these efforts, in particular with respect to genetic influences, where the small effects associated with common genetic variants require very large samples to achieve adequate statistical power. However, achieving these very large sample sizes in population-based studies may come at the cost of representativeness -participants who volunteer to participate in studies may not be representative of the general population (1).
While some studies may be relatively representative at inception, through rigorous efforts to ensure representative recruitment (e.g., birth cohort studies), as they mature the likelihood is that attrition from the study will be non-random, so that the cohort becomes less representative of the general population as time goes on.
There is already clear evidence from existing large-scale population studies that they are subject to a degree of selection bias. For example, higher genetic risk scores for schizophrenia are consistently associated with non-completion of questionnaires by study mothers and children, as well as non-attendance at data collection clinics, in the Avon Longitudinal Study of Parents and Children (ALSPAC) (2) (see Box 1).
Attrition from cohort studies may result in biased estimates of socioeconomic inequalities, and the degree of bias may worsen as participation rates decrease (3).
However, it is often argued that representativeness is not necessary in studies of this kind (4-8), although this is not universally accepted (9). In particular, for genetic variants, where conventional confounding is low (10), it has been argued, even by those concerned about selection bias, that any problems associated with a lack of representativeness may be modest (9, 11). Here we ask: What is the impact of selection bias on the results obtained from these studies?
Insert Box 1 about here.

Collider Bias
It is widely acknowledged selection bias will distort prevalence estimates.
This can be clearly seen in differences between participants in the original ALSPAC sample and those that attended later clinics (see Box 1), as well as in the UK Biobank study relative to the general population (see Box 2). However, it is often assumed that whilst selection bias will have a strong effect on representativeness and prevalence estimates, it should not have a strong impact on observed associations (4). This overlooks the fact that selection bias can in turn induce collider bias (see Figure 1), which can lead to spurious observational and genetic associations.
Insert Figure 1 and Box 2 about here.
Collider bias occurs when two variables (X and Y) independently cause a third variable (Z). In this situation, Z is a collider, and statistical adjustment for Z will bias the estimated causal association of X (exposure) on Y (outcome) (see Figure 2). Statistical adjustment of the XY association for a variable Z is equivalent to observing this association in a sub-population where all individuals share the same value of Z (1, 12). Hence if both X and Y cause participation in a study (Z), then investigating associations in the selected sample (i.e., with Z = 1, indicating participation in) is equivalent to conditioning on Z, which in turn may induce collider bias.
Insert Figure 2 about here.
Put simply, statistical control is not equivalent to experimental control (1), and so sample selection can induce spurious associations between variables that influence participation or retention in a study, when no such association exists in the wider general population from which the sample is drawn. Alternatively, if two variables are correlated in the wider population, and both cause selection, then estimated correlation in the selected sample may be biased. Moreover, this selection bias will apply to the genetic correlates (or other ancestors) of these variables, unless the phenotypes are also controlled for. So if genes Gx and Gy cause X (exposure) and Y (outcome) respectively, then in the selected sample Gx will appear to be associated with Y (unless X is also controlled for). More complex situations can also give rise to collider bias, such as when the outcome (Y) doesn't directly cause selection into the study (i.e., it is a downstream consequence of something else that is causing selection into the study). However, it is necessary that the exposure (X) either directly or indirectly (such as in the situation described above) causes selection into the study.
In other words, traits that are entirely unrelated in the general population may appear to be correlated in selected samples, if both traits influence participation (and therefore contribute to selection), as a result of implicitly conditioning on their common effect (1, 13). There are exceptions to this depending on the distribution of the outcome and the parametric analysis model used. For example, if the outcome (Y) is a binary phenotype, and logistic regression is used, then the odds ratio for the association between the SNP and outcome may be unbiased even when the outcome causes selection (14).
We have previously argued that these effects may be greater in case-control studies than prospective studies, and that since genetic associations have been similar across study designs, the impact of selection bias may in fact be modest (11).
We have also previously argued that because conventional confounding is typically low for single genetic variants, problems of selection bias will be less in this context (9). However, given the rapid growth in studies using data from highly selected samples such as UK Biobank, and the use of genetic scores rather than single genetic variants, we revisited this question, and used simulation to explore the potential impact of even relatively weak effects on participation. Given empirical evidence of selection in cross-sectional and cohort studies, what is the potential impact of this on observed phenotypic and genotypic associations?

Simulations
We simulated data on an allele score, a phenotype and an outcome, where both the phenotype and outcome influence selection into the study, but there was no association between the allele score and the outcome in the underlying population (see Figure 2). The simulation scenario was based on the UK Biobank. All variables were Normally distributed, with standard deviation of 1, and the sample size of the underlying complete population was 9,000,000. We assumed that phenotype and outcome had independent effects (i.e., no interaction on the additive scale) on the odds of selection into the sample, and for convenience we set these effects to be equal, and examined a weak association (OR of 1.2 for missingness for a 1 SD increase in phenotype/outcome) and two stronger associations (ORs of 1.5 and 1.8).
These odds ratios are similar to estimates of the likelihood of participation in UK Biobank for individuals with any educational or vocational qualifications and for nonsmokers, respectively (see Box 2), and indicate a difference in mean phenotype/outcome of 0.2 SD, 0.4 SD and 0.6 SD between those participating and those not participating. We varied the correlation between the allele score and the phenotype (between r = 0.05 and r = 0.30) to simulate genetic instruments explaining between 0.25% and 9% of the variance in phenotypes. These values are in the typical range for the association between common genetic variants, or polygenic risk scores comprising multiple common variants, and complex phenotypes. For example, the rs16969968 variant accounts for approximately 1% of the phenotypic variance in cigarette consumption (15), while the polygenic risk score for height captures approximately 9% of phenotypic variance (16). We controlled the baseline risk of selection into the sample, resulting in a selected sample of approximately 500,000 people. The analysis was an unadjusted regression of outcome on allele score not adjusting for the phenotype). In the whole population, the regression coefficient for outcome on allele score is zero, and the confidence interval contains zero 95% of the time. We simulated each scenario 100 times.
The results of this simulation are shown in Table 1, and indicate that the effects of selection bias are strongest for stronger independent selection effects, and also where the allele score is more strongly associated with the phenotype. However, even for moderate associations between missingness and both phenotype and outcome (OR = 1.5 for both phenotype and outcome) and between allele score and phenotype (r = 0.1, 1% variance explained by allele score) the confidence intervals contains zero only 89% of the time, and this continues to decrease with both greater strength of association between phenotype, outcome and missingness, and stronger association between allele score and phenotype.
Insert Table 1 about here.

Conclusions
Our results indicate the potential for unrepresentative samples to generate biased and potentially misleading estimates of both phenotypic and genotypic associations. In particular, when polygenic scores associated with a phenotype that combine many genetic variants are used, association between the phenotype and participation will cause the score to be more strongly related to participation than each individual variant is. This, in turn, can potentially lead to serious bias. For this reason, studies using polygenic scores, genome-wide allelic scores (17), and/or whole-genome genetic correlations (18,19) in highly unrepresentative studies are most at risk of producing biased and potentially misleading results.
The magnitude of effects we observed in our simulations, based on credible estimates of associations between both a phenotype or outcome and missingness, and between a polygenic score and a phenotype, are comparable with many reported associations derived from large but unrepresentative samples, such as between personality and cognitive function, and a range of physical and mental health outcomes (20,21), and between chronotype (i.e., "morningness") and years of education (22). An appreciation of the potential impact of selection bias may also resolve inconsistencies in the literature, and help to explain apparently paradoxical findings. For example, genetic correlations between cognitive ability and a range of psychiatric disorders have been reported to differ in childhood and older age (23).
One possible interpretation is that this is due to age-dependent pleiotropy, but another is that this is an artefact of different selection bias pressures at different ages.
An example serves to illustrate this. Polygenic risk scores that maximally capture schizophrenia liability are associated with increased psychotic experiences in ALSPAC participants, but scores that use more stringent thresholds for including genetic variants are associated with reduced psychotic experiences (24). Since missing data are likely to be greater for participants who report psychotic experiences, as well as for those at higher genetic risk of a psychotic disorder, , psychotic experiences may be relatively under-represented in participants with higher genetic risk, compared to those with lower genetic risk (24).
A related issue is the use of case-control studies to examine associations with "secondary" outcomes -that is, phenotypes other than the case/control outcome (25,26). In such studies, the association between genotype and secondary phenotype will be biased if both genotype and secondary phenotype are associated with case-control status. Case-control studies condition on case-control status, and thus again collider bias can bias the association between genotype and secondary phenotype. Various methods have been proposed to overcome this bias, including maximum likelihood and inverse probability weighting. This latter method requires some knowledge about the prevalence of case/control status in the underlying population, or the assumption that the disease is rare (25, 26).
We have discussed one important way in which selection into or out of a study can induce collider bias and spurious associations. There are other ways in which ascertainment can generate biases (27). For example, Figure 3 (panel B) shows a situation in which entry into a study is conditional upon the value of the phenotype (but not the outcome of interest) and where the phenotype does not cause the outcome, but the phenotype and outcome are correlated in unselected samples (i.e., due to genetic and/or environmental factors U). In this situation, collider bias occurs because conditioning on selection induces an association between SNPs related to the phenotype and the polygenic and/or environmental factors that influence the outcome. Therefore SNPs that cause the phenotype only  year mortality than the UK population as a whole (34). Clearly, agreeing to take part in UK Biobank study is associated with a number of characteristics that will reflect, for example, health status and social position. If non-smoking and having qualifications are both causally related to participation in UK Biobank, we would expect the association between smoking and having qualifications to be positively biased (i.e., we would expect to see a more positive association between genetic variants positively associated with smoking and whether participants had educational qualifications in UK Biobank than in the true population). The problem is possibly compounded in genetic studies using the first release of genomewide association data in UK Biobank, which used two genotyping arrays, one of which was applied to a nested case-control study of smoking and lung function (UK BiLEVE) (35). The first release genetic data are therefore further subject to selection bias relative to UK Biobank as a whole (although this will no longer be the case when the full release of genomewide association data becomes available).  In the entire population there is no association between allele score and outcome. Selection into the study (either through voluntary participation at baseline, or attrition over time) induces an association between allele score and outcome (collider bias).
Phenotype Outcome Allele score Selection into Study Figure 3. Scenarios where selection bias would occur.
A. In truth the SNP is not causally associated with the outcome; selection will induce an association (which could be positive or negative).
B. In truth the SNP is not causally associated with the outcome; selection will induce an association (which could be positive or negative).
C. In truth the SNP is causally associated with the outcome; selection could make this larger or attenuate it.
D In truth the SNP is causally associated with the outcome; selection could make this larger or attenuate it.