REGRESSION RECONSTRUCTION FROM A RETROSPECTIVE SAMPLE

The simplest form of retrospective study allows the reconstruction of the dependence between a binary outcome, Y , representing the contrast between cases and controls, and one or more explanatory variables. A diﬀerent objective for such situations is considered, in which there are distinct explanatory variables, say ( W, X ) determining Y . Reconstruction of the originating distribution of ( W, X ) from the case-control data is considered for both continuous and binary variables. Emphasis is on the linear regression coeﬃcient of W on X . That coeﬃcient, but not the relevant intercept, shows considerable stability, as shown by theory and simulations. An approximation to the value of the coeﬃcient not conditioning on Y is given.


Introduction
One rather general formulation of the challenge of interpreting observational studies is to suppose data available on a sample of individuals with three broad types of observation, outcomes, Y , explanatory variables, W , and background or intrinsic variables, X. The ultimate objective of study is usually the dependence of Y on W , allowing for the presence of X. An experiment or intervention will study this directly, including often an element of randomization of W . Observational studies are constrained in various ways, implying that the distributions generating the data may be only indirectly related to the distribution of interest.
In particular, in a case-control design inclusion in the data depends strongly on an outcome variable. In the present paper we suppose, somewhat unusually, emphasis lies on the dependence among the explanatory variables, in particular that of W on X in the underlying population, marginalizing over Y . Reconstruction of the underlying dependence of interest is not direct and it has been pointed out (Nagelkereke et al., 1995;Lee et al., 1997;Jiang et al., 2006;Lin and Zeng, 2009;Wei et al., 2013;Xing et al., 2016) that some methods in the literature may be misleading.
An area where this issue often arises is genetic epidemiology. Many genome-wide association studies (GWAS) have a case-control design because their main aim was to discover associations of genetic variants with relatively rare outcomes, but often data on other variables are collected and the data are re-used to assess the associations with other variables in the case-control sample. Lin and Zeng (2009) demonstrated that some approaches commonly employed in applications give biased estimates of the association of interest and proposed methods to address this issue. Monsees et al. (2009) discussed the issues with a focus on GWAS and testing. Dai and Zhang (2014) studied the Mendelian randomisation estimator for the relationship of a continuous exposure with an outcome in a case-control study.
Because the population proportion of cases is in general unknown, corrections for weighted sampling based on unequal but known probabilities of selection (Horvitz and Thompson, 1952) are not available, except possibly as the basis for a sensitivity analysis.
The emphasis in the present note is on the formal relations involved, not on explicit details of estimation procedures.
In section 2 we give the formulation of the problem and theoretical relations involved for linear regression when W is continuous and in particular a formula for the reconstructed regression coefficient. In section 3 we present some results from a simulation study illustrating the theoretical findings.
Section 4 introduces the corresponding relationships when all variables are binary, and in section 5 the conclusions are discussed.

Theory
To study these issues in their simplest form we consider two random variables (X, W ) whose population distribution is of interest. The case/control binary outcome Y depends on (X, W ) and defines two random samples conditionally respectively on Y = 1, cases, and Y = 0, controls ( Figure 1). From these we wish to reconstruct the population distribution of (X, W ). Our arguments are general but we focus on the linear regression coefficient, β W X , of W on X. We treat both variables as one-dimensional; the results extend directly to vector (W, X).
An instructive but extreme special case arises when Y is conditionally independent of W given X. Then also W is conditionally independent of Y given X so that the form of the regression relation of W on X is the same within cases and within controls and within the population. This is concordant with the general notion that in fitting regression relations the explanatory variables are typically regarded as fixed at their observed values.
The joint distribution of (W, X) is, however, in general different in cases from that in controls. To estimate the linear least squares regression coefficient of W on X we may, however, in this situation find the regression coefficients and their standard errors separately within cases and within controls and, preferably subject to an informal check of consistency, calculate a weighted mean.
More generally we suppose without loss of generality that E(W ) = E(X) = 0 and also that where L(.) is an increasing function with values in (0, 1). The regression coefficients, such as β Y W.X , are defined for a given function L(.), so that if different such functions are involved in a specific study an extended notation would be required. Natural choices for L(.) are the standardized normal integral and the logistic function. Another important possibility, normally useful, however, only over a restricted range, is the linear in probability model, It is known that if the data are concentrated in the range of probabilities say in (0.2, 0.8) empirical choice between different 'dose-response' relations such as logistic, integrated normal and linear is feasible only with very large amounts of data (Chambers and Cox, 1967). Then marginally in the population, P (Y = 1) = α. For the cases, Y = 1, we have To obtain results for Y = 0, the controls, we replace α by 1 − α and reverse the sign of the regression coefficients Thus the conditional distribution of W given X is the same in cases and controls and in the population if and only if β Y W.X = 0, consistently with the more general result noted previously. However the conditional mean of W in (1) is, on writing for the population E(W | X = x) = β W X x and simplifying, where σ 2 W.X is the conditional variance of W around its least squares regression on X.
Thus when all regression coefficients are positive the regression line of W on X among the cases is somewhat lower than its population form but has the same slope. The replacement of α by 1 − α for the controls implies that because α is typically small the distortion among the controls is much smaller.
In many applications, however, especially where some probabilities are quite small, the linear in probability model will not be reasonable. Indeed the most common reason for use of a case-control design is that cases are rare in the population, indeed possibly very rare. In such situations the relation between X and W in the population will be close to that in the controls and the linearity of the assumed dependence of suspect. We give a more realistic formulation later.
A more detailed analysis of the linear in probability model shows that if the regression of W on X were studied directly ignoring case/control status then to a first approximation the slope would be unchanged but the position of the line displaced.
For a more refined analysis abandoning the linearity assumption, we assume (X, W ) to have a bivariate normal distribution, taken without loss of generality to have zero means. The regression coefficient of W on X is again denoted by β W X . We assume further that instead of (1) where Φ(.) is the standard normal cumulative distribution function. This leads, after integrating over the conditional distribution of W given X = x, and then over the distribution of X, to Here with a complementary expression given Y = 0. If we assume that the population regression of W on X is linear with normal errors we may replace the first factor on the right-hand side by implying that case/control status is independent of W given X. In general, if we standardize W and X to have zero means and unit variances the regression coefficients involving Y are likely to be numerically small in realistic situations and expansion leads to where λ(z) = φ(z)/Φ(z), related to Mills ratio. For controls, Y = 0, change the sign of β Y W.X and change α to 1 − α.
That is, to this order the impact of case-control sampling on the regression of W on X is to leave the slope unchanged but induce translations of the regression line in opposite directions for cases and controls.
In particular it follows that to this order for the cases where β Y in the last term refers to all regression coefficients with Y as outcome variable. For the controls, again reverse the appropriate signs and change α to 1 − α.
Inclusion of quadratic terms in the β Y shows that relatively complicated nonlinearities are involved. Typically cases are rare, so that α > 0 and Φ(−α) is small. There is an upward displacement of the regression line but to this order no change in slope. The downwards shift in the line for controls is by contrast much smaller because now the denominator of λ(α) is close to one whereas the numerator is usually small.
The conditional expectation of W given X = x among the cases now follows on multiplying by w and integrating.
where φ(.) is the standardized normal density. For the controls, reverse the signs of α, β Y X . The integral is best approximated by the delta method, that is local linearization around the expected value of W , namely β W X x, to give the approximations Here φ(.) is the standardized normal density. For the controls, Y = 0, change the sign of the arguments of Φ(.).
The second term in (2) specifies a nonlinear dependence on x. It is most simply summarized by the slope at x = 0 thus changing the linear regression coefficient to Now for negative z a convenient first approximation is that φ(z)/Φ(z) ≈ −z, in effect the leading term of an asymptotic expansion of Φ(z) for large negative z, underestimating the ratio. This leads to a simple approximation to the regression coefficient among the cases of For controls, however, a quite different approximation has to be used because Φ(α/τ ), being the population proportion of cases, is no longer small. The second term in (4) is thus typically small and the regression coefficient in the controls thus only slightly different from that in the population.

Some simulations
To illustrate the problem and confirm the results of the previous section we carried out some simulations. We generated a 'population' sample with a given relationship between X and W and then selected a case-control sample within that, with case/control status depending on X and/or W . We selected a few plausible values for the relationship between the three variables involved to examine the resulting estimates of the regression of W on X by carrying out various types of analysis: the analysis in the full population and within the case/control sample (incorrectly) conditioning on case/control status, using inverse probability weighting assuming the proportion of cases is known and applying the proposed correction.
We generated 10 5 values of X ∼ N (0, 1) and where Z ∼ N (0, 1). Case/control status Y was generated to be equal to 1, When either β Y W.X or β Y X.W is zero and β W X is zero, all regressions give the same estimate of β W X . As β Y W.X or β Y X.W increases, the estimates of β W X from the regression in cases only, controls only or on both conditioning on case/control status have a downward bias, the bias increasing with increasing β Y W.X and β Y X.W . The bias is larger when the proportion of cases in the population is larger. The estimates from the weighted regression have a similar pattern but smaller bias. The estimate from the case-control sample obtained without adjusting for case/control status,β W X , has a small upward bias when the cases are rare in the population but becomes unbiased when the proportion of cases in the sample becomes closer to the proportion of cases in the population. The reconstructed estimate has a small upward bias.
As a sensitivity analysis the simulation was repeated with X and Z having a t 10 distribution (Supplementary table). The resulting estimates differed slightly but not substantially from those in Table 1.  Table 1: Simulation results; Continuous X and W ; β pop W X is the estimated coefficient from the linear regression of W on X in the population sample, β Y W.X is the effect of W on Y given X, β Y X.W is the effect of X on Y given W ,β W X.0 andβ W X.1 are the estimated coefficients of the regression of W on X in controls and cases only, respectively,β W X.Y is the estimated coefficient from the sample adjusting for Y ,β W X is the estimated coefficient from the sample ignoring Y ,β IPW W X is the estimated coefficient from inverse probability weighted regression,β * W X is the reconstructed estimate using (5) and L(α) is the proportion of cases in the population. Estimates are averages over 250 simulation runs.
We now describe an argument broadly parallel to that in the previous section for the case where both variables are binary. Consider two binary variables X and W studied at two levels of the variable Y , indicating, again, case/control status, Y . Then in the population for i, j = 0, 1 we have probabilities where p W X ij is the joint probability distribution of W = i, X = j and π = P (Y = 1) is the population probability of an individual being a case. Thus any population property, such as, for example, the log odds ratio ψ = log{(p W X 11 p W X 00 )/(p W X 10 p W X 01 )} can be found in terms of quantities that can be estimated and the population parameter π. In particular when π is small, we have that with error O(π 2 ) ψ = ψ 0 + π(∆p 11 /p 11|0 + ∆p 00 /p 00|0 −∆p 10 /p 10|0 − ∆p 01 /p 01|0 ), where ∆p ij = p ij|1 − p ij|0 and ψ 0 is the log odds ratio in the controls, Y = 0, and p ij|0 is the conditional probability of W = i, X = j in controls. Note that because probabilities sum to one, ∆p 11 + ∆p 00 − ∆p 10 − ∆p 01 = 0.
Thus the adjustment term, the coefficient of π, is particularly important when both p 11|1 − p 11|0 and p 00|1 − p 00|0 have the same sign.
An alternative approach more closely linked to the results of Section 2 is to regard the binary variables as dichotomized versions of unobserved normally distributed random variables. The earlier results may then be adapted.

Discussion
We have discussed the relationships involved when fitting a regression on a pair of variables in a non-random sample, selected on the basis of a third variable. We have found that the regression coefficient is relatively stable under different types of analysis and have proposed a correction to reconstruct the coefficient that would have been obtained under no selection. The intercepts from the different analyses vary substantially, as expected.
The simulations show that the regression coefficient estimated conditioning on case/control status has substantial downward bias, whereas the coefficient estimated by ignoring case/control status or using inverse probability weighting are closer to the value estimated from the population from which the case-control sample was been drawn.
The account given here of sampling based on case/control status can be generalized in various ways, the most immediate of which is to replace the scalar explanatory variable X by a vector. When (X, W ) are both binary the discussion can be extended to include adjustment for other covariates, including continuous ones. Yet another possibility is to consider the use of instrumental variables to assess the 'causal' effect of X on W from case-control sampling. Dai and Zhang (2015) reported that an uncorrected estimate is unbiased in the null case but otherwise biased away from the null.
In the very special case when case/control status, Y , is independent of W given X, there are three sources of information about (W, X), within controls, within cases, and comparison of group means. First approximations to the first two, before adjustment for the special sampling procedure, are given, with their estimated variances, by standard linear regression formulae.
The third estimate, based on the comparison of means is asymptotically independent of the other two and isβ B = (w 1 −w 0 )/(x 1 −x 0 ). Its variance is best found conditionally on the values of X as proportional to the variance of the difference of two independent means. The final estimate is a weighted mean with weights the reciprocals of the estimated variances. A more refined estimate of precision, allowing for errors in the estimated weights, has not been investigated.
The investigation outlined here is one facet of the broad challenge of the study of dependences that can be investigated only indirectly.