Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference

Summary Nonignorable technical variation is commonly observed across data from multiple experimental runs, platforms, or studies. These so-called batch effects can lead to difficulty in merging data from multiple sources, as they can severely bias the outcome of the analysis. Many groups have developed approaches for removing batch effects from data, usually by accommodating batch variables into the analysis (one-step correction) or by preprocessing the data prior to the formal or final analysis (two-step correction). One-step correction is often desirable due it its simplicity, but its flexibility is limited and it can be difficult to include batch variables uniformly when an analysis has multiple stages. Two-step correction allows for richer models of batch mean and variance. However, prior investigation has indicated that two-step correction can lead to incorrect statistical inference in downstream analysis. Generally speaking, two-step approaches introduce a correlation structure in the corrected data, which, if ignored, may lead to either exaggerated or diminished significance in downstream applications such as differential expression analysis. Here, we provide more intuitive and more formal evaluations of the impacts of two-step batch correction compared to existing literature. We demonstrate that the undesired impacts of two-step correction (exaggerated or diminished significance) depend on both the nature of the study design and the batch effects. We also provide strategies for overcoming these negative impacts in downstream analyses using the estimated correlation matrix of the corrected data. We compare the results of our proposed workflow with the results from other published one-step and two-step methods and show that our methods lead to more consistent false discovery controls and power of detection across a variety of batch effect scenarios. Software for our method is available through GitHub (https://github.com/jtleek/sva-devel) and will be available in future versions of the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $\texttt{sva}$\end{document} R package in the Bioconductor project (https://bioconductor.org/packages/release/bioc/html/sva.html).

Model (1.1) the first step of the two-step batch adjustment. In the second step of the two-step batch adjustment, the batch adjusted dataỸ g is obtained as: Y g = Y g − X 2β2g = (I − X 2 (X T 2 P X 2 ) −1 X T 2 P )Y g = (I − H 12 )Y g (1.2) 2

T. Li and others
where H 12 is: Therefore, the batch adjusted data has covariance matrix σ 2 It is noteworthy that X 1 should also include the all-ones vector 1 when there is an reference

The relationship between biological effect estimates and batch design
In this section, we will show the relationship between biological effect estimates and batch design.
Without loss of generality, the regression model (for each gene) is formulated as Y = α + X 1 β 1 + X 2 β 2 + , ∼ N (0, σ 2 I), where α is the background gene expression. X 1 represents the biological groups (assume there are two biological groups) and X 2 represents the batch design. Furthermore, we define the matrix X = [X 2 , X 1 ] and the matrix V = [1, X]. We also define sample variancê σ xx and covariancesσ xy as follows: Our goal is to derive the least square estimateβ 1 and its variance based on the sample variance-covariance matrix of X. It's known that the least square estimate has the matrix form Specifically, V T V is the following block matrix:

Correcting sample correlations and inferences for two-step batch adjustment
3 The inverse of V T V then becomes: Furthermore, the covariance matrix S XX is the following block matrix: where S 22 is the covariance matrix of X 2 , S 21 is the covariance matrix between X 2 and X 1 (b rows and 1 column; b is the number of batch indicators), and S 12 is just the transpose of S 21 .
The inverse of S XX is: Plugging in the above expression of S −1 XX into the block matrix in (2.7) will give the complete form of (V T V ) −1 , whose elements are all sample means or sample variances/covariances. To isolatê β 1 , only the last row of (V T V ) −1 is needed: β 1 is straightforward given the following expression of V T Y : The expression ofβ 1 is then:β (2.14) and its variance is: whereσ 2 is the estimated residual variance in regression.

The relationship between H 12 and S 12
Without loss of generality, we assume there is a reference batch and thus H 12 = X 2 (X T 2 (I − Following the regression framework in the previous section, we can derive the expression of (X T 0 X 0 ) −1 X T 0 X 2 . First, we have: and: ( 3.17) whereX 2 = [X 21 ,X 22 , . . . ,X 2b ], a 1 by b vector whose elements are means of the batch indicators in X 2 .
Taken together, we have the expression of (X T 0 X 0 ) −1 X T 0 X 2 as follows: (3.18)

Special case: balanced designs
For a balanced group-batch design, the elements in S 12 are all 0, which means: Based on (3.19), we can derive the expression of (I − X 0 (X T 0 X 0 ) −1 X T 0 )X 2 as follows: It is clear that (3.20) is just the centered version of X 2 (denoted as X c 2 ), and so we can express H 12 as: With H 12 in the form of (3.21), the correlation between two samples from two different batches in the matrix M is 1 nr where n r is the sample size of the reference batch. If a reference batch is not used, as is the case with most applications of ComBat, the correlations are not as straightforward, but can be derived using a similar procedure, and lead to the same conclusion (covariance only depends on the batch design).
To summarize, when the group-batch design is balanced, H 12 is solely a function of the batch design X 2 and has nothing to do with the group design X 1 . Therefore, removing batch effects in the first step won't result in the endogeneity issue for the second step. When the group-batch design is unbalanced, H 12 depends on both the batch design X 2 and the group design X 1 . The relationship between H 12 and S 12 (and thus X 1 ) can be derived by plugging the expression (3.18) into the expression of H 12 , which will not be detailed here. Most importantly, when the groupbatch design is unbalanced, the covariance vector S 12 will not be 0 and thus H 12 will depend on X 1 and the strength of such dependence is characterized by S 12 .

Additional analysis of the example 4
We ran a simulation based on the example 4: progressors versus non-progressors in tuberculosis and compared the performances of ComBat, ComBat+Cor, SVA and RUV. The results are presented in Table 1 and Figure 1.