High-dimensional genomic data bias correction and data integration using MANCIE

High-dimensional genomic data analysis is challenging due to noises and biases in high-throughput experiments. We present a computational method matrix analysis and normalization by concordant information enhancement (MANCIE) for bias correction and data integration of distinct genomic profiles on the same samples. MANCIE uses a Bayesian-supported principal component analysis-based approach to adjust the data so as to achieve better consistency between sample-wise distances in the different profiles. MANCIE can improve tissue-specific clustering in ENCODE data, prognostic prediction in Molecular Taxonomy of Breast Cancer International Consortium and The Cancer Genome Atlas data, copy number and expression agreement in Cancer Cell Line Encyclopedia data, and has broad applications in cross-platform, high-dimensional data integration.

If the rows in the associated matrix and the main matrix do not match, the summarization step converts the associated matrix to a summarized associated matrix with matched rows. The combination step integrates the main matrix with the summarized associated matrix into the adjusted matrix.   (b) Adjusted Rand index comparing K-means clustering on the data with actual tissue-type clustering. K-means clustering was performed 1000 times with random seeds. Blue: raw data; Red: MANCIE adjusted data; Yellow: SVA adjusted data. (c) Distribution of GC content of all reads for 61 ENCODE DNase-seq samples. (d) Samples whose DNase-Seq reads' GC-content distributions are distinct from the majority are adjused by a greater extent after MANCIE. Each dot represents a cell line sample, with x-and yaxes representing the mean and coefficient of variation, respectively, of the CG-content distribution of all reads in the DNase-seq dataset. The size of the dot represents the magnitude of adjustment of MANCIE, measured by the Euclidean distance between the sample data vectors before and after MANCIE adjustment.  Problem Description. Let m i = (m i1 , · · · , m iK ) be the i-th row (i.e., feature) of the main matrix M, and c i = (c i1 , · · · , c iK ) be its counterpart in the associated matrix C, where each k ∈ {1, · · · , K} stands for one sample or condition. Since (m ik , c ik ) T are observations of feature i from different biological experiments which often contain a lot of uncertainty, it's natural to assume that they are the noised version of the underlying "truth" (m * ik , c * ik ) T , i.e.,

After
where ε ik is a two-dimensional noise vector. MANCIE aims to remove noise in m i by borrowing information from c i , i.e., inferring m i * = (m i * 1 , · · · , m * iK ) based on both m i and c i .

Statistical Model & Inference.
To simplify the problem, let's assume that both of each other, and ρ i > 0. Clearly, δ 2 im and δ 2 ic stands for the noise-signal ratio of m i and c i respectively, where a larger δ 2 means lower quality of the data. Here, we assume that δ 2 ic ≥ δ 2 im as the main matrix usually enjoys better quality.
Under this model, we have And, it's easy to check that i.e., the correlation coefficient of the observed data (m i , c i ) is always smaller than the true correlation coefficient cov(c * i , m * i ), and the difference depends on the noise level (δ 2 im , δ 2 ic ).
Without loss of generality, we can also assume that µ i = 0 for any feature i (i.e., the observed data are centralized). Now, assume that both Σ i and ∆ i are known. Based on the Bayes rule, we have the following which means that the best guess for the unknown (m * ik , c * ik ) T should be the posterior mean Since we are interested in improving the main matrix, we will only focus on