Multi-Omics factor analysis disentangles heterogeneity in blood cancer

Multi-omic studies in large cohorts promise to characterize biological processes across molecular layers including genome, transcriptome, epigenome, proteome and perturbation phenotypes. However, methods for integrating multi-omic datasets in an unsupervised manner are lacking. We present Multi-Omics Factor Analysis (MOFA), a computational method for discovering the principal sources of variation in a multi-omics dataset. MOFA infers a set of (hidden) factors that capture biological and technical sources of variability across data modalities, thereby enabling a variety of downstream analyses, including factor annotation, data imputation and the detection of outlier samples. We applied MOFA to a study of 200 patient samples of chronic lymphocytic leukemia (CLL) profiled for somatic mutations, RNA expression, DNA methylation and ex-vivo responses to a panel of 63 drugs. MOFA discovered known dimensions of disease heterogeneity, including immunoglobulin heavy chain variable region (IGHV) status and trisomy of chromosome 12, as well as previously underappreciated drivers of variation, such as response to oxidative stress. These learnt factors capture key dimensions of inter-patient heterogeneity and enhance prediction accuracy of clinical outcomes.

learnt factors capture key dimensions of inter-patient heterogeneity and enhance prediction accuracy of clinical outcomes.

Introduction
Technological advances increasingly enable multiple biological layers to be probed in parallel, ranging from genome, epigenome, transcriptome, proteome and metabolome to phenome profiling 1 . Integrative analyses that use information across these data modalities promise to deliver more comprehensive insights into the biological systems under study. Motivated by this, multi-omic profiling is increasingly applied in biological domains including cancer genomics [2][3][4][5] , microbiology 6 and host-pathogen interactions 7 . Most recently, it has also become possible to perform multi-omics analyses in single cells [8][9][10][11] . A common aim of such applications is to understand the nature of any extant heterogeneity between samples, as manifested in one or several of the omic data types 12 .
Multi-omics approaches are particularly appealing if not all relevant axes of variation are known a priori, and hence may be missed by studies that consider a single data modality or by hypothesisdriven approaches.
A basic strategy for the integration of omics data is testing for marginal associations between different data modalities. A prominent example is QTL-analysis, where large numbers of marginal association tests are performed between genetic variants and gene expression levels or chromatin marks 13 . While eminently useful for variant annotation, such association lists do not provide a coherent global map of the molecular differences between samples. A second, multivariate strategy is to use kernel-or graph-based methods to combine different data types into a single similarity network between samples 14,15 ; however, it is difficult to identify the principal molecular drivers of heterogeneity from such networks. Finally, there exist generalizations of clustering methods to reconstruct discrete groups samples based on multiple data modalities 16,17 .
A key limitation of existing methods is lack of interpretability, in particular because the underlying factors that drive the variation are not explicitly reconstructed -a necessary step to establish associations between specific sources of variation and external data such as organismal phenotypes or (clinical) covariates. Additionally, available methods are hampered by computational scalability to larger datasets and commonly do not appropriately handle missing values and non-Gaussian data modalities, such as binary readouts or count-based traits.

Results and discussion
We present Multi-Omics Factor Analysis (MOFA), a statistical method for integrating multiple modalities of omic data in an unsupervised fashion. Intuitively, MOFA can be viewed as a versatile and statistically rigorous generalization of principal component analysis (PCA) to multi-omics data.
Given several data matrices with measurements of multiple 'omics data types on the same or on overlapping sets of samples, MOFA infers an interpretable low-dimensional data representation in terms of (hidden) factors. These learnt factors represent the driving sources of variation across data modalities, thus facilitating the identification of cellular states or disease subgroups.
Importantly, MOFA disentangles whether the underlying axes of heterogeneity are unique to a single data modality or are manifested in multiple modalities ( Fig. 1), thereby identifying links between the different 'omics. Once trained, the model output can be used for a range of downstream analyses, including the visualisation of samples in factor space, the automatic annotation of factors using (gene set) enrichment analysis, the identification of outliers (e.g. due to sample swaps) and the imputation of missing values (Fig. 1).
In particular, MOFA combines i) the inference of interpretable factors by using sparse representations of factor weights, ii) the regularization of the relevance of factors for individual data types, iii) scalable inference using variational Bayesian approximations to permit applications to larger datasets, iv) the support of non-Gaussian data modalities, including binary and count data and v) explicit modelling of missing data, including samples for which some, but not all data modalities were acquired. We used simulated data to systematically test the MOFA model and validate its handling of missing values and modelling of non-Gaussian data types (Supp. Fig S2-S5). We compared MOFA to other factor models, finding increased accuracy for identifying true drivers (Supp.

Application to Chronic Lymphocytic Leukaemia
We applied MOFA to a study of chronic lymphocytic leukaemia (CLL), which combined ex-vivo drug response measurements with somatic mutation status, transcriptome profiling and DNA methylation assays (Fig. 2a). Notably, nearly 40% of the 200 samples were profiled with some but not all 'omics types; such a missing value scenario is not uncommon in large cohort studies, and MOFA is designed to cope with it (Methods; Supp. Fig S2).

MOFA identified 10 factors (minimum explained variance 3% in at least one view; Methods).
These were robust to algorithm initialisation as well as subsampling of the data (Supp. Fig. S7,8).
The factors were largely orthogonal, capturing independent sources of variation (Supp. Fig. S9).
Among these, Factors 1 and 2 were active in most views, indicating broad roles in multiple molecular layers (Fig. 2b). However, other factors such as Factor 3 or Factor 5 were specific to two data modalities, and Factor 4 was active in a single data modality only. Cumulatively, the factors explained 41% of variation in the drug response data, 38% in the mRNA data, 24% in the DNA methylation data and and 24% in the mutation data.

MOFA identifies and refines known clinical markers in CLL
As part of the downstream pipeline, MOFA provides different strategies to use the loadings of the features on each factor in order to identify their etiology (Fig. 1). For example, based on the top weights in the mutation view, Factor 1 was aligned with the somatic mutation status of the immunoglobulin heavy-chain variable region gene (IGHV), while Factor 2 aligned with trisomy of chromosome 12 (Fig. 2c). Thus, MOFA correctly pinpointed the two most important clinical markers in CLL and identified them with two major axes of molecular disease heterogeneity, which were active at multiple 'omics layers 23,24 (Fig. 2d).
Factor 1 aligned largely with IGHV status, a surrogate of the differentiation state of the tumor's cell of origin and the level of activation of the B-cell receptor. While in clinical practice this axis of variation is generally considered binary 23 , our results indicate a more complex substructure ( Fig.   3a). At the current resolution, this factor is consistent with three subgroup models such as proposed by 25,26 (Supp. Fig. S10), although there is suggestive evidence for an underlying continuum. MOFA robustly connected this factor to multiple molecular layers (Supp. Fig. 11, 12), including changes in the expression of genes previously linked to IGHV status [27][28][29][30][31] (Fig. 3b,c) and with drugs that target kinases in or downstream of the B-cell receptor (Fig. 3d,e).
Taken together, Factor 1 captures a global cell state that is reflected in almost all molecular layers and represents the differentiation state of the cell of origin and reliance on B-cell receptor signalling, MOFA reveals a previously underappreciated axis of variation in CLL attributed to oxidative stress Despite their clinical importance, the IGHV and the trisomy 12 factors account for less than 20% of the variance explained by MOFA, suggesting the existence of other sources of variation. One example is Factor 5, which is active in the mRNA and drug response views. This factor tagged a set of genes that were highly enriched for oxidative stress and senescence pathways (Fig. 2e, Fig.   4a), with the top weights corresponding to heat shock proteins (HSPs) (Fig. 4b,c), a group of genes that are essential for protein folding and are up-regulated upon stress conditions 32,33 . The expression levels of HSPs are known to be elevated in some cancers and may contribute to prolonged tumour cell survival 34 , but so far have received little attention as biomarkers in CLL.
Consistent with the annotation based on the mRNA view, we observed that the drugs with the strongest weights on Factor 5 were associated with response to oxidative stress, such as target reactive oxygen species, DNA damage response and apoptosis (Fig. 4d,e). Overall, our results suggest that changes in oxidative stress levels are a heterogeneous feature of CLL and could underpin a functionally relevant phenotype.

MOFA identifies outlier samples and accurately imputes missing values
Next, we explored the relationship between inferred factors and clinical annotations, which can be missing, mis-annotated or inaccurate, since they are frequently based on single markers or imperfect surrogates 35 . As in clinical practice patients are divided into two major disease subgroups on the basis of IGHV status, we assessed the consistency between the inferred continuous Factor 1 and these groups. For 176 out of 200 patients, the MOFA factor was in agreement with the clinical IGHV status, and MOFA further allowed for classifying 12 patients that lacked clinically measured IGHV status (Fig. S13a,b). Interestingly, MOFA assigned 12 patients to a different group than suggested by their clinical IGHV label. Upon inspection of the underlying molecular data, nine of these cases showed intermediate molecular signatures, suggesting that they are borderline cases that are not appropriately captured by the binary classification; the remaining three cases were clearly discordant, highlighting that the binary IGHV status is not a perfect biomarker of CLL biology (Fig. S13c,d). Overall, this suggests that latent factors inferred from the multi-omic data can improve biology-based disease stratification.
As incomplete data is a common problem in studies that combine multiple high-throughput assays, we assessed the ability of MOFA to fill in missing values within assays as well as when entire data modalities are missing for some of the samples. For both imputation tasks, MOFA yielded more accurate predictions than established imputation strategies, including imputation by feature-wise mean, SoftImpute 36 and a k-nearest neighbour method 37 (Fig. S14, Fig. S15). These results demonstrate that MOFA can leverage information from multiple omics layers to accurately impute missing values from sparse profiling datasets.

Latent factors inferred by MOFA are predictive of clinical outcomes
Finally, we explored the potential of using the latent factors inferred by MOFA to predict clinical outcomes. Three of the 10 factors identified by MOFA were significantly associated with time to next treatment (Cox regression, Methods, P<10 -4 , Fig. 5a,b): the cell of origin related Factor 1, and two factors associated with TP53/del17p mutational status (Factor 7; Fig. S16) and whether the patient was treated with chemo-immunotherapy before sample collection (Factor 8) respectively.
We also assessed the prediction performance when combining the 10 MOFA factors in a multivariate Cox regression model (assessed using cross-validation, Methods, Fig. 5c). Notably, this model yielded higher prediction accuracy than models using factors derived from conventional PCA (Fig. 5c) or using the individual features (Fig. S17).