Model-based joint visualization of multiple compositional omics datasets

Abstract The integration of multiple omics datasets measured on the same samples is a challenging task: data come from heterogeneous sources and vary in signal quality. In addition, some omics data are inherently compositional, e.g. sequence count data. Most integrative methods are limited in their ability to handle covariates, missing values, compositional structure and heteroscedasticity. In this article we introduce a flexible model-based approach to data integration to address these current limitations: COMBI. We combine concepts, such as compositional biplots and log-ratio link functions with latent variable models, and propose an attractive visualization through multiplots to improve interpretation. Using real data examples and simulations, we illustrate and compare our method with other data integration techniques. Our algorithm is available in the R-package combi.

1 The data integration model 1

.1 Model specification
We specify the following statistical model: g x (E(X|Z)) = U x + ZΓ g y (E(Y|Z)) = U y + ZΘ (1) whereby g x and g y are appropriate link functions. U x and U y are offset matrices correcting for differences in baseline expression/abundance and sequencing depth defining the "independence model" (see next section). Z (n×M) is a low dimensional matrix of sample scores on latent variables (M = 2 or 3) [1]. Γ (M×p) and Θ (M×q) are view-wise parameter matrices.
Note that model (1) is not a matrix decomposition of X and Y. Instead it can be regarded as a low dimensional approximation of the expectation matrices of the saturated models E(X)=X and E(Y)=Y, without calculating the entire decomposition.

Independence model
The first step of the fitting procedure is to estimate the independence models, i.e. the models describing homogeneous sample composition. These independence models defining the offset matrix are of the form Whereby d x en d y are vectors of length n that quantify total sample abundance/expression, e.g. sequencing depth or array intensity. e x and e y are vectors of length p and q that quantify baseline feature means. They correct for baseline differences; independence models are marginal models. For compositional data, the restriction p j=1 g −1 x (e xj ) = 1 is imposed.

Conditioning on baseline covariates
A next, optional step is to condition on known confounding variables such as batch or research center. Although simple in our regression framework, it is an important advantage over decomposition based methods. The confounding variables need not be identical for all views. We call the design matrices of two sets of (potentially overlapping) confounding variables R and S, then the mean models can be extended such that: g x (E(X|R)) = U x + RΦ g y (E(Y|S)) = U y + SΞ In case of compositional data, one restriction is needed to guarantee a solution. In this case we use the additive log-ratio (alr) transform [2] as link function, which is defined as: alr(x) = log 1, This effectively sets the parameter of the first feature of a view to zero for all confounders (φ .1 = 0).
Another problem occurs for discrete confounders with features that only have zero observations in one of the groups defined by these confounders. One could filter out all these features, but this leads to significant data loss. An alternative solution is offered by bias-reduced estimates [3,4,5]. Instead of correcting the bias of the maximum likelihood estimates (which are infinite under the scenario above), they reduce the bias by directly modifying the estimating equations. This allows for the estimation of the confounder parameters under this scenario. If the (quasi)-score equation for a mean parameter η is given by s η , then the bias-reduced (quasi)-score equation is of the form Hereby ξ i = hi (2fi) f i with h i the diagonal elements of the hat matrix R(R t R) −1 R t and f i = dµi dgx (µi) and f i = d 2 µi dgx(µi) 2 . Hence f i = d log(µi) dµi −1 = µ i and f i = µ i such that ξ i = −hi 2 . In practice, these systems of estimating equations are still very hard to solve though, because of the near singularity of the Jacobian matrices.

Restrictions
Model (1) is overspecified, as both the latent variables and coefficients are unknown and need to be estimated from the data. Hence restrictions need to be applied to guarantee an identifiable model. The columns of Z are restricted to be orthogonal: Z T Z = diag(ψ) with diag() defining a diagonal matrix with the vector ψ with non-negative entries on the diagonal. The coefficient matrices are restricted to be orthonormal: ΓΩ x Γ T = ΘΩ y Θ T = I M , with Ω x and Ω y view specific, diagonal weight matrices and I M the identity matrix of dimension M. The choice of these weights follows Hawinkel et al. [6]: all samples are considered equally likely to be drawn from the population and receive equal weights in the restrictions. For the features, more abundant features are considered to be more likely to be drawn from the population and receive weights (on the diagonal of Ω) proportional to their abundance under the independence model g −1 (e).
Note that because the model is overspecified, classical inference based on (quasi-)likelihood does not hold. No confidence intervals or p-values can be calculated. The model is intended for data exploration only and relies solely on the point estimates.
Technical note: Centering and orthogonalization restrictions are directly imposed in the optimization procedure through Lagrange multipliers. This is crucial to avoid overflow, i.e. certain numbers becoming to large for the computer to store. Normalization restrictions can be imposed post hoc. This does not cause numerical problems, and the iterative algorithm will not stop until they are fulfilled. This approach is faster, as the initial optimization problem is simpler. Even if the estimating equations were not solved in some iteration, afterwards the centering and orthogonalization will be still be enforced through Gram-Schmidt orthogonalization to speed up convergence. Finally, some restriction is needed to render the estimation under compositionality with centered log-ratio transform feasible, but the centering restrictions already conveniently fulfill this role.

Model estimation
In summary, the fitting algorithm consists of the following steps 1. Estimate the view-wise independence models by iterating between the estimation of the row and column offsets and possible nuisance parameters (e.g. standard deviations for microarray data, abundancevariance trends for sequence count data). 2. (Optional) Estimate feature parameters for confounding variables, and condition on them by including their contribution to the mean matrix in the offset. This step also occurs independently for each view. 3. Iterate between estimating the latent variables, feature parameters and possible nuisance parameters.
When convergence for one dimension is achieved, incorporate this dimension in the offset, and estimate the next dimension conditional on all previous ones.
The iterative procedures in steps 1 and 3 continue until convergence. Convergence is declared when the square root of the L 2 norm of the all estimates drops below a certain tolerance (here 1e-4), e.g. for the latent variables: with Z new .m the current estimates and Z old .m the estimates of the previous iteration. It may be prudent to graphically check for convergence. An example of such convergence plot is shown in Figure S16.

Starting values
Iterative algorithms converge much faster when provided with reasonable starting values. For the independence model, simple row and column sums can be used. Starting values for latent variables and feature coefficients can be obtained from following singular value decompositions. With offset matrices g −1 x (U x ) and g −1 y (U y ), obtain standardized residual matrices diag(σ indep ) (for microarray data). σ indep are the column wise standard deviations under the independence model); see Section 1.2.3 for the denominator of the sequence count case. These residual matrices are concatenated by row into one large matrix D, for which the singular value decomposition is then obtained: The first M columns of GΣ are then used as starting values for Z, and the first M columns of H as starting values for the corresponding feature parameters. For a constrained analysis, redundancy analysis [7] on the matrix D and the design matrix of baseline sample variables c is used to obtain starting values for the environmental gradients and feature parameters.

Solving estimating equations for compositional data
The Newton-Raphson algorithm that is used to solve the estimating equations may encounter local extrema and saddle points when applied to compositional data. Therefore, if the Newton-Raphson algorithm does not converge when the old parameter estimates are used as stating values, random standard normal variates are repeatedly used as starting values instead until the estimation equations are solved. This is a brute-force solution but it works in most cases. This kind of problems in high dimensional estimation problems are very intractable, more theoretical research is needed into a more general solution in the compositional framework.

The abundance-variance trend
The abundance-variance trend models the relationship between log(π j ) and log Var(X j |Z .1,m ) = log 1 through a smooth function a m . The fit is performed on the log-scale since this spreads the observations nicely and avoids undue influence of extreme values. It allows to predict Var(X ij |Z .1,m ) as v m (π ij )s i = exp a m [log(π j )] s i . This approach shares information between features to reliably estimate variances. The smooth function a m is updated throughout the iterative procedure and is unique to every dimension. Note that this approach may imply that for some observations X ij and X i j , exp a m (log(π j )) s i = exp a m (log(π j )) s i even though E(X ij |Z .1,m ) = E(X i j |Z .1,m ).
Anders et al. [8] use a local regression as a smooth function, but this may yield a smoother with a irregular derivatives. This destabilizes the Newton-Raphson algorithm used to solve the estimating equations. Instead we use a natural smoothing spline, which has continuous first and second order derivatives. At the cost of being slightly more rigid, it yields a much more stable algorithm.
Great care should be taken when extrapolating this abundance-variance trend to small relative abundances. This will be necessary as the modelling processes may yield small relative abundances for some features in some samples. It is known that for small means, sequence count data approximately follow the Poisson distribution [9,10]. The Poisson distribution has a variance that is equal to the mean. Hence, as a heuristic, for small abundances it is assumed that Var(X ij |Z .1,m ) = π j s i . The smooth function is then constrained to have slope 1 and equal the diagonal line for some value between (min π {log(π j )} − 1) and This value is selected as the one that minimizes the squared error This complicated solution is motivated by the need to keep the derivatives continuous for numerical stability.

Microarray data
For modelling microarray data, we mainly follow the tracks of the popular limma package [11]. The array data is log-transformed, and then modelled using a simple linear model with identity link. The estimates of the feature-wise variances are shrunken towards a common value using an empirical Bayes procedure [12]. The estimating equations are then: with Y the microarray data matrix. σ 2 j,EB is the empirical Bayes estimate of the variance for feature j.

Latent variable estimation
The estimating equations for the latent variables are obtained by summing the estimating equations of all different views for every sample. If desired, different weights can be allotted to the different datasets in this way, but we use even weights by default. If all weight is allotted to a single dataset this reduces to a single view problem, as e.g. the RCM package [6].
One reasoning is to inverse weigh the elements of the estimating equations for the latent variables by the number of features in the view. Otherwise views with many features might get a very strong impact on the estimation of the latent variables, without there being a biological rationale why they should contain more information. Another argument is to state that datasets with more features carry more information and can have more weight in the estimation. Finally, it may be that the dataset with the clearest signal will take preponderance. An answer to these questions is given below in section 1.3.

Influence measures
The impact of each of the views on the estimation on the latent variables or environmental gradient components can be obtained through influence functions [13]. Influence functions reflect the influence a certain observation has on a parameter estimate, keeping the other sorts of parameters fixed. Because of the iterative algorithm this latter assumption is incorrect, but the influence functions may still harbour interesting information.
For maximum likelihood estimation, the influence function χ(η|f, x) of a parameter η for a distribution f and data x is defined as: by the Jacobian matrix. If an observation has a positive influence on a parameter, it means that it tries to "pull its value up". In other words, if the observation would not be there, the parameter estimate would be lower. As the orientation of the final graph is of no importance, it is often sufficient to look at absolute influences.
As an illustration, a dataset with three views was generated using the negative binomial distribution. The number of features are 100, 100 and 1000 respectively. The signal strength (i.e. the fold changes) is the same in all datasets. The first and third datasets have similar levels of overdispersion, the second dataset has high levels of overdispersion. We call these datasets the "regular", "noisy" and "large" datasets. In Figures  S2-S4 it is demonstrated that signal-to-noise ratio of each view drives its influence, rather than the number of features. The noisy dataset has least influence, whereas the influence of the regular and large datasets is comparable.

Remarks about compositional data analysis
The use of the centered log-ratio transform does not guarantee subcompositional coherence for the integration model. Subcompositional coherence means that the conclusions for features j and j do not change when a third feature j is omitted from the analysis (e.g. filtered out) [14]. However, when a taxon is omitted, the geometric mean of the composition changes, and thus also the outcome of the ordination. Also, because of the iterative nature of the procedure, omitting taxon j will change the estimates of the latent variables. This will in turn change the estimates of the feature parameters j and j , so that the procedure is not  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49 Figure S4: Boxplots of overall absolute influences of the views on the estimation of the latent variables for the synthetic dataset of Figure S1. It is clear that the features from the noisy dataset have less influence.
subcompositionally coherent. Because of the same reasons, classical CoDa biplots of log-ratio transformed data are also not subcompositional coherent.
Neither is our method scale invariant as is sometimes stated as a requirement for compositional data analysis [14,15]. Scale invariance means that the analysis should not depend on the total size if the composition, e.g. the library sizes in case of sequence count data. While it is true that the conclusions are only drawn on the proportions, the sampling variability of the proportions depends on the library size. A sample with 1.000.000 reads carries much more information than a sample with only 10.000 reads, even when the compositions are identical. Hence analysis of heteroscedastic count data should never be fully scale invariant. Ignoring the mean-variance trend in sequence count data leads to technical artefacts in CoDa biplots (see Hawinkel et al. [16], Supplementary material Section 3.4.1), as is also demonstrated below in Section 5.1. 8 2 Visualization

Multiplots
The resulting data integration can be visualized in a multiplot as follows: 1) Build an orthogonal axis system with equally scaled axes in 2 (or 3) dimensions. Plot the first dimensions of Z as dots. As the dimensions of Z are orthogonal, but not normalized, the distances between the samples reflect the dissimilarities between the samples over all different views. 2) Add labels at the locations defined by Γ and Θ, with arbitrary scaling. As the components are orthogonal, the biplot principle holds [17], and the orthogonal projection of e.g. the vector from the origin to γ j onto the vector from the origin to Z i is proportional to the departure from independence of feature j in sample i for the dimensions plotted. Moreover, when the projection γ t j γ l is large, this indicates that features j from view X and l from view Y are similarly associated to latent variables Z and are thus correlated. As the feature parameters are also normalized, distances between feature parameters locations cannot be interpreted (it is a so-called form or sample multiplot). To avoid overplotting, it may be necessary to limit the features plotted to the ones with the largest norms (furthest away from the origin), i.e. thresholding.
3) In case of constrained ordination, the components of Λ can be added to the plot, as arrows or as labels, again with arbitrary scaling. The projection of these variable vectors λ k onto γ j reveals how sensitive feature j is to changes in variable k. Also, the larger the component λ k , the more important the variable k is in driving the variability over the different views.
The scaling in steps 2 and 3 is usually done such that all coordinates have the same order of magnitude as the sample location, to aid interpretability. Only the relative length of the projections is meaningful.
When some of the views consist of compositional data, the interpretation of the plot is complicated. For compositional data, a positive feature parameter γ mj does not guarantee that the feature j is positively associated with the latent variable of dimension m. Neither does clr −1 (γ m ) j > 1/p guarantee this, as wrongly suggested by Xia et al. [18]. The impact of the feature parameter on the mean of a feature j depends on the values of the other feature parameters of that view, as well as on the value of the latent variable. In an extreme case, for z im → ∞, the composition collapses into a point mass of 1 at taxon j with the highest γ mj .
Once the model is fitted, the values of the latent variables and feature parameters are known of course. One can thus simply check that for those features with the largest loadings that would be plotted, the expected abundance does in fact vary monotonically with the latent variable within the observed range of latent variable values. Unfortunately, in practice this is almost never the case for most features. An explanation on how to interpret these biplots under this curse of compositionality is given below.

Compositional multiplots
Compositional biplots have been introduced by Aitchison et al. [19] for log-transformed data. The interpretation is the same though in our case of inverse log-transformed parameters, and is less intuitive than for a regular biplot. In a regular biplot, each combination of sample and feature labels is interpretable. In a compositional setting, a feature label can never be interpreted by itself, but should always be interpreted relatively to some other features. Here we discuss the interpretation with respect to 1) all other features of the same view, 2) one other feature of the same view and 3) a feature from another view.

Intepretation with respect to all other features
The interpretation with respect to all other features is the comparison with the geometric mean (gm) of all proportions: The gm behaves similarly to the Shannon index [20] in the sense that it can be seen as a measure of evenness. For a perfectly even species composition (π 1 = π 2 = . . . = π p = 1/p), the gm equals 1/p. As one feature becomes more and more abundant (one π j → 1), the gm approaches 0.
In our model, the log ratio of the proportion of one feature a on the gm of the proportions in sample i is a linear function of the latent variables: The biplot can be interpreted as a regular biplot in function of this log-ratio: the larger the projection Z t i γ a becomes, the more this log-ratio departs from the independence model for feature a in sample i. Loosely speaking, the larger Z t i γ a becomes, the more dominant feature a becomes. How the proportion π ia evolves as a function of Z i depends on the numerical values of e as well as γ. We can make this clear as follows (dropping sample subscripts, and looking at one dimension), knowing that: .
Hence, taking the logarithm This takes the shape of the regular log-linear model. To know how this proportion evolves with the latent variable, we take the derivative with respect to Z: The orthonormality restriction guarantees that p j=1 γ j clr −1 (e) j = p j=1 γ j π indep j = 0 for every dimension (see section 1.1.3). Hence we can also write For small departures from independence the second term drops, but for realistic datasets this is not the case. In practice, the second term can even be larger than γ a in absolute value, upsetting the monotonicity of π a with γ a . This formula cannot easily be simplified further, the interpretation will have to account for the compositionality. This potential pitfall in interpreting centered log-ratios is illustrated in Figure 2 in the main text.

Interpretation with respect to other features in the same view
As the interpretation with respect to "the rest of the features" (represented by the geometric mean) is so problematic, it may be easier to compare just two features. We look at the log-ratio between the relative abundances of two features π a and π b in sample i. According to the model: with π indep j = clr −1 (e) j the proportion of feature j under the independence model. Note that we have eliminated gm(π) from the expression. The expression on the left hand side has the form of a log odds ratio as in logistic regression, but with the difference that πia π ib and are not genuine odds.
The difference (γ a − γ b ) between vectors is known as the link in a plot, i.e. the straight line connecting the points defined by γ a and γ b . It is small when the labels γ a and γ b lie on approximately the same side of the origin and at the same distance from it (γ a ≈ γ b ). In that case the ratio of the relative abundances πia π ib will not differ from that under the independence model by much in any sample. In a compositional setting, a stable ratio means that the features are strongly correlated [14].
In case this link is large, the projection of the latent variable vector Z i. onto the link (i.e. Z t i (γ a − γ b )) indicates how much and in which direction the ratio πia π ib differs from that under the independence model [19]. Note that this implies that feature labels lying at the same side of the origin but at different distance (i.e.

Interpretation between features of different views
The interpretation between features of different, compositional views, or between features from a compositional view and a non-compositional view is even more difficult. Of course the interpretation with respect to the centered log-ratio is always valid, but not intuitive. If labels of feature a in view 1, and feature b in view 2 lie at the same side of the origin, their centered log-ratio transforms are correlated. This means that moving along this direction, both features become "more dominant" in their own views, although this need not imply that their abundances also increase. Features from two non-compositional views are correlated if they lie on the same side of the origin.

Real data examples
In this section the data integration plots of real data are shown for the integrations that were not shown in the main paper.

HMP2 data
The Human Microbiome Project 2 (HMP2), or integrative HMP (iHMP), aims to investigate the relationship between the microbiome and host responses. It extends the original, cross sectional HMP by also including longitudinal samples. Here we focus on the datasets in the "The Inflammatory Bowel Disease Multi'omics Database" (IBDMDB), which contains healthy and IBD patients (patients with both forms of IBD, Crohn's disease (CD) and ulcerative colitis (UC), are included), see the project website. A total of 90 subjects was be profiled for one year. The HMP2 dataset contains many different types of omics data, from which we selected the following.
The microbiome composition of the stool was assessed through sequencing of the 16S rRNA gene. The proteome was measured in fecal, nasal and blood samples. Proteins were separated by liquid chromatography and then identified using mass spectroscopy. This yields counts of proteins. The proteins were then classified biochemically (EC) or phylogenetically (KO). We use the latter convention here. Proteomics data are also to be considered compositional [21]. The composition of the virome of the stool was measured by sequencing marker genes as for the microbiome data. Data integration multiplots of these datasets can be found in Figures S5-S10. Figure S8: Data integration plot of microbiome and proteome data from the HMP2 project. Coloured dots represent patients, labels represent features of microbiome (blue) and proteome (green). The red dashed line shows the link between taxa Unc86145 and Unc04y3m, the orange line its projection onto the CD sample vector on the bottom. Despite the two taxa lying at the same side of the origin, the projection onto the sample vector is not zero, and hence ratio Unc86145/Unc04y3m is larger in this sample than in the average sample.

Zhang data
This study investigated the effect of one or three pulsed antibiotic treatments (1 and 3 PAT) on the onset of type I diabetes in mice [22]. Many views were measured, including gut microbiome, metagenomics, metabolic pathways and intestinal immunity pathways. Microbiome composition determined through 16S sequencing.
Intestinal immunity pathways measured using Nanostring. This is basically expression profiling but focused on a subset of genes involved in immunity. The original publication focused on the effect of the PAT on all different views, without attempting to integrate the different views. Overall sample sums MOFA Figure S14: Sample ordination of Zhang microbiome and immunological data by MOFA. Samples are coloured by overall sample sums, sample shapes reflect treatment group.

Gavin data
This is an observational study on T1D onset in humans. The microbiome composition was measured, as well as the human and microbial proteome from the gut.  Figure S17: Constrained integration of Gavin microbiome and human and microbiological proteomics data, for the second and third dimensions. Blue labels represent taxa, red labels microbial proteins and green labels human proteins. Black labels represent components of the environmental gradient. The second dimension reveals how IGHA1 is more abundant in new onset patients than in seropositive patients, as was found by the authors too. The third dimension mainly distinguishes healthy controls from seronegative patients

Methods comparisons
'Data integration' is a very broad concept, and here we do not intend to give an exhaustive overview of all published methods for integration of genomics data. Instead we will focus on existing methods that provide at least either sample or feature scores such that they are (partially) comparable with our method.

Principal components analysis and correspondence analysis
Principal components analysis (PCA) can be applied after concatenating all datasets. This is probably not preferable but provides a good benchmark [23], as it yields sample scores as well as feature loadings. Also correspondence analysis [24] can be applied on concatenated matrices, but without log-ratio transform.

Canonical correlation analysis
Canonical correlation analysis (CCA) finds orthogonal pairs of linear combinations of features in X and Y with maximal correlation [25]. Sparse canonical correlation analysis (sCCA) tries to increase the interpretability by imposing sparsity on the loadings [26]. CCA does not yield unique sample scores.

Partial least squares
Partial least squares (PLS) is similar to CCA, but it finds linear combinations of variables with maximal covariance rather than correlation [27] (it might be called Canonical covariance analysis). In our case we will implement the symmetric version; i.e. we will treat matrices X and Y equally. Also a sparse version of PLS (sPLS) has been proposed [28,29]. PLS does not yield unique sample scores.
PCA, CCA and PLS can be applied on the raw data, or on data transformed through centered log-ratio transform (clr). For count data, the zero counts are then first imputed using the cmultRepl() function in the zCompositions package [30].

MOFA
The MOFA model employs the same mean model as our data integration method [1]. Still, there are no orthogonality restrictions, and hence no biplots can be made. The parameters are estimated in a Bayesian framework, which has the advantage of natively dealing with missing values. For count data only the Poisson model is implemented. In practice this model almost never converges on datasets we presented.

JIVE
JIVE is a matrix decomposition method that decomposes standardized matrices into residuals, joint structure and view-wise structure [31]. The fitting method is also iterative, the ranks of the decomposition matrices are found through permutations. Linked matrix factorization JIVE (LMF_JIVE) is an extension to both row-wise and column-wise integration [32]. Both methods rely heavily on least squares, and require imputation to deal with missing values.  Figure S27: Boxplots of correlations with sample-wise sums (y-axis) of different datasets and overall sum (right panels) for different methods (x-axis) of different dimensions (top panels) for SimSeq data (strategy 2) generated based on the HMP2 microbiome and proteome datasets.  Figure S35: Boxplots of correlations with sample-wise sums (y-axis) of different datasets and overall sum (right panels) for different methods (x-axis) of different dimensions (top panels) for permuted Gavin microbiome and microbial protein datasets (strategy 3).  Figure S37: Boxplots of Wilcoxon rank sum test statistic quantifying correlated taxon identification for different methods (x-axis) and templates (top panels) on data generated with SimSeq (strategy 2).  Figure S38: Boxplots of pseudo-F statistic (y-axis) quantifying sample separation for different methods (x-axis) and templates (top panels) under simulation with SimSeq (strategy 2).