Probabilistic pathway-based multimodal factor analysis

Abstract Motivation Multimodal profiling strategies promise to produce more informative insights into biomedical cohorts via the integration of the information each modality contributes. To perform this integration, however, the development of novel analytical strategies is needed. Multimodal profiling strategies often come at the expense of lower sample numbers, which can challenge methods to uncover shared signals across a cohort. Thus, factor analysis approaches are commonly used for the analysis of high-dimensional data in molecular biology, however, they typically do not yield representations that are directly interpretable, whereas many research questions often center around the analysis of pathways associated with specific observations. Results We develop PathFA, a novel approach for multimodal factor analysis over the space of pathways. PathFA produces integrative and interpretable views across multimodal profiling technologies, which allow for the derivation of concrete hypotheses. PathFA combines a pathway-learning approach with integrative multimodal capability under a Bayesian procedure that is efficient, hyper-parameter free, and able to automatically infer observation noise from the data. We demonstrate strong performance on small sample sizes within our simulation framework and on matched proteomics and transcriptomics profiles from real tumor samples taken from the Swiss Tumor Profiler consortium. On a subcohort of melanoma patients, PathFA recovers pathway activity that has been independently associated with poor outcome. We further demonstrate the ability of this approach to identify pathways associated with the presence of specific cell-types as well as tumor heterogeneity. Our results show that we capture known biology, making it well suited for analyzing multimodal sample cohorts. Availability and implementation The tool is implemented in python and available at https://github.com/ratschlab/path-fa


Introduction
A current trend in biomedical research is to obtain multiple types of molecular data for a given sample (Boehm et al. 2022).These modalities represent different views of the molecular landscape, each measuring different aspects, resolutions, and scales to provide us with complementary information that helps us understand the relationships between molecular mechanisms.A common question regarding multimodal data is to identify factors that explain the range of observed measurements (e.g.gene or protein expression, copy number variations, phosphorylation, etc.) across a given sample population (Ritchie et al. 2015).In molecular oncology, related approaches are used to quantify tumor composition and immune cell content from bulk measurement technologies, leveraging shared signals observed across a larger cohort (Chen et al. 2018).
Thus, extracting useful representations of samples from omics data that are biologically relevant, correlate with celltype abundance, or even clinical variables, remains an important challenge.PLIER (Mao et al. 2019) and MultiPLIER (Taroni et al. 2019) leverage pathway level information to analyze factors that drive the differences in biology observed within a cohort that significantly improve interpretability compared to gene level approaches, but only apply to a single modality of transcriptomics data.MOFA (Argelaguet et al. 2018) conversely, is a multimodal factor analysis approach but operates on a potentially unidentifiable latent space instead of pathways, which can make interpretation challenging.By making less assumptions about the latent factors, it also requires more samples to infer them.
Here, we propose a novel approach for a multimodal factor analysis that operates on the level of biological pathways to integrate the information from multiple modalities.We leverage the concepts from PLIER and MOFA into a novel single framework, PathFA.We use pathway information to integrate multiple modalities into the same factor analysis model.This new multi-omics factor analysis is able to join proteomics and transcriptomics (RNA) data in the space of pathways (PLIER is designed for RNA) instead of unknown latent factors (MOFA).To effectively integrate and balance different observed markers and modalities, we propose a probabilistic pathway-based factor analysis model and efficient Bayesian inference method.
Using PathFA, we show that both (1) the addition of another data modality and (2) utilization of prior knowledge improve reconstruction in a simulation and downstream performance on real data, for example correlation with cell types.Specifically, we improve over MOFA when it is possible to incorporate prior knowledge in form of pathways and over PLIER when multiple omics are available.On matched proteomics and transcriptomics data of the Tumor Profiler Consortium (Irmisch et al. 2021;The Tumor Profiler Consortium 2024a, 2024b), the latent factors of PathFA align with relevant pathways for clinical outcome, tumor heterogeneity, and cell-type composition.Further, we find that proteomics data by itself can be as useful as transcriptomics in our case, although pathways are originally derived on RNA markers.
In summary, in this study, we make the following main contributions: 1) We introduce PathFA, a novel probabilistic factor analysis model integrating multimodal data (transcriptomics and proteomics) using biological pathways.2) We implemented an efficient Bayesian inference method for automatic hyper-parameter optimization, enhancing the analysis of high-dimensional molecular data.3) We demonstrate improved interpretability of molecular datasets by anchoring analysis in known biological pathways.4) We validate PathFA with real-world patient data from the Tumor Profiler consortium, successfully identifying key biological insights related to cell-type composition, survival, and tumor heterogeneity.

Data and preprocessing
We use the transcriptomics (RNA-Seq) and proteomics data (DIA-MS) data (Xuan et al. 2020) generated from Melanoma (The Tumor Profiler Consortium 2024b) and Ovarian patients (The Tumor Profiler Consortium 2024a), as part of the Tumor Profiler(TuPro) study (Irmisch et al. 2021).TuPro is a multicenter study that has deeply phenotyped metastatic tumor across multiple indications (Irmisch et al. 2021).All data will be made available upon release of the Tumor Profiler Marker papers.

Sample selection for transcriptomics and proteomics
Based on the transcriptomics and proteomics data from tumor profiler, we selected a subset of samples for ovarian cancer and melanoma.RNA-seq library preparation was done using the Illumina TruSeq Standed Total RNA Sample Preparation kit (Ribo-Zero Gold).Nova-Seq 6000 was used to sequence the samples.Proteomics data was generated via a Data Independent Acquisition Mass Spectrometry approach.We selected a subcohort of patients that had passed the internal quality control of both transcriptomics and proteomics nodes.RNA-seq data was filtered based on RIN score and fastqc metrics.Samples failed QC when the RIN score was below six, or if three or more FASTQC modules were failed.
For proteomics data, quality control was assessed in context of reference control samples [mix of three Ovarian cell lines (Kuramochi, OVCA3, SKOV3) or for melanoma (MKFY6-1-P15/MKFY6-1-P15/MTG5K-1-P19)].Samples with more than 70%, as well as peptides with more than 75%, missing values were removed.We also filtered for samples that had both modalities profiled, and a complete set of the relevant metadata.This left us with 42 ovarian samples and 34 melanoma samples.

Data preprocessing
Both transcriptomics and proteomics data are processed similarly and according to standard practice.In both cases, the raw data are first standardized by the library size, i.e. the total number of counts per sample.For RNA, this is equivalent to reads per million mapped reads (RPM).The counts are transformed with the following function: logð1 þ xÞ, then, quantile normalized, and z-scored.The prior knowledge about pathways is extracted from the Molecular Signatures Database (MSigDB; Liberzon et al. 2011) in addition to a curated set of pathways, developed by the Tumor Profiler Consortium for the analysis of melanoma pathways (Irmisch et al. 2021).From MSigDB, we use Hallmark (Liberzon et al. 2015) and cell-type signature genesets.However, these are specified in terms of contained genes only.To use the same pathways for proteomics data, we translate these pathways into the corresponding markers, mapping to Emsembl gene ids (Ruffier et al. 2017) via UniProt (Consortium 2019).

Synthetic data-generating process
To generate synthetic data for simulation experiments, we use MSigDB Hallmark pathways, which provide us with 50 pathways and corresponding transcriptomics and translated (Section 2.1.2) proteomics markers.Further, each pathway is associated to one of eight process categories, e.g.development, DNA damage, or immune, which we use as ground truth latent variables.This allows us to generate data by simulating loadings from a mixture of three isotropic Gaussians, simulating clusters of samples.The mean of each cluster is itself sampled from an isotropic Gaussian with higher variance.Further, we use a heteroscedastic observation noise on the markers that follows the shape of observed data in Tumor Profiler data.The average transcriptomics observation noise is 0.95 and that of Proteomics is 0.98 while the spread of noises on the marker level is smaller for Proteomics as shown in Fig. 1B.

Probabilistic pathway-based factor analysis
PathFA uses prior information in terms of associations of pathways to either gene or protein markers.This allows our model to use the pathways as a latent space based on prior knowledge.PLIER (Mao et al. 2019) uses the same prior information in their model but only for transcriptomics data and in combination with a standard factor analysis (FA) model.Below, we introduce our method that is both applicable to transcriptomics and proteomics data and we extend it to the multimodal setting (observing both per sample) in Section 2.4.We denote the data of m gene or protein markers for n samples as matrix Y 2 R m × n .Further, we assume prior knowledge about p relevant pathways that each consist of several markers (gene or protein) as C 2 f0; 1g m × p with C m;p ¼ 1 if the mth marker is part of the pth pathway and otherwise C mp ¼ 0. PathFA infers two parameter matrices: associates the pathways to latent variables such that U p;k denotes the relevance of the pth pathway to the kth latent variable.Associating latent variables through pathways allows to interpret them better than in a standard factor analysis.Finally, each column of the matrix B 2 R k × n specifies the k-dimensional loadings per sample.Informally, we model the observations Y with CUB.Typically, we have m markers in the order of thousands, p pathways in the order of hundreds, and tens of k latent dimensions.
PathFA uses a Gaussian likelihood function with learnable observation noises per marker.This allows PathFA to ignore noisy markers and focus on the informative ones as they are actually heteroscedastic (Argelaguet et al. 2018).The observation noise is parameterized by r 2 R m þ with σ m denoting the observation noise of the mth marker.The likelihood function is then given by where c m is the vector denoting the mth row of C and b n the nth column of B. The likelihood governs that each marker of a sample is reconstructed by the pathways it belongs to (c m ) summed over the samples' abundance of the respective pathways (Ub n ).This reconstruction is visualized in Fig. 1C.In comparison to standard factor analysis, all observations are reconstructed through the pathways.We additionally still maintain lower-dimensional latent variables in B corresponding to the loadings in factor analysis.We place independent zero-mean Gaussian priors over the parameter matrices to ensure sparsity of the pathway-latent association U and to identify the relevant latent dimensions in B. For U, we use the identically-sized matrix of prior precisions Λ 2 R p × k þ with entries Λ p;k denoting the regularization strength of the entry U p;k akin to automatic relevance determination (ARD).ARD is used commonly for biological latent-variable models to achieve sparsity in a probabilistic framework (Li et al. 2002;Tan and F� evotte 2012).Its advantage is that we do not have to set a sparsity level or regularization hyperparameter a priori but can infer it automatically.The precision of B is controlled per latent variable with vector d 2 R k þ .Mathematically, we have the priors To infer U and B, we find their maximum a-posteriori (MAP) values by minimizing their negative log joint distribution with the observations, LðU; BÞ ¼ −log pðY; U; Bjr; Λ; dÞ ¼ −log pðYjU; B; rÞ−log pðUjΛÞ−log pðBjdÞ: (3) To optimize this objective, we use alternating least-squares (ALS) (Hastie et al. 2015), which uses alternating closedform updates for the parameters.The inference procedure is detailed in Section 2.3.
To select the hyperparameters of the model that control the observation noises r, and relevance parameters Λ; d, we use a Bayesian model selection scheme, which, perhaps surprisingly, does not increase the complexity of our algorithm.For example with cross-validation, choosing relevance parameters would be intractable due to the curse of dimensionality as we have m þ pk þ k hyperparameters.In our experiments, this can be more than 10 4 .For our hyperparameter optimization, we use a Laplace approximation to the marginal likelihood that has been successfully used in automatic relevance determination (MacKay 1994) and can Probabilistic pathway-based multimodal factor analysis i191 jointly optimize observation noises r and relevance parameters Λ; d.In particular, we maximize the evidence of the hyperparameters, log pðYjr; Λ; dÞ � log pðY; U � ; B � jr; Λ; dÞ where the Hessian is evaluated at MAP estimates U � and B � .We further approximate the Hessian as blockdiagonal individually for U and B, which typically performs as well but has significant computational benefits (Immer et al. 2021).The hyperparameters are updated in a closed-form procedure similar to MacKay (1994) and Tipping (2001).Interestingly, the closed-form updates require computation of the same quantities as required for the alternating least-squares updates and thus do not increase overall computational complexity.That is, ARD is for free in our model when we use ALS updates for the parameters.

Efficient inference of path-FA
We optimize the parameters of the model using alternating least-squares (ALS), which guarantees improvement in each step of the algorithm due to closed-form updates.Since both ALS and the Laplace approximation rely on the Hessian, computation is shared making hyperparameter optimization asymptotically free.Our updates resemble a Bayesian ALS that jointly optimizes parameters and hyperparameters by repeated integration as done for neural networks (Immer et al. 2021), and might be relevant outside the context of fitting PathFA, for example, in general matrix factorization or factor analysis problems.We first derive the gradients and Hessians of the parameter objective in Equation (3), which then allow to derive the parameter and hyperparameter updates.

Gradients and Hessians
Defining C r ¼ def r −2 � C, where � is the pointwise hadamard product, the gradients of the objective are given by (5) where � denotes the Kronecker product and 1 n the n-dimensional vector of 1 s.For alternating least-squares and the marginal likelihood approximation, we further need the Hessians where vecðΛÞ concatenates the columns of Λ into a vector and diagð�Þ turns a vector into a diagonal matrix with elements given by the argument.The Kronecker-factored structure of the Hessian w.r.t.B allows for efficient computation of inverse and log-determinant required for ALS and marginal likelihood, respectively, by computing these quantities on the individual factors.However, the Hessian w.r.t.U requires evaluation of the Kronecker product for inversion due to the ARD prior controlled by Λ.This is not a problem because U's shape is the number of pathways times number of latents, and therefore, small, i.e. below 10 3 .

Updates and algorithm
We optimize the model parameters U and B efficiently using ALS, or equivalently Newton's method, by preconditioning the gradient with the inverse of the Hessian.The hyperparameters are updated using the same quantities, therefore incurring no additional cost, and use fixed-point updates proposed by MacKay (1992) for neural network regression but we apply these updates already during training, which requires less time to converge (Immer et al. 2021).U is initialized to a zero-matrix and constrained to be non-negative and B is initialized using the right factor of a k-truncated singular-value decomposition of the observation matrix.The observation noises are initialized to a vector of 1 s and prior precisions logarithmically spaced between 10 −2 and 10 2 along the k latent variables.
The parameter updates are given by a Newton update with step size γ: where M is either parameter U or B. For B this further simplifies due to the Kronecker-factored structure of its Hessian: which is efficient to compute since the left-multiplied Kronecker factor of the Hessian is quadratic in latents (k × k) and cheap to invert.To update U, we need to invert a matrix of size pk × pk, which is typically tractable and fast with pk � 10 3 .The hyperparameter updates rely on the inverse of the Hessian as well and therefore require no overhead computational complexity.The updates are similar to the ones for sparse Bayesian learning (MacKay 1992; Tipping 2001) but applied to our particular case of matrix factorization, which has considerably different priors and likelihoods.In particular, we have for the prior precisions Λ: which requires the diagonal elements of the inverse of Us Hessian like the ALS update.For the prior precisions of B and observation noises, we have: Computing these updates poses no overhead in computation over the ALS updates and enables a hyperparameter-free algorithm.This is because only ALS requires a step size γ, i192 Immer et al.
which can be simply set to a common default of 0.1, and the hyperparameter updates are entirely in closed-form.

Multi-omics PathFA
To extend PathFA to simultaneous proteomics and transcriptomics data per sample, we only extend the likelihood but keep the same latent variables U and B, which construct a shared hierarchical latent space (see Fig. 1).Specifically, we have another view and thus likelihood for the proteomics data with translated pathway prior knowledge C p .In principle, analyzing both modalities concurrently should enhance the precision of latent variable inference, even with a reduced number of samples.Moreover, the complementary nature of these modalities can potentially lead to more robust and effective latent representations, which may be beneficial for downstream tasks.

Probabilistic model of multi-omics PathFA
Mathematically, we have prior knowledge about the geneset-to-marker relationships in form of C p 2 f0; 1g mp × p for m p protein markers and C r 2 f0; 1g mr × p for m r RNA markers, respectively.Importantly, both cover the same pathways, which function as representation space.With proteomics observations Y p 2 R mp × n and transcriptomics observations Y r 2 R mr × n , we have the likelihood which is simply the product of two unimodal likelihoods, which are defined in Equation ( 1).This means, we potentially observe more evidence for latent variables U and B. We model both observations as conditionally independent and with their own per-marker noise levels denoted by r r and r p .
Inferring noise levels is important to correctly weight the signal and noise coming from either modality.Further, modality-specific scalar factors c r and c p are multiplied to reconstruct the respective modalities, which can absorb potential inconsistencies in preprocessing or normalization of transcriptomics and proteomics data, respectively.Both scalars can be updated in a closed form and have no prior.The Gaussian priors with zero mean and precision parameters λ U ; λ B on U and B are as in Equation (2).

Inference for multi-omics PathFA
The alternating least-squares parameter updates for multimodal PathFA and the corresponding hyperparameter updates are only slightly different from the ones of the unimodal variant described above.In particular, the additional observation in the multimodal likelihood [Equation ( 11)] gives an additional summand in the gradient and Hessian computation, which are given in Equations ( 5) and ( 6) for the unimodal case, respectively.This gives us only slightly modified gradients G m ðUÞ; G m ðBÞ and Hessians H m ðUÞ; H m ðBÞ with the subscript m denoting the multimodal model.This does not complicate the computation and the same parameter and hyperparameter updates hold as detailed in Section 2.3.2 by simply replacing the Hessian with the multimodal variants.
The final algorithm in pseudo-code is given in Algorithm 1.

Hyperparameters and computational considerations
The overall algorithm can be entirely hyperparameter-free as step sizes γ below 1 lead to convergence using ALS within tens to hundreds of steps and ARD can shrink unnecessary latent variables.In practice, we use γ ¼ 0:1 and k ¼ 10 latents in our experiments.Alternatively, the computed marginal likelihood can be used to select the number of latent variables by running the algorithm for different numbers of latents k as in Bayesian PCA (Bishop 1998).The algorithm scales cubically in the Hessian sizes, which is Oðk 3 Þ for the Hessian of B due to the Kronecker-factored structure and Oðp 3 k 3 Þ for the Hessian of U. The latter can be expensive when the number of latents and pathways is large, in which case a diagonal approximation is still viable and implemented in our code.

Results
We developed and implemented a novel model for factor analysis, PathFA that leverages pathway information for multimodal data integration and incorporates a probabilistic framework with automatic hyperparameter optimization.It bridges the gap between two existing approaches, MOFA and PLIER.PathFA enables pathway-based interpretation of multimodal factor analysis, specifically suited for small cohorts.
In the sections that follow, we demonstrate how the inclusion of prior information enhances the reconstruction loglikelihood.We observe improvements in the reconstruction of each individual modality and in the convergence behavior when a second data modality is incorporated.Additionally, we showcase the practical utility of our approach through its application to a subset of patient samples from the Swiss Tumor Profiler cohort in a real-world setting.

Including prior information improves reconstruction
On the synthetic data generated using MSigDB Hallmark as described in Section 2.1.3,we observe that PathFA has clear advantages over PLIER and MOFA in terms of reconstruction performance, sample efficiency, and inferring observation noises on the markers of both proteomics and Project U to non-negative by clamping to zero 13: Update Λ; d; r r ; r p as in Equations ( 9) and ( 10 We compare the reconstruction log-likelihood across different models using one modality, synthetic proteomics, or synthetic RNA, in Fig. 2. The trends of the reconstruction log-likelihood across different sample sizes are consistent between both modalities.MOFA without pathway information shows the lowest reconstruction log-likelihood.PathFA further improves over PLIER, especially for small sample sizes, due to automatic hyperparameter optimization and focus on pathway-level factor analysis. Figure 3 further shows that using both omics improves reconstruction of them individually, especially for proteomics, which appears to provide less evidence for the loadings.Already around m ¼ 30 samples suffice for the loadings to converge with PathFA.Like MOFA, PathFA can infer marker-level heteroscedastic observation noises and converges faster in terms of the samples needed due to the prior information on pathways, as apparent in Fig. 4.

PathFA performs well in small-sample cohorts
For the purpose of investigating how multimodality contribute towards the performance of PathFA, we compare the change in reconstruction likelihood over sample number between PathFA in a unimodal setting with synthetic trancscriptomics or proteomics data only (blue) versus PathFA with multimodal data (purple) in Fig. 3.We then use the unimodal or multimodal model, to evaluate the reconstruction log-likelihood and the observation noise, respectively, on transcriptomics and proteomics data separately.
Including both modalities strictly improves the reconstruction performance, especially for small sample sizes, which are common in realistic biomedical datasets, as also in our case in Section 3.3.
To show the effect of prior information, we also compare PathFA against multi-omics factor-analysis (MOFA).Unlike MOFA, we integrate modalities on a pathway level.This provides not only prior information but also enables a pathway-based multimodal factor analysis that should simplify the interpretation of the results.Generally, PathFA demonstrates similar behaviour to MOFA with respect to fitting observation noise.However, in the range of small sample numbers, the utility of prior information takes effect leading to much smaller observation noise error in PathFA.

Comparing pathway loadings to cell type composition
To demonstrate the utility of our approach, we also benchmark PathFA on a subset of patient samples from the TuPro cohort.This cohort is well suited for a benchmark study as it provides paired bulk proteomics and transcriptomics measurements for each sample.In addition, CyTOF data, also   generated for all samples, provides single-cell measurements of several protein markers that we able to use as an independent estimate of the ground-truth cell-type composition.We use a subset of 42 samples from ovarian cancer patients as well as 34 samples from the melanoma cancer patients that have available CyTOF compositions, clinical information of the melanoma cohort, and tumor heterogeneity scores of the ovarian cancer cohort.
Here, we train PathFA using the MSigDB Hallmark pathways and evaluate the quality of learned representations by their ability to correlate with cell-type composition.Figure 5 shows the highest correlating pathway loadings from PathFA and baseline approaches with CyTOF estimates of cell-type content in a unimodal and multimodal setting.In the unimodal and multimodal setting, PathFA consistently outperforms standard factor analysis (respectively, MOFA in the multimodal setting), for which we use standard instead of pathway loadings, as well as PLIER with respect to deriving factors that are highly correlated with cell-type content.These results are similar with the corresponding analysis in the melanoma cancer cohort (see supplemental Fig. S1), however, with systematically lower correlation in endothelial and fibroblasts across all approaches.The gain in the multimodal setting over the unimodal approach is small with respect to this evaluation.While the multimodal approach does improve over a proteomic-only approach, the performance over an RNA-only approach is similar.The difference in marker resolution between transcriptomics and proteomics data, may explain this trend.
Next, we identify pathways associated with cell-type composition by computing correlations between the CyTOF estimate and pathway loadings.Thus, we run PathFA using cell-type marking pathways and report the top five highest correlated pathways to each cell type in supplemental Fig. S3 in the Melanoma case.These pathways seem indicative of their respective cell types, melanoma pathway for tumor cell type, T cell pathways for immune cell type, Cancer associated fibroblasts (CAFs) pathways for fibroblast cell type, and endothelial melanoma pathways for endothelial cell types.We observe similar results in the Ovarian cancer cohort (see supplemental Fig. S2).
We further test the efficacy of PathFA when the pathway masks are perturbed and thus the prior information might not be fully accurate.In supplemental Fig. S6, we find that the average cell-type correlation over all four types remains high even for a substanial amount of flipped pathway-marker associations.The same figure shows that on the synthetic data, this is not necessarily the case and the reconstruction log likelihood suffers from flipping associations.

Associating pathway loadings to patient survival in a melanoma cohort
Conceptually, PathFA is designed to enable easier interpretation of the results of a multimodal factor analysis on the level of pathways.To investigate this further, we look at survival data as well as time to progression of the melanoma patient cohort.We analyzed the association of the pathway loadings of PathFA in the multimodal setting (transcriptomics and proteomics) with survival and patient progression.Each patient is given a binary label for four categories, if the patient has survived for at least 6 and 12 months, and if the patient has survive without progression for 6 and 12 months.Then, a t-test is performed between each label and the pathway loadings.A clustermap over the t-test statistics for associated pathway loadings are shown in Fig. 6 We observe three major clusters in the melanoma gene set analysis.Cluster 1, top 5 pathways, consists out of upgregulated gene sets associated with proliferation and DNA damage, while cluster 2, middle 12 pathways, encompasses various process categories, such as pathways, metabolism, and development.The remaining cluster 3, bottom 6 pathways, shows enrichment in downregulated gene sets, primarily related to immune responses, DNA damage, and pathways.Furthermore, in line with current literature reporting downregulated gene sets associated with worse clinical outcomes (Garg et al. 2021), hallmark gene sets like hallmark_allograft_rejection and hallmark_inter-feron_gamma_response are enriched in the group of patients with poor outcomes, including those who either succumbed to the disease or progressed within 12 months (see Fig. 6).When compared to a single-modal setting using transcriptomics and proteomics data separately, it becomes apparent that most of the observed signal originates from the transcriptomics data, with two pathways being uniquely identified through proteomics signal (see supplemental Fig. S4).

Pathway loadings are associated with tumor heterogeneity in ovarian cancer
We proceed to evaluate the performance of PathFA in the ovarian cancer cohort, focusing on the challenge of tumor heterogeneity (V� azquez-Garc� ıa et al. 2022).The tumor heterogeneity is computed based on the Jenson-Shannon divergence from CyTOF data on tumor cells.Similar to the previous tasks, PathFA achieves the highest correlation between latent factors and tumor heterogeneity (see supplemental Fig. S5).When comparing heterogeneity estimates derived from tumor cell populations using CyTOF, we observe significant correlations.Notably, OXIDATIVE_PHOSPHORYLATION and UN FOLDED_PROTEIN_RESPONSE shows positive correlations (see Table 1) including mTORC1 (see Fig. 7).This suggests Multirefers to the multimodal setting, while RNA and Prot refers towards results based on transcriptomic and proteomic data only.FA refers to factor analysis respectively MOFA in the multisetting.For both, correlation with latent factors is computed.PLIER can only be used in unimodal setting.
Probabilistic pathway-based multimodal factor analysis i195 associations between endoplasmic reticulum stress, metabolic processes and tumor heterogeneity in ovarian cancer and is consistent with previous reports on pro-survival mechanisms (Madden et al. 2019) and metabolic processes (Sancho et al. 2016).

Discussion and conclusion
In this study, we introduced PathFA, a novel multimodal factor analysis approach tailored for genomic data, employing Bayesian hyperparameter optimization to seamlessly Figure 6.Significant associations between survival and progression in relationship to pathway loadings from PathFA.The color refers to the normalized mean difference between the two patient groups in each column, for significant associations only.The clustering tree implies the presence of three clusters, two of which seem to be higher expressed [top (5 pathways) and middle (12 pathways)], and the remaining one appears to be down-regulated in reference to the according patient group (y-axis).integrate transcriptomic and proteomic data using pathway sets.This method innovatively addresses heteroscedasticity in markers and modalities, a prevalent challenge in experimental measurements.Through various experiments, we established that incorporating known prior information not only enhances reconstruction quality but also improves data efficiency through Bayesian hyperparameter updates.Our probabilistic formulation, especially the automatic optimization of hyperparameters, marks an improvement over PLIER, which lacks this adaptability.
A notable advancement of PathFA is its ability to provide immediate interpretability of pathway activities, in contrast to traditional approaches that often rely on post-hoc gene set enrichment analysis.This immediate interpretability is exemplified in our analysis of a cancer patient cohort from the Tumor Profiler Project, where PathFA successfully correlates pathway activities with crucial biological aspects such as cell-type composition, survival, and tumor heterogeneity.These capabilities illustrate the potential of PathFA in elucidating complex biological relationships and hypothesis generation.
PathFA is implemented using a Gaussian likelihood.This provides a robust foundation for different data modalities.In the future, it may be desirable to expand this framework to allow other likelihood functions such as the negative binomial, often used in context of read count data.PathFA may also expand across other biological data-types for which groupings (e.g.pathways masks) of individual markers (e.g.genes) is possible across the modalities.Furthermore, our approach extends beyond the typical limitations of analyzing dual modalities.It is readily adaptable to include additional data types, promising broader applicability as genomic profiling technologies evolve.A potential complication of such multimodal datasets lie in the differences in cellular composition between parallel tissues slices.Looking forward, we envision enhancements to PathFA that incorporate modalityspecific prior information that are in addition flexible to these tissues effects, further augmenting its analytical power.
In summary, PathFA represents a significant step forward in multimodal genomic analysis, offering robustness, flexibility, and direct interpretability.Its development aligns with the ongoing advancements in profiling technologies, positioning it as a valuable tool in understanding complex biological systems and disease mechanisms.

Figure 1 .
Figure 1.Schematic overview of our method.(A) Samples are hierarchically represented via pathways and latents inferred jointly from transcriptomics (RNA) and proteomics observations.Corresponding prior pathways translate into the space of both observed modalities.(B) Density plot of observation noises for both RNA and proteomics data shows heteroscedasticity within and across modalities.Proteomics markers have less varying precision while RNA has more spread.(C) Hierarchy of representations for a single sample.A sample is represented by a low-dimensional latent variable, projected into pathway abundances, which can be reconstructed into both proteomics and RNA observations.

Figure 2 .
Figure2.Reconstruction log-likelihood on the synthetic benchmark data as a function of samples for and RNA averaged over 20 runs.Shaded regions denote twice the standard error about the mean.Unimodal PathFA, unimodal PLIER, and MOFA are compared to optimal attainable performance (dotted line).

Figure 4 .
Figure 4. Average absolute error of marker-level observation noises on synthetic benchmark of Multi-PathFA and MOFA for both proteomics and RNA.The lines show the average over 20 runs and shaded regions two standard errors.

Figure 3 .
Figure 3. Reconstruction log-likelihood of multimodal PathFA in comparison to the unimodal variant as depicted in Fig. 2.

Figure 5 .
Figure 5.This figure shows the Pearson correlation coefficients of pathway loadings with cell-type content (based on ground truth computed from CyTOF data) across 42 ovarian cancer samples for the four most common cell-types.MSigDB cell-type pathways were used for PathFA across all experiments.Multirefers to the multimodal setting, while RNA and Prot refers towards results based on transcriptomic and proteomic data only.FA refers to factor analysis respectively MOFA in the multisetting.For both, correlation with latent factors is computed.PLIER can only be used in unimodal setting.
The quantile normalization ensures that the transformed counts for each sample roughly follow a normal distribution and z-scoring standardizes per marker.The preprocessing is identical toMao et al.
(2019)with the difference that we use logð1 þ xÞ instead of filtering genes with low counts and applying logðxÞ.The melanoma dataset had 19 612 genes and 4651 proteins.The ovarian dataset had 19 965 genes and 4068 proteins.
) using H m ðUÞ; H m ðBÞ 14: Closed-form update for scalar factors c r ; c p Probabilistic pathway-based multimodal factor analysis i193 transcriptomics markers.We use the public implementations and propose default parameters of PLIER and MOFA and set the number of latent variables for all methods to the ground truth 8.With access to the ground truth parameters, we can measure the reconstruction performance on test data and fitting of heteroscedastic marker-level observation noises.That is, we use the method in question on a set of m samples to infer the model and then fix the model before inferring loadings for a held-out set of m test samples, Y ðtestÞ With these parameters, we can reconstruct the noise-free observation, Ŷr and Ŷp , and assess this reconstruction under the true model in terms of log-likelihood.Further, we assess how well our probabilistic model can recover the true observation noise of the model in comparison to MOFA, which does not use prior knowledge C.
x ¼ C x UB for any data modality x.

Table 1 .
This is a list of 10 MSigDB Hallmark pathways that have the highest Pearson correlation with tumor heterogeneity score on 42 ovarian cancer samples.Scatter plot of Jensen-Shannon tumor heterogeneity computed on CyTOF measurements of tumor cells (y-axis) and mTORC1 pathway loading generated by PathFA.Each dot represents a ovarian cancer patient sample.The Pearson's correlation is 0.41 (P-value ¼ .0075).