Manifold Mixing for Stacked Regularization

Pereira, João; Stroes, Erik S. G.; Groen, Albert K.; Zwinderman, Aeilko H.; Levin, Evgeni

doi:10.1007/978-3-030-43823-4_36

João Pereira^8,10,
Erik S. G. Stroes⁸,
Albert K. Groen⁸,
Aeilko H. Zwinderman⁹ &
…
Evgeni Levin^8,10

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1167))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1876 Accesses
1 Citations

Abstract

In many real-world learning tasks, one has access to datasets consisting of multiple modalities, for example, various omics profiles of the patients coupled with medical records and other unstructured data sources. Often, the “core mechanism” (e.g. health or disease state) is reflected in all of these modalities and so this commonality can become more evident when the source domain (e.g. proteins) can accordingly transform the local geometry of the target (e.g. lipids). In this paper, we propose a novel algorithm that takes multiple data sources, constructs corresponding manifolds, and “mixes” information across them to find the common denominators in the observable outcomes. By leveraging manifold information from these different sources we obtain more robust and accurate results in comparison to standard methods. In the empirical evaluation on a clinical cohort related to ischaemia in patients with coronary artery disease, we demonstrate the applicability and efficacy of the proposed algorithm.

You have full access to this open access chapter, Download conference paper PDF

Interpretable deep learning methods for multiview learning

Article Open access 14 February 2024

Deep collective matrix factorization for augmented multi-view learning

Article 17 May 2019

DCID: Deep Canonical Information Decomposition

Keywords

1 Introduction

Exponential increase in multi-modal data, stemming from different instruments and measurements presents both an opportunity and a challenge. With a larger sheer volume of information, there is potentially more we can learn for a given process, but coherently combining different data sources with the goal of improving analysis remains a challenging and underdeveloped task. In the medical field for instance, multiple omics data such as proteins or lipids encode somewhat related biological information. Therefore, one might expect that health or disease state is reflected in both of these modalities, despite their different format. Learning frameworks such as manifold alignment [5, 10] and domain adaptation [6, 8] may not be directly applicable as they try to find a common latent manifold and learn to transfer knowledge from a source to a target domain, respectively. Orthogonally to existing methods, we present a way of “mixing” information from multiple domains, without imposing hard similarity between them. The motivation is that for a given outcome, the “core mechanism” (e.g. health or disease state) is reflected in all of these modalities and so this commonality can become more evident when the source domain (e.g. proteins) can accordingly transform the local geometry of the target (e.g. lipids).

2 Approach

We use a stacked regularization setting [11] where each level-one model is trained using “mixed manifolds” of various data modalities. In the next subsections we briefly discuss classical stacked regularization, domain alignment, adaption, and finally propose our mixing algorithm. Regarding notation, we will use capital bold, bold and no formatting for matrices, vectors and scalars or functions, respectively (e.g. $\mathbf {\mathbf {X}}, \mathbf {x}, f$/W). We will also use calligraphic font to denote spaces (e.g. $\mathcal {X}$).

2.1 Stacking

Let $\mathbf {\mathbf {X}}$ be a dataset of N samples whose values are sampled from an input space $\mathcal {X}=\{\mathcal {X}^1,\, \mathcal {X}^2,\, ...,\,\mathcal {X}^M\}$ where $\mathcal {X}^1$ to $\mathcal {X}^M$ are subspaces corresponding to different “sources” or “views” $1\rightarrow M$ which we will refer to as “domains”. Denote by y the output sampled from an output space $\mathcal {Y}$. In a supervised setting, the goal is to compute $p\left( y|\mathbf {x}^{1},\,...,\,\mathbf {x}^{M}\right) $, where $\mathbf {x}^{i}$ are the coordinates of an instance from $\mathbf {\mathbf {X}}$ in the domain $\mathcal {X}^i$. In stacked regularization, or stacking, the input is passed to a first layer of $W_0$ predictors $g^0_1(\mathbf {x}),\,...,g^0_{W_0}(\mathbf {x})$, with:

$$\begin{aligned} g^0_i(\mathbf {x})=p\left( y|\mathbf {x}^{1},\,...\,\mathbf {x}^{M},\,\varvec{\theta }_{i}^0\right) {,} \end{aligned}$$

(1)

where $\varvec{\theta }_{i}^0$ are the hyperparameters of the ith model. For our task, we suggest to pass one data source per model: $g^0_i(\mathbf {x})=p\left( y|\mathbf {x}^{i},\,\varvec{\theta }_{i}^0\right) $, so that the width of the first layer $W_0$ is equal to the number of domains M. The output from this layer is then passed to one or more layers of $W_k$ models $g_{1}^k,\,...,g^k_{W_k}$ which blend the outputs of the previous ones:

$$\begin{aligned} g^k_i(\mathbf {x})=p\left( y|g^{k-1}_1(\cdot ),\,...,\,g^{k-1}_{W_k}(\cdot )),\,\varvec{\theta }_{i}^k\right) ,\,k\in [1, L]{,} \end{aligned}$$

(2)

where L is the total number of blending layers and $\varvec{\theta }_{i}^k$ the hyperparameters of ith model from the kth layer. The last blending layer is then passed to a final model f that produces the output $f(\mathbf {x})=p\left( y| g^{L}_1(\mathbf {x}),\,...,\,g^{L}_{W_L}(\mathbf {x}),\,\varvec{\theta }^{L+1}\right) $, where $\varvec{\theta }^{L+1}$ are the hyperparameters of f. You can visualize the stacked model general architecture in Fig. 1. From a frequentist point of view, the goal of stacking is then to find:

$$\begin{aligned} \underset{\varvec{\theta }}{arg min} \,\,\mathcal {L}(\mathbf {y}; f(\mathbf {\mathbf {X}}), \varvec{\theta }){,} \end{aligned}$$

(3)

where $\varvec{\theta }$ is the set of hyperparameter values from all of the stack models, $\mathbf {y}$ is the output for all of the data, and $\mathcal {L}$ is the loss function when using $f(\mathbf {x})$ to predict $\mathbf {y}$. For a fully Bayesian approach, one should compute the posterior probability of each model by integrating out the hyperparameter values:

$$\begin{aligned} \begin{aligned} p\left( g_i^k|Z\right) \propto p\left( Z|g_i^k\right) p\left( g_i^k\right) \propto \\ p\left( g_i^k\right) \int p\left( Z|\varvec{\theta }_i^k, g_i^k\right) p\left( \varvec{\theta }_i^k|g_i^k\right) d\varvec{\theta }_i^k{,} \end{aligned} \end{aligned}$$

(4)

where Z is the complete dataset $(\mathbf {X},\mathbf {y})$.

Although this approach is attractive because it considers the uncertainty of the model, it also incurs high computational cost for large $\varvec{\theta }_i^k$.

Two important aspects are: (a) optimizing each model independently does not guarantee finding the global optimal stacked model and (b) there is an implicit assumption that each model $g_i^k$ can learn/handle data from different sources (possibly with different formats) effectively.

2.2 Stacking Optimization

Finding an optimal stacked model can be done by optimizing each sub-model individually or by jointly optimizing all the sub-models. Optimizing each model individually has an important complexity advantage because the number of possible $\varvec{\theta }$ combinations increases exponentially with the number of parameters: $k^{|\varvec{\theta }|}$, where k is the number of values considered for each parameter.

Lemma 1

For a given dataset $\mathbf {X}, \mathbf {y}$, stacked model f(x) and parameters $\varvec{\theta }$, the following relation is true: $\mathcal {L}\left( \mathbf {y}, f(x), \varvec{\theta }^{*} \right) \le \mathcal {L}\left( \mathbf {y}, f(x), \varvec{\theta }' \right) $, where

$\varvec{\theta }^{*} =\underset{\varvec{\theta }}{arg min}\,\,\mathcal {L}\left( \mathbf {y}, f(x), \varvec{\theta } \right) $, $\varvec{\theta }' = \varphi _{k=1}^L \left( \varphi _{i=0}^{W_k}\left( \; \underset{\varvec{\theta }^k_i}{arg min}\,\,\mathcal {L}\left( \mathbf {y}, g_i^k(x), \varvec{\theta }_i^k \right) \right) \right) $, and $\varphi _{j=1}^{L}\left( \underset{\varvec{\theta }_{j}}{arg min}\,f(\varvec{\theta _j})\right) $ is a sequential composition of minimizations with respect to index j: $\underset{\varvec{\theta }_{j=L}}{arg min}\left( \underset{\varvec{\theta }_{j=L-1}}{arg min}\left( ...\,\underset{\varvec{\theta }_{j=1}}{arg min}\left( f(\varvec{\theta _j}) \right) \right) \right) $.

Proof

Let $\mu $ be a measure on the measurable space $(\varTheta , \varvec{\theta })$. Since $\varvec{\theta }$ is a disjoint set, its measure is just:

$$\begin{aligned} \mu (\varvec{\theta }) = \prod _{k=0}^{L+1}\prod _{i=1}^{W_k}\mu (\varvec{\theta }_i^k) \end{aligned}$$

(5)

Denote by $\{\varvec{\theta }_i^{k}\}^*$ the set of values that satisfy $\underset{\varvec{\theta }^k_i}{arg min}\,\,\mathcal {L}\left( \mathbf {y}, g_i^k(x), \varvec{\theta }_i^k \right) $, and $\{\varvec{\theta }_i^k\}$ the set of values $\varvec{\theta }_i^k$ can take. Since $\{\varvec{\theta }_i^{k}\}^* \subset \{\varvec{\theta }_i^k\}$, then $\mu \left( \{\varvec{\theta }_i^{k}\}^*\right) \le \mu \left( \{\varvec{\theta }_i^k\}\right) ,\,\forall \; i,k$, and so:

$$\begin{aligned} \mu \left( \varvec{\theta }'\right) =\prod _{k=0}^{L+1}\prod _{i=1}^{W_k}\mu \left( \{\varvec{\theta }_i^{k}\}^*\right) \le \prod _{k=0}^{L+1}\prod _{i=1}^{W_k}\mu \left( \{\varvec{\theta }_i^{k}\}\right) \end{aligned}$$

(6)

$\mathcal {L}\left( \mathbf {y}, f(x), \varvec{\theta }^{*} \right) $ is thus optimizing over a larger domain than $\mathcal {L}\left( \mathbf {y}, f(x), \varvec{\theta }' \right) $ is, yielding $\mathcal {L}\left( \mathbf {y}, f(x), \varvec{\theta }^{*} \right) \le \mathcal {L}\left( \mathbf {y}, f(x), \varvec{\theta }' \right) $

There is a trade-off between complexity and performance when it comes to optimizing the model. If performance is the goal, then the next step is to decide what form is the optimization going to take. A grid-search would quickly become unfeasible for models with multiple hyperparameters, so an attractive solution is to instead use Bayesian optimization [1].

2.3 Domain Alignment

Note that at this point there is no information sharing between the first layer’s models. However, in many situations it may be desirable that some information is shared across these models since they are build using different modalities of the same sample set. The motivation is that even though the samples come from different distributions, the generating processes should be similar and thus they should lie in a similar low-dimensional manifold. This is the central problem of Manifold Alignment [7]. Our Manifold Mixing is based on similar motivation, with crucial difference - we consider that each domain has a contribution of its own, and therefore we will not enforce an exact match between the manifolds but merely a transformation of the local inter-sample geometry using all the domains, indirectly linking the stacked first layer models together.

2.4 Domain Adaptation

Given two domains $\mathcal {X}^s$ and $\mathcal {X}^t$, that are different but related, the goal of Domain Adaptation is that of learning to transfer knowledge acquired from the source $\mathcal {X}^s$ to the target $\mathcal {X}^t$. The most common setting is when there are many labeled examples in the source, but not in the target, and therefore one tries to learn an estimator h such that it minimizes the error on both the source and target distribution prediction [2]. In our setting, the source and target domains represent different modalities of the same sample. For clarity, we will use source and target domain definitions as well, and use the former to transform the latter.

2.5 Manifold Mixing

We would like to address the question: how to combine data from different domains with a similar relation to the output? Our approach consists in creating a map between each pair of domains $\mathcal {X}^{s}\rightarrow \mathcal {X}^{t}$, while deforming the local geometry of the two to become more similar. We drew inspiration from LLE [9] in that we will also use the neighbours of a point to predict its position. In our case, we will use the neighbours of this point in the other domains to predict its position in the original domain. Consider a set of points S and two mappings taking the points in S to two coordinate systems of domains $\mathcal {X}^t$ and $\mathcal {X}^s$: $\varphi : S \rightarrow \mathbb {R}^{|t|}$, $\psi : S \rightarrow \mathbb {R}^{|s|}$, and suppose the subsets $\mathbf {X}^t$, $\mathbf {X}^s$ of the dataset $\mathbf {X}$ are measured in these coordinate systems. Let us introduce an approximation $\mathbf {L}_{s}^{t}$ to the mapping $\varphi \circ \psi ^{-1}: \mathbb {R}^{|s|}\rightarrow \mathbb {R}^{|t|}$ from the coordinates of domain $\mathcal {X}^s$ to the coordinates of domain $\mathcal {X}^t$: $\underset{\mathbf {L}_{s}^{t}}{\min }\sum _{i=1}^{N}||\mathbf {x}_i^t - \mathbf {L}_{s}^{t}\mathbf {x}_i^{s}||^2$, with $\mathbf {x}_i^t, \mathbf {x}_i^{s}$ corresponding to the ith entry of $\mathbf {X}^t$ and $\mathbf {X}^s$, respectively. The optimal solution is then given by:

(7)

Denote by $\mathbf {n}_i^t[j]$ the jth neighbour of instance $\mathbf {x}_i$ in the domain $\mathcal {X}^t$. Let the array of the points in $\mathcal {X}^s$ which are the neighbours of instance $\mathbf {x}_i^t$ in the domain $\mathcal {X}^t$ be: $\mathbf {N}_i^{s\leftarrow t} = \left[ \mathbf {x}_{\mathbf {n}_i^t[1]}^s, \mathbf {x}_{\mathbf {n}_i^t[2]}^s, \,\dots ,\mathbf {x}_{\mathbf {n}_i^t[k]}^s\right] $. Our goal is to ‘mix’ information from different manifolds. This is accomplished by projecting the neighbours of $\mathbf {x}_i^t$ from the source to the target domain and then finding the linear combination of the points that best reconstructs $\mathbf {x}_i^t$ in the original domain:

$$\begin{aligned} \underset{\mathbf {w}_{i}}{\min }\sum _{i}||\mathbf {x}_i^t - \tilde{\mathbf {x}}_i^{t\leftarrow s}||^2=\underset{\mathbf {w}_{i}}{\min }\sum _{i}||\mathbf {x}_i^t - L_{s}^{t}\mathbf {N}_i^{s\leftarrow t}\mathbf {w}_i||^2, \end{aligned}$$

(8)

where $\tilde{\mathbf {x}}_i^{t\leftarrow s}$ is the reconstruction of $\mathbf {x}_i^t$ using domain $\mathcal {X}^s$. We visualize how substituting $\mathbf {x}_i$ by $\tilde{\mathbf {x}}_i^{t\leftarrow s}$ might affect the target manifold in Fig. 2. After setting the derivative w.r.t. $\mathbf {w}_i$ to zero, the optimal solution corresponds to:

(9)

where $\mathbf {\tilde{N}}^{t\leftarrow s}_i=\mathbf {L}_{s}^{t}\mathbf {N}^{s\leftarrow t}$, the neighbors of $\mathbf {x}_i$ in $\mathcal {X}_t$ projected from their coordinates in $\mathcal {X}_s$ back to the coordinates in $\mathcal {X}_t$. We can now transform the original space $\mathcal {X}^t$ into the space reconstructed from the other domains $\tilde{\mathcal {X}}^t$ by computing for each instance the weighted mean of its reconstructions:

$$\begin{aligned} \tilde{\mathbf {x}}_i^t=\beta _t \mathbf {x}_i^t + \sum _{s\ne d}\beta _{s}\tilde{\mathbf {x}}_i^{t\leftarrow s}{,} \end{aligned}$$

(10)

where $\beta _j$ can be seen as the prior of domain $\mathcal {X}^j$’s relevance, and $\sum _{j}\beta _j = 1$. When evaluating a new point $\mathbf {x}_{new}$, first the nearest neighbours from the training set are found, and then the reconstruction is given by $\mathbf {\tilde{N}}^{s\leftarrow t }_i \mathbf {w}_{new}$. The complexity of the algorithm is bounded by the matrix inversion of the coordinate mapping in Eq. 7, and therefore the algorithm complexity is $\mathcal {O}(d^3)$, where d is the maximum number of features among all the domains.

3 Experimental Section

3.1 Methods

To test our method we used a recent clinical cohort [3] containing data on patients with cardiovascular disease. There are 440 subjects in the dataset of which 56 suffered from an early cardiovascular event. For each patient, 359 protein levels and 9 clinical parameters are measured. We evaluate performance our method (stacked regularization with manifold mixing) for predicting a cardiovascular event. We compare proposed approach to that of using a standard stacked model with joint Bayesian optimization of the hyper-parameters, as well as with using random forest on the merged/ feature concatenated datasets (protein levels and clinical parameters). For both our method and the standard stacking, the architecture consisted of two larger random forest models in the first layer and a smaller one in the output.

3.2 Data Selection and Preprocessing

We perform random shuffles with $90\%$ train size and even class distribution in the train/test set. We use remaining $10\%$ to test the model. Since the dataset is unbalanced (much larger number of negative than positive subjects), we took a random sample from the negative class of size equal to the total number of positive class subjects, prior to the split at each shuffle. The protein were measured using a technology that uses standard panels for different proteins, meaning some of the proteins might have no relation to the outcome at all. For this reason, for each run we pre-selected 50 proteins using Univariate Feature Selection on the training set. Then, we normalized the train and test data independently, and measured the average ROC for each of the methods. We perform 5-fold cross validation for optimal hyper-parameter estimation on the train set for the random forests, and bayesian optimization for the stacked models. Once that is accomplished, we retrain the model with the optimal parameters on the complete training set and test on the remaining $10\%$. We repeat the this procedure multiple times and report the average ROC-AUC as well as the standard deviation. Described strategy is frequently referred to as stability selection procedure [4] The proteins were measured using OLINK technology that records expression levels of proteins via targeted and customised analysis [3].

3.3 Results

The results are presented in Fig. 3. Proposed approach (MM stacked) outperformed both the regular stacked model and the random forests (RF) using the merged data. Both stacked regularized techniques outperformed standard RF.

4 Conclusions and Future Work

In this paper we propose the manifold mixing framework to improve the analysis of multi-modal data stemming from different sources. In our preliminary experiments, the obtained results support efficacy of our method. We outperform both standard stacked regularization and the model built on feature concatenated data. In the near future, we plan on performing further tests with larger number of shuffles, and testing on different datasets and heterogeneous domains. One pitfall of the current algorithm is the linearity of the map between manifolds which might fail in highly curved regions. A possible solution is to kernelize the method using graph kernels. Another interesting direction is to subdivide the manifold into multiple subregions based on the local curvature and create a mapping per subregion.

References

Acerbi, L., Ma, W.J.: Practical Bayesian optimization for model fitting with Bayesian adaptive direct search. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA (2017)
Google Scholar
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.: A theory of learning from different domains. Mach. Learn. 79(1), 151–175 (2009). https://doi.org/10.1007/s10994-009-5152-4
Article MathSciNet Google Scholar
Bom, M.J., et al.: Predictive value of targeted proteomics for coronary plaque morphology in patients with suspected coronary artery disease. EBioMedicine (2018). https://doi.org/10.1016/j.ebiom.2018.12.033
Bühlmann, N.M.P.: Stability selection. J. Roy. Stat. Soc. 72(4), 417–473 (2010)
Article MathSciNet Google Scholar
Cui, Z., Chang, H., Shan, S., Chen, X.: Generalized unsupervised manifold alignment. In: Advances in Neural Information Processing Systems 27 (NIPS 2014) (2014)
Google Scholar
Hajiramezanali, E., Dadaneh, S.Z., Karbalayghareh, A., Zhou, M., Qian, X.: Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. In: 32nd Conference on Neural Information Processing Systems (NIPS 2018) (2018)
Google Scholar
Ham, J.H., Lee, D.D., Saul, L.K.: Learning high dimensional correspondences from low dimensional manifolds. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003) (2003)
Google Scholar
Kumar, A., et al.: Co-regularized alignment for unsupervised domain adaptation. In: Advances in Neural Information Processing Systems 31 (NIPS 2018) (2018)
Google Scholar
Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Article Google Scholar
Wang, C., Mahadevan, S.: Heterogeneous domain adaptation using manifold alignment. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (2011)
Google Scholar
Wolpert, D.H.: Stacked generalization. Neural Netw. 5, 241–259 (1992)
Article Google Scholar

Download references

Acknowledgments

Supported by a European Research Area Network on Cardiovascular Diseases (ERA-CVD) grant (ERA CVD JTC2017, OPERATION). Thanks to Cláudia Pinhão for supporting with the design of the figures.

Author information

Authors and Affiliations

Department of Cardiovascular Medicine, University Medical Center Amsterdam, Amsterdam, The Netherlands
João Pereira, Erik S. G. Stroes, Albert K. Groen & Evgeni Levin
Department of Clinical Epidemiology, Biostatistics and Bioinformatics, University Medical Center Amsterdam, Amsterdam, The Netherlands
Aeilko H. Zwinderman
Horaizon BV, Delft, The Netherlands
João Pereira & Evgeni Levin

Authors

João Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Erik S. G. Stroes
View author publications
You can also search for this author in PubMed Google Scholar
Albert K. Groen
View author publications
You can also search for this author in PubMed Google Scholar
Aeilko H. Zwinderman
View author publications
You can also search for this author in PubMed Google Scholar
Evgeni Levin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to João Pereira .

Editor information

Editors and Affiliations

Institut National des Sciences Appliquées, Rennes, France
Peggy Cellier
Maastricht University, Maastricht, The Netherlands
Kurt Driessens

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pereira, J., Stroes, E.S.G., Groen, A.K., Zwinderman, A.H., Levin, E. (2020). Manifold Mixing for Stacked Regularization. In: Cellier, P., Driessens, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Communications in Computer and Information Science, vol 1167. Springer, Cham. https://doi.org/10.1007/978-3-030-43823-4_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-43823-4_36
Published: 28 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43822-7
Online ISBN: 978-3-030-43823-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)