Discriminative Multimodal Learning via Conditional Priors in Generative Models

Deep generative models with latent variables have been used lately to learn joint representations and generative processes from multi-modal data. These two learning mechanisms can, however, conflict with each other and representations can fail to embed information on the data modalities. This research studies the realistic scenario in which all modalities and class labels are available for model training, but where some modalities and labels required for downstream tasks are missing. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities. We, to counteract these problems, introduce a novel conditional multi-modal discriminative model that uses an informative prior distribution and optimizes a likelihood-free objective function that maximizes mutual information between joint representations and missing modalities. Extensive experimentation demonstrates the benefits of our proposed model, empirical results show that our model achieves state-of-the-art results in representative problems such as downstream classification, acoustic inversion, and image and annotation generation.


Introduction
Different measurement-modalities x 1 , x 2 , • • • , x m of an object are used in multi-modal learning to learn z, a joint representation which captures information from all modalities.z can be used for clustering, active and transfer learning, or, where class labels y are available, downstream classification.
Deep neural networks (DNNs) and deep generative models (DGMs) with latent representations have been used in multi-modal learning (Andrew et al., 2013;Wang et al., 2015aWang et al., , 2017;;Suzuki et al., 2016;Du et al., 2018;Wu & Goodman, 2018;Du et al., 2019;Shi et al., 2019;Sutter et al., 2020Sutter et al., , 2021)).DGMs learn joint latent representations using variational approximations of the posterior distribution, and learn generative models for data modalities by optimizing a variational lower bound on the log-likelihood of the data.These two mechanisms can, however, conflict with each other.Generative models may focus on generating modalities without using the joint latent representation.Therefore, the posterior distribution for the joint representation fails to embed information on the modalities, collapsing into a non-informative prior distribution.This is called posterior collapse in the uni-modal domain (Dieng et al., 2019;Lucas et al., 2019), and harms the performance of downstream tasks based on joint representations, e.g.classification or generation of missing modalities.
There are different applications where data comes from different channels, e.g., tuples of images and annotations or acoustic and articulatory measurements.However, not all observations come in tuples, because annotating images or measuring articulatory movements can e.g.be costly (Sutter et al., 2020(Sutter et al., , 2021) ) and take time to be generated.Hence, we are interested in modeling conditional distributions that are able to learn multi-modal latent representations, which can then be used to generate the missing modalities.Learning such latent representations is important, because we can capture relationships between modalities that are valuable for generative and discriminative downstream tasks.Towards this goal, we introduce a conditional multi-modal discriminative (CMMD) model that works in the aforementioned scenarios, where all modalities and class labels are available for model training, but where some modalities and class labels required for downstream tasks are missing.We show, in this scenario, that the variational lower bound limits mutual information (MI) between multi-modal representations and missing modalities.To counteract this limitation and to circumvent posterior collapse, we introduce a novel likelihood-free objective function that optimizes MI and also introduce a prior distribution for joint representations that is conditioned on the available modalities.
We show, through extensive experimentation, that our proposed CMMD model does not suffer from multi-modal posterior collapse.We also show that its joint representations embed information from multiple data modalities, which is useful to downstream tasks.We have benchmarked different multi-modal learning models across different representative domains, e.g.image-to-image, acousticto-articulatory, image-to-annotation, and text-to-image.The empirical results from this show that CMMD achieves state-of-the-art results in downstream classification and in the generation of missing modalities at test time.
Our main contributions are: • A new objective function that counteracts the restriction on MI between joint representations and the missing modalities • A generative process that generates missing modalities at test time using a conditional prior1 • Insights into the effect of posterior collapse in downstream classification and in the generative process in multi-modal learning.

Multimodal Learning
We use a common notation.Data modalities are represented by x and distinguished by a subscript.Joint latent representations are denoted by z.In the following we provide an overview of the relevant multi-modal learning models to this work.See Guo et al. (2019) for a comprehensive review.
Deep Neural Networks Deep canonical correlation analysis (Andrew et al., 2013) couples deep neural networks with canonical correlation analysis (Hotelling, 1936) to train neural networks f (•) and g(•) such that they can maximize the correlation ρ(f (x 1 ), g(x 2 )) between views (modalities) x 1 and x 2 .DCCA (Deep Canonical Correlation Analysis) not only handles non-linearities, but also captures high-level data abstractions in each of the multiple hidden layers.Its objective function is, however, a function of the entire data set and therefore does not scale to large data sets.To overcome this limitation, Wang et al. (2015a) developed the deep canonically correlated autoencoder (DCCAE), which is optimized using stochastic gradient descent.DCCAE also introduced reconstruction neural networks for the data modalities, which minimized their reconstruction error.This is in addition to maximizing the canonical correlation between the learned representations.
Variational Inference A problem with DCCAE is that the canonical correlation term in its objective function dominates the optimization procedure (Wang et al., 2015a).The reconstruction of the modalities is therefore poor.Wang et al. (2017) therefore developed a variational CCA (VCCA) model to overcome this problem.VCCA uses variational inference and deep generative models to generate latent representations of the input modalities.Du et al. (2019) proposed DMDGM, a supervised extension of VCCA that combines multi-modal learning and classification in a unified framework.The classification in DMDGM uses available views and not joint representations.DMDGM is, however, not the only model that addresses classification in a unified objective function.Du et al. (2018) developed a semi-supervised deep generative model for missing modalities, the latent variable being shared across modalities.They also modeled the inference process as a Gaussian mixture model (GMM).Modeling the inference process as a GMM, however, harms the tightness of the lower bound, as the entropy of a GMM is intractable.
The model presented by Vedantam et al. (2017) focuses on cross-modality generation, using the product of experts (PoE) in the factorization of posterior inference distributions.Wu & Goodman (2018) similarly introduced MVAE, which assumes that the posterior distribution is proportional to the product of individual conditional posteriors p(z|x 1 ) • • • p(z|x n ) normalized by the prior distribution p(z).The joint posterior distribution is therefore also a PoE.Shi et al. (2019), through applying a similar approach, used a mixture of experts (MoE) to develop MMVAE, the generative process of the model allowing conditioning modalities and generation modalities to be interchangeable.MoE and PoE provide elegant ways of cross-generation.The linear combination of marginal distributions, however, learn joint representations that might not be useful for downstream classification (see Section 4.2).Sutter et al. (2021) show that the MVAE models the joint posterior distribution as a geometric mean, while MMVAE models it as an arithmetic mean.Further, they generalize these two approaches in a Mixture-of-Products-of-Experts (MoPoE) VAE, which approximates the joint posterior of all subsets of modalities.It is noteworthy that MVAE, MMVAE, and MoPoE approximate the joint posterior distribution, conditioned on all modalities, as a function of unimodal posterior distributions.Such a modeling approach can deal with any combination of missing modalities simultaneously and, therefore, cross-modal generation can be done in any direction efficiently.However, none of these models are discriminative by nature and, as a consequence, can only deal with discriminative tasks in a two-steps fashion.CMMD is also able to model any combination of missing modalities, but one at a time.On the other hand, generative and discriminative models are trained end-to-end in the CMMD model.
The most recent multi-modal learning research has focused on different ways of learning flexible joint representations that are useful in cross-modality generation.For example, Theodoridis et al. (2020) describe the learning of joint representation by introducing a cross-modal alignment of the latent spaces by minimizing Wasserstein distances; Nedelkoski et al. (2020) couple normalizing flows and MVAE to learn more expressive representations; Liu et al. (2021) propose a variational information bottleneck lower bound to force the encoder to discard irrelevant information, keeping only relevant information to generate one modality.Chen & Zhu (2022) use generative adversarial networks to simultaneously align the different encoder distributions with the joint decoder distribution.None of these new methods, however, have been developed for downstream classification with missing modalities.Javaloy et al. (2022) focus on learning encoders and decoders that are impartial to the unimodal posterior distributions that generate latent representations.To achieve such impartial optimization (IO), the authors propose a novel optimization technique that modifies the gradients of each modality and, as a result, does not neglect the optimization of any specific modality.
Abrol et al. ( 2020) introduced a uni-modal method that uses, as in our proposed model, conditional priors to generate a discrete mixture of representations in the prior space.These are considered to be local latent variables.Continuous variables in the posterior distribution are considered to be global.Local and global variables are, for supervised data, aligned using maximum mean discrepancy (Gretton et al., 2007), which optimizes the mutual information of global latent variables and input data.However, our proposed CMMD model focuses, instead, on multi-modal data and uses conditional priors to generate representations when some modalities are missing.Further, its objective function arises from the restriction imposed by the Kullback-Leibler divergence in the evidence lower bound on mutual information.

Evidence Lower Bound
We have access to labeled multi-modal data The evidence lower bound (ELBO) L(x M , x O , y) of our proposed model is therefore the inequality being a result of the concavity of log and Jensen's inequality.See Appendix A for details.

Maximizing Mutual Information
We can, in principle, optimize Eq. 1 using the stochastic variational gradient Bayes (SVGB) algorithm (Kingma & Welling, 2013).Eq. 1 does, however, include an average Kullback-Leibler divergence that is an upper bound on the conditional mutual information between z and x M (see Appendix B), i.e.
The latent representations may therefore, as a consequence of this, fail to encode information about x M , which is equivalent to generating x M based on the prior p(z|x O ).This problem is called posterior collapse in the one-modality literature (Dieng et al., 2019;Lucas et al., 2019).It occurs when the variational posterior distribution matches the prior.We therefore introduced a conditional mutual information term weighting the optimization on the mutual information term.The following likelihood-free objective function for a single data point is therefore obtained3 where the last divergence term is called the marginal KL divergence (Hoffman & Johnson, 2016).The full derivation for Eq. 3 is given in Appendix A.
The first KL divergence term in Eq. 3 has an analytical solution.The second KL divergence is intractable due to the marginal distribution q(z|x O ).It can, however, be replaced by any strict divergence term (Zhao et al., 2017), e.g.maximum mean discrepancy divergence (MMD) (Gretton et al., 2007).We select the squared population MMD since it encourages the average posterior distribution to match the whole prior, which is (4) Here F is a unit ball in a universal reproducing kernel Hilbert space H, p and q are Borel probability measures, and k(•, •) is a universal kernel.We use a Gaussian kernel in our proposed model.Finally, the objective function for a single data point therefore becomes (5) where λ counteracts the loss imbalance between the X 2 and Z spaces and α controls the importance of the classification loss in the objective function.
Effect of ω on the objective function: The first (term-by-term) KL divergence in Eq. 5 regularizes each posterior distribution towards its prior and is minimized when The marginal MMD divergence, on the other hand, regularizes an average posterior distribution q(z|x O ) = 1/N i q(z|x O , x i M , y i ) towards the prior distribution, without sacrificing model power (Hoffman & Johnson, 2016).Makhzani et al. (2015) show that the term-by-term KL divergence simply encourages the average posterior distribution to match the modes of the prior p(z|x O ).However, the MMD term in Eq. 5 encourages the average posterior distribution to match the whole prior, giving an effect similar to the adversarial training proposed by Makhzani et al. (2015).Furthermore, setting the marginal MMD divergence to 0 may lead to representations from the prior that are useless for sculpting latent representations (Hoffman & Johnson, 2016).Setting the term-by-term KL divergence to 0 also implies that the joint posterior representation is independent of the modality x M .Our proposed objective function therefore offers an elegant way of trading-off these effects through the ω parameter, recovering the variational lower bound for ω = 1 and, for 1 > ω ≥ 0, optimizing mutual information.The optimal ω value, as can be seen in Sections 4.3.1-4.3.5 and 4.4, is specific to the learning task and, therefore, must be found by cross-validation.
CMMD finally assumes the following density functions for the prior distribution, the classifier, and the encoder The decoder network is parametrized as where N and Cat denote the Gaussian and multinomial distributions respectively, and where f (•) is a multilayer perceptron (MLP) network (Rumelhart et al., 1985).This means that the density parameters µ, σ 2 , p, and π y|z are parametrized by neural networks, learnable parameters being denoted by θ and φ.Note that the classifier can handle binary, multi-class, and multi-label classification using a sigmoid, softmax, or multiple sigmoid activation function respectively at the output layer.
We observed that z ∼ q(z|x O , x M , y) leads to an unstable classification of y.We therefore fed the classifier p(y|z) with z ∼ p(z|x O ) during training and test time.We hypothesize that the prior distribution reproduces the test scenario more accurately than the posterior distribution.Fig. 1 shows the forward propagation during training and test time in our proposed methodology4 .

Experiments and Results
This section compares the CMMD model we propose with different multi-modal learning algorithms in downstream classification tasks across three different domains: image-to-image using a multimodal version of MNIST and SVHN, acoustic-to-articulatory with the XRMB data set, and imageto-annotation using the MIR Flickr data set.The benchmark models are: CCA (Hotelling, 1936), DCCA (Andrew et al., 2013), DCCAE (Wang et al., 2015a), MVCL (Hermann & Blunsom, 2013), RBM-MDL (Ngiam et al., 2011), VCCA (Wang et al., 2017), MVAE (Wu & Goodman, 2018), and MMVAE (Shi et al., 2019).We include a classifier model M-x O that only uses the always available modality, to allow the impact of joint representations for classification to be assessed.Network architecture and model training details are given in Appendix C.However, given the importance of the ω hyperparameter in the optimization of our proposed model, we mention here the value found by cross-validation in each experiment, unless otherwise specified.See Figure C1 for an overview over all ω values.

Data sets
In the following we explain the multi-modal data sets used in this research.

2-modality MNIST:
This data set, introduced by Wang et al. (2015a), consists of 28 × 28 MNIST hand-written digit images (Deng, 2012).The images have been randomly rotated at angles in the interval [−π/4, π/4], to generate x O .The modality x M is generated by randomly selecting a digit from x O and adding noise uniformly sampled from [0, 1] to each pixel in the non-rotated image.Each pixel is then truncated to the interval [0, 1].

MNIST-SVHN:
We randomly paired each instance of a MNIST digit (x O ) with one instance of the same digit class in the SVHN data set (Netzer et al., 2011) (x M ), which is composed of street-view house numbers, just as in Shi et al. (2019).

3-modality MNIST:
This data set combines some of the modalities in the previous data sets, i.e. original MNIST, rotated MNIST, and SVHN digits.All of the same digit class.
MNIST-SVHN-Text: This data set was first introduced in Sutter et al. (2020) and it is based on the MNIST-SVHN data set.The additional string modality contains 8 characters where everything is a blank space except the digit word.Further, the starting position of the word is chosen randomly.The 8 character string is, finally, converted to a 71D one-hot-encoding, which corresponds to the length of possible characters in the dictionary used in Sutter et al. (2020).The experiments using this data set consider all possible combinations of missing and observable modalities, see Section 4.3.2.

XRMB:
The original XRMB data set (Westbury, 1994) contains simultaneously recorded speech and articulatory measurements from 47 American English speakers.The modality x O , the acoustic data, is composed of a 13D vector of mel-frequency cepstral coefficients (MFCCs).We also included their first and second derivatives.This 39D vector is concatenated over a 7-frame window around each frame, resulting in a 273D vector that corresponds to x O .The modality x M , the articulatory data, is formed by horizontal and vertical displacements of 8 pellets on the tongue, lips, and jaw, resulting in a 112D vector.The data set then finally contains 40 phone classes.
Flickr: The Flickr data set (Huiskes & Lew, 2008) contains 1 million images, 25000 being labeled according to 24 classes.Note that each image can be assigned to multiple classes.Stricter labeling was also carried out for 14 of the classes, images only being annotated with a category where that category was salient.The data set therefore has 38 classes.We used the same 3857D feature vector (x O ) as used by Srivastava & Salakhutdinov (2012) to describe the images.The modality x M is composed of tags related to the image, the tags constrained to a vocabulary of the 2000 most frequent words.

Posterior Collapse in Multimodal Learning
This section evaluates the impact of posterior collapse in VCCA, MVAE, MMVAE and our proposed CMMD model.We therefore measured posterior collapse as the proportion of latent dimensions that are within KL divergence of the prior for at least 99% of the data sample, as introduced by Lucas et al. (2019).
We trained all models using a 4-fold cross-validation approach, each fold containing 2 speakers from the XRMB data set (Westbury, 1994).Table 1 shows that CMMD, optimized with ω = 0.8, outperforms all other methods in terms of error rates and root mean square errors (rmse) for the generated missing modality.VCCA5 surprisingly ranks number two in the classification task, despite having a simpler architecture than MVAE and MMVAE.MVAE has lower error rates than MMVAE, even when we train MMVAE using an importance weighted approach and k = 10 samples.MMVAE IWAE generates the missing modality more accurately than MMVAE ELBO, and achieves smaller error rates.
The first two diagrams on the left side of Fig. 2 show the posterior collapse between z|x O and z, and between z|x M and z.They show, for both versions of MMVAE, that around 80% of the dimensions in the latent representations collapse to N (0, 1).This implies that the latent representation is independent of the observed modalities.MVAE, however, needs more than 5 nats when conditioned on the modality x O , and more than 6 nats when conditioned on view x M before 80% of the latent dimensions collapse.None of the latent dimensions in VCCA and CMMD are within 6 nats, and their latent representations therefore embed more information on the observed modalities.This information on the modalities is useful for downstream classification and, for CMMD, for the generation of the missing modality.The third diagram finally shows posterior collapse between the representations generated using z|x O and z|x M .We want, in this case, z|x O to collapse into z|x M , this meaning that the model is able to learn joint representations that contain information on x M .Note that, in MMVAE, the collapse between both marginal distributions is strong given that both collapsed to N (0, 1).On the other hand, the marginal distributions in MVAE embed information on the modalities (see first two diagrams).MVAE, however, fails to learn joint representation as suggested by the third diagram.CMMD does, however, counteract posterior collapse through the conditional prior and through directly optimizing mutual information, as shown by the first three diagrams.
We adapted the posterior collapse definition to the analysis of the variance parameters in the generative process, to allow us to understand the rmse results for the generated missing modality.This insight is shown in the last diagram of Fig. 2. For example at = 0.06, around 79% of the dimensions in x2 generated by CMMD, have lower values than .Furthermore, only 45% of the parameters learned by MMVAE ELBO have lower values than .We therefore hypothesize that MMVAE ELBO and MVAE overestimate the variance parameters in the decoder, resulting in higher rmse.The significant improvement for MMVAE IWAE seems to only change the decoder to a high capacity decoder and does not really improve the learned representations.Note that the variance collapse for VCCA is included for reference.It is actually generated using the modality x M , which in theory is missing.A mixture of experts and product of experts provide an elegant cross-generation in multi-modal learning, the joint posterior distribution being a linear combination of marginal parameters or distributions.Our approach to learning the posterior distribution is, however, to use a single encoder network, which can capture interactions between all modalities.The model we propose handles missing modalities using a conditional prior modulated by the available modalities.VCCA presents an interesting alternative to learning joint representations, the generative process embedding information on modalities into z.VCCA cannot, however, generate missing modalities, which its generative model requires.Note that only CMMD has lower error rates than the baseline model M-x O , which indicates that current variational multi-modal models are not suitable for learning useful joint representations for downstream classification.CMMD should therefore be preferred over VCCA, MVAE and MMVAE given that, in the setting of this research, CMMD outperforms concurrent models in downstream classification and in the generation of missing modalities at test time.

Image-to-Image with MNIST
Table 2 shows that the performance of our proposed CMMD model is on a par with state-of-the-art models, including those that use pre-trained weights.We observed (practically) the same model performance for this data set at different ω values, our best model using a value of 0.4.Note that both DCCA and DCCAE use pre-trained weights with Boltzmann machines (BMs) (Salakhutdinov & Hinton, 2009).We therefore, for completeness, also retrained DCCA and DCCAE without using pre-trained weights.2D t-SNEs of the latent space can be found in Appendix F.
We used (in a second analysis) the original version of MNIST as x O and the SVHN data set as x M .
Our best model used ω = 0.1 and achieved a higher accuracy than MVAE and MMVAE, as shown in Table 3.
To show that CMMD can handle more than one missing and observed modality, we construct a 3-modality data set matching the class labels in: MNIST (x 1 ), rotated MNIST (x 2 ), and SVHN (x 3 ).We used the same model parameters as were used in the previous experiment, and considered two test scenarios: i) rotated MNIST and SVHN are both missing, i.e. x O = x 1 and x M = (x 2 , x 3 ) and ii) SVHN is missing, i.e. x O = (x 1 , x 2 ) and x M = x 3 .The top (bottom) row in Table 4 shows the classification performance for the test scenario, in which two (one) modalities are missing.Generated modalities are shown in Appendix G.

Image-to-Text with MNIST and SVHN
Note that given M = 3 modalities, there are 2 M − 1 = 7 combinations of observable modalities x O6 .We generate multimodal representations, conditioned on all of the possible combinations of observed modalities, with the CMMD model.After training, we randomly choose 500 representations from the training data set to train a multiclass logistic regression to classify true digits.Table 5 compares the classification performance of the CMMD model (see Figure C1 and Table H2 to know the ω values used in these experiments), under this two-step classification approach, with that of MVAE, MMVAE, and MoPoE in similar experiments to those in Sutter et al. (2021) and Javaloy et al. (2022).We report model accuracy averaged over all 7 combinations of observable modalities and 5 different runs.Models ending with IO are trained with the impartial optimization approach introduced and reported in Javaloy et al. (2022).
We can see that IO increases the classification accuracy of all three models, especially for MMVAE.However, CMMD achieves higher discriminative power in all scenarios of observable modalities.Furthermore, Figure 3 shows some examples of the images generated by CMMD.The left panel shows generated images for MNIST and SVHN modalities conditioned on the Text modality, while the right panel shows generated images for the SVHN modality conditioned on both Text and MNIST modalities.Note that the SVHN images in the right panel are sharper, compared to the left panel, since the generative model is conditioned on more observed modalities in that case.Details on model architectures and hyperparameter values are in Appendix H.
Finally, using the same models as before, we evaluate the quality of the generated missing modalities conditioned on all different combinations of observable modalities, i.e. conditional or cross-modal generation.To that end, we use the generative coherence metric, first introduced in Shi et al. (2019).Following previous works and same architectures as in Sutter et al. (2021), we train a classifier on the original unimodal training data set to classify the generated modalities.If the classifier detects the same attributes in the generated samples, it is a coherent generation.Further, we use classification accuracy to measure the quality of generated samples.Table 6 shows accuracy values of the conditionally generated modalities averaged over 5 different runs.The letter at the top indicates the modality being generated based on the different sets of modalities below, where M, S, and T x M = x 3 98.9 ± 0.13%  stands for MNIST, SVHN, and Text modalities.CMMD achieves higher accuracy in most of the conditional generation scenarios.

Acoustic-to-Articulatory with XRMB
The experimental setup and the data pre-processing used in this section are based on Wang et al. (2017).Table 2 shows average error rates for all test speakers, CMMD outperforming previous  2017) also used the 39D vector of MFCCs and the joint data representations as input data for the tandem recognizer for all experiments.We hypothesize that this further improves the performance of the tandem recognizer.The CMMD model we propose, however, only uses the shared data representations for classification7 .

Image-to-Annotation with Flickr
We use the same data set in this section as that used in Srivastava & Salakhutdinov (2012).Most of the Flickr data corresponds to unlabeled images.We therefore used a two-stage training approach.Firstly, we trained our proposed model, but without the classifier and omitting the class label in the encoder, i.e. q(z|x O , x M ).Secondly, we used the weights from the first stage in the corresponding networks of Eq. 5 and used random weights at initialization for y in the encoder q(z|x O , x M , y).
Following the standards set by previous research, we use the mean average precision (mAP) to measure the classification performance of our proposed CMMD model for 10000 randomly selected images.Table 2 shows that CMMD, optimized with ω = 0.5, and MVAE outperform previously proposed image classification methods.

Acoustic Inversion and Annotation Generation
We tested the generative process p(x M |x O , z) in CMMD in image-to-annotation mapping and also acoustic-to-articulatory (called acoustic inversion (AI)).The scarce availability of articulatory data (Badino et al., 2017) makes acoustic inversion an important field.Table 8 shows, on the test data set, that CMMD outperforms the rmse for AI reported in Wang et al. (2015b), which is based on the training and validation data set (1.17 and 1.96, respectively).Our results also outperform the average rmse of 2.14 obtained on the test dataset of Badino et al. (2017).
The second experiment involves generating tags, which can be costly to obtain, that describe a given picture in the Flickr data set.We used our trained model from the previous section and compared it with the deep Boltzmann machine (DBM) model (Srivastava & Salakhutdinov, 2012), MVAE, and MMVAE.We furthermore tested all models on different images and with different levels of complexity.Table 7 shows some of the generated tags.More examples are given in Appendix D.  The 2nd plot shows average variances of all generated x M features.The last plot compares rmse for their generated values.Speaker 28 was removed as the rmse in both cases is roughly 0.
The generative process in CMMD generates quality articulatory and annotation samples at test time.
The results suggest that the prior distribution in our proposed model learns joint representations through the optimization of our proposed objective function, which maximizes mutual information between representations and the missing modality at test time.

Analysis of the Objective Function
In this section, we train the CMMD model using the XRBM data set for all speakers in Table 8 using a speaker-dependent approach, i.e. 70% − 30% of the data for each speaker used for training-testing, unless otherwise specified.We furthermore trained the CMMD model in two ways: i) we fine-tuned ω in the range [0, 0.1, • • • , 1] and ii) we used ω = 0.7 (which is the optimal value in the previous section) and fixed the variance parameters in the decoder network p(x M |x O , z) to the same value as in Wang et al. (2017), i.e. σ 2 = 0.01.

Impact of Fixed Variance Parameters:
The second diagram in Fig. 4 shows some betweenvariability in the modality x M .Fixing the variance parameters in p(x M |x O , z) therefore deteriorates error rates, as shown in the first diagram.Should we optimize the ELBO?The third panel in Fig. 4 compares error rates for the ELBO (dashed line), recovered for ω = 1, and for our proposed objective function (Eq.5) with fine-tuned ω.Our proposed objective function achieves lower error rates for all speakers.The rmse for the generated features in x M are also smaller when we optimize our proposed objective function.
The top panel in Figure 5 shows the posterior collapse in the CMMD model for ω = 0 and ω = 1; the latter optimizes the ELBO, while the former optimizes mutual information (in addition to the generative and classifier models).Remember that the main motivation for including the mutual information term I(x M , z|x O ) is to counteract the posterior collapse problem and, from the figure, it is clear that CMMD avoids the posterior collapse problem by optimizing mutual information.However, as shown by the four panels in the middle and bottom of Figure 5 (in which we add the relatively more complex learning task presented in Section 4.2, but varying ω) optimizing only mutual information harms the performance of the generative and classifier models, reflected in the rmse and error rate respectively.Note that only optimizing mutual information accounts for the minimization of an average MMD divergence measure.That is, we only minimize the divergence from the average conditional posterior q(z|x O ) to the conditional prior.Our results confirm that minimizing an average divergence measure makes the prior distribution, which is used for downstream tasks, unable to sculpt latent representations as suggested by Hoffman & Johnson (2016).On the other hand, only optimizing the term-by-term KL divergence leads to latent representations z that are independent from x M , which turns out to be relatively less harmful for downstream tasks.Fortunately, our  In both cases, we use data for speaker 7 in the XRMB data set.The two panels in the middle show average rmse and error rate as a function of ω for speakers 7, 16, 20, 21, 23, 28, 31, and 35 in the XRMB data set.Finally, the two panels in the bottom show average rmse and error rate values in the cross-validation approach introduced in Section 4.2.
proposed objective function offers a way of trading-off these two effects and, as can be seen in the middle and bottom rows in Figure 5, there is an ω region in which the generative and classifier models achieve higher performance.Hence, the optimal ω value is specific to the learning task and must be found by cross-validation.
How much overhead does mutual information optimization add?We use the MNIST-SVHN-Text data set to measure training time for ω = 0 and ω = 1.The average training time for processing one batch with 256 observations is 10.59 milliseconds if the ELBO is optimized.On the other hand, the average training time to optimize our proposed objective function, including mutual information, is 11.04 milliseconds, which is the same training time for 1 > ω > 0. Therefore, our proposed objective function does not add significant overhead and is able to achieve higher performance in the downstream tasks considered in this research.

Conclusion
This research studies the effect of posterior collapse in downstream classification and in the generative process of multi-modal learning models.We show that the variational lower bound on the conditional likelihood has a Kullback-Leibler divergence that limits the amount of information on the modalities embedded in the joint representation.We, to counteract this effect, propose a novel likelihoodfree objective function that optimizes the mutual information between joint representations and the modalities that we are interested in generating at test time.Our proposed CMMD model furthermore uses an informative prior distribution that is conditioned on the modalities that are always available.
The empirical results show that the objective function we propose achieves higher downstream classification performance and lower rmse in the generated modalities than the regular variational lower bound.The model we propose also successfully counteracts the posterior collapse problem by optimizing mutual information, and by using an informative prior.Finally, the higher performance of our proposed CMMD model with respect to the state-of-the-art is consistent across different representative multi-modal problems.

Appendices A Objective Function
The joint distribution in the CMMD model is p(x M , y, z|x O ) = p(x M |x O , z)p(y|z)p(z|x O ) and, under this model specification, the posterior distribution p(z|x O , x M , y) is intractable.Therefore, CMMD uses VI and approximates the true posterior distribution with a variational density q(z|x O , x M , y).Hence, the variational lower bound on the marginal log-likelihood of a single observation is where the inequality is a result of the concavity of log and Jensen's inequality.
Now we can write the conditional mutual information term I e (x M , z|x O ) (which depends on the functional form of the encoder as denoted by the subscript), as follows where ), and all probability density functions are approximated by variational approximations (the encoder and prior distribution in our proposed model).The expectations E p(x M ,x O ) and E p(x O ) are finally estimated using the empirical data distribution pD .
Adding the conditional mutual information term (1 − ω)I e (x M , z|x O ) to the lower bound in Eq.A1 (mutual information optimization being controlled by ω ∈ [0, 1]) and replacing p e with the encoder q(z|x O , x M , y)8 gives the likelihood-free objective function for a single data point Note that we can obtain unbiased samples from q(z|x O ) by first randomly sampling tuples (x M , y) ∼ pD and then z ∼ q(z|x O , x M , y).These are used to estimate the MMD divergence term in Eq. 5.

B Upperbound on Mutual Information
Using the last line in Eq.A2 and replacing p e with the encoder q(z|x O , x M , y), which acknowledges the access to a labeled data set, it follows that given that the KL divergence is strictly positive.The expectation can be estimated using the empirical data distribution pD .

C Model Training and Architectures
We minimized Eq. 5 using SVGB and automatic differentiation routines in Theano (Team et al., 2016).Note that the reconstruction term of Eq. 5 can be efficiently estimated using the reparameterization trick (Kingma & Welling, 2013).The KL divergence term has a closed-form expression (Kingma & Welling, 2013;Mancisidor et al., 2020), and the MMD divergence is approximated numerically by drawing samples, as explained in Section A. This is the method suggested by Zhao et al. (2017) and Rezaabad & Vishwanath (2020).
CMMD architectures are, to provide a fair comparison in all experiments, chosen to resemble previous works.We furthermore use softplus activation functions in all hidden layers, using dropout (Srivastava et al., 2014) with 0.2 probability.We use the same α and λ parameter values for all CMMD models, which are set to 10 and 1000 respectively.We furthermore tune the hyperparameter ω over the grid shows the optimal value of ω for each experiment in this research, which is found by cross-validation.Finally, we use the Adam optimizer (Kingma & Ba, 2014) with a 10 −4 learning rate in all experiments.Our model is implemented in Theano and trained on a GeForce GTX 1080 GPU.
Image-to-Image with MNIST: The encoder network uses 3 hidden layers of 2500 neurons.Both the prior distribution and the decoder use 3 layers of 1024 neurons.The latent variable is a 50D vector and the classifier uses 2 hidden layers of 50 neurons.We assume, given that the second view is almost a continuous variable, that it is Gaussian distributed.
Image-to-Image with MNIST-SVHN: The encoder, decoder and prior distribution in this experiment have 1 hidden layer of 400 neurons.The latent representation is a 20D vector and the classifier has 2 hidden layers of 50 neurons each.Table C1: Tags generated using our proposed CMMD for some labeled images in the Flickr data set.

3-modality MNIST:
We use the same encoder in this experiment as in the "image-to-image with MNIST" experiment.The decoder architecture is shown in Table H3 (Decoder columns), which is the same architecture as in Shi et al. (2019).We add an extra layer, such as the one at the bottom of Table H3, but with 1 stride to generate two missing modalities (x 2 and x 3 ).Note that we, for the rotated MNIST images, pad the images to a 32 × 32 matrix during training, and crop-back to a 28 × 28 matrix at test time.The decoder loss is, finally, the sum of two cross entropy terms, one for each missing modality.
Acoustic-to-Articulatory with XRMB: We trained our model using the same 35 speakers used by Wang et al. (2015aWang et al. ( , 2017)).The current version of the test data set, however, only contains 8 speakers without silence frames (silence frames were removed in the other 35 speakers).Our model is, for this reason, tested on 8 speakers in a speaker-independent downstream classification task (Table 2).
We use an encoder with 3 hidden layers of 3000 neurons.The prior distribution and decoder each have 3 hidden layers of 1500 neurons.The classifier model has 2 hidden layers of 100 neurons and the latent shared representation is a 70D vector.We assume a Gaussian distribution for modality x M in this case.The ω parameter has, for this data set, a significant impact on downstream classification and our best model uses ω = 0.7.
Image-to-Annotation with Flickr: We use an encoder with 4 hidden layers of 2048 neurons each.
The prior distribution and decoder use 4 hidden layers of 1024 neurons.We, given that the modality x M corresponds to tags, use a Bernoulli decoder.The shared representation is a 1024D vector and our best model uses ω = 0.5.We deal with multi-label classification in this data set.The classifier for this model therefore uses 2000 neurons with sigmoid activations in the output layer and 2 hidden layers of 1550 neurons.Note that we follow previous works and exclude unlabeled images if they have less than 2 tags, given that we are interested in finding joint representations for both data-modalities.We finally standardize all features in the modality x O .

D Generating Tags
Table C1 shows tags generated using our proposed CMMD model for some labeled images in the Flickr data set.Note that none of the images have any tag in the original data set.

E Additional Details on Posterior Collapse
We use the posterior collapse definition introduced in Lucas et al. (2019).This, in our experiments, is P r(KL[q(•)||p(•)] < ) ≥ 1 − δ, where δ = 0.01 and ∈ [0, 6].We therefore measure the proportion of latent dimensions i that are within KL divergence for at least 1 − δ of the data points.The MMVAE, MVAE, and VCCA models are, in our experiments, trained using the authors' publicly available codes9 .ere we show a qualitative evaluation in the MVAE codebase, minimally alt oint approximate posterior, with the results of the original POE-MVAE model a an see that MOE is able to generate recognisable MNIST digits from SVHN olumn 3), while the original model fails completely at cross-modal generation hat neither model performs well at coherence joint generation (top row).

(e) MVAE
Figure G1: Generated images using CMMD (top row), MMVAE (middle row), and MVAE (bottom row).We, for all models, use the original MNIST digits to draw latent representations, which are further used to generate SVHN digits.Note that the MVAE images are taken from Shi et al. (2019).
Table H1: Accuracy performance, averaged over 5 different runs, for all subsets of observable modalities.We do not include results for the method introduced in Javaloy et al. (2022) given that the authors only provided average values over the different set of observable modalities (see Table 5).

H Cross-modal Generation with MNIST-SVHN-Text
The network architectures used in the experiments using the MNIST-SVHN-Text data set are shown in Table H3, H4, and H5, which are the same architectures used in Shi et al. (2019); Sutter et al. (2020Sutter et al. ( , 2021) ) and Javaloy et al. (2022).The only difference is that we use the encoder architecture in the aforementioned methods for the prior distribution in the CMMD model.The encoder architecture in the CMMD model is a fully-connected neural network with 3 hidden layers, each with 2500 units.All layers in the encoder use softplus activation functions and a dropout layer with 0.2 probability.Following previous work, the multimodal representation is a 20D latent variable, and we use the same values for α and λ as in the other experiments, which are 10 and 1000 respectively.It is noteworthy that the 3 modalities are vectorized and concatenated before sending them through the encoder.
To make a fair comparison with previous methods, we implemented a two-step classification using the multinomial logistic regression model implemented by scikit-learn with default values.The logistic

Figure 3 :
Figure 3: The left panel shows generated images for the MNIST and SVHN modalities conditioned on the observed Text modality using the CMMD model.The right panel shows generated images for the SVHN modality conditioned on Text and MNIST modalities, which are assumed to be observed at test time.

Figure 4 :
Figure4: The 1st and 3rd plot show error rates for the speaker-dependent experiments (Section 4.4).The 2nd plot shows average variances of all generated x M features.The last plot compares rmse for their generated values.Speaker 28 was removed as the rmse in both cases is roughly 0.

Figure 5 :
Figure 5: The top panel shows the posterior collapse in the CMMD model for ω = 0 and ω = 1.In both cases, we use data for speaker 7 in the XRMB data set.The two panels in the middle show average rmse and error rate as a function of ω for speakers7, 16, 20, 21, 23, 28, 31, and 35  in the XRMB data set.Finally, the two panels in the bottom show average rmse and error rate values in the cross-validation approach introduced in Section 4.2.
Figure F1: 2D t-SNEs of the latent space in CMMD, MVAE and VCCA.The scatter color is assigned by the class label.

Fig. 2
Fig. 2 shows different measures of collapse for Fold 1 10 for the section 4.2 experiments.The far left diagram shows posterior collapse P r(KL[(z i |x O )||(z i )] < ) ≥ 1 − δ), where (z i ) ∼ N (0, 1) and (z i |x O ) are drawn from the prior distribution, the joint MoE posterior, the joint PoE posterior, and the shared inference distribution for CMMD, MMVAE, MVAE, and VCCA respectively.The second diagram in Fig.2calculates P r(KL[(z i |x M )||(z i )] < ) ≥ 1 − δ).However, z i ∼ N (0, 1) and (z i |x M ) are, for this, drawn from the inference posterior distribution, the joint MoE posterior, the joint PoE posterior, and the inference private distribution in CMMD, MMVAE, MVAE, and VCCA respectively.Finally, the third diagram in Fig.2calculates P r(KL[(z i |x O )||(z i |x M )] < ) ≥ 1 − δ),where (z i |x O ) and (z i |x M ) are drawn as explained above.F Latent Space -MNIST Figure F1 shows 2D t-SNEs (Van der Maaten & Hinton, 2008) of the latent space learned using CMMD, MVAE and VCCA.The t-SNEs for both CMMD and VCCA show well separated class labels.Note that the class label variability is larger for the CMMD embeddings than VCCA.The t-SNEs for MVAE, however, show some overlapping class labels.G Generating Multiple Missing Modalities We, in this section, compare the missing modality/modalities generated at test time by the decoders in CMMD, MMVAE, and MVAE.In panel (a) we assume SVHN digits are missing at test time, in panel (b) both rotated-MNIST and SVHN are missing modalities at test time.In panel (c) and (d) we train MMVAE, optimizing the evidence lower bound (ELBO) and its importance weighted autoencoder (IWAE) version respectively, and generate the missing modality at test time (SVHN digits).For completeness, panel (e) shows the SVHN digits generated using MVAE reported in Shi et al. (2019).Both CMMD (panel (a)) and MMVAE-IWAE (panel (d)) generate quality and coherent SVHN digits, matching the MNIST digit in all cases.MVAE (panel (e)), however, generates low quality SVHN digits and it is difficult to see whether the generated image matches the MNIST digit.CMMD generates two missing modalities in panel (b), which is clearly a more challenging task.Only digits 7 and 0 are generated correctly for both missing modalities.Finally, it is interesting to compare the results obtained with MMVAE using two objective functions.If MMVAE optimizes the evidence n modalities that are always available and x M = (x n+1 , • • • , x n+m ) are m modalities that are missing at test time 2 .Only x O is therefore available for downstream tasks, the label y and x M both missing.Our proposed model at test time generates latent representations, using a prior distribution p(z|x O ) conditioned on the observed modalities.Latent representations z ∼ p(z|x O ) are furthermore used in both the generative process p(x M |x O , z) and in the classifier model p(y|z).This encourages the model to learn useful representations for classification, and to generate missing modalities at test time.The joint distribution in our proposed model is, in this scenario, given by p(x M , y, z|x O ) = p(x M |x O , z)p(y|z)p(z|x O ), where p(z|x O ) is a prior distribution conditioned on the always available modalities, p(x M |x O , z) is the generative process for the missing modalities at test time, and p(y|z) is the density function for class labels.Note that the posterior distribution p(z|x O , x M , y), the joint latent representation that we want to learn, requires a marginal distribution that is not available in closed form.We therefore approximate the true posterior distribution p(z|x O , x M , y) using the parametric model, or encoder distribution, q(z|x O , x M , y).
Forward propagation in our proposed CMMD model.The orange arrow indicates a forward pass during training, which is replaced by the blue arrow at test time, i.e. the input to p(x M |x O , z) is z ∼ q(z|x O , x M , y) during training, while z ∼ p(z|x O ) at test time.The black arrow depicts a common forward propagation during training and testing, i.e. the input to p(y|z) is always z ∼ p(z|x O ).

Table 2 :
Wang et al. (2017)tes (lower is best) for experiments with MNIST and XRMB (average over speakers in the test dataset).For the Flickr data set, we report the mean average precision (mAP; higher is best).Results are based onWang et al. (2017), except for values marked with † (which are from our own tests without pre-trained weights with Boltzmann machines) and results for CMMD.

Table 4 :
Accuracy results for 3-modality MNIST.The first experiment classifies using representations generated with x O = x 1 , while the second experiment uses x O

Table 6 :
Accuracy values of the conditionally generated modalities averaged over 5 different runs.The letter at the top indicates the modality being generated based on the different sets of modalities below, where M, S, and T stands for MNIST, SVHN, and Text modalities, respectively.ze r o s e v e n e i g h t o n e n i n e s i x f i v e

Table 7 :
(Srivastava & Salakhutdinov, 2012)ed with the multi-modal learning deep Boltzmann machine (DBM)(Srivastava & Salakhutdinov, 2012)and with CMMD.DBM fails to generate coherent tags in the first 3 images.CMMD is, however, able to generate meaningful tags.In the last image, both models generate coherent tags.
Hermansky et al. (2000))specific classifier.For this experiment, CMMD is optimized with ω = 0.7.Note thatWang et al. (2017)used the tandem speech recognizer(Hermansky et al., 2000)as classifier model in all the experiments they conducted.The tandem speech recognizer successfully couples neural networks and Gaussian mixtures models for word recognition and, in the benchmark results ofHermansky et al. (2000), reduced speech classification error rates by 35%.Wang et al. (

Table 8 :
We report rmse for AI and error rates (%) for downstream classification in a speakerindependent experiment for eight speakers.Average and standard deviation (std) values are shown at the bottom. ) Figure C1: Optimal ω value, found by cross-validation, for each of the experiments in this research.Experiments are ordered chronologically.

Table H2 :
Accuracy values (generation coherence), averaged over 5 different runs, of the modalities conditionally generated by the CMMD model, together with the optimal ω values found by crossvalidation.lower bound, then the generated SVHN images have relatively low quality and do not match the MNIST class.