Journal Pre-proofs Distance assessment and analysis of high-dimensional samples using varia‐ tional autoencoders

An important question in many machine learning applications is whether two samples arise from the same generating distribution. Although an old topic in Statistics, simple accept/reject decisions given by most hypothesis tests are often not enough: it is well known that the rejection of the null hypothesis does not imply that diﬀerences between the two groups are meaningful from a practical perspective. In this work, we present a novel nonparametric approach to visually assess the dissimilarity between the datasets that goes beyond two-sample testing. The key idea of our approach is to measure the distance between two (possibly) high-dimensional datasets using variational autoencoders. We also show how this framework can be used to create a formal statistical test to test the hypothesis that both samples arise from the same distribution. We evaluate both the distance measurement and hypothesis testing approaches on simulated and real world datasets. The results show that our approach is useful for data exploration (as it, for instance, allows for quantiﬁcation of the discrepancy/separability between categories of images), which can be particularly helpful in early phases of the a machine learning pipeline.


Introduction
An important question in many applications of machine learning and Statistics is whether two samples (or datasets) arise from the same data generating probability distribution [gretton2012kernel, holmes2015two, soriano2015bayesian].
Although an old topic in statistics [Mann47,Smirnov48], simple accept/reject 5 decisions given by most hypothesis tests are often not enough: it is well known that the rejection of the null hypothesis does not mean that the difference between the two groups is meaningful from a practical perspective [1904. Thus, tests that go beyond accept/reject decisions are preferred in practice. In particular, tests that provide not only single and in-10 terpretable numerical values, but also a visual way of exploring how far apart the datasets are from each other especially useful. This raises the question of how to assess the distance between two groups meaningfully, which is especially challenging in high-dimensional spaces.
In this work, we present a novel nonparametric approach to assess the dis-15 similarity between two high-dimensional datasets using variational autoencoders (VAE) [vae]. We show how our approach can be used to visually assess how far apart datasets are from each other via a boxplot of their distances and additionally, provide a way of interpreting the scale of these distances by using the distance between known distributions as a baseline. We also show how a formal 20 permutation-based hypothesis testing can be derived within our framework.
The remaining of this paper is organized as follows. In sections 1.1 and 1.2, we present a brief description of work related to our proposed method. In Section 2, we present a review of variational inference and VAE, and show that the latter can be interpreted as a density estimation procedure, which is the basis 25 of the proposed method. In Section 3, we show how variational autoencoders can be used as a method of exploring the differences between two samples. In Section 4, we use our method to derive a formal hypothesis testing procedure.
Both sections also show applications of the methods to simulated and real-world datasets. Finally, Section 5 concludes the paper with final remarks. Appendix 5 30 contains details on the configurations of the software and neural networks used, as well as a link to our implementation, which is published open source.

Related work on two-sample hypothesis testing
Several nonparametric two-sample testing methods have been proposed in the literature; they date back to Mann47, Smirnov48, WELCH1947: three 35 classical two-sample tests (Mann-Whitney rank test, Kolmogorov-Smirnov and Welch's t-test, respectively) which were designed to work for univariate random variables only. On the other hand, holmes2015two, soriano2015bayesian, ceregatti2018wiks investigate Bayesian univariate methods for this task.
More recently, gretton2012kernel introduce a two-sample test comparison 40 using reproducing kernel Hilbert space theory that works for high-dimensional data. The test, however, does not provide a way of to visually assess the dissimilarity between the datasets. two-sample-deep-learning proposes a method for two-sample hypothesis testing utilizing deep learning, which contrary to a permutation based test, only controls the type-1 error rate asymptotically; 45 binary-two-sample proposes a test statistic built using binary classifier in the context of causal inference and causal discovery, also relaying on asymptotic distribution for the test statistic (the distance between the performance of binary classifiers) under the null hypothesis.
Other two-sample tests for high-dimensional data can be found in [mondal2015high, 50 NIPS2016_6209] and references therein. Although these tests are robust and effective in many settings, they do not provide a visual analysis to assess the distance between the groups. Thus, they do not provide ways of checking if the difference between the datasets is meaningful from a practical perspective, a gap in literature which is filled by this article.

Related work on two-sample comparison and distance measurement
There has also been some work devoted to two-sample comparison and related tasks: In particular deAlmeidaIncio2018, provides a framework for assessing the distance between populations using density estimation methods.
However, the method provided in that work is based on MCMC (Markov Chain 60 Monte Carlo) Bayesian simulations, and therefore it is unable to scale to large datasets (see betancourt_mcmc_subsampling, for instance) and high-dimensional spaces. In this work, we overcome these issues by using variational autoencoders to estimate densities, and by introducing a specific metric which has an analytic solution even in high-dimensional spaces). 65 pmlr-v97-kornblith19a (and references therein) proposes a new method of comparison of neural networks representation. pmlr-v48-larsen16 propose a variant of variational autoencoders (VAE) that better measure similarities in data space than a vanilla VAE. an2015variational uses VAEs for anomaly detection: that is, with the goal of identifying whether a single instance is different 70 from an observed sample. 1280752 evaluates existing similarity measurement methods in the context of image retrieval. These papers however do not use their methods for performing formal hypothesis tests.
Finally, for closely-related problems and applications, see also pfister2016kernel, ramdas2017wasserstein, 1908.00105 for methods on how to solve the prob-75 lem of independence testing and desmistifying-gans which uses two-sample tests as a tool to evaluate generative adversarial networks.

Variational Autoencoders
In this section, we review key aspects of the variational autoencoders frame- Consider an i.i.d. random sample D = (X 1 , X 2 , ..., X n ). Variational autoencoders estimate the density of this sample by encoding the information of each X i using latent random variables Z = (Z 1 , Z 2 , ..., Z n ), which are linked to (X 1 , X 2 , ..., X n ) by a parameter θ. More precisely, the model assumes the where Z i ∼ N (0, 1), g θ is a complex function (which is the output of a neural network) with parameter θ (i.e.: the parameters/weights of a neural network), and µ i and σ i are the mean and standard deviation of the Gaussian distribution. Inference on such model is performed by maximizing the evidence P (D = d; θ) := P θ (D = d). 110 Note that, if g θ is complex enough, we can actually model any distribution . This is why g θ is parametrized using an artificial neural network; it leads to flexibility (because of the richness of the space of functions they can represent, Hornik1989) as well as scalability.
Unfortunately, maximization of the evidence cannot be directly solved due 115 to the curse of dimensionality [tutorial_vae]. The next section shows how variational inference can be used to overcome this.

Variational inference
The curse of dimensionality can be solved using variational inference, which consists of optimizing where D KL refers to the Kullback-Leibler divergence and Q given by: This framework is called variational autoenconder and it solves the curse of dimensionality [vae]. The training procedure for variational autoencoders is 120 presented in Figure 1; for more details, see vae.

Generative model
The trained model can be used to generate new instances X j : this can be done by applying Pθ( X j ) = Pθ( X j | Z j = z)P ( Zj ) (dz) , i.e.: sample Z ∼ N (0, 1), apply it on the neural network g θ and then sample from a N ((µ, σ) = 125 g θ (z)). Therefore, variational autoencoders are a Gaussian 5 mixture model (of an infinite number of Gaussians), and thus a density estimator.

Identifiability of the mixture of Gaussians
As per the structure of variational autoencoders, the distribution of such mixture of Gausssians is not identifiable [teicher1961identifiability, wechsler2013bayesian]: 130 two different configurations of the parameters, say θ 1 and θ 2 , can lead to the exact same distribution of (µ, σ). That is, In other words: if we train a variational autoencoders framework on a dataset, we will get a "generator" of pairs (µ, σ), say m 1 ; and if we train a variational autoencoders framework with identical structure on the same dataset, 135 we will get another "generator" of pairs (µ, σ), say m 2 . Generators m 1 and m 2 do not necessarily give the same distribution over samples of pairs (µ, σ).
Nonetheless, the induced final density (i.e., the Gaussian mixture) should be same analytically (i.e.: ignoring the stochastic variation that estimation methods induce).

Two sample comparison: definition of the distance
We name our approach for assessing the similarity between two datasets, D 1 and D 2 as vaecompare and describe it as follows. First, we train two varia-tional autoencoders: one for D 1 and one for D 2 . Let g θ1 and g θ2 be the learned functions for each of the autoencoders. g θ1 and g θ2 , together with Z ∼ N (0, 1), induce two distributions over the parameter space (µ, σ). Let S 1 = (µ 1 , σ 1 ) and S 2 = (µ 2 , σ 2 ) be two samples generated from the enconders g θ1 and g θ2 , respectively. We then measure the distance between S 1 and S 2 . Now, recall Thus, a meaningful distance between S 1 and S 2 should be in the space of the random variables they generate. The key idea to make the method computationally feasible is to use a symmetric Kullback-Leibler divergence between the distributions induced by S 1 and S 2 : where d is the dimension of the feature space, P Si is a (multivariate) Gaussian distribution with parameters (µ i , σ i ), and D KL is the Kullback-Leibler divergence. D KL has an analytical solution in the Gaussian case: In case X represents an image, we use the standard approach of using multi-dimensional Bernoulli distributions with dimensions independent from each other (see [vae, tutorial_vae], for instance). In this case, the Kullback-Leibler can also be obtained analytically: Using this approach, we can therefore assess the distance between one sample generated from the first autoencoder and a sample generated from the second autoencoder. In order to assess the divergence between the datasets D 1 and D 2 , we can repeat this procedure several times; this will give a sample of the 145 distribution of distances. Now, in order to overcome the identifiability issue discussed in Section 2.4, we train the variational autoencoders multiple times (we call these "refits") for each dataset (using distinct initialization seeds for the network parameters) and use the new instances pairs (µ, σ) from each of them in equal proportion. The Note that from the perspective of applying this method to images, it can also be interpreted as a data exploration tool, as it helps exploring the separability and uncertainty of classes of images and the relation between their data generating processes.

Assessing the magnitude of the distance
In Section 3, we defined a method to measure the distance between two datasets. A yardstick is still required in order to say what is a "low" and "high" distance. In order to create a baseline to interpret such distances, we proceed in similar fashion as deAlmeidaIncio2018: we compute the distance between 160 two known distributions.
In the case of Gaussian VAEs we can work for instance with D(N 0 , N 1 ), where N 0 is a multivariate Gaussian with covariance given by an identity matrix  Generate a sample s 1 from V 1 (e.g.: a pair (µ, σ) for Gaussian VAE).

6:
Generate a sample s 2 from V 2 .

7:
Calculate D(s 1 , s 2 ) and store it on S.
8: end for 9: end for and mean given by a vector of zeros and N 1 is a multivariate Gaussian with covariance given by an identity matrix and mean given by a vector of ones. 165 We have that D(N 0 , N 1 ) = 1/2. For binomial VAEs, we use known binomial distributions as the baseline.

Evaluation (images)
Next, we apply the our method to CIFAR10 data [cifar10] using the VAE as a generator of binomial distributions. The dataset consists of images from 170 10 distinct categories (ranging from 0 to 9), with each category containing 5000 images. To make the comparison fair when comparing a category to itself and when comparing a category to another, we chose to work with half of each category dataset (2500 images) to train each VAE; i.e.: when comparing category 0 to category 1, one VAE is trained with 2500 images from category 0 and the 175 other is trained with 2500 images from category 1; on the other hand, when comparing category 0 to itself, each VAE is trained with half (2500 images) of the category 0 dataset. We worked with 90 VAE refits for each dataset.
In Figure 3 shows the median and mean for each category, as well as the divergence of known Bernoulli distributions (plotted as horizontal lines).
Except for categories 0, 2 and 4, the divergence samples were all concentrated near zero when comparing a category to itself (a desirable behaviour). On the 185 other hand, for these 3 categories, we can observe a considerable amount of divergence samples spread far from zero indicating some uncertainty, but even in this case, there was a considerable amount of them near zero. Note also that the median for these three categories is much closer to zero than the mean, indicating its resilience to outliers.

190
On the other hand, when making comparisons between distinct categories, there are cases with high uncertainty (i.e.: boxplots with wide extensions; generally with few points close to zero), as well as cases with higher certainty (i.e.: boxplots with narrow extensions).
We conclude that the method is therefore useful for the purpose of data 195 exploration as it works as expected in a complex space such as images.

Hypothesis testing
We can additionally use vaecompare to directly test if two samples come from the same population. One way to do this is to find a threshold value  Given that, we work instead with a simple permutation test where the datasets are repeatedly permuted against each other (i.e.: their data is mixed), and the average divergence of the samples is used as a test statistic. The pvalue is then given by the quantile of the non-permuted dataset among all the statistics 6 . We note that for the hypothesis test to work in the sense of being a 210 proper test (uniform under the null), it is not necessary to do VAE refits, but refits increase the test power as we shall see next. We present the procedure in Algorithm 2.

Evaluation (simulated data)
In this section, we apply the proposed hypothesis testing method to simulated datasets from a known data generating function and plot the observed p-value distribution. The data generating function for the datasets is defined Algorithm 2 Obtaning the p-value for hypothesis testing using vaecompare Input: dataset D1, dataset D2, number of desired samples per refit n, number of desired refits R, number of permutations t, averaging function M (e.g. mean or median) Output: p-value ρ.

3:
Calculate M (S i ) and store the result in K i .

4:
Permute the instances of datasets D 1 and D 2 .
5: end for 6: Obtain the number of points q 1 in {K 2 , K 3 , ..., K t } which are greater than K 1 . 7: Obtain the number of points q 2 in {K 2 , K 3 , ..., K t } which are greater than or equal to K 1 .
For simplicity, we do not use refits here. The value of the vector k is fixed in zero for one of the datasets, and varied for the other. This is done in order 220 to change the dissimilarity between the samples (i.e.: the larger k is, the more dissimilar the sample distributions are) and from that, observe the behaviour of the distribution of the p-value.
In Figure 4a, we present the results of such experiment using the permutation test: the empirical cumulative distribution of the p-values; while in Figure 4b, 225 we do the same simulation study using an Gaussian asymptotic approximate to the permutation test.
The permutation test fulfilled the required properties of a frequentist hypothesis test, as it has approximately (sub)uniform distribution under the null hypothesis (as expected) and the test power increases as the divergence increases.

230
Notice that, for simplicity, we do not use refits here; if we do, we expect the power to increase as is the case in the next section. The asymptotic test, on the other hand, performed poorly.

Evaluation (images)
Here, we also applied the hypothesis testing method to the CIFAR10 dataset, 235 using the same 2500 images for each category as described in 3.2. In Tables 1, 2, 3 and 4 we present the p-values obtained in the test while in Tables 5, 6, 7 and 8 we present the combinations that gave the correct results and type 2 error for a significance level of 5%. We applied the tests both without VAE refits and with 5 refits; we also tried the median as an alternative to the mean with the 240 intuition that this might help remove the weight of outlier distance points.
In Table 9, we present a summary of the results. The method performed well under the null for both the mean and median metrics. Moreover, the method has shown to have a significant increase in test power when used with VAE refits.

245
In case of metrics performance comparison, it can be seem that the mean had incurred in less type I errors while the median incurred in larger but an admissible number given the critical rate of 5%. On the other hand the performance of median metric was considerably better regarding type II errors, this might be related to its robustness to outliers which have shown to be a frequent 250 problem in the Figure 3.         Table 5: Results of the hypothesis testing when applying a critical rate of 5% without refits and averaging using the median. Here G stands for "good" and E2 for type 2 error.
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9  The figures indicate that vaecompare had competitive performance when compared to the other methods. 8 Additionally, it is the only method that can be used exploratory data analysis (on two-sample comparison) instead of just hypothesis testing and moreover, as shown in Section 4.2, the test power could 275 potentially increase if an additional number of refits, which were set to be a small number because of the computational restrictions of the simulation study.
7 Such smoothed version could also be obtained by increasing the number of permutations for each test, this has not been done due to computational constraints of such increase.
8 Note in particular, that, contrary to the problems of classification and regression, is not possible to easily data split the dataset to choose the best hypothesis testing method before applying it the whole dataset (at least not without causing further problems such as bias multiple comparisons).

Discussion and Conclusions
In this work, we proposed and applied a novel method of two sample distance measurement and hypothesis testing to simulated and real-world datasets. We 280 conclude that both two sample distance measurement and hypothesis testing were able to satisfactorily perform the intended tasks on the tested simulated and real world datasets.
The proposed methods could be used for various tasks in the machine learning pipeline, including:

285
• Distribution shift detection and measurement: a dataset from a experiment done in one month (e.g.: opinions of customers on a product on a specific month) might diverge in distribution from a dataset collected in another month. With our method it is possible to measure and test this diverge.

290
• Dataset split: to address overfitting, the data is usually split into train, validation and/or test parts. To be able to develop robust models, these parts should be similar, but should also differ enough to ensure generalization.
With the proposed methods the dataset split can be done in a controlled manner, an important speed on state-of-the-art predictive methods (e.g.: 295 see Breiman1996StackedR, Coscrato2020 and references therein).
• Self-supervised clustering: based on the distance, by fine-tuning the threshold (cutpoint), binary or multi-class clustering could be performed.
• Anomaly detection: applying the proposed method to processes where anomaly may occur (e.g. malicious attack, malfunction, etc.). In this 300 case, the distance measurement can give a direct feedback of how much the actual behaviour differs from the normal one.
• To test the quality of data generated from GANs and similar approaches (e.g.: see binary-two-sample).

Appendix: Neural networks configuration, software and package
We work with a dense neural network of 10 layers with 100 neurons on each 320 layer (totaling 195060 parameters), for both encoder and decoder networks, and the following additional specification: • Optimizer: we work with the Adamax optimizer [adam-optim] with initial learning rate of 0.01 and decrease its learning rate by half if improvement is seen on the validation loss for a considerable number of 325 epochs.
• Initialization: we used the initialization method proposed by [nn-initialization].
• Stop criterion: a 90%/10% split early stopping for small datasets and a higher split factor for larger datasets (increasing the proportion of train-330 ing instances) and a patience of 50 epochs without improvement on the validation set.
• Normalization and number of hidden layers: batch normalization, as proposed by [batch-normalization], is used in this work in order to speed-up the training process.

335
• Dropout: here we also make use of dropout which as proposed by [dropout] (with dropout rate of 0.5). Algorithm 3 Algorithm to evaluate encoder network g(.) presented in Figure 1 Input: x, Output: µ, σ.