Model Criticism in Latent Space

Model criticism is usually carried out by assessing if replicated data generated under the fitted model looks similar to the observed data, see e.g. Gelman, Carlin, Stern, and Rubin [2004, p. 165]. This paper presents a method for latent variable models by pulling back the data into the space of latent variables, and carrying out model criticism in that space. Making use of a model's structure enables a more direct assessment of the assumptions made in the prior and likelihood. We demonstrate the method with examples of model criticism in latent space applied to factor analysis, linear dynamical systems and Gaussian processes.


Introduction
Model criticism is the process of assessing the goodness of fit between some data and a statistical model of that data 1 .While model criticism uses goodness-of-fit tests to judge aspects of the model, its general objective is to identify deficiencies in the model that can lead to model extension to address these deficiencies.The extended model(s) can again be subjected to criticism, and the process continues until a satisfactory model is found (O'Hagan, 2003).Model criticism is contrasted with model comparison in that model criticism assesses a single model, while model comparison deals with at least two models to decide which model is a better fit.Model comparison can be applied to compare the original and the extended model after model criticism and extension (O'Hagan, 2003, p. 2).
Bayesian modelling has become an indispensable tool in statistical learning, and it is being widely used to model complex signals, e.g., by Ratmann et al. (2009).With its growing popularity, there is need for model criticism in this framework.Most work on model criticism makes use of the idea that "if the model fits, then replicated data generated under the model should look similar to observed data" (Gelman et al., 2004, p. 165).In contrast, in this paper we focus on a less well explored idea that for latent variable models, we can probabilistically pull back the data into the space of the latent variables, and carry out model criticism in that space.We can summarize this principle as that if the model fits, then posterior inferences should match the prior assumptions.
To elaborate, consider a model with observed variables X and unobserved variables U with joint distribution P(X, U | γ) where γ are known parameters.In general U may contain latent variables Z, parameters Θ, and hyperparameters λ 2 .Given a sample x obs from the marginal distribution P(X | γ), and a single posterior sample u * from the conditional distribution P(U | x obs , γ), the joint sample (x obs , u * ) is a draw from the distribution P(X, U | γ).This property can be used to check the fit of the model in the latent space by checking if u * is a sample from the marginal distribution P(U | γ).Testing a single sample against a distribution, however, is not an effective approach.But, in many widely-used models, groups of unknown variables are independently and identically distributed under the prior.These related variables are easily aggregated together, giving a simple test of the prior assumptions.Figure 1 summarizes the overall approach, which is justified in §3.
In comparison to model criticism in the observation space, comparing u * with prior P(U | γ), provides an additional tool for model criticism which does not require crafting an appropriate discrepancy measure, generating replicate observations, and approximating the null distribution.This approach also does not suffer from the "double use" of data (see discussion in §2).These points have also been made by Yuan and Johnson (2012), but were applied to a relatively small scale hierarchical linear model.We develop the use of model criticism in latent space for large scale and complex models, yielding new insights and developments.Specifically, we apply this approach to the criticism of linear dynamical systems, factor analysis and Gaussian processes, and discuss its connection to the observation space based approach.
The structure of the rest of the paper is as follows: in §2 we describe the methods of model criticism in observation space.§3 provides details of the argument for model criticism in latent space and describes related work, and §4 shows results from applying the method to the three examples.Table 1 describes the notations followed in the paper, and Table 2 shows the distributions used in the paper.
Probability density function of r. v. X, abbreviation for p X (x) where x i ∈ (0, 1) and Table 2: List of distributions used in the paper.

Model Criticism in Observation Space
A general approach of model criticism is to evaluate if replicated data generated under the (fitted) model looks similar to observed data.Consider that we are modelling observed data x obs with a latent variable model parameterized by U, i.e., we have defined the likelihood p(x | u) and (optionally) a prior distribution P(U) over potential parameter values.The principle of model criticism in the observation space is to assess if x obs is a reasonable observation under the proposed model.For example, given the maximum likelihood estimator (or another point estimate) û of the parameters, one standard approach is to find the plug-in p-value (Bayarri and Berger, 2000) Here D is called a discrepancy function and it resembles a test statistic in hypothesis testing, i.e., a larger value rejects the null hypothesis or indicates incompatibility of data and model, and X rep is a replicate observation generated under the fitted model, i.e., X rep ∼ P(X | û).
If the p-value is low, then it implies that the probability of generating a more extreme dataset than the observed data is small, or in other words, the observed data itself is considered extreme relative to the model, and thus, the model does not adequately describe the dataset.In summary, a low p-value rejects the hypothesis that the data is being adequately modelled.The p-value is usually estimated via an empirical average by generating multiple replicates x rep r , r = 1, . . ., R, and evaluating where x rep r is a sample from P(X | û).An alternative to point estimation is to consider a Bayesian treatment of the problem where one can integrate out the contribution of the parameters.The test statistic can be averaged under either the prior distribution or the posterior distribution.The prior-predictive distribution is defined to have the density p(x rep | γ) = p(x rep | u) p(u | γ) du where γ parameterizes the prior distribution over U.One can generate replicate observations from this distribution, and compute the prior predictive p-value (Box, 1980) where (x rep , u) r is a sample from P(X, U | γ).This approach is not reasonable when the prior distribution is improper (cannot be integrated) or uninformative.Additionally, even if the prior distribution is informative, one might not generate enough samples to represent the data distribution well when the parameter space is large.However, notice that one does not need to fit the model to criticise it.On the other hand, one can use the posterior distribution P(U | x obs ), and sample from the posteriorpredictive distribution with density p(x rep | x obs ) = p(x rep | u) p(u | x obs ) du.The posterior predictive p-value (Rubin, 1984) is then computed as: where (x rep , u) r is a sample from P(X rep , U | x obs ), i.e., by generating samples u r from the posterior distribution instead3 .The support of the posterior is usually more concentrated than prior, and the posterior distribution may be well-defined even if the prior distribution is improper.The posterior predictive p-value has been criticised for "double use" of data, once for computing the posterior distribution P(X rep | x obs ) and once for computing the discrepancy measure D(x obs , U) (Bayarri and Berger, 2000).This means that p post does not have a uniform distribution under the null hypothesis, whereas p prior is a valid p-value.p plug-in is subject to the same criticism as p post since the MLE uses the observed data as well (Bayarri and Berger, 2000).Lloyd and Ghahramani (2015, §7) view the different p-values as arising from "different null hypotheses and interpretations of the word 'model' ".They argued that the posterior predictive and plug-in p-values are most useful for highly flexible models, as the aim is to assess the fitted model rather than the whole space of models.Lloyd and Ghahramani (2015) also point out that "it may be more appropriate to hold out data and attempt to falsify the null hypothesis that future data will be generated by the plug-in or posterior distribution", which is also in line with the discussion in (O'Hagan, 2003, §2.1).Further examples of posterior predictive checking can be found in (Belin and Rubin, 1995;Gopalan et al., 2015).
In all of the model criticism described above, a key quantity is the discrepancy function D used to compare the data and predictive simulations.We agree with Belin and Rubin (1995, p. 753) who wrote of the importance of identifying discrepancy functions "that would not automatically be well fit by the assumed model", and that "there is no unique method of Bayesian model monitoring, as there are an unlimited number of non-sufficient statistics that could be studied".Lloyd and Ghahramani (2015) suggest the Maximum Mean Discrepancy (MMD) as a measure of discrepancy between the observed data and replicates.The motivation of using this approach is to maximize the discrepancy over a class of discrepancy functions rather than choosing only one, i.e., where F is a set of functions.The function that maximizes the discrepancy is known as the witness function.When F is a reproducing kernel Hilbert space (RKHS) the witness function can be derived in closed form as where κ is the kernel of the RKHS.This estimation does not work well in high dimensions, and therefore, the authors suggests reducing the dimensionality of the observation space before applying this statistic (Lloyd and Ghahramani, 2015, p. 4).

Model Criticism in Latent Space
Recall we have a model P(X, U | γ), with observed variables X, unobserved variables U, and known parameters γ.In general U may contain latent variables Z, parameters Θ, and hyperparameters λ.Our procedure depends on the following two key observations: 1.If x obs is drawn from the above model, then a sample u * from P(U | x obs , γ) is a sample from the prior distribution P(U | γ).To see why this is true, observe that a natural way to sample from the joint P(U, X | γ) is to generate a sample u from P(U | γ), and then generate a sample x from P(X | u, γ) in that order.However, it is also valid to draw samples from the joint by first sampling x from P(X | γ) and then sampling u from P(U | x, γ).Thus we have Statement 1.If x obs is a sample from P(X | γ), then a sample u * from P(U | x obs , γ) will be a draw from P(U | γ).
It is important to clarify what Statement 1 is not saying.It is not saying that repeated draws from P(U | x obs , γ) will explore the full prior distribution P(U | γ), but only that it is a valid way to draw one sample from it if x obs is a draw from the model4 .However, testing how well a single draw from a given distribution fits that distribution is difficult.This brings us to our second observation.
2. If U is a collection of variables, i.e., U = (U 1 , . . ., U K ), and the prior distribution of U decomposes into independent draws from the same distribution, e.g., then it is possible to aggregate these variables together, i.e., instead of testing if (u * 1 , . . ., u * K ) is a sample from P(U | γ), one can test if {u * 1 , . . ., u * K } is independent and identical draws from the distribution P u (• | γ).In other words, rather than testing one sample against a known high dimensional distribution, one can test if the collection of K samples are independent and identical draws from a known lower-dimensional distribution P u .Thus, we define aggregation as pooling variables with the same prior distribution together, and an aggregated posterior sample (APS) is defined as a set of posterior samples that have been aggregated for comparison with a specific reference distribution.The above can be generalized to the situation where U = (U 1 , . . ., U K , θ) is a collection of variables and parameters such that Then {u * 1 , . . ., u * K } can be aggregated and tested against P u (• | θ * )5 .Aggregation can be also extended to the case where U consists of groups of variables (U 1 , . . ., U G ) where aggregation is performed within each group where U −g denotes all groups except g.We provide more concrete examples of aggregation in §3.1 and Table 3.
We refer to this approach as aggregated posterior checking (APC).We summarize this approach in Algorithm 1. Ideas equivalent to Statement 1 and the aggregation of posterior samples can also be found in Yuan and Johnson (2012) 6 , but were applied to the case where U contains only model parameters, and for hierarchical linear models.See §3.2 for more details on related work.

Algorithm 1 Aggregated posterior check
Require: Observed data x obs Require: Bayesian model P(X | U, γ)P(U | γ) with latent variables U 1: Generate a posterior sample u * from P(U | x obs , γ) 2: Generate aggregated posterior sample(s) See Table 3 3: Compare aggregated posterior sample(s) with corresponding reference distribution(s) with appropriate test 4: return p-value of the test(s) So far, we have addressed the idea of assessing deviations from the prior distribution and aggregation in the latent space.However, the same idea can be applied to the observation space as well, i.e., to the likelihood by testing if x obs is a sample from P(X | u * , γ).Although it is true that a discrepancy in the choice of likelihood should be reflected in the posterior sample, assessing the discrepancy in the likelihood directly provides better understanding and easier resolution of the discrepancy.Notice that, although we make use of x obs , our approach is not equivalent to model criticism in the observation space since we do not compare the observed data x obs with replicate observations x rep r , but only investigate the relation between the latent space u * and observation space x obs .Both methods, however, require generating posterior samples u r (for model criticism in the latent space we use r = 1).

Application to different models
We discuss below the application of model criticism in latent space to factor analysis, linear dynamical systems and Gaussian process regression.These situations are then demonstrated on real data in §4.

Factor analysis model
Consider a factor analysis model with hyperparameters λ = {τ θ , τ}, parameters (loading matrix) Θ, latent variables (factors) z and data x.Grouping Z = {z i } n i=1 and similarly for X7 , Figure 1d illustrates this model.In Gaussian factor analysis, z ∼ N (0, 2).There is an identifiability issue in the factor analysis model between Θ and z, which is resolved by fixing the scale of one of the two.In Eq. ( 7) the dependence of z on λ is taken to be null, i.e., τ z = 1.(In the case of example §4.1 we fix the scale of Θ instead.)Also, p(x | z, Θ, τ) = N (Θz, τ −1 I) and τ, τ θ | α, β ∼ Gamma(α, β).Thus the fixed parameters γ = {τ z , α, β}.If X obs is drawn from the above model, then a sample λ * , Θ * , Z * from P(λ, Θ, Z | X obs ) is a sample from the prior P(λ)P(Θ | λ) P(Z | λ).In factor analysis, Z decomposes into independent draws from P(z | τ z ), and therefore, one can pool the posterior samples z * i to assess deviations from P(z | τ z ).Moreover, each z i usually decomposes into independent draws over the different latent dimensions as ∏ k p(z ik | τ z ), one can pool the z * ik to assess deviations from p(z | τ z ).Similarly, if the prior over the factor loadings matrix Θ decomposes as p(Θ | τ θ ) = ∏ kj p(θ kj | τ θ ) then one can pool the θ * kj s, and compare with p(θ | τ * θ ).One can also go beyond the marginal z or the full vector z, and assess a subset of the vector such as bivariate interactions (see §4.1).

Linear dynamical system
One can extend the idea of aggregation beyond factor analysis models.For example, Statement 1 holds for general latent variable models with repeated structure.Take, for example, a linear dynamical system model with a latent Markov chain, so that where U consists of the system and observation matrices A, B, and precisions, Q, R. Then according to Statement 1 a sample (Z * , u * ) drawn from P(Z, U | X obs ) should be distributed according to the prior over (Z, U).Although the z t 's are not independent (due to the Markov chain), we can consider model criticism for p(z t | z t−1 ).For example for a system model parameterized as , violations of the model will show up as deviations of the * t 's from N (0, Q * −1 ) (see §4.2).Similarly, for an observation model parameterized as of the model may also show up as deviations of the ψ * t 's from N (0, R * −1 ).See §4.2.

Gaussian process regression
A Gaussian process probabilistic model is defined as: where m(x | ϑ) is the mean function parameterized by ϑ, κ(x, x | ζ) is the covariance function (or kernel) parameterized by ζ, and τ is the observation noise precision, see e.g., Rasmussen and Williams (2006).
Alternatively, considering the eigendecomposition K = UΛU where U = [u 1 , . . ., u n ] is the matrix of the eigenvectors u i s and Λ is the diagonal matrix of the corresponding eigenvalues, i.e., This implies that, according to the model, the projections c i of the signal y on the eigenvector u i are independent samples from N (0, λ i ).Thus, the normalized projections should be independent samples from N (0, 1) distribution.One can thus test the normality of the z's to assess the goodness of fit.However, note that if the ith eigenvalue of K is much smaller than the noise variance τ −1 , then this z i is dominated by the white noise contribution.Thus we only include z's corresponding to eigenvalues with λ i > 2τ −1 to assess the fit of the GP model8 .See §4.3.

Related Work
Cook et al. ( 2006) consider the situation with (in our notation) a prior p(u) on the parameters of the model, and data p(x | u).They then assume that specific parameters u 0 are drawn from the prior, then data x obs drawn from P(X | u 0 ).They then consider samples u 1 , u 2 , . . ., u L drawn from P(U | x obs ), and comment (in the caption of their Figure 1, translating to our notation) that "(x obs , u ) should look like a draw from P(X, U) for = 0, 1, . . ., L".They then use the 'reverse' of Statement 1 to validate the correctness of posterior samples generated by a statistical software, by comparing u 0 with u 1 , u 2 , . . ., u L .Their recommended method for this is to calculate posterior quantiles for each scalar parameter; if the software is working correctly then the posterior quantiles are uniformly distributed.Although they share with us the observation that (x obs , u ) should look like a draw from P(X, U), this is used to answer a totally different question.Also, they do not discuss the inclusion of latent variables in the model.Johnson (2007) and later Yuan and Johnson (2012) also consider a model with parameters U and data drawn from P(X | U).Their interest is in the use of pivotal quantity d(x, u) that has a known and invariant sampling distribution when data x obs are generated from a model with data-generating parameters u 0 .Then Yuan and Johnson (2012) show that if the d(X, u 0 ) is a pivotal quantity distributed according to F, then d(X, u ) is also distributed according to F, if u is drawn from the posterior on U given x obs .The result of Yuan and Johnson (2012) extends earlier work by Johnson (2007) to the case where d(x, u) does not depend on the data x-for example this situation can arise in a Bayesian hierarchical linear regression model, when considering the second level where parameters for individual units are generated from a hyperprior.
Regression diagnostics is a well-explored example of model criticism.Existing approaches assess certain statistical assumptions made during modelling, e.g., if the residuals follow a normal distribution with zero mean, (e.g., using a Q-Q plot (Wilk and Gnanadesikan, 1968)), if the residuals are homoscedastic, (e.g., using the Breusch-Pagan (Breusch and Pagan, 1979) or White test (White, 1980)) or if the successive residual terms are uncorrelated (e.g., using the Durbin-Watson test (Durbin and Watson, 1950)).Regression diagnostics can be seen as a special case of model criticism in the latent space since residuals are representatives of errors, which are latent variables of the model.However, our methods are also applicable to more complex models.Meulders et al. (1998) consider a factor analysis model for binary data, using (in our notation) Beta(2, 2) priors on Z and Θ.They carry out posterior sampling using block Gibbs sampling for Z and Θ and compare histograms of these variables against the prior.Discrepancies between the prior and histograms of the sampled aggregated posterior led to model extension, expanding the model to use a mixture of two beta distributions for the parameters.However, the authors do not explain the basis for carrying out this check (cf.Statement 1).Buccigrossi and Simoncelli (1999) consider the posterior distribution of wavelet coefficients (analogous to z in the factor analysis model) in response to image patches.By considering the distribution of a bivariate aggregated posterior, they show that this is not equal to the product of the marginals, but exhibits variance correlations.(This is shown by introducing a "bowtie plot" showing the conditional histogram of z 2 given z 1 .)This work is a nice example of how the failure of a diagnostic test can give rise to an extended model (see §4.1).
O'Hagan (2003, §3) considered model criticism tools that can be applied at each node of a graphical model (and of course latent variables can be considered as such).O'Hagan ( §3.1 2003) discussed the idea of residual testing at different levels of a hierarchical model as well as a generic probabilistic model.He suggested checking if a node in a probabilistic model is misbehaving by comparing the posterior samples at that node to prior distribution.O'Hagan ( §3.2 2003) also emphasized that conflict can arise between the different sources of information about a variable at a particular node, arising from contributions from each neighbouring node in the graph.However, he did not suggest using the aggregated posterior to assess goodness of fit, but considered the posterior at each node separately.Tang et al. (2012) introduce the concept of the "aggregated posterior" as applied to deep mixtures of factor analysers (MFA) model.Consider the situation as above but where Θ is estimated by maximum likelihood, so it is the posterior over Z that is of interest.Thus p(X, Z | Θ) = ∏ n i=1 p(z i ) p(x i | z i , Θ).Under this model we also have that p(z) = p Θ (z | x) p Θ (x)dx where the Θ subscript denotes that both p Θ (z | x) and p Θ (x) correspond to distributions under the model.Tang et al. (2012, p3) define the aggregated posterior as "the empirical average over the data of the posteriors over the factors", i.e., where the integral wrt p Θ (x) has been replaced by the empirical average over samples.If the data distribution p(x) is equal to the model distribution p Θ (x) then p(z) should agree with p(z).However, differences between p(x) and p Θ (x) will manifest as differences between the two respective distributions in the latent space.
In practice, however, one does not explicitly construct the aggregated posterior Eq. ( 12) since it is only asymptotically equal to the prior.Instead Tang et al. (2012) compare a collection of n samples z * i from p Θ (z | x obs i ) for i = 1, . . ., n to p(z).This is a valid approach since if {x obs 1 , . . ., x obs n } follow the distribution p Θ (x), then {z * 1 , . . ., z * n } follow the distribution p(z) as we show in Statement 1.Additionally, as Θ is not known in practice, Tang et al. (2012) replace Θ with maximum likelihood estimate Θ in the definition of aggregated posterior Eq. ( 12).In Statement 1 we extend this idea to a Bayesian setting where Θ and λ are not fixed parameters but latent variables themselves.
Tang et al. ( 2012) started with a simple mixture of factor analysers (MFA), and observed that the aggregated posterior for a latent component often doesn't match the N (0, I) prior.By replacing the prior for a component with another MFA model, they constructed a deep MFA model.The idea of the aggregated posterior (although not the name) can be traced back e.g. to Hinton et al. (2006), where in deep belief nets the idea was that the posterior distribution of the latents of a restricted Boltzmann machine (RBM) could be modelled by another RBM.

Examples
In this section, we provide three examples of model criticism and extension in the latent space.First, we explore a factor analysis model in the context of image compression ( §4.1).The objective of this example is to show how the model can be criticised in the latent space as well as in the observation space.Our analysis leads to changing the latent distribution from a single Gaussian to a scale mixture of Gaussians, which captures both the marginal and the joint structure of the latent space, and improves the model in the observation space as well.
Next, we explore a linear dynamical system model ( §4.2) in the context of modelling time series.We show that model criticism in latent space allows us to interrogate not only the standard "innovations" (defined in ( 20)), but also the latent residuals (defined in ( 19)).
Finally, we explore a Gaussian process model ( §4.3) in the context of modelling time series.The objective of this example is to show when model criticism in the latent space can be a natural choice whereas model criticism in the observation space can be difficult.Our analysis leads to changing the covariance function from squared exponential to a combination of periodic and squared exponential kernels.
We implemented all models (except the Gaussian process model) in JAGS (Plummer, 2003), keeping a single sample in the MCMC run after discarding a burn-in of 1000 samples (10,000 samples for §4.2).Note that for model criticism in the latent space, we need only a single sample.We summarize the aggregation process and corresponding reference distributions used in this section in Table 3.
An alternative latent variable model for the factor analysis model is (See Table 2) (15, Laplace) We use the same aggregation strategy as before, and compare the empirical distributions with the univariate distribution L(0, τ * z ) and bivariate distribution L(z 1 ; 0, τ * z ) L(z 2 ; 0, τ * z ) respectively.We observe similar characteristics in the aggregated posterior as before (Figure 2, middle column).
To accommodate these observations, we allow a scale mixture of Gaussian distributions as used by Wainwright and Simoncelli (2000) with 8 components for the latent variable (16, Scale Mixture of Gaussians) We generate sample (π * 1 , . . ., π * 8 , τ * 1 , . . ., τ * 8 ) as well.We assess the same aggregated distributions as before and compare them with respectively.We observe that the empirical marginal distribution follows the mixture distribution well, although a KS test rejects the hypothesis that the aggregated posterior follows the mixture distribution.Additionally, the joint distribution captures the heteroscedasticity in the latent space (Figure 2, right column).
We also show the eigenvectors of the corresponding loading matrix for each of the three cases (Figure 2 bottom row).We show the eigenvectors rather than the loading matrix themselves since for the Gaussian and Gaussian scale mixture, the columns of the loading matrix may not correspond to any particular pattern due to rotational invariance.We observe that all three loading matrices span a similar space.
The matrix factorization model can be criticised in the observation space with established image statistics as a discrepancy measure.This, however, requires generating replicate data of the same size as the observed data, which in this case is computationally extensive since X obs ∈ R 64×50,000 .To avoid generating multiple replicates, i.e., matrices X rep r ∈ R 64×50,000 , we only generate a single replicate for each latent distribution choice and compare them the observed data.For all three cases, i.e., Gaussian, Laplace, and Scale Mixture of Gaussians, we generate latent samples z rep i from the fitted parameters τ * z (and τ, π * for Scale Mixture of Gaussians).We use the rest of the fitted parameters, i.e., Θ * , τ * , and b * to generate samples x rep i from z rep i .We generate 50,000, 8 × 8 replicate image patches, and compare the observed and replicate data in terms of the distribution of raw pixel values.We show the results in Figure 3.We observe that the distribution of the image pixel values in the replicate data follows the observed data more closely for Scale Mixture of Gaussians than the other latent distributions.However, it is not a perfect fit, and that tells us that this model can improved further; potentially by increasing K, and varying the noise characteristics such as using a full diagonal covariance.

Honey bee data
The honey bee data consists of measurements of (x, y) coordinate and head angle (ν) of 6 honey bees.The measurements are usually translated into a 4-dimensional multivariate time series (x, y, cos(ν), sin(ν)), and modelled using a switching linear dynamical system to capture three distinct dynamical regimes, namely, left turn, right turn and waggle (Oh et al., 2008).We follow this strategy, and model each time series by a switching linear dynamical system (SLDS) (Fox et al., 2009) as follows: where s t can be in one of {1, . . ., S} states.We assume that Q (•) (for each state) and R are diagonal matrices with Gamma(α, β) prior over nonzero entries, entries of A (•) (for each state) and B originate from Gaussian distribution, and π (•) (for each state) follow a Dirichlet distribution, i.e., a (•) We group s = {s i } n i=1 and Z = {z i } n i=1 .We set α = β = 0.001.We fit two models with S = 1 (standard linear dynamical system), and S = 3, both with a 4 dimensional latent space.We generate a posterior sample (s * , Z * , A (1) * , . . ., A (S) * , Q (1) * , . . ., Q (S) * , B * , R * ), and aggregate the standardized latent residuals and observation residuals (or innovations) For an LDS the standard approach to model criticism is to check that the innovations sequence is zero-mean and white (see, e.g.(Candy, 1986, §5.1)), although this is usually carried out for known or point-estimates of the parameters, not in a Bayesian setting.We use this check (extended to the SLDS case) below, but also consider the latent residuals.
First, we focus on marginal structures zkt and xjt by pooling k = 1, 2, 3, 4 and j = 1, 2, 3, 4 together, rather than the 4-dimensional vectors themselves as shown in Figure 4 (left).We expect that the APSs would deviate from normality more (lower p-value) when S = 1, compared to S = 3, and we observe this to be true for all honey bee sequences except 2. For sequences 4-6, the latent segmentations of the SLDS in terms of (s * 1 , . . ., s * n ) agree with the ground truth well; we present the 6th sequence in Figure 4 (right).For sequences 1-3, we observe that the segmentations are rather poor, similar to the results in (Fox et al., 2009, §5).
Next, we focus on the joint structures in the temporal domain by pooling, ( xj 1 t , xj 2 t ) ∀j 1 , j 2 = 1, 2, 3, 4 and j 1 = j 2 , ( zkt , zk(t+1) ) ∀k = 1, 2, 3, 4, and ( xjt , xj(t+1) ) ∀j = 1, 2, 3, 4. We expect that this APSs would deviate from the reference distribution N (0, I 2 ) more for S = 1 than for S = 3.We compute the correlation coefficients, and observe the respective p-values for S = 1 and S = 3 (Rahman, 1968).We observe that for S = 1, the models are rejected either in the latent domain or in the observation domain except for sequence 3, while for S = 3, the models are rejected either in the latent domain or the observation domain for sequences 2 and 3 only.In other words, for sequences 1, 4, 5 and 6, the model improves for S = 3, whereas for sequence 2 it fails to improve, and for sequence 3 it degrades for S = 3.These observations can again be attributed to the poor segmentation for sequences 1-3.We show the corresponding aggregated posteriors in the latent and observation space for sequence 6 in Figure 5.We observe that the residuals in both latent and observation space display correlations for S = 1 while these is reduced considerably for S = 3.

Carbon emission data
The CO2 emission dataset10 comprises monthly average atmospheric carbon concentration y i (in parts per million) between 1958 and 2017 (707 measurements after removing missing values).Rasmussen and Williams (2006, §5.4.3) show that this time series can be modelled well by a combination of 4 standard covariance functions involving 10 hyperparameters (and an additional parameter to model the additive white noise).Each covariance function is introduced to model a specific aspect of the signal, e.g., a squared exponential kernel to model the long term trend, a decaying periodic kernel to model the seasonal variation, a rational quadratic kernel to model the short term irregularities, and another squared exponential kernel to model the residual correlated noise.We show below how model criticism and extension can be used to justify the use of covariance functions representing similar aspects of the data.
Following Rasmussen11 we use the measurements up to year 2004 (543) as training data and the rest (164) as testing data12 .We remove the mean of the training data before modelling, and use a zero mean function, i.e., m(x) = 0 or m = 0. We use Gamma(α, β) (See Table 2) priors over ζ and τ −1 to keep them positive.We use the GPstuff toolbox (Vanhatalo et al., 2013) to generate MCMC samples, and initialize the sampler at the maximum likelihood (ML) solution obtained using GPML toolbox 13 .We set the parameters α and β such that the mean of the prior distribution is at the ML solution, and the variance is equal to the mean.We generate a posterior sample (ϑ * , ζ * , τ * } and aggregate the standardized projections where U * and Λ * are the eigenvectors and eigenvalues of kernel matrix K * such that K δ(x i , x j ), and m * = 0 by design.We first model the time series with the squared exponential or Gaussian kernel, where l is the length scale and σ 2 f is the signal variance, i.e., ζ = {σ 2 f , l}.We obtain ζ ML = (188, 0.30) and ζ * = (197, 0.29).We present the fitted data along with unstandardized and standardized projections in Figure 6 14 .The figure shows that the Gaussian kernel fails to model the time series as the prediction quickly falls to the mean of the training signal and KS-test p-value = 4 × 10 −10 .We observe that most of the signal strength (c i 's) is concentrated at lower frequencies (corresponding to large eigenvalues λ i∈{1,5} ).The respective eigenvectors correspond to an upward trend.Also, a relatively high strength is observed at eigenvalues 92-93.The respective eigenvectors correspond to sinusoids of frequency ∼1 year (see Figure 7a) which indicates a potential need of a periodic covariance function to model this data.
To tackle this, we use the decaying periodic function to model the time series, (Rasmussen and Williams, 2006, §5.4.3) where p is the period of the covariance function.Therefore, ζ = (σ 2 f , p, l p , l d ).We obtain ζ ML = (283, 1, 5.13, 5.86), and ζ * = (385, 1, 4.88, 6.09).We observe that this provides a better fit than squared exponential kernel (p-value 0.08).Although the KS-test fails to reject the fitted model (perhaps due to lack of samples), we observe that the signal strengths (c i 's) still deviate from their expected values.In particular, the second, fourth and sixth projections show relatively high values compared to third, fifth and seventh.The signal ∑ i∈{2,4,6} c i u i corresponds to an upward trend, which corroborates the need to model the trend further.See Figure 7b.Note that although the CO2 data is a time-series, the analysis of the c-samples (see Eq. 10) does not depend on this, and can also be used where the input-space is multi-dimensional.
To accommodate the upward trend, we introduce a squared exponential kernel with a relatively large length scale.However, to avoid modelling small scale variations with the same kernel, we use combination of two squared exponential kernels with two different length scales.Therefore, ζ = (σ 2 f , p, l p , l d , σ 2 f s , l s , σ 2 f l , l l ) where the last four parameters belong to the two squared exponential kernels with small (s) and large (l) length scales.We obtain ζ ML = (4.37, 1, 1.78, 74.60, 0.81, 0.92, 4132, 27.14), and ζ * = (2.25, 1, 1.24, 73.88, 0.32, 0.66, 4095, 32.25).We observe that this improves the fit even further, both in terms of the testing data (visually) and in terms of unstandardized projections.The KS-test uses more samples, and still fails to reject the model (p-value 0.55).
Model criticism of Gaussian processes in the observation space has been discussed by Lloyd and Ghahramani (2015).However, their approach is different from the standard posterior predictive check since the authors use hold-out data rather than using the observed data twice.Although this approach shows if the response on hold-out data is different for the fitted model, it does not necessarily point out how the model can be extended.
One could generate a replicate sample from y rep ∼ p(• | ζ * , X obs ), and compare y rep and y obs as for a posterior predictive check.However, note that in this case y rep will be an independent draw from the GP with parameters ζ * and input locations X obs , hence it could look very different from y obs -this is why Lloyd and Ghahramani (2015) make use of held-out data.Also, it would be difficult to come up with a suitable discrepancy function in this case.One could consider the χ 2 discrepancy15 , i.e., y K −1 y.However, this quantity is fitted when sampling the kernel parameters ζ, and is also (as discussed above) dominated by the noise for small eigenvalues of K. Other discrepancy measures could be investigated, but exploring these alternatives is beyond the scope of this paper.

Discussion
Model criticism explores the discrepancies between a statistical model P(X, U) and observed data x obs .This is often achieved by generating replicate observations X rep ∼ P(X | x obs ) from the fitted model, and investigating which aspects D(X, U) of the replicated observations do not match the observed data.Instead here we have focused on pulling the effect of the data back into the latent space, and investigating if the posterior sample u * ∼ P(U | x obs ) follows the prior distribution P(U), as it should do by Statement 1 if the data were generated by the model.This is tested by aggregating related variables with the same prior distribution and comparing them with the associated prior.
It should be noted that model criticism is not used to judge if a model is right or wrong.On the contrary, it is widely accepted that all models are wrong but some are useful (Box and Draper, 1987, p. 424).Model criticism aims at understanding the limitations of the model with the hope that a better model can be found, e.g., since all models are basically simplifications of a more complex process, model criticism inspects if the simplification is meaningful, or if the statistical assumptions made are reasonable.Following this principle, we have discussed four examples of model criticism in latent space.We have shown that by analysing the distribution of the aggregated posterior, a model can be extended so that the aggregated posteriors follow the respective prior distributions better.

Figure 2 :
Figure 2: Aggregated posterior [agg.post] and associated prior at the latent node for (left to right) the Gaussian, Laplace and Scale mixture of Gaussian models of image patches.Top: ECDF of aggregated posterior samples (16 × 50, 000 samples) and CDF of respective prior.Middle: conditional mean and standard deviation of the bivariate aggregated posterior samples (16 × 15 × 50, 000 ÷ 2 samples, over 100 bins) and respective prior distribution.Bottom: eigenvectors of respective loading matrices.

Figure 3 :
Figure 3: Empirical cumulative distribution functions of raw pixel values on observed and replicate data for varying different distributions.

Figure 5 :
Figure 5: Scatterplot of bivariate aggregated posterior samples (607 × 4 samples) for sequence 6.The black line shows the best linear fit, whereas the dotted line shows the expected fit in the absence of (serial) correlation.The values in each plot are the correlation (r) and respective p-value (p).

Figure 6 :
Figure 6: Latent values of the Gaussian process model for CO2 emission dataset.Top: Original and fitted signal.Training and testing sets are separated by a gray line.Middle: Unnormalized projections c * i 's for i = 1, . . ., n, and the respective 95% confidence interval ±1.96λ 1/2 i .We only show values for which λ * i > 2τ * −1 .y-axis has been transformed by sgn(y)|y| 0.3 to show small values.Bottom: ECDFs of aggregated posterior samples [agg.post.] of the normalized projections z i 's and CDF of prior distribution N (0, 1).p-values correspond to KS-test.

Figure 7 :
Figure 7: Weighted eigenfunctions of carbon emission dataset.The kinks in the plot appear due to the short length scale.

Table 1 :
Description of notation used in the paper.

Table 3 :
The table summarizes observed data x obs , unknown variables U, aggregated posterior sample(s) (APS), and corresponding reference distribution(s) (as elaborated in Algorithm 1) for three models discussed in §4, and different scenarios within each model.
matrix factorization models of the form: b