The probabilistic tensor decomposition toolbox

This article introduces the probabilistic tensor decomposition toolbox - a MATLAB toolbox for tensor decomposition using Variational Bayesian inference and Gibbs sampling. An introduction and overview of probabilistic tensor decomposition and its connection with classical tensor decomposition methods based on maximum likelihood is provided. We subsequently describe the probabilistic tensor decomposition toolbox which encompasses the Canonical Polyadic, Tucker, and Tensor Train decomposition models. Currently, unconstrained, non-negative, orthogonal, and sparse factors are supported. Bayesian inference forms a principled way of incorporating prior knowledge, prediction of held-out data, and estimating posterior probabilities. Furthermore, it facilitates automatic model order determination, automatic regularization on factors (e.g. sparsity), and inherently penalizes model complexity which is beneficial when inferring hierarchical models, such as heteroscedastic noise modelling. The toolbox allows researchers to easily apply Bayesian tensor decomposition methods without the need to derive or implement these methods themselves. Furthermore, it serves as a reference implementation for comparing existing and new tensor decomposition methods. The software is available from https://github.com/JesperLH/prob-tensor-toolbox/.


Introduction
Tensors, higher-order or n-way arrays are increasingly encountered in all areas of science. While standard two-way matrix analysis methods can be applied by restructuring these higher order arrays into a matrix, such approaches fail to properly exploit the inherent multi-way structure. Instead, using multi-way or tensor methods, such as the Canonical Polyadic (CP) decomposition (also known as PARAFAC and CandeComp) [1][2][3], Tucker model [4,5] or variations thereof are able to account for the intrinsic structure of the data. Tensors are prevalent in both research and industry (for reviews see [6][7][8][9]) and has a large heterogeneous research community and several tensor analysis toolboxes are publicly available. The most prominent being the Matlab based N-way Toolbox [10], Tensorlab [11], and Tensor Toolbox [12] enabling researchers to apply multi-way modeling across domains.
However, these existing prominent multi-way toolboxes are based solely on maximum likelihood (ML) estimation which only provides a point estimate of the underlying parameters and does not account for parameter uncertainty. Instead estimating uncertainty requires repeated model fitting using for instance jackknifing [13] or bootstrapping [14] which are increasingly expensive as the size of the data grows.
In contrast, Bayesian inference results in an approximation of the true posterior distribution either via sampling [15][16][17] or a direct approximation [18][19][20]. Apart from uncertainty quantification, benefits of Bayesian inference includes automatic penalty for increased model complexity, inference on model order, determining sparsity level, incorporation of more realistic noise assumptions, and a principled way to include prior information and performing cross-validation [16,21] on left out slices or blocks of data. distribution of θ i . Furthermore, let × n denote the n-mode matrix multiplication and ⊗, ⊙, and • respectively denote the Kronecker product, Khatri-Rao product, and element-wise product.

Bayesian tensor decomposition
Bayesian methods for tensor modelling presently includes the CP and Tucker model with factors (and core array) following different distributions, as presented in section 1.1. Additionally, the Tensor Train [61], PARAFAC2 [57], and multi-tensor factorization [59,60] model were all recently developed using Bayesian inference.
Tucker decomposition is one of the core tensor models and is here used for illustrate some of the differences between maximum likelihood (ML) and Bayesian estimation. Consider an N th order data tensor with I n observations in the n th mode, e.g. X I1×I2×...×IN , the Tucker model can be written as where A (n) ∈ R In×Dn , n = 1, 2, . . . , N are the N factor matrices, G ∈ R D1×D2×...DN is the core array which contains the coefficients for all multi-linear interactions of the columns of the factor matrices, and E is the noise. For brevity, let M denote the model or data reconstruction where This provides the framing for stating the ML and Bayesian inference schemes below. Note, that the inference is stated as a least squares and Gaussian likelihood for ML and Bayesian inference, respectively. While other choices are possible this is the most common choice.
For Bayesian inference, the first step is to specify the prior distributions, P(θ), which describe the a priori knowledge about the parameters. The next step is to specify the likelihood function, L(X |θ), which is a probability distribution specifying how the data is generated based on the reconstructed array by the imposed model M and assumptions regarding noise. The posterior distribution, P(θ|X ), is then calculated via Bayes rule as specified in equation (4).
This article and the presented toolbox focuses on variational Bayesian (VB) inference and MCMC Gibbs sampling due to their interconnection, ease of use, and wide application within the existing literature on Bayesian tensor modeling.

Gibbs sampling and VB approximation
Given the prior distribution, P(θ), and the likelihood, L(X |θ), the posterior distribution, P(θ|X ) is given by Bayes rule, The posterior is generally intractable as the marginal likelihood or evidence is intractable, P(X ). Gibbs sampling is useful when it is not possible or too expensive to directly sample from the joint posterior, P(θ|X ), but the conditional distribution P(θ i |θ −i , X ) has a closed form or is easy to sample from. Gibbs sampling then draws a sample from the conditional distribution, where θ i is the set of random variable which are sampled and θ −i are the random variables which are conditioned upon. A full sample θ is obtained by cycling through all conditional distributions. When the sampler has converged, then these samples approximate the true posterior distribution [16]. For variational Bayesian inference, the main idea is to approximate the posterior, l P(θ|X), using a set of variational distributions, Q(θ). The Q distributions provides a lower-bound on the log evidence log P(X ) which is called the evidence lower bound (ELBO). Using Jensen's inequality, the ELBO is, The variational distributions are chosen such that the ELBO is tractable [18,21]. This paper uses a mean-field approximation where the probability distributions are assumed to be independent between subsets of random variables, i.e. Q (θ) = ∏ i Q (θ i ). The optimal variational distribution for each subset θ i is then, where ⟨ · ⟩ \Q(θ i ) is the expected value with respect to all variational distributions except Q(θ i ), see [72]. The optimal parameters of the variational distribution are then identified via moment matching in equation (8).

Probabilistic tensor decomposition toolbox
The probabilistic tensor decomposition toolbox is available at https://github.com/JesperLH/ prob-tensor-toolbox/. Presently, it only considers the Gaussian likelihood which is analogous to the least squares error. Given a data tensor X I1×I2×···×IN , the likelihood is, where M is the mean array and m ≡ vec(M) its vectorization, N A and N is the array and multivariate normal distribution, and ϵ is the residual error following a Kronecker structured covariance matrix [26,27,73].
A simpler heteroscedastic noise model where the mode specific covariance is diagonal, , is implemented for Bayesian CP. This noise structure allows quantifying mode-specific noise, such as noisy samples or unreliable features. For the Tucker and the Tensor Train decomposition only homoscedastic noise is considered, which is the same as specifying Σ (n) ϵ ≡ σ 2 I and Σ (m) ϵ ≡ I ∀ m̸ =n for an arbitrarily chosen n.
The CP, Tucker and Tensor Train (TT) models are all specified by the structure of the mean array M. For the Tucker decomposition the mean array is defined as in equation (2). Similarly, CP is described by equation (2) but constraining the core array to the unit hyper-cube G = I. The unit hyper-cube has value one along the hyper-diagonal and off-diagonal elements are zero. For tensor train decomposition, the mean array is, where × [a,b] is tensor contraction along mode a and b and U (n) ∈ R Dn−1×In×Dn is the latent factor or train cart for mode n = 1, …, N. Details on tensor train and its probabilistic extension are found in [61,74].
Handling mode-specific heteroscedastic noise is feasible both for the Tucker decomposition and the TT decomposition. However, when imposing orthogonality constraints on a latent factor heteroscedastic noise is not permitted as the prior on the factor (von Mises-Fisher Matrix distributed) and likelihood (heteroscedastic Gaussian distributed) are not conjugate, i.e. the posterior distribution of each factor no longer follows a von Mises-Fisher Matrix distribution.

Partially observed data
For partially observed data, the missing elements can either be imputed or marginalized. For the latter, the likelihood is specified only on the observed elements of the tensor, where O is the set of observed elements and i = (i 1 , i 2 , . . . , i N ) is the vector index of a single element.
Marginalization is preferred over imputation as it does not introduce any bias in the estimated posterior distribution and preserves any convergence guarantees of the chosen inference method. Imputing missing values is generally faster and works well in many practical applications. Unfortunately, imputation can fail unexpectedly as the number of missing values increase as the inference is highly influenced by past (poor) imputations. For Bayesian inference, imputation is more principled than its maximum likelihood counterpart [16]. In Bayesian inference, the likelihood is just the probability of data X under some model θ, i.e. L(X |θ) ≡ P(X |θ). In the presence of missing data, this can be rewritten as P(X obs , X miss |θ) and the probability of a missing element x i is then, Determining this distribution is intractable, but it can be approximated using VB [21,28] or sampling [16]. The toolbox currently use a coarse approximation and assumes P(x i , |X obs ) = g(M) i for all missing elements i ̸ ∈ O. For VB, g(M) = ⟨ M ⟩ is the reconstructed element using the current expected first moments. For sampling, g(M) = M which is the reconstruction using the most recent sample of the random variables. Due to this coarse approximation, imputation should always be followed by at least one marginalization update to get valid posterior moments.

Factor matrices
The Bayesian Tensor Train model only supports orthogonal train carts and details are provided in the original reference [61]. Therefore, this section only concerns the specification of factor matrices for Bayesian CP and Tucker decomposition, i.e. A (n) for mode n as defined in equation (2). Presently, the following priors are implemented in the toolbox, where each n-mode factor matrix A (n) has observations i = 1, 2, …, I n and latent components d = 1, 2, …, D n . Presently, both Tucker and CP support von Mises-Fisher Matrix (vMF) distributed factor matrices which represent an orthogonal assumption on the factor matrix. For CP, the multivariate normal distribution is used to model unconstrained factors, while the truncated normal and exponential distribution are used to model non-negativity. Furthermore, the uniform distribution can be applied to achieve an uninformative box constraint.

The core array
For the Tucker decomposition an explicit core array G D1×D2×···×DN is inferred which models multi-linear interactions between the components. Here, the core array is specified using its vectorized version g ≡ vec(G) and follows a normal distribution, where Ψ D1D2···DN×D1D2···DN is the precision matrix and the precision of a single element in G is ψ d1,d2,...,dN . The toolbox considers learning a shared scale Ψ ≡ ψI, an automatic relevance determination (ARD) prior (1) [28], and element-wise sparsity diag(Ψ) ≡ ψ [25]. For these priors off-diagonal elements of Ψ are zero and the prior on diagonal elements is assumed independent and from a Gamma distribution, P(ψ · ) ∼ G(α ψ , β ψ ) with shape α ψ and rate β ψ .
In the context of Bayesian analysis, a Kronecker structured precision matrix (1) was explored by [27] and for count data a core array of CP structured sub-cores [33]. The latter is related to the block term decomposition [75] and can be viewed as a special case for count data using Bayesian inference.

Component and Core Precision Matrix
The probabilistic toolbox allows specifying a prior P(Λ (n) i ) on the precision of each observation i and latent components D n for factor matrix A (n) . Similarly, for each element of the core array g d1,d2,...,dN it is possible to specify its precision P(ψ d1,d2,...,dN ). At present time, only precisions following a Gamma distribution are considered. For the core array, the priors are for learning a shared scale Ψ ≡ ψI, element-wise sparsity Ψ ≡ diag(ψ), and prune irrelevant slices Ψ ≡ diag([ψ (1) , . . . ψ (N) ), respectively. Note α ψ and β ψ is the rate and shape of the Gamma distributions which are specified as broad priors. For the latent factors, a full precision matrix for each observation i and mode n is rarely needed. Instead, it is modelled as These specifications model a shared scale, ARD shared over modes, mode-specific ARD, mode specific sparsity, and a mode specific interaction between the latent components. The latter P(Λ (n) |·) follow a Wishart distribution while the elements in each of the former specifications follow a Gamma distribution. The Gamma distribution is used as it is the conjugate prior for the precision of a normal or truncated normal distribution, as well as for the rate of an exponential distribution. The Wishart prior is the conjugate prior of the precision matrix from a normal distribution and is only applicable to normal factor priors equation (14).
Gamma based ARD priors work well in many applications [25,72,76], but they do not necessarily provide the desired shrinkage [77,78].

Inferring the Posterior Distribution
For a given decomposition model, its parameters or all the random variables of interest are denoted as θ. As an example, consider Bayesian CP with three factors A (1) , A (2) , A (3) , shared component precision λ, and homoscedastic noise τ for which θ = {A (1) , A (2) , A (3) , λ, τ } is the set of random variables.
The probabilistic tensor decomposition toolbox seeks to characterize these random variables via their posterior distribution P(θ|X ). As discussed in section 2.2, exact inference is not feasible and the posterior is approximated using either variational Bayesian (VB) inference or Gibbs sampling, see also [16,18] for details.

Prediction on Held-out Slices or Samples
Bayesian inference facilitates a principled way of predicting on held-out slices or samples from a tensor X . This allows splitting the tensor into a disjoint training and test set, X = [X train , X test ]. This approach facilitate greater independence between training and test set when compared with leaving out elements or fibers of the tensor. Test set performance is assessed using the predictive posterior distribution [16,21] where typically its logarithm is used, Here θ and θ ⋆ are the random variables for the training and test set, respectively. Generally, this integral is intractable and the probabilistic tensor decomposition toolbox considers two approximation approaches inspired by [21]. The first approach relies on the assumption that the estimated variational posterior distribution is a good approximation of the true posterior distribution, i.e. P(θ|X train ) ≈ Q(θ|X train ). This provides the following lower-bound, where Q(θ ⋆ ) is a variational approximation of the test set parameters. The second approach assumes a point estimate of the training parameters θ is a good approximation of the posterior distribution, e.g. P(θ|X train ) ≈ ⟨ θ ⟩ where the mean value of θ is used. This approximation then only requires evaluating the integral over testset parameters θ ⋆ , where P(X test |·) is the likelihood of the test data and P(θ ⋆ ) the prior on the test parameters. This integral is sometimes tractable, but otherwise it is approximated via variational inference or sampling.

Results and discussion
This section contains a few experiments to illustrate the use of the probabilistic tensor decomposition toolbox. These experiments demonstrate the benefits of probabilistic modelling, but also serve to highlight potential limitations. These illustrations are primarily based on two common fluorescence spectroscopy datasets. The Amino Acid dataset [79][80][81] contains five samples where fluorescence was measured as different emission levels (250-450 mm) and excitation levels (250-300 nm). The data is represented in the tensor X 5×201×61 where modes are samples, emission levels, and excitation levels. The samples are mixtures of three pure components and the concentrations are known.
The Sugar Process dataset [82,83] consists of 268 equally spaced samples and for each sample fluorescence was measured at 571 emission levels and 7 excitation levels, the data is represented in the tensor X 268×571×7 . The samples were collected every 8 hours from a sugar plant and the dataset contains chemical shifts and is more noisy than the Amino Acid dataset.
For illustrating tensor completion, section 3.3, two additional datasets are considered. The Wakeman dataset [84,85] which contains electroencephalogram (EEG) data from a visual experiment. Before analysis, the data was preprocessed as described in Chapter 42.3 of the SPM12 Manual (https://www.fil.ion.ucl.ac.uk/spm/) except for the addition of an initial filtering step for removing high frequency noise. This step used a band pass filter (FIR) [1,35]Hz with filter order 1815 and applied it to each channel and in both directions. This is further detailed in the script acq_wakeman_eeg_data.m. In this paper, only data from subject 7 and condition Famous is used. The data is represented as a tensor X 70×181×296 where the modes are channels, time points, and trials, respectively.
The Human Allen Brain dataset [86] consists of gene expression data from different brain tissues in six subjects, represented by X 58692×414×6 where the modes are genes, areas, and subjects, respectively. The data is available from http://human.brain-map.org/. For some brain areas, multiple tissue samples where gathered, but the present analysis ignores this aspect and averages over all samples within an area and subject. Previously, this preprocessing scheme was applied when analysing the data using a non-negative CP model based on either maximum likelihood [62] or Bayeasian inference [45]. Further details are found in the script acq_allen_geneexpr_data.m. The toolbox is made to be extensible and future changes might affect how the developed functions are called. Therefore, specific function calls are not included. Instead, the name of the relevant demonstration file is given, i.e. demo_*.m.

Component extraction: amino acid and sugar process dataset
Tensor decomposition methods seek to factorize data, such that the extracted components represent the underlying structure of the data. Commonly, the components of this underlying structure are of interest as they provide a more compact and possibly interpretable representation of the data. The toolbox, presently, includes Bayesian Canonical Polyadic/PARAFAC (CP), Tucker, and Tensor Train (TT) decomposition. Calling these methods are demonstrated in the script demo_component_extraction.m which also visualizes the results obtained by each method.
This section only illustrates the CP decomposition, as neither Tucker nor TT decomposition have simple intuitive visualizations. The CP decomposition is applied to the Amino Acid and Sugar Process datasets, as described in demo_fluorescence.m which also generates figures 1(a)-(d).
Each dataset is standardized to have unit variance and a Bayesian CP with normal factors and shared column-wise precision (ARD) prior is fit. The random variables are θ = {A (1) , A (2) , A (3) , λ, τ } and are inferred using variational inference.
To investigate if irrelevant components are pruned appropriately, the model is initialized with either 10 or 100 components, i.e. D init = 10 or D init = 100. The maximum number of iterations was 100 and 500 for the Amino Acid and Sugar Process datasets, respectively. Results are shown in figure 1.
For the Amino Acid dataset, the model correctly estimates the latent subspace and correctly identifies the components and concentrations of each sample, see figures 1(a) and (b). Note, that the ARD for D init = 10 For the Sugar Process dataset, seven components are estimated for D init = 10. However, this is not the true chemical spectras, as samples contain chemical shifts that violate the tri-linear assumption of the CP model. Therefore, additional components are necessary to adequately represent the data. The effect of this is most visible in the emission mode, as seen in both figures 1(c) and (d). Figure 1(d), the initial number of components are increased D init = 100 and while most are pruned, it does not estimate the same number of components. This illustrates, that if the initial number of components are greatly over-specified, then the correct subspace might not be identified and it can be beneficial to refit the model with a lower number of initial components.
Generally, determining the true number of components for any dataset remains difficult, but the Bayesian-CP model has proven to provide more robust estimation when the number of components are over-specified [39,45]. This added robustness of using a Bayesian formulation has also been observed for other tensor decomposition methods, for instance [27,28,57,[59][60][61].

Heteroscedastic noise estimation
Modelling mode-specific heteroscedastic noise allows quantification of data quality of the samples, as well as investigation of which part of the spectra the model has an uncertain representation of. This example shows an analysis of the Amino Acid data using a Bayesian CP with truncated normal factors, column-wise precision (ARD) prior on the sample mode, and heteroscedastic noise on all modes. The set of random variables is now θ = {A (1) , A (2) , A (3) , λ, τ (1) , τ (2) , τ (3) }. The model is initialized with 10 components and results are shown in figure 2. The analysis is included in script demo_tb_heteroscedasticCP.m.
The estimated noise, presently, suffers from scale indeterminacy as τ (1) ⊙ τ (2) = (τ (1) α −1 ) ⊙ (τ (2) α) for an arbitrary α ∈ R ≥0 . Therefore, the estimates are interpretable within each mode but not in absolute magnitude across modes. For comparison across modes, the estimate has to be invariant to scale changes, c.f. [27]. The heteroscedastic noise estimation is useful for down-weighting specific samples, emission levels, or excitation levels. However, it does not enforce a fixed threshold where an observation i n for some mode n should be left out of the estimation.

Data denoising
Tensor decomposition methods can be used for data denoising where the goal is to remove unstructured and/or structured noise from the data. This is illustrated on the Amino Acid dataset using either a CP, Tucker or Tensor Train decomposition. This reproduces an experiment from the Bayesian TT article [61], but presently also including the Bayesian Tucker decomposition with orthogonal factors.
Similar to [61], the number of components are set to D = 3 in the CP model and D = (1, 6, 5, 1) in the TT model. For the Tucker decomposition, the core size is determined by exhaustive evaluation using maximally D 1 = 5, D 2 = D 3 = 10 resulting in 10 2 · 5 models for which the model achieving the highest evidence lowerbound had core size D = [5,9,9]. Each of the decomposition is fit until convergence or 100 iterations. The toolbox includes the script demo_denoising.m which reproduces this analysis.
The original data, reconstructed data, and the residual error for each sample using each decomposition are shown in figure 3. The differences between the decomposition methods are most apparent for sample 3 and 5 where Rayleigh scattering (structured noise) is also most pronounced. The CP model is the closest approximation to the ground truth model [79,81] and provides the best denoised reconstruction. Structured noise is retained in both the Tucker and TT decomposition which is attributed to these models being more flexible than the CP model.

Tensor completion
The last example is missing value prediction or data completion for which a variety of tensor methods has been proposed, for a review see [87]. Presently, the developed toolbox only allows missing values for CP decomposition which limits this demonstration to the Bayesian CP model. For the Bayesian Tucker model, marginalizing over missing values was presented in [28].
The CP model is applied to the Sugar Process data, subject 7 from the Wakeman EEG data, and the Allen Human Brain data. Missing values are generated by holding out a subset of the observed data, the remaining data is used for training. Model performance is assessed on the held out data, i.e. test data, using the root mean square error (RMSE). Two missing value scenarios are displayed, first elements missing at random and then entire fibers missing. The missing fibers are emission spectras (mode-2 fibers), trials (mode-1 fibers), and samples (mode-1 fibers) for the Sugar Process, Wakeman, and Allen Brain data, respectively. In contrast to the Sugar Process and Wakeman data, the Allen Brain data is not fully observed and has 43.44% missing values due to missing samples (mode-1 fibers).
For each dataset, the true number of components is unknown and a CP with normal factors and ARD is fit using D = 10 initial components. For both VB and Gibbs samplings 250 iterations are run. To reconstruct the data, VB uses the expected value in the final iteration whereas for Gibbs sampling the average reconstruction error of the last 125 samples is used. To mitigate the effect of local minima, each model is fit 10 times and the median error is displayed. The analysis is reproduced by the script demo_tensor_completion.m and results are given in figure 4.
When elements are missing at random, figure 4(a)-(c), the results are similar and performance is mostly unchanged from the initial 10% missing values until around 80% missing values. The test set performance then degrades for both VB and Gibbs sampling.
When entire fibers are missing at random, figure 4(d)-(f), the performance visibly differ between the datasets. Figure 4(d) and (f) show performance degrades earlier when entire fibers are missing, both for VB and Gibbs sampling. The transition for the Allen data is less abrupt than in the Sugar Process data. This is attributed to missing scheme matching the actual reason for missing data in the Allen data. In contrast, for the Sugar Process data, leaving emission fibers out rapidly removes all essential information and does not reflect a valid missing data mechanism.
For the EEG data, the missing fibers (trials) have little effect on the predictive performance. The reason for this is that the time-lock signal is never fully removed due to event related EEG data being highly correlated across channels and trials.
Consider leaving out all timepoints for channel i and trial j , i.e. a mode-2 fiber X i,:,j , then the left out channel is highly correlated with all other channels during trial j due to spatial correlation of the signal. This demonstration illustrates how the amount of missing data and its structure affects tensor completion. Furthermore, it highlights the importance selecting a test set which reflect the actual missing data mechanism. This was illustrated on the Allen data, were leaving out individual elements leads to over-confidence in the results.

Conclusion
In this paper, we introduced the probabilistic tensor decomposition toolbox which allows for Bayesian inference in the Canonical Polyadic/PARAFAC (CP), Tucker, and Tensor Train decomposition models. The toolbox offers flexibility in the choice of priors/constraints on factors, both homo-and heteroscadastic noise modelling and efficient inference via the variational approximation. The results indicate that Bayesian inference successfully prunes factors facing over-specified models. However, in practice violations of the model structure still leads to ambiguity with respect to the number of factors to be included. The results show that heteroscedastic noise modelling is useful for quantification of noise and provides additional useful information for model interpretation, which can help researchers identify noisy samples and indicate potential issues with the data acquisition process. In conclusion, the proposed Bayesian toolbox for probabilistic tensor decomposition naturally extends existing methods based on maximum likelihood estimation, and offers more comprehensive modelling accounting for uncertainty readily applicable in the many disparate fields in which tensor decomposition is already widely applied.

Data availability
Data sharing is not applicable to this article as no new data were created or analysed in this study. The analyzed data are available online.
The scripts for reproducing the analysis are provided as part of the Probabilistic Tensor Toolbox, see https://github.com/JesperLH/prob-tensor-toolbox/.

VB and Gibbs based inference
This appendix describes how to infer each of the random variables using either Gibbs sampling [15,16] or variational Bayesian inference [18,21]. The former relies on determining the conditional distributions and the latter on finding a set of distribution that approximate the true posterior distribution. These two inference methods results in similar updates schemes. To avoid duplicating the appendix, the notation is abused and ⟨ · ⟩ denote either a sample form the relevant conditional distribution (when using Gibbs sampling) or the expected value under the Q-distribution (when using VB).
Updating of the CP factor matrices under homoscedastic noise, heteroscedastic noise, and in the presence of missing values is presented in section 6.1. Section 6.2 concerns inferring the precision prior on factor matrices under the different prior choices. Section 6.3 concerns inferring Tucker factor matrices while the core array and precision prior are presented in section 6.4. Finally, section 6.5 gives updates for inferring homoscedastic or mode-specific heteroscedastic noise precision.
Inference of the Probabilistic Tensor Train model are stated in [61]. These are not restated as this paper does not expand on the model.
Most of the update rules presented in this appendix are not novel, but restating them in a common notation provides an overview and more easy comparison of their differences and similarities.
To monitor the progress of the inference procedure, variational inferences uses the evidence lowerbound (ELBO) which increases monotonically each iteration. Gibbs sampling monitors the log-joint distribution which should generally increase, but not monotonically. The expressions are calculated as, where ⟨ · ⟩ Q is expectation under the Q-distribution for variational inference. For sampling θ t is the sample of random variables at iteration t.

Notation
In addition to the previous notation, the sequential product of N matrices is defined as, which is the sequential Kronecker, Khatri-Rao, and element-wise product, respectively. For the Kronecker and Khatri-Rao products, the order of multiplication matters and here it is assumed to be numerically descending. Additionally, ⊗ n̸ =m and ⊙ n̸ =m will be used for the sequential product of the n = N, N − 1, . . . , m + 1, m − 1, . . . , 1 matrices.

Updating factor priors 6.1.1. Prior follows a normal distribution
Let the n-mode factor matrix, A (n) , follow a normal distribution , then the Q-distribution follows where m (n) i and Σ (n) i are the estimated mean and covariance for observation i in mode n. Generally, the covariance is not observation specific, but this depends on the prior on precision matrix, missing values, and noise modelling.
Under a homoscedastic noise assumption, the factor is updated as If heteroscedastic noise is modelled on all modes, then the factor is updated as Marginalizing over missing values leads to where i = (i 1 , . . . , i n−1 , i n , i n+1 , . . . , i N ) and indexes an observed element in O.
Here O is both the set of observed elements and a binary indicator. The update rules are similar to the fully observed case, but here only the expectation when reconstructing present elements is included.

Prior follows a truncated normal, exponential, or uniform distribution
Let the n-mode factor matrix, with lower and upper-bounds α and β. Under a normal likelihood, then each of these prior distributions has the following Q-distribution, Under a homoscedastic noise assumption, the factor is updated via, where σ 2 C and µ C depend on the prior distribution, P(a (n) ij ), which slightly changes the resulting update. These values are, for the truncated normal distribution, for the exponential distribution, and (µ C , σ 2 C ) = (0, 0) for the uniform distribution.
Similarly, under a heteroscedastic noise assumption, the factor is updated via, Marginalizing over missing values is the same as only accounting for the contribution of present elements and leads to, where i = (i 1 , . . . , i n−1 , i n , i n+1 , . . . , i N ) and provides the index of each present element in O. The expression for log-prior and entropy contributions are easily derived following standard textbook definitions. However, evaluating tail probabilities in the truncated normal distribution can be numerically unstable and the toolbox implements the approach described in [88].

Prior follows a von Mises-Fisher matrix distribution
Let the n-mode factor matrix, A (n) , follow a von Mises-Fisher distribution P( , then the Q-distribution follows, Under a homoscedastic noise assumption, the concentration matrix is, If any factor follow a vMF-distribution, then marginalization of missing values is not supported. Instead, it is possible to impute missing values and perform updates as in the fully observed case. The usual caveats for imputation of missing values then apply. Heteroscedastic noise is not supported on any mode following the vMF-distribution. It is, however, still possible to model heteroscedastic noise on non-orthogonal modes. For heteroscedastic noise and non-orthogonal factors on all but mode n, the factor is updated as, For variational inference, the expected value of the von Mises-Fisher matrix distribution is determined as in [89]. Determining the log prior and entropy requires evaluating the hyper-geometric function 0 F 1 with a matrix argument for which efficient approaches are presented in [89,90]. Presently, sampling is not included, but can be implemented following the work in [91].

Updating factor precision or rate prior
For factor matrices following a normal or truncated normal distribution it is possible to infer the precision matrix Λ (n) i or a restricted form of it as described in section 2.3.4. Similarly, for factors following an exponential distribution it is possible to infer the rate parameter. For the truncated normal and exponential distribution, it is possible to infer either the scale, component relevance (ARD), or element-wise sparsity. Similarly, the scale and component relevance can be inferred for the multivariate normal distribution, as well as the full precision matrix following a Wishart distribution. However, the toolbox presently does not allow multivariate factors with element-wise sparsity.
Note, uniform factors has no inferable parameters and for the von Mises-Fisher matrix distribution inferring the concentration matrix F 0 is not supported.
The toolbox allow sharing precision/rate priors across several modes, where the shared set is defined as n ∈ Ω shared . For each distribution, a constant η n affects how much the n th factor contribute to the prior update. For the normal and truncated normal distribution η n = 0.5 and for the exponential distribution η n = 1. Dependent on the choice of prior P(·), the updates are as follows.
For learning the scale of n ∈ Ω shared , let Λ (n) i ≡ λI D and the distribution and updates are, For automatic relevance determination of the components, let Λ (n) i ≡ diag(λ) and the distributions and their updates are, For element-wise sparsity, a precision prior is placed on each observation, i.e. Λ (n) i ≡ diag(λ i ). This gives the following update, For inferring the full precision matrix under multivariate normal factors, let Λ (n) i ≡ Λ then, Note, the updates presented in this section are not affected by whether the noise is homoscedastic or heteroscedastic.

Updating tucker factor matrix
Specifying the prior on the factor matrices is done identically for the CP and Tucker model. However, updating factors in the Tucker model requires accounting for the core array which is not done in the updates presented in section 6.1. Presently, the toolbox only implements factors following the von Mises-Fisher matrix distribution, i.e. orthogonal factors, where the prior distribution is P(A (n) ) = vMF ( . The Q-distribution then is then, and the estimated concentration matrix is The toolbox assumes F 0 = 0 which is the uniform prior on the D n -sphere.

Updating the core prior and its precision prior
Let the core array G be represented by its vectorization g D1D2···DN×1 = vec(G) whose prior follows a normal distribution P(g|µ G , Ψ) = N ( g|µ G , Ψ −1 ) . Then, the Q-distribution has the following form, Under homoscedastic noise, the core is then updated as, Presently, the toolbox only considers µ G = 0 and A (n) ∼ vMF(F 0 ). The latter makes updating equation (56) exceedingly simple as A (n) ⊤ A (n) = I result in Σ g being a diagonal matrix. As presented in section 2.3.4, how P(Ψ) is specified determines the functionality of the core. The possibilities are inferring the scale, determining element-wise sparsity, or slice-wise pruning.
For inferring the scale, then Ψ ≡ ψI D and the distributions and updates are, For inferring element-wise sparisty, the distributions and updates are, For slice-wise automatic relevance determination Ψ ≡ ⊗ N n=1 diag ( ψ (n) ) and the distribution and updates for each mode n are, , An alternating update scheme is then used, such each mode n = 1, 2, …, N is updated conditioned on all other random variables.

Updating noise precision prior
The toolbox assumes a normal likelihood with either homoscedastic or heteroscedastic noise variance, the latter is only support CP decomposition. For homoscedastic noise, the noise precision τ follows a Gamma distribution, i.e. P(τ ) = G (α τ , β τ ) with shape α τ and rate β τ . The resulting Q-distribution is, For fully observed data, the shape parameter isα τ = α τ + 1 2 ∏ N n=1 I n . For CP decomposition the rate is Tucker decomposition has a similar update, but differ as it has to account for the core array G which changes the estimated rate,

vec(G)
) For the CP model with partially observed data, the estimated shape changes toα τ = α τ + 1 2 |O| while rate changes to,β where O is the set of present values, |O| the number of present values, and (i 1 , i 2 , . . . i N ) indexes a single element in O.

CP with Heteroscedastic Noise
Estimating mode-specific heteroscedastic noise assumes a special Kronecker structured covariance, see section 2.3, equation (9). Modelling heteroscedastic noise on mode n is done by defining . The Q-distribution is then, The noise precision is then estimated for each heteroscedastic mode using the following equations in an alternating update scheme.α whereβ τ (n) is a vector R In×1 >0 of rates.