Supervised multiway factorization

We describe a probabilistic PARAFAC/CANDECOMP (CP) factorization for multiway (i.e., tensor) data that incorporates auxiliary covariates, SupCP. SupCP generalizes the supervised singular value decomposition (SupSVD) for vector-valued observations, to allow for observations that have the form of a matrix or higher-order array. Such data are increasingly encountered in biomedical research and other fields. We describe a likelihood-based latent variable representation of the CP factorization, in which the latent variables are informed by additional covariates. We give conditions for identifiability, and develop an EM algorithm for simultaneous estimation of all model parameters. SupCP can be used for dimension reduction, capturing latent structures that are more accurate and interpretable due to covariate supervision. Moreover, SupCP specifies a full probability distribution for a multiway data observation with given covariate values, which can be used for predictive modeling. We conduct comprehensive simulations to evaluate the SupCP algorithm. We apply it to a facial image database with facial descriptors (e.g., smiling / not smiling) as covariates, and to a study of amino acid fluorescence. Software is available at https://github.com/lockEF/SupCP .


Introduction
Data are often best represented as a multiway array, also called a tensor, which extends the familiar two-way data matrix (e.g., Samples × Variables).Multiway data are increasingly encountered in many application areas.For example, in biomedical applications molecular profiling data may be measured over multiple body regions, tissue types, or time points for the same sample set (e.g., Samples × Molecules × Time × Tissue).A multiway representation is also useful for imaging modalities.In this article we describe an application to aligned color images of the faces of different individuals, from the Labeled Faces in the Wild database [17].In particular we consider images for 4000 individuals, each with pixel dimensions 90 × 90, where each pixel contains three color intensities (red, green, blue); thus the data is given by a 4-way array of dimensions 4000 × 90 × 90 × 3 (Individual × X × Y × Color).
Unsupervised factorization methods for multiway data, which extend wellknown methods such as the singular value decomposition (SVD) and principal component analysis (PCA) for a data matrix, are commonly used in many fields.See [15] for an accessible and detailed survey of multiway factorization methods and their uses.A classical and straightforward method is the PARAFAC/CANDECOMP (CP ) [11] decomposition, in which the data are approximated as a linear combination of rank-1 tensors.Alternative approaches include the Tucker decomposition [30], and the population value decomposition (PVD) for 3-way data [8].All of these approaches reconstruct the data using a small number of underlying patterns in each modality, and they are therefore useful for dimension reduction, de-noising, and exploratory analysis.
In many scenarios, additional variables of interest are available for multiway data objects.For example, [16] provide several attributes for the images in the Faces in the Wild database, which describe the individual (e.g., gender and race) or their expression (e.g., smiling / not smiling).There is a growing literature on analyzing the relationship between multiway data and additional variables via predictive models in which the multiway data are predictors and the additional variables are treated as outcomes.[34] propose a tensor regression model for a clinical outcome, in which the coefficients for multiway data are assumed to have the structure of a low-rank CP decomposition.[24] describe a Bayesian formulation for regression models with multiple outcome variables and multiway predictors, which is applied to a neuroimaging study.Multiple methods have been developed for the classification of multiway data [5,29,22], extending well-known linear classification techniques under the assumption that model coefficients have a factorized structure.
As in matrix factorization techniques such as PCA, the goal of multiway factorization is to capture underlying patterns that explain variation in the data.It is often reasonable to assume that these patterns are at least partially driven by additional measured variables of interest (e.g., gender or expression for facial images).Therefore, supervising on these variables can improve both the accuracy and the interpretation of the results.In this article we introduce a probabilistic CP factorization for multiway data that incorporates additional variables, called SupCP.To our knowledge there is no previous work wherein additional variables inform the factorization or dimension reduction of multiway data.Moreover, SupCP specifies a full probability distribution for a multiway data observation with given covariate values.It can therefore be used for modeling of a multiway outcome given vector-valued predictors (rather than predicting an outcome vector from multiway predictors), for which there is little existing methodology.
There is an existing literature on supervised factorization methods for twoway data (matrices).For example, [4] describe an approach in which PCA is performed on a subset of variables that are selected by the strength of their association with an outcome, and the resulting principal components are used in a predictive model for the outcome.However, in our framework the additional variables are not outcomes to be predicted from multiway data; rather, they are treated as auxiliary covariates to inform a CP model for multiway data.In this respect our framework is most related to the supervised SVD (SupSVD) approach [19] for matrices, which has also been generalized to accommodate sparse or functional PCA models [18] and non-parametric relations between the covariates and principal components [9].
The rest of this manuscript is organized as follows.In Section 2 we introduce some algebraic notation and terminology for multiway factorization.In Section 3.1 we introduce the SupCP model, in Section 3.2 we give its full likelihood, and in Section 3.3 we describe its identifiability properties.In Section 4 we describe a modified EM-algorithm for simultaneous estimation of all parameters in the model.In Section 5 we describe the connections between SupCP and other methods in more detail.In Section 6 we describe a comprehensive simulation study to evaluate the SupCP method, in Section 7 we apply it to a study of amino acid fluorescence, and in Section 8 we describe an application to facial image data from the Faces in the Wild database.

Notation and Preliminaries
Here we give some algebraic notation and terminology for multiway data that will suffice to describe the development in the following sections.For a general and more comprehensive introduction to multiway data and its factorization we recommend [15].
Define a K-way array (i.e., a Kth-order tensor) by X : d 1 × . . .d K , where d k is the dimension of the kth mode.The entries of the array are given by square brackets, X[i 1 , . . ., i K ], where i k ∈ {1, . . ., d k } for k ∈ 1, . . ., K. In this article we use script capital letters to denote arrays of order greater than 2, bold capital letters to denote matrices (X : , and bold lowercase letters to denote vectors (v : d × 1).
For vectors v 1 , . . ., v K of length d 1 , . . ., d K , respectively, define the outer product An array given by the outer product of vectors has rank 1.For matrices V 1 , . . ., V K of the same column dimension R, we introduce the notation where v kr is the rth column of V k .This defines a rank R CP factorization.Often, the columns of each V k are scaled to have unit norm ( v kr F = 1 for r = 1, • • • , R, where • F represents the Frobenius norm) and scalar weights λ r are given for each rank-1 component: The rank R CP approximation to an array of higher-rank X is typically estimated by minimizing the residual sum of squares imsart-ejs ver.2011/01/24 file: SupervisedTensorEJS_Final_arxiv.tex date: April 3, 2018 From (1) it is easy to see how the CP factorization extends the SVD for arrays of order greater than 2.However, there are some important differences.In particular, for K > 2 it is not common to enforce orthogonality among the columns of V k because this restricts the solution space and is not necessary for identifiability.Also, while the dimensions of the row space and column space are equal for K = 2, there is no analogous symmetry for K > 2, and therefore the rank of X may be greater than the dimension of each mode (i.e., d 1 , . . ., d K ).Finally, it is often useful to represent a multiway array in matrix form, wherein it is unfolded along a given mode.For this purpose we let X (k) : i =k d i be the array X arranged so that the jth row contains all values for jth subarray in the kth mode; specifically, for k = 1 and we define X (k) similarly for k = 2, . . ., K.

Latent Variable Model
Suppose we observe a K-way data array (e.g., an image with modes x × y × color) for each of n observational units, hereafter called samples.The full data set can then be represented as a (K + 1)-way array with samples as the first mode, X : In addition, we observe a q-dimensional vector of covariates for each sample (e.g., describable features for facial images), given by the matrix Y : n × q.Without loss of generality, we assume both X and Y are centered across samples.
The SupCP model extends the CP factorization for X (1) to a generative likelihood model in which the covariates in Y inform latent patterns for the sample mode.Specifically, the rank R SupCP model is where • F : n × R has independent and identically distributed (iid) multivariate normal rows with mean 0 and R × R covariance Σ f .
The loadings [[V 1 , . . ., V K ]] form a basis for the sampled d 1 × • • • × d K arrays, that can be decomposed into patterns for each mode.The scores U reconstruct each sample from this basis, and give an efficient lower-dimensional embedding of X if the data have multilinear structure.Thus, to describe the relation of covariates Y on structured variation in X it suffices to consider the relation between Y and U, and this gives an efficient generative model for X given Y.Moreover, the patterns in each mode (V 1 , . . ., V K ) are often interpretable, and supervision on Y can facilitate this interpretation by capturing how they relate to the covariates Y (see, e.g., Section 7).
The sample scores U are decomposed into what is explained by the covariates (YB) and not explained by the covariates (F).Therefore, the SupCP model decomposes the variation in X into variation explained by the covariates, structured variation not explained by the covariates, and independent residual noise; this is easily seen with the following reformulation of (2): The model allows for components that have little relation to Y, are partially explained by Y, or that are almost entirely explained by Y, depending on the relative size of B and Σ f .If X has structured low-rank variation that is orthogonal to Y, then for some components r the sample variance of the rth column of YB will be very small relative to Σ f [r, r].Also, if Σ f F goes to 0, it implies that all structured variation in X is dependent on the variables in Y.In Section 4 we describe a maximum likelihood estimation approach for all parameters simultaneously, including Σ f and σ 2 e ; therefore, the appropriate level of covariate supervision is determined empirically.
In practice, we constrain the columns of each V k to have a unit norm.The scaling of the factorization components, given by the λ r 's in (1), are therefore subsumed into U. Remarkably, this is enough to make the model parameters identifiable up to trivial sign changes and ordering of components, as discussed in Section 3.3.

Likelihood
The full marginal distribution for X under the SupCP model, specified in (2), is a multilinear normal distribution [25].Here we derive its equivalent likelihood by considering the more tractable matricized version of X, X (1) .The ith row of X (1) gives all values for the ith sample.We define d as the overall dimension for each sample, where the rth column is the vectorization of the K-way array given by v 1r • • • • • v Kr .This gives a matrix representation of the SupCP model, It follows that the distribution of X (1) , marginalizing over F, is multivariate normal with log-likelihood where µ X = YBV T mat , and I d is a d×d identity matrix.Because the likelihoods of X and X (1) are equivalent, it suffices to consider (4).However, this likelihood function is very highdimensional and not convex with respect to all parameters.Thus it is difficult to optimize directly over all parameters.In Section 4 we describe an Expectation-Maximization (EM) algorithm to maximize this likelihood over {V 1 , . . ., V K , Σ f , σ 2 e , B}.
The covariance matrix for the d − dimensional matricized data for each sample is given by Σ X .This can be described as a spiked covariance model with rank−R structure and independent noise.The resulting covariance is not in general separable [13], meaning that it can not be factorized into the Kronecker product of separate covariances for each mode.Indeed, our model assumes that structured variation in X is driven by latent factors that affect multiple modes, rather than an independent factor structure for each mode [10].

Identifiability
Here we describe conditions under which the parameters in the SupCP model are identifiable.That is, for two sets of parameters e , B} and Θ = { Ṽ1 , . . ., ṼK , Σf , σ2 e , B}, we give conditions for which equivalency of the likelihoods, L(X | Θ) = L(X | Θ), implies equivalency of the parameters, Θ = Θ.Clearly, the model is not identifiable under arbitrary scaling of B and the We address this by scaling the columns in V k to have unit norm, and restricting the first nonzero entry (typically the first entry) of each column to be positive: The model is also clearly not identifiable under permutation of the R rank-1 components, i.e., applying the same re-ordering to the columns in V 1 , . . ., V K , and B. This is easily addressed by rank-ordering the components under any given criteria; by default we order the components according to the diagonal entries of Σ f , which will be distinct for non-trivial cases: Alternatively the components can be ordered by their variance explained by Y, (YB) T (YB)[r, r], or by their overall variance, (YB) After scaling and ordering, the model is identifiable under general conditions for K ≥ 2. It is remarkable that no other restrictions are required.In particular, the columns of V k need not be orthogonal and Σ f need not be diagonal for identifiability.
There is a vast body of literature on the uniqueness of components in a CP factorization up to scaling and ordering (see Section 3.2 of [15] for a review), which can be used to derive conditions for identifiability of the model given (a) and (b) above.Here we use a result of [27], which requires the notion of k-rank.The k-rank of a matrix A, kr(A), is defined as the maximum number k such that any k columns of A are linearly independent.Note that in general kr(A) ≤ rank(A) and if A is of full column rank, then kr(A) = rank(A).Sufficient conditions for identifiability of the SupCP model are given in Theorem 1.
Theorem 1 For the model defined in (2), if and if Y is of full column rank, then the parameters {V 1 , . . ., V K , Σ f , σ 2 e , B} are identifiable under the restrictions (a) and (b).
The proof of Theorem 1 is given in Appendix A. Generally we expect the solution to be matrices of full rank satisfying kr(YB) = min(n, q, R) and kr(V k ) = min(d k , R), and therefore the model will be identifiable if R is sufficiently small relative to the dimensions of X.Nevertheless, the conditions of Theorem 1 are easily verifiable.Moreover, these conditions are sufficient but not necessary, and so failure to satisfy (5) does not imply that the model is not identifiable.
In practice, although not necessary for identifiability, we also constrain the covariance Σ f to be diagonal: This simplifies the model complexity considerably, and tends to improve performance of the estimation algorithm.We impose this constraint by default, but allow the option of an unconstrained covariance in our implementation.

Model Estimation
Here we describe estimation of all parameters for the model specified in Section 3.1.In Section 4.1 we describe a modified EM-algorithm for optimizing the likelihood (4) for fixed rank R. In Section 4.2 we discuss selection of the rank.

EM algorithm
For our implementation of the modified EM algorithm, the latent random variables U are updated via their conditional moments at each iteration.For this purpose, it is useful to consider the joint log-likelihood of X and U: where and Moreover, the conditional log-likelihood ( 7) can be expressed in terms of the matrices X (1) and V mat used in Section 3.2: and so the joint likelihood (7) gives the sum of two multivariate normal likelihoods.This facilities a modified EM algorithm, which is described in detail in steps 4.1.1through 4.1.3.

Initialization
One could initialize U and V 1 , . . ., V K via a least-squares CP factorization (1), wherein the columns of each V k are scaled to have unit norms and the corresponding weights are absorbed into U.However, this is estimated via an alternating least squares algorithm that can be computationally intensive, and is not guaranteed to find the global minimizer.
Alternatively we consider a random initialization in which the entries of each V k are generated independently from a normal distribution with mean 0, and then the columns of each V k are scaled to have unit norm, for k = 1, . . ., K. The latent factor U is initialized by In practice we find that random initialization performs as well as initialization via a CP factorization (see Appendix D), and so in our implementation we initialize randomly by default, with the option to initialize via CP factorization.
After initializing U and V 1 , . . ., V K , we initialize B by regressing U on Y: We initialize σ 2 e by the sample variance of the entries in the residual array and initialize Σ f by the diagonal entries of the covariance of the residuals for the random factors F = U − YB: where diag(A) sets the off-diagonal values of a square matrix A to zero.

E-Step
We estimate U via its conditional expectation, given the data {X (1) , Y} and fixed parameters: Similarly, the conditional variance of U is This matrix inverse can be efficiently evaluated via the Woodbury matrix identity [26].Thus, the computation of the E-step is very efficient.

M-Step
The parameters V 1 , . . ., V K , B, Σ f , σ 2 e are updated by maximizing the conditional expectation of the log-likelihood.Derivations for the following conditional maximum likelihood estimates, using the expected moments of U from 4.1.2,are given in Appendix B. The where the rth column of W (1) and • is the Hadamard (entry-wise) product.The remaining loading matrices V 2 , . . ., V K are updated similarly.
We then update the parameters B, Σ f , and σ 2 e by setting their respective partial derivative of the expected log-likelihood equal to 0. As a result, we obtain: As a final step, we adjust the sign and scale of each column of V k and rescale B and Σ f accordingly.This step is only for identifiability purpose (see Section 3.3) and does not change the likelihood.Specifically, if Ṽ1 , . . ., ṼK , Ṽmat , B and Σf correspond to their respective unscaled parameters, the scaled versions are given by v kr = ṽkr / ṽkr F for k = 1, . . ., K and r = 1, . . ., R, Optionally, we also restrict Σ f to be diagonal by setting Σ f [r, s] = 0 for r = s.
We iterate between the E-step and the M-step of the algorithm until the marginal likelihood (4) converges.The M-step (Section 4.1.3)maximizes the expected log-likelihood coordinate-wise over each parameter, but may not give a global optimum over the full parameter space {V 1 , . . ., V K , Σ f , σ 2 e , B}.Nevertheless, the marginal likelihood is guaranteed to increase over each iteration and therefore converge.In practice the likelihood may converge to a local, rather than a global, optimum, depending on the initial values.We find that a form of annealing, in which random variation is incorporated into the first several iterations, helps to avoid dependence on the initialization.See Appendix D for more detail.
After estimation,

Rank selection
We recommend selecting the rank R of the model by likelihood cross-validation, in which model parameters are estimated via a training set and evaluated using the likelihood of a test set.Such an approach is desired because it directly assesses the fit of the specified probabilistic model (via the likelihood) in a way that is robust to overfitting.Also for exploratory purposes it provides validation that the estimated components describe real underlying structure.
First, we select n train samples to fit the model, yielding the reduced data array X train : n train × d 1 × • • • × d K and covariate matrix Y train : n train × q.The remaining samples are considered the test set, yielding X test : For candidate values of R, we optimize the likelihood of the training data X train and Y train using the algorithm in Section 4.1 to obtain estimates The resulting estimates are assessed via the log-likelihood of the test set which can be evaluated as in (4).The rank is chosen to be the value of R that gives the lowest log-likelihood of the test set.The above approach works well in simulation studies, as shown in Appendix C.

Special Cases
The SupCP model specified in Section 3.1 reduces to traditional factor analysis without supervision (B = 0) and when K = 1, i.e., when X is a matrix.More generally, when K = 1 SupCP reduces to the SupSVD model [19].Thus, an alternative strategy of decomposing X is to apply SupSVD to the matricized data X (1) .However, this strategy is inefficient and therefore less accurate when the data have multi-linear structure, as shown in Sections 6 and 8; moreover, the patterns within each mode (V k ) that are provided by a multiway factorization framework are often useful to interpret.Lastly, the identifiability results in Section 3.3 are not applicable to matrices; the SupSVD model requires additional restrictions of orthogonality and diagonal covariance for identifiability.Without the presence of covariates Y, SupCP reduces to a probabilistic CP factorization in which factors for the first mode (samples) are considered random.The EM algorithm in Section 4 fits such a model when Y = 0. To our knowledge this particular unsupervised model and estimation approach, which provides a generative model for the sampled mode, is novel.Related maximumlikelihood based CP factorizations have been proposed by [23] and [32], and there is a large body of work on Bayesian models for the CP and other multiway factorizations (see, e.g., [13] and [35]).
If the entries of the residual factor covariance Σ f are large relative to the noise variance σ 2 e , then U, V 1 , . . ., V K reduce to a standard unsupervised leastsquares CP factorization.Specifically, if Σ f is diagonal and it is easy to see that the joint likelihood ( 6) is maximized by the solution that minimizes the squared residuals for the entries of X; the supervision component YB is given by a regression of Y and U, but will not influence the factorization of X.
If the entries of the residual factor covariance Σ f are small, then the residuals F also tend to zero and the underlying structure of X is given entirely by Y: This can be considered a multi-linear regression model for a multiway response (X) from vector valued predictors Y.This has received considerably less attention than the reverse scenario, prediction of a vector outcome from multiway predictors.However, recently [20] proposed a tensor response regression, wherein the tensor response is assumed to have a Tucker factorization with weights determined by vector-valued predictors.[14], describes a general framework for tensor-on-tensor regression, in which the predictor tensor and the outcome tensor have the same number of modes.

Model complexity
An important advantage for using a multiway factorization for multiway data, rather than vectorizing the data and using a matrix factorization, is the decrease in model complexity.Any SupCP model is also included in the support of the SupSVD model for the matricized data, as seen by the matrix form of the model (3).However, model complexity scales with the sum of dimensions in each mode under SupCP and with the product of dimensions in each mode under SupSVD.The total number of free parameters in the SupCP model with diagonal covariance, accounting for the identifiability conditions of Theorem 1, is The number of free parameters for the SupSVD model, accounting for orthogonality conditions required for identifiability, is Thus, model complexity for the same rank can decrease by several orders of magnitude under SupCP, especially for data with several high-dimensional modes.Moreover, incorporating supervision into the likelihood model increases the model complexity by a factor of only Rq, which is negligible if the size of X is much larger than the size of Y.

Simulation
Here we describe a comprehensive simulation study which compares three different methods: the proposed SupCP method, the least-squares CP factorization, and the SupSVD method (on matricized data along the first mode).We simulate data from different settings, and evaluate the performance of different methods in terms of parameter estimation and low-rank signal recovery accuracy.To avoid complication, we set the ranks for different methods to be the true ranks.The simulation study for rank estimation is conducted separately in Appendix C.

Simulation Settings
We first consider a 3-way array X with n = 100 samples, and d 1 = 10 and d 2 = 10 variables in the other two modes respectively.In particular, the generative model for ] is the underlying lowrank signal with true rank R = 5, and E is normally distributed noise with iid entries with mean 0 and variance σ 2 e = 4.The auxiliary data matrix Y contains q = 10 variables, potentially related to the latent variables in U through a linear relation U = YB + F. The coefficient B is a 10 × 5 matrix, and the random matrix F has iid rows with mean 0 and covariance Σ f .
We particularly consider the following settings: • Setting 1 (Non-supervision Setting): We set B to be a zero matrix, indicating the auxiliary data Y have no effect on the underlying structure of X (i.e., X is generated from a probabilistic CP model).The covariance matrix Σ f is a diagonal matrix with diagonal values {25, 16, 9, 4, 1}.• Setting 2 (Mixed Setting): The coefficient matrix B is filled with Gaussian random numbers.The covariance matrix Σ f is the same as in Setting 1. Correspondingly, the latent score U is jointly affected by Y and F.
• Setting 3 (Full-supervision Setting): The coefficient matrix B is the same as in Setting 2, and the covariance matrix Σ f is a zero matrix.Namely, the latent score U is solely determined by Y.
In all of the above settings, the loading matrices V 1 and V 2 are filled with random numbers and normalized to have orthonormal columns.We also consider several additional simulation settings in Appendix E, where 1) loadings are highly collinear; 2) dimensions d 1 and d 2 are high; 3) there are more than 3 modes; 4) the model is misspecified and data do not have multiway structure.We conduct 100 simulation runs for each setting, and compare different methods using various criteria.In particular, to evaluate the overall performance of dimension reduction, we assess the Frobenius norm of the difference between the estimated and true low-rank tensors ] F (denoted by SE).We also consider the estimation accuracy of loadings via the maximal principal angles [6].This only applies to SupCP and CP because SupSVD decomposes X (1) and there is no direct estimation of V 1 and V 2 .In addition, we evaluate the estimation accuracy of other important parameters in SupCP and SupSVD via B − B F , σ 2 e − σ 2 e /σ 2 e (relative error, denoted by RE e ), and the mean relative errors of the diagonal values of Σ f (denoted by RE f ).We also compare the fitting times of different methods on a standard desktop computer (8Gb Ram, 3.30GHz).

Results
The simulation results for Settings 1-3 are presented in Table 1.Additional results can be found in Appendix E. In all settings, SupCP always provides the smallest SE.Namely, it has the highest low-rank signal estimation accuracy.This is true even in Setting 1 where the data are generated from a CP model.We remark that SupCP outperforms CP in the non-supervision setting due to the shrinkage estimate of U in (8) from the EM algorithm.The shrinkage estimate strikes a balance between bias and variance, and thus leads to a smaller SE.A similar result was reported in [19].From the dimension reduction point of view, SupCP outperforms the competing methods and identifies the most accurate underlying structure.
In addition, in Setting 1, SupCP has similar loading estimation losses to CP. Namely, the proposed method automatically approximates the unsupervised method when the auxiliary data are irrelevant.SupCP also outperforms SupSVD in terms of the estimation of other parameters.In Setting 2, SupCP almost uniformly outperforms the competing methods.In Setting 3, where the latent score is solely determined by the auxiliary data, both supervision methods (SupCP and SupSVD) outperform CP in the signal recovery accuracy.Moreover, SupCP significantly improves the loading estimation over CP.In terms of the fitting time, CP is the fastest while SupCP and SupSVD are also very efficient in these settings.However, when dimensions are higher or the number of modes is larger, the computation of the matrix-based method SupSVD quickly becomes infeasible.Both SupCP and CP scale well to larger data sets (see Appendix E).Overall, SupCP is computationally efficient and has a good performance in a range of settings.

Application to Amino Acid Fluorescence
We consider fluorescence data for five laboratory samples, measured over the emission and excitation frequency domains with a spectral fluorometer.Intensities are available on a 2D grid for excitation wavelengths between 250nm and 450nm, and emission wavelengths between 250nm and 310nm, in increments of 1 nm.Thus, the resulting fluorescence array is of dimension X : 5 × 61 × 201.Each sample is comprised of a mixture of three amino acids diluted in water: Tryptphan, Tyrosine, and Phenylalanin.The known concentration of each amino acid in Mole/L is given in the matrix Y : 5 × 3.These data have been previously published [7] and are freely available online at http://www.models.life.ku.dk/nwaydata (accessed 02/28/2018).The structure of excitation/emission fluorescence data is suitably characterized by a CP factorization [3], to decompose latent patterns in the excitation and emission domains.We apply SupCP to the fluorescence array X, supervising on Y to capture the relationship between fluorescence structure and the three amino acids.A rank−3 model is selected to maximize the test likelihood (Section 4.2) under a leave-one-out cross validation scheme.The resulting supervision weights B are given in Table 2, scaled by the standard deviation of each amino acid.These results clearly show that each of the three components are driven by the concentration of a different amino acid.The resulting components are shown in Figure 1, and can be interpreted as the excitation and emission fluorescence spectra that are specific to each amino acid.

Application to Facial Images
The Labeled Faces in the Wild database [17] is a publicly available set of over 13000 images, where each image includes the face of an individual.The images are taken from the internet and the individuals represented are celebrities.Each image is labeled with the name of the individual, but the images are unposed and exhibit wide variation in lighting, image quality, angle, etc. (hence "in the wild").
[16] developed an attribute classifier, which gives describable attributes for a given facial image.These attributes include characteristics that describe the individual (e.g., gender, race, age), that describe their expression (e.g., smiling, frowning, eyes open), and that describe their accessories (e.g., glasses, make-up, jewelry).These attribute were determined on the Faces in the Wild dataset, as well as other facial image databases.In total 72 attributes are measured for each image.Our goal is to use these characteristics to supervise dimension reduction of the images and to develop a probabilistic model for computationally generating faces with given describable characteristics.
PCA and other low-rank matrix factorization approaches are commonly used to analyze facial image data.For example, the terminology eigenfaces [28,31] has been used to describe a broad class of methods in which PCA of vectorized facial images precedes a given task, such as face recognition.PCA can only be applied to vectorized images, wherein vectors for color images).Thus, PCA does not exploit multiway structure.Although facial images are not obviously multi-linear, the CP factorization has been shown to be much more efficient as a dimension reduction tool for facial images than PCA, and marginally more efficient than other multiway factorization techniques [21].
For our application the images are first frontalized, as described in [12].That is, the unconstrained images are rotated, scaled, and cropped so that all faces appear forward-facing and the image shows only the face.After this processing step, we expect the nose, mouth and other facial features to be in approximately the same location across the images.This step is important for linear factorization approaches such as PCA or SupCP, which assume the coordinates are consistent across the images.Some examples of the frontalized images are shown in Figure 2.
We work with a random sample of 4000 frontalized images from unique individuals.Each image is 90 × 90 pixels, and each pixel gives the intensity for colors red, green and blue, resulting in a multiway array of dimensions X : 4000 × 90 × 90 × 3. We center the array by subtracting the "mean face" from each image, i.e., we center each pixel triplet (x × y×color) to have mean 0 over the 4000 samples.The attribute matrix Y : 4000 × 72 is measured on a continuous scale; for example, for the smiling attribute, higher values correspond to a more obvious smile and lower values correspond to no smile.We standardize Y by subtracting the mean and dividing by the standard deviation for each row, hence converting each attribute to its z-scores.
We assess the fit of the SupCP model with a training set of 100 randomly imsart-ejs ver.2011/01/24 file: SupervisedTensorEJS_Final_arxiv.tex date: April 3, 2018  We fit the SupCP model, with rank 200, on the full set of 4000 images.This imsart-ejs ver.2011/01/24 file: SupervisedTensorEJS_Final_arxiv.tex date: April 3, 2018 job required a computing server with 64GB of RAM.We run the EM algorithm for 2000 iterations, including a 500 annealing burn-in, which took approximately 8 hours.After 2000 iterations the log-likelihood appears to converge, as shown in Figure 4.The estimated parameters provide a data-driven generative model for faces with given covariates.In particular, given a feature vector y with scores for the 72 describable attributes, the "mean face" for the given attributes is where V 1 : 90×200 contains loadings in the x-dimension, V 2 : 90×200 contains loadings in the y-dimension, and V 3 : 3 × 200 contains color loadings.The constructed mean facial images for certain attributes are given in Figure 5.To generate these images, the presence of a given features is coded as a z-score of 3 in the covariate vector, and all other scores are set to 0. The resulting images are smooth and remarkably intuitive.For comparison the analogous constructed images using SupSVD with rank 16 are shown in Figure 6, and the features are less distinctive.

Discussion
High-dimensional data with multiway structure are often encountered in modern research, yet present statistical methods to deal with such data are in their infancy and leave much room for further development.Most of the methodological development for the analysis of multiway data has traditionally come from fields other than statistics, such as computer science and chemometrics.Perhaps in part because of this, methods for multiway data are often not based on a generative statistical model.In this article we have developed the SupCP method, which is based on a probabilistic CP factorization model in which additional covariates inform the factorization.Our simulation studies illustrate the advantages of SupCP for dimension reduction and interpretation when the underlying structure of multiway data is multi-linear and at least partially driven by additional covariates.Our application to amino acid data illustrates how supervision can facilitate interpretation of the underlying low-rank components.Our application to facial image data illustrates how SupCP can be used to more generally estimate a generative model for multiway data with given covariates.
Many datasets can be represented as multiway arrays, but we caution that SupCP is not well motivated if the data do not have any multi-linear structure.In particular, SupCP and other approaches that involve multiway factorization are not universally applicable to visual image analysis problems.For our application to facial analysis it is critical that the images are well-aligned to common coordinates, and share common structure on that coordinate system.The data also show some (high-rank) multiway structure, as SupCP is more accurate than the analogous approach that relies on vectorization of the multiway images.State-of-the-art generative deep learning models, such as a properly trained conditional variational auto encoder, have also been applied to the task of facial image generation with promising results [33].
Our model assumes that array data can be decomposed into low-rank structure and independent residual error.Extensions to the model may allow for more complex residual correlation structures, such as a separable covariance for a more general array normal framework [13].
The SupCP framework introduced in this article may be extended in several ways.Here we have focused on the CP factorization because it allows for a straightforward extension of the supervised SVD model and has been shown to be an efficient factorization for facial image data.Supervised and probabilistic approaches to the Tucker factorization, or other multiway dimension reduction techniques, are interesting directions for future work.Moreover, there is a growing body of work on multiway factorization methods for sparse or functional data [1,2].The SupCP framework may be extended to accommodate sparse and functional data, or non-linear covariate effects, which are analogous to recent extensions of the SupSVD framework [18,9].Finally, here we have focused on allowing a random latent variable and covariate supervision for one mode (the sample mode).More general approaches which allow for supervision or random components in more than one mode is another promising direction of future work.
3. The entries of V 1 : 25 × R and V 2 : 25 × R are generated to have orthonormal columns via the SVD of a random matrix.
The error array E : 100 × 25 × 25 is generated by independent N (0, σ 2 ) entries, where we consider differing noise levels given by σ 2 = 1, 5, or 10.For each noise level, we independently generate 10 datasets as above for each of R = 0, 1, . . ., 10.For each of the 3 • 10 • 11 = 330 simulated datasets, we estimate the rank via the likelihood cross-validation scheme described in Section 4.2.We randomly specify a training set and a test set, each of size 50, and consider the ranks r = 0, 1, 2, . . ., 9, or 10, choosing the rank that gives the lowest loglikelihood in the test set.Note that R = 0 corresponds to a null dataset with independent entries, and the results for r = 0 are given by a the independent normal log-likelihood of the test set with variance σ2 given by the maximum likelihood estimate of the training entries.
Figure 7 shows a scatterplot of the estimated rank and the true rank for each noise level.For σ 2 = 1 the estimated rank is a good predictor for the true rank; the correct ranks were chosen in 83 of the simulations, overestimated slightly in 13, and underestimated slightly for 4.This demonstrates that a likelihood cross-validation approach is reasonable to determine the rank of the model.For σ 2 = 5, the estimated rank still shows a clear association with the true rank, but it is less accurate, with a tendency toward under-estimation.When σ 2 = 10 the ranks are severely underestimated, as the low-rank signal is difficult to distinguish from the noise.For all null simulations (R = 0) the approach correctly does not identify any signal.to a local maximum, and we explore different strategies to alleviate this issue.
Our simulation scheme is as follows.We consider a 4-way array X of dimensions 10 × 20 × 40 × 50, and a vector of covariates y : 10 × 1.For rank R = 2, 1.The entries of U : 2 × 10 are generated independently from a normal distribution, variance of each row chosen from Uniform(2, 22) 2. The entries of y are correlated with first column of U; specifically, y = u 1 − f where the entries of f are generated from a standard normal distribution 3.The entries of V 1 : 20 × 2, V 2 : 40 × 2, V 3 : 50 × 2 are generated independently from a standard normal distribution and standardized so that the columns have unit norm.4. The error array E : 10 × 20 × 40 × 50 is given by independent N (0, 1) entries.
We generate 100 datasets under the above scheme, and for each dataset we run the algorithm under different initialization methods.We apply both random initialization and initialization via a CP decomposition, as described in Section 4.1.1.To avoid local maximum, we also consider a form of annealing in which random variation is incorporated during the first few iterations.Specifically, independent normally distributed noise with mean 0 is added to the expected value for U at each iteration (Section 4.1.2),for the first L iterations, and the standard deviation of the noise is proportional to the inverse of the iteration number.
In total we consider six initialization methods: (1) random initialization with no annealing, (2) random initialization with 100 annealing iterations, (3) random initialization with 500 annealing iterations, (4) CP initialization with no annealing, (5) CP initialization with 100 annealing iterations, and (6) CP initialization with 500 annealing iterations.For each method, and each simulated data set, we run the algorithm until convergence twice under different random seeds.
In Table 3 we show the mean absolute difference in log-likelihoods between different runs, and the mean likelihood, under the different methods.Under this scheme, a CP initialization generally gives little benefit over a random initialization, and annealing improves the log-likelihood after convergence and the agreement under different runs.However, even 500 annealing iteration did not result in perfect agreement between different runs of the algorithm.To further reduce convergence to local optimum, one can run the algorithm under multiple initialization and choose that which gives the highest log-likelihood after convergence.

E.1. Collinearity Simulation
Here we investigate the effect of loading collinearity on model estimation.We conduct additional simulation studies under Settings 1-3, while setting the loadings in V 1 and V 2 to be highly collinear, respectively.In particular, we fill V 1 and V 2 with random numbers, and only normalize the columns to have unit norm.The other parameters are kept unchanged from Settings 1-3 in Section 6.
We conduct 100 simulation runs in each setting, and the results are presented in Table 4.In general, the results are very similar to those in Section 6.2 (where true loadings are orthonormal).SupCP significantly outperforms CP and SupSVD in the low-rank signal estimation.In terms of the estimation of each individual parameter, SupCP is better than or at least comparable to the corresponding competing methods.We note that principal angles ∠(V 1 , V 1 ) and ∠(V 2 , V 2 ) for SupCP and CP are generally larger and have higher variabilities under the collinear settings compared to the orthonormal settings.This may be due to the local optimum issue of the low-rank approximation to a tensor array, which calls for further investigation.

E.2. High-Dimension Simulation
We also investigate the performance of different methods in higher dimensions.In particular, we fix the parameters (B and Σ f ) as in Setting 2, and increase the dimensions (d 1 , d 2 ) of the loading matrices from (10, 10) to (50, 50), (100, 100) and (500, 500).We fill V 1 and V 2 with random numbers and normalize them to have orthonormal columns.In order to keep the signal-to-noise ratio (SNR) unchanged from Setting 2, we adjust σ 2 e accordingly.The results of 100 simulation runs under different settings are presented in Table 5.The fitting time for every method increases with the increasing dimension, but SupCP and CP can be fitted within a reasonable time in all settings.However, SupSVD is computationally infeasible when (d 1 , d 2 ) = (500, 500), where there are 250,000 variables in the matricized data.For SupCP, the estimation accuracy of the low-rank signal and loading matrices improves with the increasing dimension and fixed SNR.SupCP significantly outperforms CP regardless of the dimensions.

E.3. Multi-Mode Simulation
We also conduct a simulation study where the observed tensor has more than 3 modes.The setting is similar to Setting 2 in Section 6, but with 2 additional modes (i.e., a 5-way tensor) with d 3 = d 4 = 10 dimensions.We use the same parameters as in Setting 2 and adjust σ 2 e so that the SNR remains the same.The results from 100 simulation runs are shown in Table 6.SupCP still provides the best low-rank structure estimation.We note that the tensor methods (SupCP and CP) tend to have larger variabilities in this multi-mode setting compared to Setting 2. This may be due to the complexity of multi-mode tensors such as the identifiability issue and the local optimum issue.
We remark that the number of entries in a multiway tensor increases exponentially with the number of modes, presenting a huge challenge in computation.Existing methods do not scale well to a very large number of modes (say, > 10).We deem it a future research direction to extend SupCP to higher modes.

E.4. Misspecified simulation
Here we describe a simulation study in which the generated data do not have multiway structure.The setting is analogous to Setting 2 in Section 6, except that loadings V : 100 × 5 are given by the left singular vectors of a 100 × 5 matrix of independent N (0, 1) entries.The underlying structure is then given by the entries of UV T , arranged into an array of dimension 100 × 10 × 10.Thus, X has rank R structure when it is matricized, but does not have low rank multilinear structure in 3 dimensions; i.e., the generative model is supported by the SupSVD model, but not SupCP.We conduct 100 replications and estimate a rank 5 SupCP, CP, and SupSVD fit for each replication.The results are shown in Table 7, where the principal angle ∠(V, V) is found for the matricized loadings imsart-ejs ver.2011/01/24 file: SupervisedTensorEJS_Final_arxiv.tex date: April 3, 2018 Lock and Li/Supervised multiway factorization 1178 V = V mat under the SupCP and CP approaches.As expected, the SupSVD model is superior in this setting, demonstrating that a multiway factorization performs poorly if the data have no multiway structure.

Fig 1 .
Fig 1. Rank-1 components corresponding to Phenylalin, Tryptophan and Tryosine.The bottom two rows give the loadings for emission and excitation wavelengths, and the top row gives their resulting product.

Fig 3 .
Fig 3. Log-likelihood for the estimated SupCP and SupSVD models, for different ranks, on the training images used to fit the model and the test images.

Fig 4 .
Fig 4. Log-likelihood for the model over 2000 EM iterations.The first 500 iterations (a) incorporate random annealing to avoid local modes.

Fig 7 .
Fig 7. True rank vs. estimated rank for 100 randomly generated simulations, under different levels of noise variance.A small amount of jitter is added to show multiple points with the same coordinates.

Table 1
Simulation results under Settings 1, 2 and 3 (each with 100 simulation runs).The median and median absolute deviation (MAD) of each criterion for each method are shown in the table.The best results are highlighted in bold.

Table 2
Scaled coefficients for rank−3 fluorescence model.

Table 3
Results under 6 different initialization approaches over 100 simulated datasets.The mean log-likelihood, mean absolute difference between runs for the same data, and mean computing time for a single run are shown, with standard deviations in parentheses.

Table 4
Simulation results under Settings 1, 2 and 3 with collinear loadings (each with 100 simulation runs).The median and median absolute deviation (MAD) of each criterion for each method are shown in the table.The best results are highlighted in bold.

Table 6
Simulation results for 5-way tensor under Settings 2 (with 100 simulation runs).The median and median absolute deviation (MAD) of each criterion for each method are shown in the table.The best results are highlighted in bold.

Table 7
Simulation results for reduced matrix rank with no multiway structure under Settings 2 (with 100 simulation runs).The median and median absolute deviation (MAD) of each criterion for each method are shown in the table.The best results are highlighted in bold.