Clustering Matrix Variate Longitudinal Count Data

: Matrix variate longitudinal discrete data can arise in transcriptomics studies when the data are collected for N genes at r conditions over t time points, and thus, each observation Y n for n = 1, . . . , N can be written as an r × t matrix. When dealing with such data, the number of parameters in the model can be greatly reduced by considering the matrix variate structure. The components of the covariance matrix then also provide a meaningful interpretation. In this work, a mixture of matrix variate Poisson-log normal distributions is introduced for clustering longitudinal read counts from RNA-seq studies. To account for the longitudinal nature of the data, a modiﬁed Cholesky-decomposition is utilized for a component of the covariance structure. Furthermore, a parsimonious family of models is developed by imposing constraints on elements of these decompositions. The models are applied to both real and simulated data, and it is demonstrated that the proposed approach can recover the underlying cluster structure.


Introduction
Biological studies are a major source of longitudinal data. While modelling such longitudinal datasets, it is important to take into account the correlations among the measurements at different time points. Gene expression time course studies present important clustering and classification problems. Understanding how different genes are modulated over a period of time during an event of interest can provide key insight in understanding their involvement in various biological pathways [1][2][3][4]. Cluster analysis allows us to group genes into clusters with similar patterns, or 'expression profiles', over time.
Model-based clustering utilizes finite mixture models [21] for cluster analysis, which assumes that a population is a mixture of G subpopulations or components and each component can be modelled using a probability distribution. A random vector Y is said to arise from a parametric finite mixture distribution; we can write its density in the form where π g ∈ [0, 1] such that ∑ G g=1 π g = 1 is the mixing proportion of the gth component, f(y | θ g ) is the density of the gth component, and Θ = (π 1 , . . . , π G , θ 1 , . . . , θ G ) are the model parameters. The choice of an appropriate probability density/mass function depends on the data type.
Various approaches have been developed for the clustering of time course gene expression data (e.g., [6,7,[22][23][24][25]). However, most statistical approaches for clustering gene expression data are tailored for microarray studies. While some of this could be attributed to RNA-seq technology being more recent compared to microarrays, the computational cost that comes with fitting multivariate discrete models needed for RNA-seq data has also led to challenges. The transcriptomics data arising from RNA-seq studies are discrete, high-dimensional, and over-dispersed count data. Efficiently analyzing these data remains a challenge. Because of the restrictive mean-variance relationship that comes with the Poisson distribution, the negative binomial distribution emerged as the univariate distribution of choice [26,27]. However, the multivariate negative binomial distribution [28] is seldom used in practice due to the computational cost that comes with fitting such a model [29].
Recently, [10] proposed mixtures of multivariate Poisson lognormal (MPLN) models for clustering over-dispersed, multivariate count data. In the MPLN model, the counts of the nth gene (with n = 1, 2, . . . , N) are modelled using a hierarchical Poisson log-normal distribution such that Y nj ∼ Poisson(e X nj +log c j ) and X n ∼ N p (µ, Σ), where X n = (X n1 , X n2 , . . . , X np ), N p (µ, Σ) denotes a p-dimensional Gaussian distribution with mean µ and covariance Σ, and c j is a known constant representing the normalized library size of the jth sample. This hierarchical structure allows for over-dispersion similar to the negative binomial distribution, but it also allows for correlation between the variables. An efficient framework for parameter estimation for mixtures of MPLN distributions was developed by [30] that utilizes a variational Gaussian approximation.
In RNA-seq studies, it is common to obtain expression levels of n genes at r conditions over t occasions in one study. A natural framework for modelling such data is a matrix variate approach such that each observation Y n is a r × t matrix. Alternately, a multivariate framework can also be utilized, where each observation Y n can be written as a vector vec(Y n ) of dimensionality rt = r × t. In [31], the authors developed mixtures of matrix variate Poisson lognormal distributions (MVPLN). In the MVPLN model, Y n,ij ∼ Poisson(e X n,ij +log C ij ) and X n ∼ N r×t (M, Φ, Ω), where X n is an r × t matrix, and N r×t (M, Φ, Ω) denotes a matrix normal distribution with location matrix M and scale matrices Φ and Ω, and C is a r × t matrix where C ij denotes the normalized library size of the ith condition from the jth time point. Mathematically, the MVPLN model is equivalent to Y n,ij ∼ Poisson(e X n,ij +log C ij ) and vec(X n ) ∼ N rt (vec(M ), Σ = Φ ⊗ Ω), where N rt is an rt-dimensional multivariate normal distribution and ⊗ is a Kronecker product. By adopting a matrix variate form, the large rt × rt covariance matrix of the latent variable X can be written as the Kronecker product of two smaller r × r and t × t scale matrices Φ and Ω, i.e., Σ rt×rt = Φ r×r ⊗ Ω t×t . This can greatly reduce the number of parameters in Σ.
In Section 2, we extend the mixtures of the matrix variate Poisson log-normal model for clustering matrix variate longitudinal RNA-seq data by incorporating a modified Cholesky decomposition of the scale matrix Ω that captures the covariances of the t occasions of the latent variable X. Furthermore, by imposing constraints on the components of scale matrices to be equal or different across the group, a family of eight models is obtained. Parameter estimation is performed using a variational variant of the expectationmaximization algorithm. In Section 3, the proposed models are applied to simulated and real datasets, and Section 4 concludes the paper.

Methods
A matrix variate Poisson log-normal distribution was proposed by [31] for modelling RNA-seq data. This arises from a hierarchical mixture of independent Poisson distributions with a matrix variate Gaussian distribution. Suppose Y 1 , . . . , Y N are N observations from a matrix variate Poisson log-normal distribution, where the nth observation Y n is an r × t dimensional matrix representing r conditions and t time points. A matrix variate Poisson log-normal distribution for modelling RNA-seq data can be written as Y n,ij | X n,ij ∼ Poisson(e X n,ij +log C ij ), and X n ∼ N r×t (M, Φ, Ω), where M is an r × t matrix of means, C is an r × t matrix of fixed, known constants to account for the differences in library sizes across each sample, and Φ and Ω are r × r and t × t scale matrices, respectively. Suppose the least-squares predictor of unobserved latent variable X n,ij of the nth observation from the ith condition at the jth time point can be written asX where X n,i1 , X n,i2 ,. . . , X n,i(j−1) are the unobserved latent variables from the preceding j − 1 time points, ε j ∼ N (0, 1), and ρ jk and d j are the autoregressive parameters and innovation variances, respectively [32]. Thus, when the responses are recorded over a period of time (i.e., t time points), the parameter Ω that relates to the covariance of the t time points can be parameterized to account for the relationship between measurements at different time points. Now, Ω can be decomposed using the modified Cholesky decomposition [7,23,32] such that TΩT = D. This can alternately be written as Ω −1 = T D −1 T, where D is a unique diagonal matrix with innovation variances d 1 , . . . , d t , and T is a unique lower triangular matrix with 1 as the diagonal elements and the autoregressive parameters (ρ jk with j > k) as the off-diagonal elements: In the context of a G-component mixture of matrix variate Poisson log-normal distributions [31], Equation (1) can be written as where π g > 0 is the mixing proportion of the gth component such that ∑ G g=1 π g = 1, and f (Y | M g , Φ g , Ω g ) is the marginal distribution function of the gth component with parameters M g , Φ g , and Ω g , and Θ denotes all the model parameters.

Longitudinal Data and Family of Models
Due to the longitudinal nature of t time points, one can utilize the modified Cholesky decomposition for Ω g such that Ω −1 The number of parameters in Ω g (i.e., t(t + 1)/2) increases quadratically with respect to time points, and this is further compounded in mixture models as G different Ωs need to be estimated. Thus, similar to [7], constraints can be imposed on T g and D g to be equal or different across groups, and an isotropic constraint can also be imposed on D g = δ g I, where δ g is a scalar. Various combinations of these constraints result in a family of eight models (see Table 1). Table 1. The family of eight models obtained by imposing various constraints on the components of Ω g .

Group Group Diagonal
Parameter estimation is outlined in Section 2.2. In cluster analysis, the best constraint for a specific dataset is also unknown. Selection of the best fitting model among the eight models in the family for a given dataset is performed using a model selection criteria, which is discussed in detail in Section 2.3.

Parameter Estimation
Parameter estimation for mixture models is typically conducted using the traditional expectation-maximization (EM; [33]) algorithm. In the case of an MVPLN model, this requires computing the posterior expectations of E(X | Y) and E(XX | Y). However, the posterior distribution of a latent variable (i.e., X | Y) does not have a known form; thus, a Markov chain Monte Carlo expectation-maximization (MCMC-EM) algorithm is typically employed for empirically estimating E(X | Y) and E(XX | Y). Such an approach can be computationally intensive [10,31].
Subedi and Browne [30] developed an efficient parameter estimation algorithm for a matrix variate Poisson log-normal distribution using variational approximations [34]. This utilized a computationally convenient approximating density to approximate a more complex but 'true' posterior density through minimization of the Kullback-Leibler (KL) divergence between the true and the approximating densities. Suppose we have an approximating density q(X); then, the marginal log of the probability mass function can be written as log , and the approximating distribution q(X), and A Gaussian density is utilized as the approximating density for variational Gaussian approximations. Similar to [31], assuming q(X ng ) = N r×t (ξ ng , ∆ ng , κ ng ), the ELBO for each observation y n from the gth component can be written as Thus, the complete-data log-likelihood for the mixtures of MVPLN distributions can be written as f (X n |Y n ,Z ng =1) dX ng is the KL divergence between f (X n | Y n , Z ng = 1) and approximating distribution q(X ng ). As the variational parameters that maximize the ELBO also minimize the KL divergence, the estimates of the model parameters are obtained by maximizing the variational approximation of the complete-data log-likelihood using the ELBO, i.e., Similar to [31], an iterative EM-type algorithm is utilized for parameter estimation. At the (k + 1)th step,

1.
Conditional on the variational parameters ξ ng , ∆ ng , and κ ng and on M g , Φ g , T g , and D g , the E(Z ng ) is given by As the marginal distribution of Y n is difficult to compute, we use an approximation of E(Z ng ) using the ELBO such that

GivenẐ
where the vector function exp[a] = (e a 1 , . . . , e a r ) is a vector of the exponential of each element of the r-dimensional vector a, diag(κ) = (κ 11 . . . , κ tt ) puts the diagonal elements of the t × t matrix κ into a t-dimensional vector, and is the Hadamard product. (b) A fixed-point method is used for updating κ ng : where the vector function exp[a] = (e a 1 , . . . , e a t ) is a vector of the exponential of each element of the t-dimensional vector a, diag(∆) = (∆ 11 . . . , ∆ rr ) puts the diagonal elements of the r × r matrix ∆ into an r-dimensional vector, and is the Hadamard product. (c) Newton's method is used to update ξ ng :  , the updates of model parameters π g , M g , and Φ g are obtained as

GivenẐ
Estimates of the model parameters T g and D g can be obtained by maximizing the approximation of the log-likelihood with respect to the parameters T g and D g . If we define S (k+1) g as: the estimates of T g and D g are analogous to [7]. The elements of T g can be estimated as: where j = 2, . . . , t, and the updates for D g can be obtained aŝ Similarly, with the above defined S g in Equation (5), the updates of T g and D g are analogous to [7] for all eight models with various constraints, and thus the R package longclust [35] is utilized for the covariance decomposition.
Convergence of the iterative EM-type approach was determined using the Aitken's acceleration-based [36] criteria which computes the asymptotic estimate of the log-likelihood at each iteration and assumes that the algorithm converges when the successive difference in the asymptotic estimate of the log-likelihood is less than ε [37] . Here, we used ε = 0.05.

Model Selection and Performance Assessment
In clustering, the true number of components are typically unknown. Additionally, the best constraint for the covariance structure here is also unknown. It is common practice to fit the model for a range of G for all possible models and select the best fitting model a posteriori using a model selection criteria. The Bayesian information criterion (BIC; [38]) remains the most widely used criterion. In our case, similar to [30], we use an approximation of BIC defined as: where p is the number of parameters, N is the sample size, and log L is the approximation of the maximized log-likelihood using ELBO. This approximation is computationally efficient as the marginal of the probability mass function does not have a closed form to compute the maximized log-likelihood using the marginal probability mass function. When the true number of clusters is known, assessment of the clustering performance can be conducted using an adjusted Rand index (ARI; [39]). The ARI provides a measure of the clustering agreement between the true and predicted labels and adjusts for agreement by chance. The ARI for perfect agreement is 1, and the expected value of ARI under random classification is 0.

Simulation Studies
Simulation studies were conducted to demonstrate clustering performance of the proposed family of models and show parameter recovery.

Scenario 1
In the first scenario, 25 datasets, each of size n = 1000 were generated from a twocomponent fully unconstrained model "VVA". Here, each observation in the dataset is a 3 × 4 (i.e., r = 3 and p = 4) matrix. The parameters used to generate the dataset are summarized in Table 2.  All eight models with G = 1 to G = 5 were fitted, and BIC was used for model selection. In all 25 datasets, the approach selected a two-component "VVA" model with an ARI of 1.00 ± 0.00. The mean of the estimated parameters along with their standard errors are also summarized in Table 2. The estimated parameters are close to the true parameters.

Scenario 2
In the second scenario, 25 datasets, each of size n = 1000, were generated from a twocomponent model with the same set of parameters as Scenario 1 but with a constraint on D such that D 1 = . . . = D g = δI (i.e., "VEI" model). Again, each observation in the dataset is a 3 × 4 (i.e., r = 3 and p = 4) matrix. All eight models with G = 1 to G = 5 were fitted, and the BIC was used for model selection. In all 25 datasets, the approach selected a two-component model with an average ARI of 1.00 ± 0.00. In 11 out of the 25 datasets, a "VEI" model was selected, and in 14 out of the 25 datasets, a "VVI" model was selected. Note that a "VVI" model also assumes an isotropic constraint on D g such that D g = δ g I, but in a "VVI" model, the δ g varies across groups. The mean of the estimated parameters along with their standard errors from the datasets where a two-component "VEI" model was selected are summarized in Table 3. The estimated parameters are close to the true parameters. Table 3. True parameters along with the estimated means and standard deviations of the model parameters for Scenario 2 from all 25 datasets.

Comparison with Other Approaches
The performances of the proposed models were compared to other mixtures of discrete distributions. Since other approaches for matrix variate discrete data were not available, the matrix variate data was first converted to multivariate data before comparison. Two model-based clustering techniques for RNA-seq data: HTSCluster [9,40,41], which is a mixture of Poisson distributions, and MBCluster.Seq [8,42], which is a mixture of negative binomial distributions, were used. The comparison of the performance of the proposed approach with the two competitive approaches for the simulated datasets from both scenarios is summarized in Table 4. Both HTSCluster and MBCluster.Seq failed to recover the underlying cluster structure in both scenarios. This could be partly because both approaches are mixtures of independent univariate distributions, and in the presence of covariance, their performance suffers. This is in line with findings of [30]. Through simulation studies, ref. [30] previously showed that when the dataset is generated from mixtures of independent Poisson distribution, HTSCluster can recover the underlying cluster structure. However, in the presence of over-dispersion (i.e., when the data are generated from a model such as multivariate Poisson log-normal distribution or negative binomial distribution), the performance of HTSCluster suffers.

Transcriptomics Data Analysis
The proposed approach was used to cluster a transcriptomics dataset fission from the R package fission available through bioconductor. The dataset was originally proposed by [43]. The study consists of a time course RNA-Seq experiment of fission yeast in response to oxidative stress (1M sorbitol treatment) at 0, 15, 30, 60, 120, and 180 mins from two types of yeast: wild type and mutant type (aft21∆ strain).Thus, the measurements for each observation can be written using a matrix notation such that the two types of yeast are treated as rows (i.e., r = 2), and the time points are treated as columns (i.e., t = 6). We treat developmental stages as longitudinal.
For cluster analysis, we focused on the subset of the differentially expressed genes provided in the Supplementary Material by [43]. Genes were considered differentially expressed if their mean expression level differed in at least one time point relative to unstressed reference, and multiple testing correction was performed to ensure overall FDR was kept below 5%. A total of 3169 genes (out of 5957) were differentially expressed in wild type yeast, and 3044 genes were differentially expressed in the aft21∆ strain. For our analysis, we included the gene if it was differentially expressed in both wild type and aft21∆ strain; thus a total of 2476 genes were included.
All eight models were fitted to the dataset for G = 1 to G = 20, and the best fitting model was selected by BIC. A G = 17 component "EVA" model with a constrained T g (i.e., T 1 = T 2 = . . . = T 17 ) and unconstrained anisotropic D g was selected. The constrained T g suggests that the correlation structure (i.e., autoregressive relationship) among the developmental stages is the same for all groups. However, the unconstrained anisotropic D g suggests that the variances at the developmental stages are different, and it varies from cluster to cluster. Visualization of the log-transformed expression values of the genes in each group along with its mean-expression trends is shown in Figure 1. As seen in Figure 1, the clusters have distinctive mean-expression trends.

Conclusions
A novel family of matrix variate Poisson log-normal mixture models was developed for clustering longitudinal transcriptomics data. This approach utilized a modified Cholesky decomposition of a component of the covariance matrices of the latent variable, and constraints were imposed on various components of this decomposition which resulted in a family of eight models. Performance of the proposed approach was illustrated using both simulated and real datasets where the proposed approach showed good clustering performance.
One of the limitations with the proposed approach is that it assumes that measurements are taken at the same fixed intervals for all observations, which can be restrictive. Some future work will focus on extending these models to allow for varying interval lengths between observations. Furthermore, time is continuous; hence, discretizing it can result in a loss of information. Some work will also focus on developing a modelling framework that models time as a continuous variable.