Multiple-view clustering for identifying subject clusters and brain sub-networks using functional connectivity matrices without vectorization

In neuroscience, the functional magnetic resonance imaging (fMRI) is a vital tool to non-invasively access brain activity. Using fMRI, the functional connectivity (FC) between brain regions can be inferred, which has contributed to a number of findings of the fundamental properties of the brain. As an important clinical application of FC, clustering of subjects based on FC recently draws much attention, which can potentially reveal important heterogeneity in subjects such as subtypes of psychiatric disorders. In particular, a multiple-view clustering method is a powerful analytical tool, which identifies clustering patterns of subjects depending on their FC in specific brain areas. However, when one applies an existing multiple-view clustering method to fMRI data, there is a need to simplify the data structure, independently dealing with elements in a FC matrix, i.e., vectorizing a correlation matrix. Such a simplification may distort the clustering results. To overcome this problem, we propose a novel multiple-view clustering method based on Wishart mixture models, which preserves the correlation matrix structure without vectorization. The uniqueness of this method is that the multiple-view clustering of subjects is based on particular networks of nodes (or regions of interest, ROIs), optimized in a data-driven manner. Hence, it can identify multiple underlying pairs of associations between a subject cluster solution and a ROI sub-network. The key assumption of the method is independence among sub-networks, which is effectively addressed by whitening correlation matrices. We applied the proposed method to synthetic and fMRI data, demonstrating the usefulness and power of the proposed method.


Introduction
In neuroscience, the functional magnetic resonance imaging (fMRI) is a vital tool to non-invasively access brain activity. In particular, the functional connectivity (FC) between different brain regions, which is inferred from fMRI, is quite useful to investigate fundamental properties of the brain network, and contributed to the uncovering of intriguing associations between neural circuits and their functions [1,2,3,4,5]. FC is also promising as potential biomarkers for various psychiatric disorders [6,7]. Furthermore, it is recently suggested that the clustering of patients based on FC may reveal important subtypes of psychiatric disorders [8,9,10,11]. However, despite its usefulness, high dimensionality of FC data hinders an effective application of conventional clustering methods. This is mainly because in addition to the well-known problem of 'curse of dimensionality' [12] for high-dimensional datasets, there are specific problems for cluster analysis. Typically, there is a complex structure of data with multiple subject clustering patterns, depending on the selection of features [13,14]. That is, there may exist the underlying multiple cluster solutions of subjects, which are characterized by different combinations of features.
To overcome this difficulty, a multiple-view clustering method has been proposed, which is successfully applied to fMRI data to identify subtypes of depressive disorder [15,16]. This method's main principle is to optimally partition features into several groups (called 'view') in which subject clustering is performed separately. In so doing, the partitioning of features and subject clustering are performed simultaneously. Hence, the yielded results are optimal in terms of both feature partition and subject clustering. The method is based on a Gaussian mixture model in which a univariate Gaussian distribution is fitted to each cluster block.
Despite its usefulness and wide areas of application, there is a drawback of the method when it is applied to FC data from fMRI. Typically, FC denotes a Pearson's correlation of blood-oxygenation-level-dependent (BOLD) signals between ROIs, which is represented in the form of a matrix. The relationship among elements of the correlation matrix need to satisfy the positive definite condition, which restricts the existence of a correlation matrix in non-Euclidean space [17] (mathematically, such space is not closed under the operation of subtraction; the resulting matrix may not be positive definite anymore [18]). Hence, extracting elements from a correlation matrix and separately dealing with these elements in vector form may distort the underlying cluster structures. Nonetheless, element-wise feature extraction from a correlation matrix is a conventional practice for the clustering due to computational simplicity and high interpretability.
In the present paper, we propose a novel multiple-view clustering method for correlation matrices, which keeps the correlation structure intact, hence not resulting in the distortion mentioned above. Here, we use the terminology of 'node' in a graph theory as a synonym for ROI. For a mathematical formulation, it is also a synonym of 'variable', which denotes a generative model of a correlation matrix. In our approach, nodes in a correlation matrix are exclusively partitioned into several views, each of which enables the identification of a sub-network in a graph that is represented by a given correlation matrix. Further, we assume an object clustering structure for each network fitted by mixture models of Wishart distribution. The number of views as well as the number of object clusters are automatically estimated in the nonparametric Bayesian framework. Such a node-based approach allows us to extend a mixture model from a single view to multiple views in a natural way while preserving the positive definite structure of correlation matrices. The proposed method provides a novel perspective of clustering in terms of network analysis in a data-driven manner.
In the following sections, we first review related works on our proposed method. Next, we present our novel multiple-view clustering method. Subsequently, we evaluate the performance of the proposed method by applying it to artificial data. Further, we apply the proposed method to fMRI data, which evaluates the proposed method's performance for task fMRI data and resting-state fMRI data. Finally, we discuss interpretations of the yielded results from a neuroscientific point of view and important features of the proposed method.

Related work
There are several approaches to revealing the underlying multiple-view structure in data (for comprehensive reviews see [19,20]). We can largely categorize these approaches into 'alternative clustering' and 'subspace clustering'. In the alternative clustering, different cluster solutions from a given cluster solution are sequentially identified [21,22,23,24]. The idea behind it is to search the alternative cluster solution that is orthogonal to the given cluster solution. Typically, the alternative clustering is inferred based on the dissimilarity from the given cluster solution, using all features. On the other hand, in the subspace clustering, clustering is performed in several subspaces of features. This approach further entails two major strategies for identification of the subspace: the de-coupled strategy and the coupled strategy [25]. For the de-coupled strategy, the appropriate subspace is identified prior to performing clustering [26,27,28,29], whereas for the coupled strategy the identification of the subspace and clustering are performed simultaneously.
In the latter strategy, there are several variants in terms of clustering methods. In [30], the multiple cluster solutions are inferred by spectral clustering based on optimization of the non-redundant subspace. In [31], the rotation of the feature space and the projection coupled with K-means clustering are simultaneously optimized. For the model-based approach, the Gaussian mixture model is extended to multiple-view structure [32,15]. The proposed method in the present paper belongs to this category of multiple-view clustering. Despite such large variations in multiple-view clustering, to the best of our knowledge, there has been no attempt to perform multiple-view clustering for correlation matrices while keeping the correlation matrix structure intact.
Next, we briefly review the recent progress on the Wishart mixture model, which is a basic probabilistic model for our proposed method. There have been major theoretical developments with regard to the Wishart mixture model despite limited attempts to apply to real data. The expectation- Covariance matrix for nodes in node cluster Scale parameter for object cluster k in view v Σ v,k,g Scale parameter for object cluster k and node cluster g in view v T Degree of freedom maximization algorithm has been derived for inferring relevant parameters [33,34], whereas a Bayesian nonparametric extension for inferring the number of clusters is also formulated [35]. However, the area of application has been limited to computer vision analysis [33,36]. Recently, in the field of neuroscience, a Wishart mixture model was applied to develop a hidden Markov model in the context of dynamic FC [37,38].
In a nutshell, our proposed method is theoretically quite unique, revealing the underlying multiple-view structure based on the Wishart mixture model. Practically, it is also a first attempt to apply a multiple-view clustering method to FC correlation matrix without vectorization.

What is multiple-view clustering?
Multiple-view clustering is based on the idea that different cluster solutions of objects (subjects) exist depending on the features that we focus on. In a toy example in Fig.1a, we have different cluster solutions for six letters, depending on the color, style, and number of holes, which corresponds to views 1-3, respectively. We apply this idea to the clustering of objects characterized by correlation matrices (Fig.1b). There are two underlying object cluster solutions in this illustration, depending on two subsets of nodes (nodes 1 and 2). In this example, the nodes are sorted in nodes 1 and 2, which facilitates visual inspection of the object cluster structure. If we focus on patterns of correlations in nodes 1, we have a cluster solution, as in view 1. On the other hand, if we focus on patterns in nodes 2, we have another cluster solution, as in view 2. An important assumption in this model is that the underlying correlation of nodes between nodes 1 and 2 is zero. This suggests that we consider the clustering of objects for each independent sub-network of nodes (Fig.1c).
In real data, of course, it is not a trivial question of how to partition nodes. In the present study, assuming an exclusive partition of nodes where each node belongs to only a single view, we aim to simultaneously optimize a partition of nodes and object cluster solutions using a stochastic model.

Model
The proposed multiple-view clustering method is based on a probability model, which assumes that covariance matrices follow a Wishart distribution conditional on a cluster block. We consider that the Wishart distribution is the most natural and direct probabilistic formulation of FC correlation matrix as in the same spirit as [38]. In this section, we outline the model, which explicitly incorporates node partition and object clustering (for parameters in the model, please refer to Table 1).
We consider a p × T dimensional data matrix, typically, time series data, denoted as p-dimensional X(t) at time point t (t = 1, . . . , T ). Here, we suppose p nodes, each of which generates a single time series data with length T . In the context of fMRI data, one may suppose BOLD signals where the number of ROIs is p. We assume that X(t) is generated from a multivariate Gaussian distribution, where N(µ, Σ) denotes a multivariate Gaussian distribution with mean vector µ and covariance matrix Σ. Further, assuming µ = 0, the p × p matrix M = T t=1 X(t)X(t) ′ follows a Wishart distribution, which is given by where T and Σ are the degree of freedom and a scale parameter, respectively. Here, the constant term C M ,T and the non-constant term g(·) are defined as where Γ p (a) = π p(p−1)/4 p j=1 Γ((a − j + 1)/2) with a gamma function Γ(·). To model multiple-view clustering, we introduce several structures in the Wishart model in Eq. (2). First, we introduce a view structure for a scale parameter Σ in Eq.(2). We partition nodes into several groups (referred to as 'view' hereafter), assuming that the time series between views are independent. To explicitly denote a partition of nodes, we consider a p-dimensional vector of view partition u. This index defines view memberships of nodes, denoting that node i belongs to view u(i). For instance, u = (1, 1, 1, 2, 2) denotes that nodes one, two, and three belong to view one, whereas nodes four and five view two. Because of the assumption of independence between views, the scale parameter Σ in Eq.(2) is partitioned into V views (v = 1, . . . , V ) as follows: where Σ v is a scale parameter among the nodes that belong to view v. Note that for simplicity, nodes are sorted in ascending order of view memberships in Eq. (3). Conditional on the view structure u, it can be shown that the Wishart density function is given by where M v is a covariance matrix for nodes in view v. It is worth noting that C M ,T does not depend on the view structure u, which facilitates the inference of relevant parameters. Second, we consider a collection of n covariance matrices, denoted as In the context of fMRI, one may suppose an FC matrix for n subjects. Assuming that the distribution of a covariance matrix follows a Wishart distribution, we cluster objects by fitting each realization of M to a finite mixture of Wishart distribution with K components, where T k and Σ * ,k are the degree of freedom and a scale parameter for component k, and w k is the mixing proportion ( K k=1 w k = 1). Note that the suffix * in Σ * ,k denotes that the scale parameter is for all nodes without partitioning. For simplicity, we set T k = T (k = 1, . . . , K). Note that we have not considered a view structure yet. The parameter θ on the left-hand side in Eq.(5) denotes all relevant parameters for the Wishart mixture model on the right-hand side. At this stage, the number of components K is fixed. Next, we introduce a binary K-dimensional vector z, which denotes a latent class label of an object, that is, z(k) = 1 if the object belongs to component k, and z(k) = 0 otherwise. Conditional on the latent class label z, the Wishart density function is given by This equation indicates that the density function is identical to the Wishart density function in the component k such that z(k) = 1. Next, we incorporate the view structure into Eq.(6). Analog to Eq.(4), the density function conditional on the view-dependent object cluster membership z v becomes where K v is the number of object clusters in view v. Moreover, in the spirit of co-clustering [15], we also introduce a cluster structure for nodes within a view. This has the effect of modeling more than one independent network within a view, hence identifying all relevant networks for the object clustering in the same view. Here, to avoid cluttering, we do not lay it out, but the following model development is basically the same (see Appendix A for details).
Finally, using Eq.(7), the likelihood of data D conditional on view memberships and object cluster memberships is given by This equation indicates that the density function of each covariance matrix is evaluated by the Wishart distribution with appropriate scale parameters, which are defined by view membership and subject cluster membership. It is noted that Eq.(8) is in a simple form of multiplication because of the assumption of independence among views in Eq. (3). Also, it is of note that we consider the degree of freedom T as a parameter to estimate. In usual circumstances, the degree of freedom T is known in advance. However, if the independent assumption of X(t) in Eq.(1) does not hold, we need to consider the effective degree of freedom, which is often the case for fMRI time series data [39,40]. In the present method, we allow for flexibility of this quantity, which is inferred in the framework of model estimation.

Correlation matrix
We consider the application of the aforementioned model to correlation matrices. We assume that a correlation matrix R is given as R = T t=1 X(t)X(t) ′ /T , where p-dimensional vector X(t) is generated as in Eq.(1), with the diagonal elements in covariance matrix Σ being one. It can be interpreted that the correlation matrix R is a normalized covariance matrix. The effect of such normalization is appropriately considered in prior distributions for scale parameters in the following section. Further, it is noted that the diagonal elements of R are close to one, but not exactly one in general because of the stochastic nature of X(t). More specifically, denoting the vth element of X(t) as From the Gaussian assumption of X(t), the numerator of this quantity follows a χ 2 distribution with degree of freedom T (hence, the expectation of T t=1 x v (t) 2 /T is one). In contrast, the diagonal elements of an empirical correlation matrix are one (constant). The exact modeling of such an empirical correlation matrix is not straightforward, which involves a further complication of the model [41]. Hence, as an approximate model, we apply the model mentioned above in the previous section to an empirical correlation matrix.

Priors
For Bayesian inference, we introduce prior distributions for relevant parameters. First, for a scale parameter Σ v,k , denoting the matrix size as p ′ × p ′ , we assume a conjugate prior, that is, an inverse Wishart distribution W −1 (·|·), where ν and S are the degree of freedom and scale parameter, respectively. We used a non-informative setting ν = p ′ +3 for the degree of freedom. When we apply the model to the correlation matrices, we set the scale parameter S = (ν − p ′ − 1)I/T , where I is an identity matrix. With this setting of the scale parameter S, the expectation of diagonal elements of Σ v,k becomes I/T . Hence, the prior expectation of an observed matrix becomes an identity matrix, which fits well with a correlation matrix.
Second, for both view and cluster labels u and z, we assume a Chinese restaurant process (CRP), which enables the estimation of the number of views and clusters: where z v denotes the subject cluster memberships for view v. We set the hyperparameter α (concentration) in CRP to one as a default (see Appendix B for more details).
Third, for the degree of freedom T , we assume a (uniform) categorical distribution Cat(·|·), where c and p are the sample space and event probability, respectively. We set c = (T 1 , . . . , T q ) and that p i = 1/q (i = 1, . . . , q) (see Appendix C for more details).
Lastly, we assume that all these priors are independent. Hence, the joint prior distribution is simply given by the products of these priors.

Inference
By Bayes' rule, the posterior distribution in our model is given by where the first term on the right-hand side is given by Eq.(8). The second is the joint prior for the relevant parameters. For efficient optimization of the posterior, we integrate out all scale parameters {Σ v,k }: Because we assume a conjugate prior for scale parameters {Σ v,k }, this integration can be obtained in closed form. Hence, our objective is to find the remainder of the parameters T , v, and {z v } that optimize (maximize) L in Eq.(9). There are several approaches to optimizing L, such as variational Bayes, Gibbs sampling, and iterative conditional modes (ICM) [42,43]. In the present study, we optimize L by means of ICM, which in turn gives maximum posterior (MAP) estimates of T , u, and {z v }. This algorithm iteratively optimizes parameters one by one until L converges. Because there is no guarantee for global optimization in the algorithm, we repeat this procedure for a number of initializations (setting the number of initializations J to 1000 as a default). Among the parameter estimates for various initializations, we select the one that gives the maximum value of L (see Appendix D for details).
The motivation for taking this approach is that such a hard assignment of view membership allows for a substantial reduction in computation time because of the explicit independent structure of the scale parameters in Eq. (3) (for the computing complexity see section 5.2). Averaging methods such as variational Bays would deal with view memberships as 'averaging'. This is not computationally efficient in our context because there is a need to evaluate the whole matrix with high dimensions. It is also noted that Gibbs sampling is not efficient, although it gives a hard assignment of view memberships because it takes a long time to evaluate the convergence of chains. Rather than simulating in long chains, we make the best use of parallel computing, which allows for an efficient search in MAP utilizing ICM.

Preprocessing
To effectively perform our proposed method, we undertook the following preprocessing of the data. One is the regularization of covariance matrices in a high-dimensional case (large p). The other is the whitening of covariance matrices, which transforms covariance matrices to meet the assumption of independence between views. Both preprocesses apply to the correlation matrices as well.

Regularization
For a data matrix X in Eq.(1), an empirical covariance matrix becomes singular when the number of data points is smaller than the number of nodes.
In such a case, we cannot evaluate the covariance matrix using a Wishart distribution. Even if the number of data points is sufficiently large, the same ill-defined case may occur when the independent assumption in terms of datapoint t does not hold, which is typically the case for fMRI data. Hence, in a high-dimensional case (a large number of nodes), some regularization of the covariance matrices is required. In the present study, we use a regularization method proposed by [44], which effectively shrinks the sample eigenvalues.

Whitening
The proposed method largely relies on the assumption that nodes are independent between views. This assumption may not hold in real data analysis, in particular, in fMRI data. Nonetheless, even in that case, there may be different object cluster structures between views. To reveal such structures using the proposed method, we consider the whitening of covariance matrices [45,46]. The whitening procedure entails the following linear transformation of a covariance matrix: whereM is the empirical mean covariance matrix of data, i.e.,M = n i=1 M (i) /n; Here, the prime denotes the transposition of a matrix. Through the whitening procedure, it is expected that if there is no object cluster structure in off-diagonal elements in the covariance matrix, these elements become (nearly) zero for all objects. In contrast, if there is an object cluster structure, these elements do not become zero for all objects. Hence, whitening has the effect of making elements zeros between different views.
Incorporation of the whitening procedure into the proposed method considerably widens the area of application to real data. Note that in the case of correlation matrix, we further transform the yielded matrix M (i) into the correlation matrix.
Regarding whitening, one may wonder whether there is a correspondence between the whitened nodes and the original nodes. In other words, does node j of M (i) represent the node j of M (i) in Eq.(10)? The whitening procedure in question is the one that maximizes the sum of the cross-correlation between the whitened and original nodes [47]. It is taken for granted that a correspondence exists between whitened nodes and original nodes in the literature, for example, in terms of the affine framework [48]. However, as whitening involves different scaling for nodes, the order of data points is not, in general, preserved between the whitened and original nodes. To the best of our knowledge, there is no theoretical foundation to ensure complete agreement between two node spaces in terms of cross-correlations. Hence, for interpretations of whitened ROIs, we consider the following two approaches. One approach is to consider that whitened ROIs represent the corresponding original ROIs (hereafter, referred to as the 'heuristic' approach). One may examine cross-correlations between original ROIs and whitened ROIs, which are given byM 1/2 [47]. The other approach is to consider the effect of whitening (hereafter, referred to as the 'vigorous' approach) as follows. For a given original ROI i, we evaluate its contribution to a particular view (e.g., view v) as the summation of absolute values of correlation coefficients between the ROI in question and the whitened ROIs in the view. Denoting such a contribution as Importance (i, v), we formulate it as where r i,u is the cross-correlation between the original ROI i and whitened ROI u. Importance can be used to select relevant ROIs for this view.

Results
In this section, we report the results of two simulation studies to evaluate the proposed method's performance. One is the application to synthetic data, whereas the other is to fMRI data. For synthetic data, we aim to examine the proposed method's workability and compare its performance with other methods. For fMRI data, we aim to demonstrate the workability of the proposed method in a real situation. To this end, we consider the application to two types of datasets released in the Human Connectome Project (HCP) [49]. The first datasets are task fMRI data. We mixed subjects into two different tasks. Hence, two clusters are artificially embedded in the data.
We tested whether the proposed method can recover these 'true' clusters characterized by the difference in tasks. Second, we applied the method to resting-state fMRI data. We considered two sessions of resting-state fMRI of the same subjects. We examined whether the proposed method can identify similar subject clusters between the two datasets. In both cases, we evaluated a correlation matrix for a subject based on the mean BOLD signal in brain parcels of Shen parcellation [50], which defines 268 ROIs in the whole brain.

Synthetic data 4.1.1 Data generation
We consider two types of data: Type 1 and Type 2. For Type 1, we assume that the background correlation is zero, which exactly matches our model assumption. On the other hand, for Type 2, we assume that the background correlation is not zero, which does not match the model assumption and hence requires a whitening procedure. For these two types of data, we used the following settings of the view and cluster structures: • Number of nodes: p = 30 • Number of objects: n = 100 • Number of views: V = 3 (each view has ten nodes) • Number of object clusters: K v = 4 (v = 1, . . . , V ) (hence, each cluster has 25 objects) • Number of node clusters: We generate data based on Eq.(1), using a correlation matrix Σ * , which is obtained by combining the correlation matrices in three views. For each object cluster in a view, a 10 × 10 correlation matrix is generated as follows: First, a 10 × 10 lower triangular matrix L is generated, where each element is randomly generated from the standard Gaussian distribution (mean 0 and variance 1). Second, we evaluate the matrix product A = L * L ′ . Because of the property of Cholesky decomposition, the yielded matrix A is positive definite. Third, we transform this covariance matrix into a correlation matrix. Subsequently, we randomly shuffle the nodes in the correlation matrix. We generate such a correlation matrix for each object cluster in each view.
Finally, we make a correlation matrix for all nodes by diagonally combining the correlation matrices from each view, setting correlations of nodes between views to zero as in Eq.(3). Such a combination is specific for object cluster-ID and view ID. Denoting the yielded matrix as Σ (we omit a symbol for dependency on object cluster structures), we obtain Σ * by adding a 'noisy' background matrix N (diagonal elements are one), where weight w is the noise ratio that we manipulate. Regarding the background matrix N , we consider two cases for its off-diagonal elements: 0 (Type 1) and 0.2 (Type 2). Finally, we set the number of data points to T = p + 10. Next, for each object, we randomly assign a cluster membership of each view. Subsequently, using the corresponding correlation matrix Σ * for cluster memberships, the data (sample size T ) are randomly generated for each object from a multivariate Gaussian distribution in Eq.(1). We obtain a collection of correlation matrices from the generated data, which are used as inputs for clustering methods. We have 100 replications for each configuration. As a preprocessing step, we do not perform a whitening procedure for datasets of Type 1, whereas for datasets of Type 2, we perform a whitening procedure.

Other clustering methods
To the best of our knowledge, no method exists that simultaneously infers both view and object cluster structures for correlation matrices. Hence, as competitive methods, we consider several combinations of existing clustering methods.

Hierarchical + K-means
In this method, a hierarchical clustering method is first performed for partitioning nodes using the mean correlation matrix over objects as in the sub-network analysis of ROIs [51,52]. Subsequently, K-means clustering is performed for the clustering of objects using vectorized correlation matrices in each view. For hierarchical clustering, we set the link function to 'average' with the distance defined as (1 − r), where r is a corresponding off-diagonal element of the correlation matrix.

Community detection + K-means
Instead of a hierarchical clustering method, a community detection method [53] is performed for the partitioning of nodes as in the sub-network analysis of [54], followed by the clustering of objects using a K-means clustering method. The community detection method is based on the weighted stochastic block model (WSBM), a generalized stochastic block model for weighted edges.

K-means
We consider that the input is vectorized correlation matrices. Without considering a view structure, we naively apply a K-means clustering method to vectorized correlation matrices.

Multiple-view clustering based on Gaussian mixture models
This is another type of multiple-view clustering, where the input is vectorized correlation matrices as in the case of 'K-means' above. Instead of nodes, the matrix elements are partitioned into views. Specifically, we use the multipleview clustering method based on Gaussian mixture models [15], which fits Gaussian mixture models to the matrix elements in each view.
For the hierarchical + K-means, and community detection + K-means methods, we fixed the number of views to the true one. In addition, for all competitive methods that use K-means, we fixed the number of object clusters to the true one. On the other hand, for our proposed method and multipleview clustering method based on Gaussian mixtures, these quantities were not fixed, but were inferred in a data-driven manner. When K-means and multiple-view clustering based on Gaussian mixture models were applied, features were standardized (i.e., mean zero and standard deviation one), following the conventional procedure. To alleviate the local minimum problem of the object function, we set the number of initializations to 100 for all clustering methods in the present study.

Evaluation of performance
We evaluated the performance of these clustering methods in two ways. One is the recovery of the true view structure, whereas the other is the recovery of the true object cluster structures. To recover the true view structure, we examined the agreement between the true view memberships and estimated view memberships. We evaluated the mean value of the adjusted rand index (ARI) [55] of view memberships over 100 replications of synthetic data. To evaluate the recovery of the true object cluster structures, we examined the extent to which the true object clusters in each view were recovered. Because view labels are arbitrary, that is, they are simply nominal, there is generally no correspondence between the numbering of the true view labels and the numbering of yielded view labels. Hence, we need to match these views.
However, such a matching is not necessarily obvious, particularly when the clustering performance exacerbates. For each true view, we simply evaluated the recovery of the true object cluster structure from the perspective of taking the maximum value of ARI with estimated cluster solutions in all views. Lastly, we took the grand average of such ARIs over different true views and over 100 replications to be the overall performance.

Results of performance
The performance results are summarized in Fig.2. For Type 1 data (i.e., background correlation zero between views), the proposed method ('Multiple' in Fig.2) performed reasonably well, perfectly recovering both the view and object cluster structures for the range of noise weight w between 0 and 0.6. However, the performance deteriorates as the noise weight w further increases. The method of community detection + K-means ('WSBM-Kmeans') performs better than the proposed method for the noise weight of w between 0.7 and 0.9. In contrast, the hierarchical + K-means ('Hier-Kmeans') and K-means ('K-means') methods do not perform well. In particular, the multiple-view clustering method based on Gaussian mixture models yielded poor results.
For Type 2 data (i.e., background correlation 0.2 between views), the performance of the proposed method is as good as in the case of Type 1 data. However, the performance of the method of community detection + K-means worsens. This is because the true view structure becomes vague in the mean correlation matrix for Type 2 data, which hinders this method from correctly identifying the true view and object cluster structures. As in the case of Type 1 data, the multiple-view clustering method based on Gaussian mixture models performed poorly.
Overall, this simulation study suggests the reasonably good performance of the proposed method. Importantly, its performance is stable regardless of whether the background correlation is zero or non-zero. On the other hand, the competitive methods do not perform well, particularly in Type 2 data. The main reason for poor performance is the lack of information on the view structure in the mean correlation matrix. This suggests that without explicitly modeling view structures, there is a limit to recovering the view structure from the data. Lastly, it is of note that the multipleview clustering method based on Gaussian mixture models performs rather poorly. This implies that for this method the multiple-view structure largely disappears through vectorization and standardization of the given correlation matrices.

Task fMRI
In this experiment, we applied the proposed method to samples of correlation matrices obtained from task fMRI data in the HCP data depository. The objective is to examine whether the proposed method can identify subject clusters characterized by task differences. That is, we aim to identify the relevant view(s) for the task differences.
In HCP data, the task fMRI data consists of seven tasks: motor, working memory, language, social cognition, gambling, relational processing, and emotion processing. Each subject is engaged in all these tasks. Among these seven tasks, we focused on three tasks: motor (MOTOR), working memory (WM), and language processing (LANGUAGE), which have a relatively longer duration of scanning than the other tasks. Because our objective was to examine the performance of our proposed method, we used a simple structure of a dataset in which subjects differed only in two tasks. Hence, there were two clusters in a dataset. The data generation procedure was as follows. For each task, we randomly selected 50 subjects from the HCP depository (970 subjects) without overlapping, which resulted in three task-specific datasets. We combined two of these three datasets, which results in three mixed datasets: MOTOR-WM, MOTOR-LANGUAGE, and LANGUAGE-WM, each of which consisted of 100 subjects. We applied the proposed method to these three types of mixed datasets. It is of note that this way of randomly selecting subjects affects controlling factors other than tasks. Alternatively, one might choose 50 subjects and extract two task data for each subject, which results in a sample size of 100. This procedure would perfectly control factors other than the task difference.
However, the samples would not be independent because we would have used two data sets for the same subject. To avoid such dependency between samples, we opted to select 100 different subjects in the manner mentioned above, which is analog to an actual data collection situation.
To evaluate correlation matrices, we used the minimally preprocessed task fMRI data provided by the HCP (e.g., tfMRI_MOTOR_LR.nii). Unlike resting-state fMRI, it would be appropriate to remove the influence of taskrelated activities by regressing out of task events. Nonetheless, it has been reported that such a regression step has only minimal effects on FC analysis [56]. Moreover, in the present study, our objective was to examine the feasibility of the clustering method. Hence, for simplicity, we used the whole time series, including the time duration of non-tasks. We removed linear components related to 12 motion parameters, low-pass filtering, and scrubbing as part of preprocessing.
Regarding the proposed method, we performed regularization and whitening for the correlation matrices. The number of initializations J was set to 1000. To select a stable model, we used the following heuristic for model selection: First, we selected the top five models in terms of the posterior L in Eq.(9). Among these five models, we subsequently evaluated the agreement of view memberships between models by means of ARI. Then, we identified the pair of models that give the largest value of ARI. The final model was the one in this pair, which gave a larger value of the posterior L. This procedure allowed us to choose a stable model with a large value of the posterior, removing the possibility of a spurious model.

Results of clustering
Here, we report on the clustering results in detail for the MOTOR-WM dataset. This performed the best in terms of recovery of the task difference. First, the number of views was estimated as 25, whereas the number of ROIs in the views ranged from 1 to 32 (Fig.3). The number of subject clusters ranged from 1 to 56 (Fig.3, 4). Second, we examined the stability of clustering results in terms of view memberships. To this end, we focused on 30 best models out of 1000 random initializations. We evaluated the stability of view memberships between the optimal model and those 30 models by means of the Dice coefficient [57] (Fig.5; the red points denote 95% upper limits by permutation test). It was observed that the extent of stability varied depending on the view. However, it appeared to be more stable as the number of subject clusters increased.

Recovery of task difference
In this experiment, the dataset consisted of two heterogeneous groups of subjects attributed to different tasks, MOTOR, and WM. We examined whether such a heterogeneous structure of subjects was recovered in the clustering results, which was the main objective of this experiment. It was found that in view 16, the proposed method best recovered the task difference (Fig.6). In Fig.5, it can also be observed that view 16 is among the stable views.
Furthermore, to confirm the recovery of the task difference, we examined the distributions of subjects in view 16 (Fig.7). Remarkably, subject cluster 1 was dominated by subjects with the task of WM, whereas subject cluster 2 was dominated by those with the task of MOTOR. Additionally, to examine whether such a clear task difference can be recovered without the multiple-view structure, we applied the proposed method by imposing on the constraint of the single view structure (i.e., the number of views is not estimated, but fixed to one). It is found that the clear task difference was not recovered in this case.

Relevant brain regions for view 16 in MOTOR-WM
Next, we analyzed relevant brain regions for view 16 shows that the correlation matrix patterns differ between subject clusters 1 and 2 (Fig.8). Differences, though subtle, are also observed in several instances of the subjects as well (columns 2 to 5 in Fig.8).
Lastly, we identified the relevant sub-network of the brain in this view. We considered both heuristic and vigorous approaches to the selection of ROIs. We selected the aforementioned 12 ROIs in the heuristic approach, confirming that the original ROIs and whitened ROIs were in good agreement ( Fig.9). In addition, in the vigorous approach, the top 12 ROIs in terms of the importance index in Eq.(11) is the same as the 12 ROIs identified in the heuristic approach (Fig.10). A large discrepancy between the 12th ROI and the 13th ROI in Fig.10 suggests that the main contribution of the original ROI space is due to these top 12 ROIs. Regarding location in the brain, it was found that these ROIs are related to the motor strip and parietal regions (Fig.11). More detail about these brain regions is discussed in section 5.1.

Resting-state fMRI
In this experiment, we applied the proposed method to resting-state fMRI data. We aimed to examine the reproducibility of the clustering results between two resting-state fMRI data obtained from the same subjects. Moreover, inspired by the clustering results (shown in the following section), connectome fingerprinting has been examined [59]. The study by [59] demonstrated that a subject was identifiable using a specific collection of FCs in predefined brain networks. In contrast, we approached connectome fingerprinting in terms of the clustering of subjects without prior information on the brain network structures.
In the HCP data, resting-state fMRI was acquired four times for a single subject: twice for each phase encoding L-R and R-L [49]. In the present experiment, we used two L-R phase encoding data (REST1_LR_hp2000_clean. nii and REST2_LR_hp2000_clean.nii) to prepare two datasets of correlation matrices for 100 randomly selected subjects. We refer to these datasets as 'Rest1' and 'Rest2', respectively. We performed the same preprocessing as in the case of task fMRI datasets in the previous section, followed by applying the proposed method to each dataset separately. It is of note that for the whitening of correlation matrices, we used the mean correlation matrix of all subjects in both Rest1 and Rest2. This allows us to compare the clustering results between the two datasets because the whitened ROI space becomes the same for Rest1 and Rest2.

Results of clustering
The number of views is estimated as 24 and 29 for Rest1 and Rest2, respectively (Fig.12). For Rest1, the number of ROIs in a view ranged from 2 to 29, while for Rest2, it ranged from 4 to 45. There is a large variation over views regarding the number of subject clusters; it ranges from 2 to 52 for Rest1 and 2 to 50 for Rest2. (Fig.12, Fig.13). Next, we examined the clustering results' stability as in the case of task fMRI, focusing on the 30 best models for each dataset (Fig.14). We evaluated the stability of view memberships between the optimal model and the 30 models using the Dice coefficient. The extent of stability varied depending on the view. However, there was a general tendency to be more stable as the number of subject clusters increased.

Comparison between Rest1 and Rest2
We identified pairs of views of Rest1 and Rest2 in terms of both agreements of view memberships and subject cluster memberships between Rest1 and Rest2 (Fig.15). It was found that the agreement in the following pairs of views was significant based on permutation tests (p < 0.05) in both cases: For these pairs, there was good agreement between the subject cluster memberships between Rest1 and Rest2, whereas the number of subject clusters in a view was considerably large. Hence, it was hypothesized that there might be good agreement at the level of individual subjects. To examine this hypothesis, we evaluated the accuracy of the matching of subjects between Rest1 and Rest2, which is analogous to the functional connectome fingerprint [4]. In this matching, we used the 'Wishart' distance between two correlation matrices, focusing on relevant ROIs in a view. In other words, we evaluated the similarities between a correlation matrix A and correlation matrix B as where we set T to the number of datapoints in BOLD signals. Note that this measure was not symmetric for A and B. In our context, we let A denote a correlation matrix for a matched subject, whereas B for a matching subject (see more details in the caption of Fig.16). In this manner, we evaluated the accuracy of the matching of subjects in both directions for Rest1 and Rest2 (Fig.16a, b). As a result of matching, we found that the accuracy of the matching of subjects was high for those three pairs. In particular, the accuracy reached more than 0.95 for pair 1 and pair 3. For pair 1, the accuracy was 0.96 in Rest1 and 0.97 in Rest2, whereas for pair 3, the accuracy was 0.97 in Rest1 and 0.97 in Rest2. In contrast, for pair 2, the accuracy deteriorated to 0.64 in Rest1 and 0.61 in Rest2.
Regarding the sub-network of the brain, the relevant ROIs for pairs 1-3 were in the following brain regions: prefrontal, motor strip, parietal, and temporal (left) for pair 1; temporal and occipital (right and left) for pair 2; and prefrontal, motor strip, parietal, and temporal (right) for pair 3 (Fig.17). The result about the relevant brain regions is discussed in more detail in section 5.1.

Discussion
The main contribution of the present paper is to propose a new multiple-view clustering method for correlation matrices without vectorization, which provides a useful framework for cluster analysis of FC data. We formulated the proposed method based on Wishart mixture models by means of extension from a single view to multiple views in the nonparametric Bayesian framework. With the whitening procedure for correlation matrices, the usefulness of the present method was demonstrated both in synthetic and real fMRI data.
In this section, first, we discuss the clustering results for the fMRI data from a neuroscientific point of view. Focusing on the relevant brain regions, we compared the results with those of previous studies. In addition, we provide a possible interpretation for connectome fingerprinting using the age of the subjects. Second, we discuss important features of the proposed method, including its limitations.

Interpretation of the clustering results
In the task fMRI data, we identified 12 relevant ROIs for the task difference between MOTOR and WM. Regarding the brain regions, these 12 ROIs were located in the motor strip and parietal regions (Fig.11). This result is consistent with previous studies. In [60], it was hypothesized that the MOTOR task was relevant for motor and somatosensory cortices, whereas WM tasks for various brain regions, including the prefrontal, inferior frontal, precentral gyrus, and anterior cingulate. The systematic study [61], which focuses on BOLD signal changes, reports that the supplementary motor cortex is most relevant for the MOTOR task, whereas the middle frontal gyrus for WM. Further, they suggest that the effect size is larger for the MOTOR task, which gives a possible explanation of why only MOTOR-related ROIs are identified in the present study. It is of note that FC's inter-task difference is smaller than the difference in FC between task and resting-state [56], which suggests that the identification of task differences is more challenging than the identification of task and resting differences.
In the resting-state fMRI data, the relevant brain regions for the three pairs 1-3 are identified in the heuristic approach as follows: Prefrontal, motor strip, parietal, and temporal (left) for pair 1; temporal and occipital (right and left) for pair 2; and prefrontal, motor strip, parietal and temporal (right) for pair 3 (Fig.17). Further, it was found that the accuracy of the matching of subjects was quite high for pair 1 and pair 3, whereas it deteriorated for pair 2. These results are consistent with a previous study [4], which demonstrated that the frontoparietal networks could distinguish between individuals, whereas the visual networks less so. Our results are consistent with their analysis of the contributions of individual FCs. The uniqueness of our approach is that we obtained these results in a data-driven manner without assuming connectivity networks in advance.
Moreover, we characterized subject clusters for pair 3, turning our attention to clusters with a large sample size. Among the three pairs in question, the largest overlapping of subjects in clusters was observed in pair 3: between cluster 1 in view 22 of Rest1 and cluster 1 in view 28 of Rest2 (Fig.18). The number of overlaps was ten, whereas 34 subjects belonged to either of these clusters. We consider that the subjects belonging to the same cluster were homogeneous in terms of FC in the relevant ROIs. Ideally, such subjects would belong to the same cluster in both Rest1 and Rest2. However, because of the small sample size of the overlapping subjects in the present study, we did not require this, focusing on the 34 subjects that belonged to either of these clusters. We hypothesized that these subjects had a common attribute other than FC. To test this hypothesis, we characterized these 34 subjects (we refer to it as 'group 1') against the remainder of the subjects ('group 2'), using basic subject information on sex and age. Statistical tests suggest that the sex difference between the two groups was not significant at the 0.05 level (p-value = 0.58 by χ 2 test), whereas the age difference was significant (p= 0.0018 by Mann-Whitney U test). Further analysis of age showed that the subjects in group 1 were older than those in group 2. Considering that the subjects in group 1 were more homogeneous than those in group 2, this result suggests that as age increases, the connectome fingerprint for pair 3 becomes more homogeneous.

Features of the proposed method
The proposed method is unique in that it reveals different clustering patterns of objects depending on a specific subset of nodes. An exclusive partition of nodes defines multiple views in which objects are clustered in a specific subset of nodes. Both the partition of nodes and object clustering are simultaneously optimized based on a probability model. Importantly, the structure of the correlation matrix is kept without vectorization. In the simulation study of synthetic data, it was shown that the proposed method outperformed the other methods. It is worth noting that the multiple-view method based on vectorized matrices (i.e., multiple-view clustering based on Gaussian mixture models) failed to recover both view and object cluster structures. A possible reason for this poor performance is as follows. Due to the generating model of synthetic data, the distributions of vectorized features differ significantly. This makes it difficult to fit the data with a single univariate Gaussian distribution, which is assumed in the multiple-view method based on vectorized matrices. This suggests a limitation of this type of multiple-view clustering method, which relies on the vectorization of a correlation matrix.
The proposed method's key assumption is independence between views, which is effectively addressed by means of whitening of correlation matrices.
To elucidate the view structure, the idea of the whitening of the correlation matrix is quite new. Combined with a whitening procedure, the proposed method's usefulness is demonstrated in simulation studies both in synthetic and fMRI data.
The tuning parameter in the proposed method is just a few: the concentration parameter α of CRP for inferring the number of views and the number of clusters. In all applications of the proposed method in the present paper, we set the concentration parameter α to one, which is the default setting for the conventional application of CRP [62]. As regards the computing complexity, the proposed method has a specific property. As can be seen in Eq.(3), the dimension of correlation matrix decreases when the number of views increases, leading to the reduction of the computing complexity. To take into account this effect, we discuss the computing complexity focusing on the number of views V . For simplicity, we fix both the number of subject clusters and the number of iterations without explicitly including them in the evaluation of the computing complexity. Further, we assume the same number of nodes over different views, denoting it as p ′ , i.e., p ′ = p/V where p is the number of all nodes. In this setting, the computing complexity for updating the subject cluster membership is given by O(nV p ′3 ) where n is the sample size. Note that we set the computing complexity of the determinant of p ′ × p ′ matrix to p ′3 . On the other hand, the computing complexity for updating the view membership is given by , where the first term is due to the evaluation of the mean correlation matrix in a subject cluster. By substituting p ′ = p/V , the overall computing complexity becomes O((n + p)p 3 /V 2 ). This result implies that the proposed method can efficiently work for a larger number of views, given the fixed number of nodes. Next, we evaluate the computing complexity of the multiple-view clustering based on Gaussian mixture models, which is applied for vectorized matrices. For simplicity, we consider the Gibbs sampling version of [15]. The computing complexity of such a method is given by It is expected that the present method may provide a useful framework to understand better the brain-related heterogeneity of subjects, such as subtyping of psychiatric disorders, which is currently an active area of research. The task fMRI analysis results in the present study imply that the present method can identify subject groups that differ in FC patterns in a particular sub-network of the brain. It is worth noting that in the additional experiment of task-fMRI, the task-difference of MOTOR-WM was not recovered when we imposed the constraint of the single view structure. This suggests the usefulness of the proposed method that explicitly models the underlying multiple view structure. However, the reproducibility of subject clusters could not be achieved in the analysis of resting-state fMRI. Nonetheless, this does not rule out the possibility of identifying subtypes of psychiatric disorders using resting-state fMRI, which may stably differ in connectivity patterns. Alternatively, it might be a good option to focus on views with better reproducibility, utilizing scanning resting-state fMRI at least twice.
Finally, we discuss the limitations of the proposed method. First, the proposed method applies only to the correlation matrix, or in general, Gram matrix, which is positive definite. For other types of connectivity matrices, one may consider a probability distribution other than the Wishart distribution. Nonetheless, the incorporation of a multiple-view structure into a probability model is similar. Second, the proposed method provides a cluster solution of objects based on different correlation matrix patterns, but the difference between clusters is not necessarily obvious by visual inspection. This is because a cluster solution is based not on a single element but several correlation matrix elements. To solve this problem, dimension-reduction methods may be useful. Third, the proposed method is based on the assumption that the correlation matrix may follow a Wishart distribution in each cluster block. This assumption may not hold for real FC data. In that case, a yielded cluster should be considered just as a mixture component.
Note that such a situation may in general arise in clustering using mixture models. Since it is expected that a non-Wishart cluster may be fitted by several Wishart mixture components, the merging of several components in close proximity can theoretically provide information on the true cluster structure. However, we did not address this issue in the present paper because of the complication of the problem.

Conclusion
For the analysis of fMRI data, we proposed a novel multiple-view clustering method for correlation matrices. The main feature of this method is to optimally partition nodes into several exclusive groups (views), whereas objects are clustered using only relevant nodes for each view. The assumption of independence between views is addressed by means of whitening correlation matrices. It is of particular note that the structure of correlation matrices is preserved, as evaluated using the Wishart distribution. The partitioning of nodes and clustering of objects, including the number of views and clusters in each view, is simultaneously performed without a tuning parameter. The power of this novel method is shown in a simulation study of synthetic data, which outperforms other clustering methods.
The proposed method provides a useful framework for the analysis of FC data. In the analysis of task fMRI data, it was shown that the proposed method identifies task differences in a data-driven manner. In addition, in the analysis of resting-state fMRI data, the proposed method identifies ROI networks related to connectome fingerprints in a data-driven manner. It is expected that the proposed method may provide a new framework for clustering problems in various contexts, such as the subtyping of psychiatric disorders.
versity. We thank Dr. Hiroshi Morioka at RIKEN for his useful comment on a whitening procedure.
[  Color denotes Pearson's correlation (the diagonal part is set to zero). Note that in the conventional terminology, the correlation matrix denotes the association between two variables. However, here, instead of 'vairable', we use the terminology 'node', which is more appropriate as a synonym for ROI in the present paper. If we focus on the first 50 nodes (nodes 1), we have a two-cluster solution as denoted in view 1. On the other hand, if we focus on the last 50 nodes (nodes 2), we have a three-cluster solution as in view 2. Note that we assume independent networks nodes 1 and nodes 2 (the between-network correlation is zero). Also, we assume an exclusive partition of nodes, where each node belongs to only a single view. Panel c: Illustration of Panel b in terms of undirected weighted graphs. For simplicity, we consider three nodes for nodes 1 and nodes 2, respectively. The strength of connectivity (weight) between two nodes is denoted by color in an edge. Absence of edge denotes zero weight. Figure 2: Results of simulation study on synthetic data. Panels a, b: Recovery of views and recovery of object clusters for Type 1 data (background correlation 0 between views) for the proposed method ('MultipleWishart'), hierarchical + K-means ('Hier-Kmeans'), community detection + K-means ('WSBM-Kmeans'), K-means ('K-means'), and multiple clustering based on Gaussian mixture models ('MultipleGauss'). Panels c, d: Recovery of views and recovery of object clusters for Type 2 data (background correlation 0.2 between views). For all panels, the horizontal axis denotes a noise weight w in Eq. (12). The recovery is evaluated by mean Adjusted Rand Index over 100 replications of datasets for each configuration. Note that recovery of views is not applicable for the method of K-means, which does not yield a solution for the view structure. Also, it is not applicable for the multiple view clustering based on Gaussian mixture models, because it yields views based on matrix elements, rather than nodes.  The horizontal axis denotes view ID, whereas the vertical axis denotes the demarcation of subject clusters. In each color bar, subjects are sorted in the ascending order of subject cluster ID. On the other hand, subject cluster ID is in advance sorted in the descending order of its cluster size. Cluster differences are denoted by color, but note that the same color may be used for different clusters apart.    The first column denotes mean correlation matrices over subjects for subject cluster 1 (top) and subject cluster 2 (bottom). For the second to the fifth columns, a correlation matrix is displayed for several subjects in the corresponding subject cluster. Both horizontal and vertical axes denote ROI ID in view 16. Note that these ROIs denote 'whitened' ROIs after the pre-processing of whitening.     . The horizontal axis denotes view ID, whereas the vertical axis denotes the demarcation of subject clusters. In each color bar, subjects are sorted in the ascending order of subject cluster ID. On the other hand, subject cluster ID is in advance sorted in the descending order of its cluster size. Cluster differences are denoted by color, but note that the same color may be used for different clusters apart.

B Chinese Restaurant Process (CRP)
A prior distribution of the view membership u (p-dimensional vector) is given by a Chinese Restaurant Process (CRP) as follows [62,63]: where K is the number of views; N k the number of nodes that belong to view k; α hyperparameter (concentration parameter). An important feature of CRP is that there is no need to fix the number of views K in advance.
Theoretically, we can consider an infinite number of views (K = ∞), the optimal number being inferred in a data-driven manner based on the posterior distribution of u. We set the concentration parameter α to one. For membership vectors z v and y v , priors are defined in a similar manner based on CRP.
C Prior for degree of freedom T We assume a uniform categorical distribution for the prior of the degree of freedom T in the Wishart distribution as follows: T ∼ Cat(·|c, p), where c and p are sample space and event probability, respectively. In the present paper, we set that c = (T 1 , . . . , T q ) and that p i = 1/q (i = 1, . . . , q). These hyperparameters are given by T 1 = p + 5, T 2 = p + 5 + δ, . . ., T q = p + 5 + (q − 1)δ, where p is the number of (all) nodes; δ = 3; q = max(n ∈ Z | T q ≤ T * ). We assume the upper bound T * = max{2p, T ori }, where T ori is the number of datapoints of the original time series X.

D Algorithm Initialization
We initialize relevant parameters as follows.
• Degree of freedom T : T = max{2p, T ori }, where T ori is the number of datapoints in the original time series data X; p is the number of nodes.
• View membership u: Randomly drawn from a prior distribution based on CRP in Eq. (14).
• Object cluster membership z v : Randomly drawn from a prior distribution based on CRP in Eq. (14).
• Node cluster membership y v : all ones (assuming a single node cluster per view).

Optimization
Starting from the aforementioned initial values of parameters, we iteratively update these parameters to find MAP estimates (Algorithm 1). We repeat this procedure for 1000 random initializations. As a default, we select the best model in terms of the posterior distribution. However, for the purpose of selecting a stable model, we use the optimization approach in the application to fMRI data as described in the main manuscript.