Optimal Clustering with Missing Values

Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Missing values can complicate the application of clustering algorithms, whose goals are to group points based on some similarity criterion. A common practice for dealing with missing values in the context of clustering is to first impute the missing values, and then apply the clustering algorithm on the completed data. We consider missing values in the context of optimal clustering, which finds an optimal clustering operator with reference to an underlying random labeled point process (RLPP). We show how the missing-value problem fits neatly into the overall framework of optimal clustering by incorporating the missing value mechanism into the random labeled point process and then marginalizing out the missing-value process. In particular, we demonstrate the proposed framework for the Gaussian model with arbitrary covariance structures. Comprehensive experimental studies on both synthetic and real-world RNA-seq data show the superior performance of the proposed optimal clustering with missing values when compared to various clustering approaches. Optimal clustering with missing values obviates the need for imputation-based pre-processing of the data, while at the same time possessing smaller clustering errors.

appropriate number of clusters. For example, in a Bayesian framework, utilizing Bayes Factor can incorporate both a priori knowledge of different models, and goodness of fit of the parametric model to the observed data. Moreover, nonparametric models such as Dirichlet-process mixture models [5] provide a more flexible approach for clustering, by automatically learning the number of components. In small-sample settings, model-based approaches that incorporate model uncertainty have proved successful in designing robust operators [6][7][8][9], and in objectivebased experiment design to expedite the discovery of such operators [10][11][12].
Whereas classification theory is grounded in featurelabel distributions with the error being the probability that the classifier mislabels a point [6,13]; clustering algorithms operate on random labeled point processes (RLPPs) with error being the probability that a point will be placed into the wrong cluster (partition) [14]. An optimal (Bayes) clusterer minimizes the clustering error and can be found with respect to an appropriate representation of the cluster error [15].
A common problem in clustering is the existence of missing values. These are ubiquitous with highthroughput sequencing technologies, such as microarrays [16] and RNA sequencing (RNA-seq) [17]. For instance, with microarrays, missing data can occur due to poor resolution, image corruption, or dust or scratches on the slide [18], while for RNA-seq, the sequencing machine may fail to detect genes with low expression levels owing to the random sampling nature of sequencing technologies. As a result of these missing data mechanisms, gene expression data from microarray or RNA-seq experiments are usually in the form of large matrices, with rows and columns corresponding to genes and experimental conditions or different subjects, respectively, with some values missing. Imputation methods, such as MICE [19], Amelia II [20] and missForest [21], are usually employed to complete the data matrix before clustering analysis; however, in small-sample settings, which are common in genomic applications, these methods face difficulties, including colinearity due to potential high correlation between genes in samples, which precludes the successful imputation of missing values.
In this paper we follow a different direction by incorporating the generation of missing values with the original generating random labeled point process, thereby producing a new RLPP that generates the actual observed points with missing values. The optimal clusterer in the context of missing values is obtained by marginalizing out the missing features in the new RLPP. One potential challenge arising here is that in the case of missing values with general patterns, conducting the marginalization can be computationally intractable, and hence resorting to approximation methods such as Monte Carlo integration is necessary.
Although the proposed framework for optimal clustering can incorporate the probabilistic modeling of arbitrary types of missing data mechanisms, to facilitate analysis, throughout this work we assume data are missing completely at random (MCAR) [22]. In this scenario, the parameters of the missingness mechanism are independent of other model parameters and therefore vanish after the expectation operation in the calculation of the posterior of label functions for clustering assignment.
We derive the optimal clusterer for different scenarios in which features are distributed according to multivariate Gaussian distributions. The performance of this clusterer is compared to various methods, including k-POD [23] and fuzzy c-means with optimal completion strategy [24], which are methods for directly clustering data with missing values, and also k-means [25], fuzzy cmeans [26] and hierarchical clustering [27] with the missing values imputed. Comprehensive simulations based on synthetic data show the superior performance of the proposed framework for clustering with missing values over a range of simulation setups. Moreover, evaluations based on RNA-seq data further verify the superior performance of the proposed method in a real-world application with missing data.

Optimal clustering
Given a point set S ⊂ R d , where d is the dimension of the space, denote the number of points in S by η(S). A random labeled point process (RLPP) is a pair ( , ), where is a point process generating S and generates random labels on point set S. maps from a probability space to [ N; N ], where N is the family of finite sequences in R d and N is the smallest σ -algebra on N such that for any Borel set B in R d , the mapping S → η(S∩B) is measurable. A random labeling is a family, = { S : S ∈ N}, where S is a random label function on the point set S in N. Denoting the set of labels by L = {1, 2, ..., l}, S has a probability mass function on L S defined by P S (φ S ) = P( S = φ S | = S), where φ S : S → L is a deterministic function assigning a label to each point in S.
A label operator λ maps point sets to label functions, λ(S) = φ S,λ ∈ L S . For any set S, label function φ S and label operator λ, the label mismatch error is defined as where I A is an indicator function equal to 1 if A is true and 0 otherwise. The error of label function λ(S) is computed as λ (S) = E S [ λ (S, φ S )|S], and the error of label operator λ for the corresponding RLPP is then defined by Clustering involves identifying partitions of a point set rather than the actual labeling, where a partition of S into l clusters has the form P S = {S 1 , S 2 , ..., S l } such that S i 's are disjoint and S = l i=1 S i . A cluster operator ζ maps point sets to partitions, ζ(S) = P S,ζ . Considering the label switching property of clustering operators, let us define F ζ as the family of label operators that all induce the same partitions as the clustering operator ζ . More precisely, a label function φ S induces partition P S = {S 1 , S 2 , ..., S l }, if S i = {x ∈ S : φ S (x) = l i } for distinct l i ∈ L. Thereby, λ ∈ F ζ if and only if φ S,λ induces the same partition as ζ(S) for all S ∈ N. For any set S, label function φ S and cluster operator ζ , define the cluster mismatch error by As shown in [15], error definitions for partitions can be represented in terms of risk with intuitive cost functions.
Specifically, define G P S such that φ S ∈ G P S if and only if φ S induces P S . The error of partition can be expressed as where K S is the set of all possible partitions of S, P S (P S ) = φ S ∈G P S P S (φ S ) is the probability mass function on partitions P S of S, and the partition cost function between partitions P S and Q S of S is defined as with φ S,P S being any member of G P S . A Bayes cluster operator ζ * is a clusterer with the minimal error [ ζ * ], called the Bayes error, obtained by a Bayes partition, ζ * (S) for each set S ∈ N: The Bayes clusterer can be solved for each fixed S individually. More specifically, the search space in the minimization and the set of partitions with known probabilities in the summation can be constrained to subsets of K S , denoted by C S and R S , respectively. We refer to C S and R S as the set of candidate partitions and the set of reference partitions, respectively. Following [15], we can search for the optimal clusterer based on both optimal and suboptimal procedures (detailed in "Results and discussion" section) with derived bounds that can be used to optimally reduce the size of C S and R S .

Gaussian model with missing values
We consider an RLPP model that generates the points in the set S according to a Gaussian model, where features of x ∈ S can be missing completely at random due to a missing data mechanism independent of the RLPP. More precisely, the points x ∈ S with label φ S (x) = i are drawn independently from a Gaussian distribution with param- Assuming n i sample points with label i, we divide the observations into G i ≤ n i groups, where all n ig points in group g have the same set, J ig , of observed features with cardinality |J ig | = d ig . Denoting by S ig the set of sample points in group g of label i, we represent the pattern of missing data in this group using a d ig × d matrix M ig , where each row is a d-dimensional vector with a single non-zero element with value 1 corresponding to the observed feature's index. Thus, the non-missing portion of . Given ρ = {ρ 1 , ρ 2 , ..., ρ l } of independent parameters, to evaluate the posterior probability of random labeling function φ S ∈ L S , we have where P(φ S ) is the prior probability on label functions, which we assume does not depend on the specific points in S.

Known means and covariances
When mean and covariance parameters of labelconditional distributions are known, the prior probability We define the group-g statistics of label i as where m ig and ig are the sample mean and scatter matrix, employing only the observed n ig data points in group g of label i. The posterior probability of labeling function (7) then can be expressed as where ig = M ig i M T ig is the covariance matrix corresponding to group g of label i.

Gaussian means and known covariances
Under this model, data points are generated according to Gaussians whose mean parameters are random and their covariance matrices are fixed. Specifically, for label i we where ν i > 0 and m i is a length d real vector. Thus the posterior of label function given the point set S can be derived as By completing the square and using the normalization constant of multivariate Gaussian distribution, the integral in this equation can be expressed as where

Gaussian-Inverse-Wishart Means and Covariances
Under this model, data points are generated from Gaussian distributions with random mean and covariance parameters. More precisely, the parameters associated with label i are distributed as μ i | i ∼ N m i , 1 ν i i and i ∼ IW(κ i , i ), where the covariance has inverse-Wishart distribution To compute the posterior probability of labeling function (6), we first marginalize out the mean parameters μ i in a similar fashion to ( 10), obtaining The integration in the above equation has no closed form solution, thus we resort to Monte Carlo integration for approximating it. Specifically, denoting the term in the brackets in Eq. (15) as g( i ) , we draw J samples

Results and discussion
The performance of the proposed method for optimal clustering with missing values at random is compared with some suboptimal versions, two other methods for clustering data with missing values, and also classical clustering algorithms with imputed missing values. The performance comparison is carried out on synthetic data generated from different Gaussian RLPP models with different missing probability setups, and also on a publicly available dataset of breast cancer generated by TCGA Research Network (https://cancergenome.nih.gov/). In our experiments, the results of the exact optimal solution for the RLPP with missing at random (Optimal) is provided for smaller point sets, i.e. wherever computationally feasible. We have also tested two suboptimal solutions, similar to the ideas in [15], for an RLPP with missing at random. In the first one (Subopt. Pmax), the set of reference partitions (R S ) is restricted to a closed ball of a specified radius centered on the partition with the highest probability, where the distance of two partitions is defined as the minimum Hamming distance between labels inducing the partitions. In both Optimal and Pmax, the reference set is further constrained to the partitions that assign the correct number of points to each cluster, but the set of candidate partitions (C S ) includes all the possible partitions of n points, i.e. 2 n−1 . In the second suboptimal solution (Subopt. Pseed), a local search within Hamming distance at 1 is performed starting from five random initial partitions to approximately find the partition with (possibly local) maximum probability. Then the sets of reference and candidate partitions are constrained to the partitions with correct cluster sizes with a specified Hamming distance from the found (local) maximum probability partition. The bounds derived in [15] for reducing the set of candidate and reference partitions are used, where applicable, in Optimal, Pseed, and Pmax.
In all scenarios, k-POD and fuzzy c-means with optimal completion strategy (FCM-OCS) are directly applied to the data with missing values. In the simulations in [24], where FCM-OCS is introduced, to initialize cluster centers, the authors apply ordinary fuzzy c-means to the complete data, i.e. using knowledge of the missing values. To have a fair comparison with other methods, we calculate the initial cluster centers for FCM-OCS by applying fuzzy c-means to the subset of points with no missing features for lower missing rates. For higher missing rates we impute the missing values by the mean of the corresponding feature values across all points, and then apply fuzzy c-means to all the points to initialize the cluster centers. In order to apply the classical algorithms, the missing values are imputed according to [28], by employing a multivariate Gibbs sampler that iteratively generates samples for missing values and parameters based on the observed data. The classical algorithms included in our experiments include k-means (KM), fuzzy c-means (FCM), hierarchical clustering with single linkage (Hier. (Si)), and hierarchical clustering with complete linkage (Hier. (Co)). Moreover, completely random clusterer (Random) results are also included for performance comparisons.

Simulated data
In the simulation analysis, the number of clusters is fixed at 2, and the dimensionality of the RLPPs (number of features) is set to 5. Additional results for 20 features are provided in Additional file 1. Point generation is done based on a Gaussian mixture model (GMM). Three different scenarios for the parameters of the GMM are considered: i) Fixed known means and covariances ii) Known covariances and unknown means with Gaussian distributions. iii) Unknown means and covariances with Gaussian inverse-Wishart distributions. We select the values of the parameters of the point generation process to have an approximate Bayes error of 0.15. The selected values are shown in Table 1 . For the point set generation, the number of points from each cluster is fixed a priori. The distributions are first drawn from the assumed model, and then the points are generated based on the drawn distributions. A subset of the points' features is randomly selected to be hidden based on missing at random with different missing probabilities. Four different setups for the number of points are considered in our simulation analysis: 10 points from each cluster (n 1 = n 2 = 10), 12 points from one cluster and 8 points from the other cluster (n 1 = 12, n 2 = 8), 35 points from each cluster (n 1 = n 2 = 35), and 42 points from one cluster and 28 points from the other cluster (n 1 = 42, n 2 = 28). When having unequal sized clusters, in half of the repetitions n 1 points are generated from the first distribution and n 2 points from the second distribution, and vice-versa in the other half. In each simulation repetition, all clustering methods are applied to the points to generate a vector of labels that induces a two-cluster partition. The predicted label vector by each method is compared with the true label vector of each point in the point set to calculate the error of that method on that point set. In other words, for each method the number of points assigned to a cluster different from their true one are counted (after accounting for the label switching issue) and divided by the total number of points (n = n 1 +n 2 ) to calculate the clustering error of that method on the point set. These errors are averaged Gaussian means and fixed covariances Gaussian means and inverse-Wishart covariances The average results for the fixed mean vectors and covariance matrices across 100 repetitions are shown in Fig. 1. Here, the Hamming distance threshold for reference partitions in Pmax and Pseed is fixed at 1. It can be seen that Optimal, Pmax, and Pseed outperform all the other methods in all the smaller sample size settings, and Pmax and Pseed have virtually the same performance as Optimal. For the larger sample size settings where only Pseed is applied, its superior performance compared with other methods is clear from the figure. have lower average errors than all the other methods. We can see that for balanced clusters the suboptimal and optimal solutions have very close performances, but for the unbalanced clusters case with higher missing probabilities the difference between Optimal and Pmax and Pseed gets noticeable. For larger sample sizes Pseed consistently outperforms the other methods, although for lower missing probabilities it has closer competitors. In all cases, as the missing probability increases, the superior performance of the proposed methods becomes more significant. The average results under the unknown mean vectors and coavriance matrices with Gaussian-inverse-Wishart distribution over 40 repetitions are provided in Fig. 3. In the smaller sample size cases, the Hamming distance threshold in Pmax and Pseed is fixed at 8, and we can see that the proposed suboptimal (Pmax and Pseed) and optimal clustering with missing values have very close average errors, and all are much lower than the other methods' errors. For larger sample sizes, only the results for missing probability equal to 0.15 are shown vs. the Hamming distance threshold used to define the reference partitions in Pseed. Again, Pseed performs better than the other methods.

RNA-seq data
In this section the performance of the clustering methods are examined on a publicly available RNA-seq dataset of breast cancer. The data is available on The Cancer Genome Atlas (TCGA) [29], and is procured by the R package TCGS2STAT [30]. It consists of matched tumor and normal samples, and includes 97 points from each. The original data are in terms of the number of sequence reads mapped to each gene. RNA-seq data are integers, highly skewed and over-dispersed [31][32][33][34][35]. Thus, we apply a variance stabilizing transformation (VST) [36] implemented in DESeq2 package [37], and transform data to a log2 scale that have been normalized with respect to library size. For all subsequent analysis, other than for calculating clustering errors, we assume that the labels of data are unknown. Feature selection is performed in a completely unsupervised manner, since in clustering no labels are known in practice. The top 10 genes in terms of variance to mean ratio of expression are picked as features to be used in clustering algorithms. In general, for setting prior hyperparameters, external sources of information like signaling pathways, where available, can be leveraged [38,39]. Here, we only use a subset of the discarded gene expressions, i.e. the next 90 top genes (in terms of variance to mean ratio of expression), for prior hyperparameters calibration for the optimal/suboptimal approaches. We follow the approach in [40] and employ the method of moments for prior calibration, but unlike [40], a single set of hyperparameters is estimated and used for both clusters, since the labels of data are not available. It is well known that in small sample size settings, estimation of covariance matrices, scatter matrices and even mean vectors may be problematic. Therefore, similar to [40], we assume the following structure and estimate five scalars (m, σ 2 , ρ, κ, ν) from the data.
In each repetition, stratified sampling is done, i.e. n 1 and n 2 points are sampled randomly from each group (normal and tumor). When n 1 = n 2 , in half of the repetitions n 1 and n 2 points are randomly selected from the normal and tumor samples, respectively, and vice-versa in the other half. Prior calibration is performed in each repetition, and 15% of the selected features are considered as missing values. Similar to the experiments on the simulated data, the clustering error of each method in each iteration is calculated by comparing the predicted labels and true labels of the sampled points (accounting for label switching issue), and the average results over 40 repetitions are provided in Fig. 4. It can be seen that the proposed optimal clustering with missing values and its suboptimal versions outperform the other algorithms. It is worth noting that the performance of Pseed is more sensitive to the selected Hamming distance threshold for reference partitions compared with the results on simulated data.

Conclusion
The methodology employed in this paper is very natural. As with any signal processing problem, the basic problem is to find an optimal operator from a class of operators given the underlying random process and a cost function, which is often an error associated with operator performance. While it may not be possible to compute the optimal operator, one can at least employ suboptimal approximations to it while knowing the penalties associated with the approximations.
In this paper, we have, in effect, confronted an old problem in signal processing: If we wish to make a decision based on a noisy observed signal, is it better to filter the observed signal and then determine the optimal decision on the filtered signal, or to find the optimal decision based directly on the observed signal? The answer is the latter. The reason is that the latter approach is fully optimal relative to the actual observation process, whereas, even if in the first approach the filtering is optimal relative to the noise process, the first approach produces a composite of two actions, filter and decision, each of which is only optimal relative to a portion of the actual observation process. In the present situation involving clustering, in the standard imputation-followed-by-clustering approach, it is typically the case that neither the filter (imputation) nor the decision (clustering) is optimal, so that even more advantage is obtained by optimal clustering over the missing-value-adjusted RLPP.