Convex Covariate Clustering for Classiﬁcation

,


Introduction
Interpretability is paramount to communicate classification and regression results to the domain expert.Especially in high-dimensional problems, sparsity of the solution is often assumed to increase interpretability.As a consequence, 1 various previous work in statistics focuses on covariate selection, with the 1penalty being a particular popular choice [11].
However, even after successful covariate selection, the remaining solution might still contain several hundreds or thousands of covariates.We therefore suggest to further cluster the covariates.In particular, we propose a clustering method that uses a-priori similarity information between covariates while respecting the response variable of the classification problem.The resulting covariate clusters can help to identify meaningful groups of covariates and this way, arguably increases the interpretability of the solution.
As a simple motivating example, consider the situation of document classification, where the covariates are words in a document.Without further context, we might want to cluster the words "apple" and "cherry" together.This cluster might be appropriate for classifying a document into either category "cooking" or category "sports'.'However, this cluster might be inappropriate for classifying a document as either "sports" or "computer", due to the ambiguity of "apple" meaning also "Macinctosh".
This demonstrates that a simple two step approach, first step: covariate clustering; second step: classification, might lead to clusters that are difficult to interpret. 1 Therefore, we propose to formulate the clustering problem as a joint sample classification and covariate clustering problem.In particular, we use a logistic regression loss with a pair-wise group lasso penalty for each pair of covariate weight vector.Our formulation leads to a convex optimization problem, like convex clustering [6,12].As a consequence, we have the desirable properties of a unique global minima, and the ease to plot a clustering hierarchy, instead of just a single clustering.
Our proposed method is conceptually related to joint convex covariate clustering and linear regression as proposed in [19].However, the change from linear regression to logistic regression is computationally challenging.The main reason is that the objective function is not decomposable for each covariate anymore.Furthermore, the logistic regression loss is convex, but not, in general, strictly convex, and therefore prohibits the application of the alternating minimization algorithm (AMA) as in [6].
Therefore, we propose a specialized alternating direction method of multipliers (ADMM) with an efficient gradient descent step for the non-decomposable primal update.Our solution allows us to scale the covariate clustering to problems with several 1000 covariates.
Since we often want to decide on one clustering result, we also propose a model selection criterion using an approximated marginal likelihood criterion.The motivation for this criterion is similar to the Bayesian Information Criterion (BIC) [18], but prevents the issue of a singular likelihood function.The proposed criterion circumvents the need for hyper-parameter selection with crossvalidation which is computationally infeasible here.
The outline of this article is as follows.In the next section, we introduce our proposed objective function, and describe an efficient ADMM algorithm for solving the convex optimization problem.In Section 3, we describe our model selection criterion for selecting a plausible clustering result out of all clustering results that were found with the proposed method.Next, in our experiments, Sections 4 and 5, we compare the proposed method to k-means clustering and convex clustering [6].In particular, in Section 4, we evaluate the proposed method and the model selection criterion on several synthetic data sets.In Section 4.2, we evaluate the computational runtime of the proposed ADMM algorithm and compare to standard convex optimization solvers.In Section 5, we evaluate the quality of the clustering result on two real text data sets.Finally, in Section 6, we summarize our conclusions.
A note on our notation: we denote a matrix by capital letter, e.g.B ∈ R c×d , and a column vector by bold font e.g.x ∈ R d .Furthermore, the i-th row of B is denoted by B i,• and is a row vector.The j-th column of B is denoted by B •,j or simply b j , and is a column vector.

Proposed Method
Let B ∈ R c×d , where c is the number of classes, and d is the number of covariates.B l,• is the weight vector for class l.Furthermore, β 0 ∈ R c contains the intercepts.We now assume the multi-class logistic regression classifier defined by .
We propose the following formulation for jointly classifying samples x s and clustering the covariates: minimize where S i1,i2 ≥ 0 defines a similarity measure between covariate i 1 and i 2 and is assumed to be given a-priori.The last term is a group lasso penalty on the class weights for any pair of two covariates i 1 and i 2 .The penalty is large for similar covariates, and therefore encourages that B •,i1 − B •,i2 is 0, that means that B •,i1 and B •,i2 are equal.The clustering of the covariates can be found by grouping two covariates i 1 and i 2 together if B •,i1 and B •,i2 are equal.
The advantage of this formulation is that the problem is convex, and we are therefore guaranteed to find a global minima.
Note that this penalty shares some similarity to convex clustering as in [6,12].However, one major difference is that we do not introduce latent vectors for each data point, and our method can jointly learn the classifier and the clustering.
We remark that it is straight forward to additionally add a sparsity penalty on the columns of B to jointly select and cluster all covariates.We omit in the following such extensions and focus on the computationally difficult part, the clustering.Our implementation2 allows to perform joint or sequential covariate selection and clustering.
Finally, we remark that the similarity matrix can be represented as an undirected weighted graph with an edge between covariates i 1 and i 2 iff S i1,i2 > 0. In practice, S might be manually crafted from a domain expert (e.g.given by some ontology), or learned from a different data set (as we do in Section 5).

Optimization using ADMM
Here, we focus on the optimization problem from Equation ( 1), which we can rewrite as minimize where d i denotes the number of adjacent covariates of i.Assuming that the adjacent covariates of i are ordered from 1 to d i , the function e i (j) returns the global covariate id of the j-th adjacent covariate of i.We can then formulate the problem as where we denote b j := B •,j ∈ R c .Therefore, z i→a can be read as "a copy of b i for the comparison with b a ", where a is an adjacent node of i.

Update of primal variables
The update of B can be solved with an approximate gradient Newton method [5], where fast calculation of the gradient and function evaluation is key to an efficient implementation.Let us define Due to the second term, the calculation of g(B) is in O(d•max i d i ), and therefore, O(d 2 ) for dense graphs.However, the double sum in the second term can be expressed as follows: where we defined q ij := z k i→ei(j) + u k i→ei(j) , and q all := d i=1 di j=1 q T ij q ij , and q ↑ i := di j=1 q ij .Since q all and q ↑ i can be precalculated, each repeated calculation of g(B) is in O(d).
Furthermore, we note that

Update of auxiliary variables
The update of z i→j and z j→i , for each unordered pair {i, j} can be performed independently, i.e.: ∀{i, j} : i ∈ {1, . . .d}, j ∈ {1, . . ., d i } : This optimization problem has a closed form solution, which was proven in a different context in [10], with where with

Stopping criteria
As stopping criteria, we use, as suggested in [4], the 2 norm of the primal r k and dual residual s k at iteration k, defined here as We stop after iteration k, if where l is the total number of edges in our graph, and is set to 0.00001.We also stop, in the cases where k ≥ 10000.

Identifying the covariate clusters
Although in theory the optimization problem ensures that certain columns of B are exactly 0, due to the use of ADMM this is not true in practice.We therefore follow the strategy proposed in [6]: after convergence of the ADMM, we investigate Z, and place an edge between node i and j iff The equality (3) can be check exactly (without numerical difficulties) due to the thresholding of θ to 0.5 in Equation ( 2).
In the resulting graph, there are two possible ways to identify clusters: • identify the connected components as clusters. 3 consider only fully connected components as clusters.
Of course, in theory, since the optimization problem is convex, after complete convergence, we must have that the two ways result into the same clustering.However, in practice, we find that the latter leads to too many covariates not being clustered (i.e. each covariate is in a single cluster).The latter is also computationally difficult, since identifying the largest fully connected component is NP-hard.Therefore, we proceed here with the former strategy.
We denote the identified clusters as C 1 , . . ., C m , where m is the number of clusters and C := {C 1 , . . ., C m } is a partition of the set of covariates.

Approximate Bayesian Model Selection
Note that different hyper-parameter settings for ν will result in different clusterings.For our experiments, we consider ν ∈ {n • 2 −0.1a | a ∈ 0, 1, 2, . . ., 299}.This way, we get a range of clusterings.Like convex clustering, this allows us to plot a clustering hierarchy.However, in most situations, we are interested in finding the most plausible clustering, or ranking the clusterings according to some criterion.
One obvious way is to use cross-validation on the training data to estimate held-out classification accuracy for each setting of ν.However, the computational costs are enormous.Another, more subtle issue is that the group lasso terms in Equation (1) jointly perform clustering and shrinkage, but controlled by only one parameter ν.However, the optimal clustering and the optimal shrinkage might not be achieved by the same value of ν, an issue well known for the lasso penalty [14].Therefore, we consider here a different model selection strategy: an approximate Bayesian model selection with low computational costs.
After we have found a clustering, we train a new logistic regression classifier with the parameter space for B limited to the clustering.Assuming some prior over B, the marginal likelihood p(y) provides a trade-off between model fit and number of parameters (= number of clusters).A popular choice is the Bayesian information criterion (BIC) [18], which assumes that the prior can be asymptotically ignored 4 .However, BIC requires that the unpenalized maximum-likelihood estimate is defined.Unfortunately, this is not the case for logistic regression where linearly separable data will lead to infinite weights.For this reason, we suggest here to use a Gaussian prior on B and then estimate the marginal likelihood with a Laplace approximation.
In detail, in order to evaluate the quality of a clustering, we suggest to use the marginal likelihood p(y|X, C , β 0 ), where we treat the intercept vector β 0 as hyper-parameter.X ∈ R n×d denotes the design matrix, and y ∈ {1, . . ., c} n the responses.
We define the following Bayesian model where π denotes our prior on B, and x C denotes the projection of covariates x on the clustering defined by C .This means where we define the matrix T C ∈ R m×d as follows Assuming a Gaussian prior N (0, σ 2 ) on each entry of B ∈ R m×m , the log joint distribution for one data point (y, x) is given by log f (y|x, B, For simplicity, let us denote r i := (B i,• ) T .Note that r i ∈ R m .Then the gradient with respect to r i is The Hessian with respect to r i , r j is Let us denote by S(B) the Hessian of the log joint probability of the observed data X = {x 1 , . . ., x n } and prior B, i.e.

S(B)
where S(B) ij ∈ R m×m denotes the i, j-th block of S(B) ∈ R cm×cm .By the Bayesian central limit theorem, we then have that the posterior is approximately normal with covariance matrix −S( B) −1 .Then applying the Laplace approximation (see e.g.[2]), we get the following approximation for the marginal likelihood p(X|β 0 ): where B and β0 are the maximum a-posteriori (MAP) estimates from model (4).
For large number of clusters m the calculation of | − S( B) −1 | is computationally expensive.Instead, we suggest to use only the diagonal of the Hessian, which can be calculated for one sample (y, x) as follows with i := u m , and z := u − (i − 1)m.We consider the intercept terms β 0 as hyper-parameters and use empirical Bayes to estimate them, i.e. we set them to β0 .The hyper-parameter σ cannot be estimated with empirical Bayes, since the unpenalized maximum-likelihood might not be defined, and this would lead to σ → ∞.Therefore, we estimate σ using cross-validation on the full model, i.e. no clustering.This is computationally feasible since the MAP is an ordinary logistic regression with 2 penalty and cross-validation needs to be done only once and not for every clustering.

Synthetic Data Experiments
Here, in this section, we investigate the performance of our proposed method for covariate clustering, as well as the performance of our proposed model selection criterion on synthetic data.
For all synthetic datasets, we set the number of classes c to 4. The intercept vector β 0 is set to the zero vector.We group the covariates evenly into 10 clusters.The weight vector b i for covariate i is set to the all zero vector except one position which is set to 5.0.That means each covariate is associated with exactly one class.If two covariates i and j belong to the same cluster, then b i and b j are set to be equal.Finally, we generate samples from a multivariate normal distribution as follows: given a positive definite covariate similarity matrix S, we generate a sample x from class y by x ∼ N (B y,• , S).For each class we generate the same number of samples.
For S we consider two scenarios: S agrees with class label information If covariates i and j are in different clusters, then S ij = 0, otherwise S ij = 0.9.S ii is set to 1.0.An example, where each cluster has four covariates is show in Figure 1. Figure 1: Shows the first 20 covariates with its associated cluster labels (ground truth), associated classes, and similarities between the covariates.Grey bar at the same hight means that the covariates are similar to each other, i.e. S ij = 0.9.
Here, similarity S agrees with class information.
S partly disagrees with class label information This setting considers the case when the prior information from S partly disagrees with the class label information.For that purpose, we modify the clustering from before.In particular, a new cluster C is introduced that covers evenly covariates from two different clusters that are associated with different class labels.This is repeated once more for two different clusters.This way we get two new clusters, which we denote C 1 and C 2 .S is defined as follows.If covariates i and j belong to the same new cluster, then S ij = 0.9.If covariates i and j belong to the same old cluster, and neither belongs to a new cluster, then S ij = 0.9.In all other cases S ij is set to 0, and S ii = 1.0.An example is show in Figure 2. Here, similarity S disagrees with class information.

Clustering Evaluation
We consider d ∈ {40, 200} and n ∈ {40, 400, 4000}.We compare the proposed method to k-means with the initialization proposed in [3] and convex clustering [6].For k-means, we consider k ∈ {1, 2, . . ., d}.For convex clustering the hyperparameter γ is tested in the range 0 to 5 evenly spaced with step size 1/300, and weights w ij are set to the exponential kernel with φ = 0.5. 5e repeat each experiment 10 times and report average and standard deviation of the adjusted normalized mutual information (ANMI) [21] when compared to the true clustering.The ANMI score ranges from 0.0 (agreement with true clustering at pure chance level) to 1.0 (complete agreement with true clustering).All results are summarized in Table 1.For selecting a clustering we use the approximate marginal likelihood selection criterion described in Section 3. If the correct clustering was not found with the marginal likelihood model selection criterion, we also show the oracle model selection performance, i.e. the highest ANMI score among all the clusterings found (small font in Table 1).This allows us to distinguish the failure of creating the correct clustering from the failure of selecting the correct clustering.
Agreement and disagreement of similarity measure with class-label information As expected, for the agreeing case, both k-means and convex clustering can find the correct clustering.At least, for k-means, this is true when focusing on the oracle model selection performance.However, when the a-priori similarity measure disagrees with the class-label information (disagreeing case), then k-means and convex clustering fail to create the correct clustering.The proposed method finds in all cases the correct clustering.

Quality of marginal likelihood model selection criterion
For the disagreeing case, we see a discrepancy between the selected clustering and the best clustering, when clustering with k-means.However, in all other cases, we find that the marginal likelihood model selection criterion proposed in Section 3 always selects the best clustering.In particular, for the proposed method, we see that the correct clustering is always selected.

Runtime Experiments
In order to check the efficiency of our proposed ADMM solution, we compare it to two standard solvers for convex optimization, namely ECOS [9] and SCS [16] using the CVXPY interface [8,1].Note that SCS uses a generic ADMM method for which we use the same stopping criteria described in Section 2.2.We run all experiments on a cluster with 32 Intel(R) Xeon(R) CPUs with 3.20GHz.
The results for the synthetic data set (disagreement of similarity measure with class-label information) with d ∈ {40, 200, 1000} and n ∈ {40, 400, 4000} are listed in Table 2.We repeat each experiment 10 times and report the average   In cases where one experiment did not finish after 12 hours we stopped the evaluation.
Our evaluation shows that the proposed method scales well with the number of samples n, but less well with the number of variables d.One reason is that the number of auxiliary variables grows quadratically with the number of variables d, and therefore increases the cost per iteration by Ω(d 2 ). 6Nevertheless, our proposed method is considerably faster than standard solvers.Importantly, our method scales to several 1000 variables which can be crucial when applying to real data.

Experiments on Real Data
For our experiments on real data, we used the movie review corpus IMDB [13], which consists of in total 100k (labeled and unlabeled) documents.IMDB is a balanced corpus with 50% of the movies being reviewed as "good movie", and the remaining as "bad movie".As the second dataset, we used the 20 Newsgroups corpus (Newsgroup20) 7 with around 19k documents categorized into 20 classes (topics like "hockey", "computer",...) In order to check whether our model selection criterion correlates well with accuracy on held-out data, we use 10000 documents for training and clustering selection, and the remaining documents as held-out data. 8We removed all duplicate documents, and performed tokenization and stemming with Senna [7].

Covariate Selection
First, for each dataset we extract the 10000 most frequent words as covariates, and represent each document s by a vector x s where each dimension contains the tf-idf score of each word in the document. 9Finally, we normalize the covariates to have mean 0 and variance 1.
For such high dimensional problems, many covariates are only negligibly relevant.Therefore, in order to further reduce the dimension, we apply a group lasso penalty on the columns of B like in [20] 10 and select the model with the highest approximate marginal likelihood.This leaves us with 2604 and 1538 covariates for IMDB and Newsgroup20, respectively.

Covariate Similarity Measure
From additional unlabeled documents, we determine the similarity between two covariates i and j as follows.
First, using the unlabeled datasets, we create for each covariate i a word embedding w i .For IMDB, we create 50-dimensional word embeddings with word2vec [15] using 75k documents from the IMDB corpus. 11For Newsgroup20, since, the number of samples is rather small, we use the 300 dimensional word embeddings from GloVe [17] that were trained on Wikipedia + Gigaword 5.
Finally, the similarity between covariate i and j is calculated using

Quantitative Results
As a quantitative measure of interpretability, we suggest to use the number of effective parameters, which is just the number of clusters here.The ideal covariate clustering should lead to similar, or even better accuracy on held-out For selecting a clustering we use the proposed marginal likelihood criterion (Section 3), and the held-out data is only used for final evaluation.The results for IMDB and Newsgroup20 are shown in Table 3 and 5, respectively.We find that the marginal likelihood criterion tends to select models which accuracy on held-out data is similar to the full model, but with fewer number of effective parameters.Furthermore, for IMDB, we find that the proposed method leads to considerably more compact models than convex clustering and k-means, while having similar held-out accuracy.On the other hand, for Newsgroup20, the accuracy of the proposed method and k-means clustering is similar, indicating that there is good agreement between the similarity measure S and the classification task.
In order to confirm that these conclusions are true, independent of the model selection criterion, we also show the held-out accuracy of clusterings with number of effective parameters being around 100, 500 and 1000.Note that, in contrast to k-means, the proposed method and convex clustering can only control the number of clusters indirectly through their regularization hyper-parameter.The results for IMDB and Newsgroup20, shown in 4 and 6, confirm that the proposed method can lead to better covariate clusterings than k-means and convex clustering.

Qualitative Results
Like convex clustering, our proposed method can be used to create a hierarchical clustering structure by varying the hyper-parameter ν.In particular, for large enough ν, all covariates collapse to one cluster.On the other side, for ν = 0, each covariate is assigned to a separate cluster.In order to ease interpretability, we limit the visualization of the hierarchical clustering structure to all clusterings with ν ≤ ν 0 , where ν 0 corresponds to the clustering that was selected with the marginal likelihood criterion.Furthermore, for the visualization of IMDB, we only show the covariates with odds ratios larger than 1.In the supplement material, we give the clustering results of all methods for all classes. 13ach node represents one cluster.If the cluster is a leaf node, we show the covariate, otherwise we show the size of the cluster (in the following called cluster description).From the logistic regression model of each clustering, we associate each cluster to the class with highest entry in weight matrix B, and then calculate the odds ratio to the second largest weight.The color of a node represents its associated class.All classes with coloring are explained in Figure 3 and 4 for IMDB and Newsgroup20, respectively.The odds ratio of each cluster is shown below the cluster description.Further details of the creation of the hierarchical structure and visualization are given in the supplement material. 14or IMDB, the qualitative differences between the proposed method and k-means are quite obvious.K-means clustering tends to produce clusters that include covariates that are associated to different classes.An example is shown in Figure 5, where all number ratings are clustered together.However, these have very different meaning, since for example 7/10, 8/10, 9/10, 10/10 rate good movies and 0, 1/10, *1/, 2/10,... rate bad movies.We observe a similar result for convex clustering as shown in Figure 6.In contrast, our proposed method distinguishes them correctly by assigning them to different clusters associated with different classes (see Figure 7 and 8).
For Newsgroup20, the qualitative differences of the proposed method with k-means are more subtle, but still we can identify some interesting differences.For example, k-means assigns "apple" and "cherry" to the same cluster which is associated with the class "Macintosh" (comp sys mac hardware) (see Figure 9).In contrast, our proposed method correctly distinguishes these two concepts in the context of "Macintosh" (see Figure 10).

Conclusions
We presented a new method for covariate clustering that uses an a-priori similarity measure between two covariates, and additionally, samples with class label information.In contrast to k-means and convex clustering, the proposed method creates clusters that are a-posteriori plausible in the sense that they help to explain the observed data (samples with class label information).Like convex clustering [6], the proposed objective function is convex, and therefore insensitive to heuristics for initializations (as needed by k-means clustering).
Solving the convex objective function is computationally challenging.Therefore, we proposed an efficient ADMM algorithm which allows us to scale to several 1000 variables.Furthermore, in order to prevent computationally expensive cross-validation, we proposed a marginal likelihood criterion similar to BIC.For synthetic and real data, we confirmed the usefulness of our proposed clustering method and the marginal likelihood criterion.

Figure 2 :
Figure2: Shows the first 20 covariates with its associated cluster labels (ground truth), associated classes, and similarities between the covariates.Grey bar at the same hight means that the covariates are similar to each other, i.e. S ij = 0.9.Here, similarity S disagrees with class information.

Figure 3 :
Figure 3: Coloring of each class in IMDB.

Figure 10 :
Figure 10: Shows part of the clustering result of the proposed method on News-group20 for the class Macintosh (hardware).

Table 2 :
[16]age runtime (in minutes with standard deviation in brackets) of the proposed ADMM solution, and two standard solvers ECOS[9]and SCS[16]NA (not available) marks experiments where one run did not stop after 12 hours.

Table 3 :
Results for IMDB dataset.Shows the top 5 clustering results measured by the marginal likelihood, the number of effective parameters and the accuracy on hold-out data.Sorted according to marginal likelihood.

Table 4 :
Results for IMDB dataset.Comparison of held-out accuracy of the proposed method, convex clustering and k-means for around the same number of effective parameters: 99, 491, and 1010 (for convex clustering: 85, 509, and 1008).

Table 5 :
Results for Newsgroup20 dataset.Shows the top 5 clustering results measured by the marginal likelihood, the number of effective parameters and the accuracy on hold-out data.Sorted according to marginal likelihood.
1. Part of the clustering results of the proposed method, convex clustering and k-means are shown in Figures 5, 6, 7, 8, and Figures 9, 10, for IMDB and Newsgroup20, respectively. 12 Shows part of the clustering result of convex clustering on IMDB for the class bad movie.Figure 7: Shows part of the clustering result of the proposed method on IMDB for the class bad movie.Figure 8: Shows part of the clustering result of the proposed method on IMDB for the class good movie.Figure 9: Shows part of the clustering result of the k-means clustering on News-group20 for the class Macintosh (hardware).
Figure 5: Shows part of the clustering result of k-means clustering on IMDB for the class bad movie.