Fusion of Centroid-Based Clustering With Graph Clustering: An Expectation-Maximization-Based Hybrid Clustering

This article extends the expectation-maximization (EM) formulation for the Gaussian mixture model (GMM) with a novel weighted dissimilarity loss. This extension results in the fusion of two different clustering methods, namely, centroid-based clustering and graph clustering in the same framework in order to leverage their advantages. The fusion of centroid-based clustering and graph clustering results in a simple “soft” asynchronous hybrid clustering method. The proposed algorithm may start as a pure centroid-based clustering algorithm (e.g., <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-means), and as the time evolves, it may eventually and gradually turn into a pure graph clustering algorithm [e.g., basic greedy asynchronous distributed interference avoidance (GADIA) (Babadi and Tarokh, 2010)] as the algorithm converges and vice versa. The “hard” version of the proposed hybrid algorithm includes the standard Hopfield neural networks (and, thus, Bruck’s Ln algorithm by (Bruck, 1990) and the Ising model in statistical mechanics), Babadi and Tarokh’s basic GADIA in 2010, and the standard <inline-formula> <tex-math notation="LaTeX">${k}$ </tex-math></inline-formula>-means (Steinhaus, 1956), (MacQueen, 1967) [i.e., the Lloyd algorithm (Lloyd, 1957, 1982)] as its special cases. We call the “hard version” of the proposed clustering as “hybrid–nongreedy asynchronous clustering (H-NAC).” We apply the H-NAC to various clustering problems using well-known benchmark datasets. The computer simulations confirm the superior performance of the H-NAC compared to the <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-means clustering, <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-GADIA, spectral clustering, and a very recent clustering algorithm structured graph learning (SGL) by Kang et al. (2021), which represents one of the state-of-the-art clustering algorithms.


I. INTRODUCTION AND MOTIVATION
C LUSTERING is a fundamental mechanism in data processing and machine learning applications, and thus, it is a fundamental research area. It is the task of grouping a set of objects so that objects in the same group are more similar to each other than to those in other groups, and thus, it helps in understanding and discovering the natural grouping in a dataset. The clustering algorithms in the literature can be put into different categories depending on their features and mathematical models, such as centroid-based clustering, graph clustering, spectral clustering, distribution-based clustering, hierarchical clustering, graph-based subspace clustering, and multiview clustering. There is not a magical clustering method that solves all different types of challenging real-life clustering problems with the best performance. Naturally, there are potential shortcomings for all existing clustering techniques depending on their mathematical models and the characteristics of datasets. For example, generally speaking, the k-means clustering cannot find nonconvex clusters. Distribution-based clustering suffers from an overfitting problem although it has a strong theoretical foundation. Another prominent method, the Gaussian mixture model (GMM), assumes Gaussian distributions, which is a rather strong assumption for various real-life datasets. The spectral clustering-based methods usually have high time complexity due to the high time cost of eigenvector calculations of a Laplacian matrix [49]- [51].
Eventually, an algorithm that is designed for a specific kind of dataset model may fail on a different type of dataset model in practice. Thus, for a given particular clustering problem and its datasets, we often end up determining the most appropriate clustering algorithm experimentally. Spectral clustering [50], [51], [53]- [56] aims at exploring the local nonlinear manifold (spectral graph) structure [57], [58] and has attracted great attention in recent years. Spectral clustering methods usually perform eigendecomposition of the similarity matrix first and then use k-means to obtain the final clustering results from eigenvectors. Some popular spectral clustering algorithms are the ratio cut [86], normalized cut [50], multiclass spectral clustering [87], spectral embedded clustering [56], constrained Laplacian rank [88], and self-balanced min cut [89], among many other methods [57], [82], [83]. They have shown their power in many applications, such as image clustering [89], image segmentation [87], [90], gene expression data [91], and face recognition [48]. A novel large-scale balanced min cut (LABIN) algorithm LABIN is presented in [81].
Subspace clustering [84] has recently been one of the most common methods used to recover the low-dimensional representations of high-dimensional data with some theoretical guarantees [92], [93]. The subspace clustering has successfully been applied to numerous areas, such as image segmentation, human motion segmentation [94], sequential data clustering [95], medical imaging [96], and bioinformatics [97]. For further references, see [49].
Various authors have, in recent years, reported that, among multiple approaches, graph-based subspace clustering often generates the best performance [98], [99]. As a result, This  graph-based subspace clustering methods have been a hot research topic very recently. Its goal is to find relevant dimensions spanning a subspace for each cluster [84], [85]. Kang et al. [46] propose a novel graph-based subspace clustering framework, which is called structured graph learning (SGL). Specifically, the SGL [46] is based on the ideas of anchor points and bipartite graphs. Kang et al. [46] establish a theoretical connection between the SGL and the k-means clustering, and present extensive experiments demonstrating the efficiency and effectiveness of the SGL with respect to many state-of-the-art clustering methods in [46]. Wang et al. [47] have recently introduced an efficient method for graph clustering, which performs spectral embedding and spectral rotation simultaneously.
Furthermore, the idea of multiview spectral clustering [59], [60] has been another hot research topic in recent years as well due to the fact that combining information from multiview feature spaces generally outperforms the single-view clustering [52]. These views may be obtained from multiple sources or different feature subsets. Multiview spectral clustering examines local manifold structures by seeking eigenvalueeigenvector decompositions [52]. A great many methods for multiview data clustering have been proposed by a wide spectrum of applications, such as similarity search [100] and human action recognition [101], among others: Some very recent methods are multiview subspace clustering [61], multiview spectral clustering [64], [65], graph-based multiview clustering [62], multiview clustering with graph learning [63], and other multiview spectral clustering algorithms [64]- [70].
Among all the clustering methods mentioned above, there is a very simple yet important algorithm that has been "shining" for more than 50 years: the k-means clustering [3]- [6], [12]: Indeed, if we made to a list of the "top-ten clustering algorithms in the history," most of the researchers and experts would probably give a place to the k-means clustering in this top list. The reason why we say "shining" for the k-means is twofold.
1) The k-means has never lost its popularity since the 1980s.
2) The k-means has taken important roles in not only "old fashion" clustering methods [71] but also the very latest state-of-the-art clustering methods. Consequently, it would be fair to say that one of the most popular and efficient clustering methods is the k-means method (e.g., [33], [42]). The Lloyd algorithm [5], [6] for the k-means clustering seeks to find k clusters that minimize the sum of squared Euclidean distances between each observation and its respective centroid (cluster mean). In its simplest form, the k-means clustering iteratively alternates between two steps: 1) for a given set of cluster centers, assign each observation to the cluster with the nearest center and 2) for a given assignment of observations to clusters, update each centroid as the sample means of all points in that cluster.
The reasons why the k-means clustering has been so popular in various applications are given as follows.
1) The k-means (Lloyd's algorithm [5], [6]) can be used in many different machine learning problems when we need to choose k different but prototypical samples from a large dataset.
2) The computational cost is low, and it is very fast and easily implemented at an even large scale. 3) It is often used as a preprocessing step (and sometimes as a postprocessing step) for many other algorithms. On the other hand, the theory of graphs has been well established since the 19th century, and the associated applications from this theory have extended to mathematics, physics, chemistry, linguistics, biology, computer science, and social sciences. Graph clustering has recently broadened its areas of application to social network analysis, big data networks, and biological (protein interaction and gene coexpression) networks, and wireless systems, among others. For the graph clustering (partitioning) theory literature, see [11]. For some recent applications of graph clustering to wireless systems, see [16]- [18].
In this article, we extend the well-known expectationmaximization (EM) formulation for the GMM with a novel weighted dissimilarity loss. This extension yields the fusion of two different clustering methods, namely, centroid-based clustering and graph clustering, in the same framework in order to leverage their advantages. Our investigations on the fusion of centroid-based clustering and graph clustering in this article yield a novel hybrid clustering method, which touches various multidisciplinary subjects, such as Bruck's Ln algorithm [1], the Hopfield neural network (HNN) [7], [8] in the artificial neural networks literature, thus, the Ising model in statistical mechanics, Babadi and Tarokh's pioneering algorithm greedy asynchronous distributed interference avoidance (GADIA) in [2] in wireless communications, the k-means clustering, and graph clustering in machine learning.
HNN [7], [8] has been one of the important milestones in the history of artificial neural networks. HNN's applications varied from associative memory systems design problems to mobile ad hoc network routing problems, combinatorial optimization to image restoration, constrained optimization problems to analog-to-digital conversion, system identification to channel allocation in wireless systems, and so on. For references related to the real HNN and complex HNN, see [9] and [10], respectively. Bruck presented the so-called Ln algorithm in his 1990 article [1] in order to show an alternative proof for the convergence of the discrete-time real HNN. A recent article [9] has further investigated the behavior of any neuron in not only discrete-time HNN but also continuous-time HNN when the corresponding energy function is minimized during an optimization process. In this article, we further examine Bruck's Ln algorithm [1] and extend it from a two-cluster case to an arbitrary k-cluster case. This yields a graph clustering (partitioning) algorithm which we will call the k-Ln algorithm in this article due to Bruck's Ln algorithm in [1].

A. Contributions of This Article
In this article, using the well-known EM formulation, we unite the two commonly used but different types of clustering methods, namely, the centroid-based clustering and the graph clustering, in the same framework in order to design a hybrid clustering method with more merits than each of the two. Our investigations yield various novel results; some of which are given as follows.
1) This article extends the EM formulation for the GMM with a novel weighted dissimilarity loss. This extension results in the fusion of two different clustering methods, namely, centroid-based clustering and graph clustering. The proposed algorithm may start as a pure centroid-based clustering algorithm (e.g., k-means), and as the time evolves, it may eventually and gradually turn into a pure graph clustering algorithm (e.g., basic GADIA [2]) as the algorithm converges and vice versa. 2) The "hard" version of the proposed method, called hybrid-nongreedy asynchronous clustering (H-NAC) algorithm, includes the standard HNNs (and, thus, Bruck's Ln algorithm in [1] and the Ising model in statistical mechanics), Babadi and Tarokh's basic GADIA in [2], and the standard k-means [3], [4] (and the Lloyd [5], [6] algorithm) as its special cases. 3) We compare the clustering performance of the proposed algorithm H-NAC with those of the k-means, k-GADIA, spectral clustering, and the SGL in [46], which represents a state-of-the-art clustering algorithm as applied to various well-known benchmark datasets. The computer simulations confirm the superior performance of the H-NAC compared to these algorithms. 4) Furthermore, we also extend Bruck's Ln algorithm [1] from two-cluster case to arbitrary k-cluster case, which will be called the k-Ln algorithm in this article. The extended k-Ln clustering algorithm turns out to be equivalent to the basic version of the pioneering algorithm GADIA of Babadi and Tarokh [2]. The article is arranged as follows. In Section II, we summarize the standard EM algorithm for the GMM. Sections III and IV present the soft version and the hard version of the proposed hybrid clustering method, respectively. Computer simulations are presented in Section V, followed by the conclusions in Section VI.

II. RELATED WORK: EM ALGORITHM AND EM FOR GMM A. EM Algorithm
Let us denote L-dimensional data as In the problem definition, the cluster label for each data point is not known. This unobserved/hidden label is called latent variable, which is represented as z = {z j } K j =1 in this article. The latent variable of sample i, represented as z j (i) , is multinomial and indicates which distribution a given sample x i belongs to. Thus, the model is probability distribution function p(x, z; θ) in which only x is observed. Assuming independent data samples {x i } N i=1 ∈ R L , our aim is to maximize the probability l(θ ) = N i=1 p(x i ; θ). However, for the sake of mathematical simplicity, instead of the probability l(θ ), it is common to maximize its log-likelihood (1) Incorporating the latent variable z j (i) into (1) and using Jensen's inequality, one obtains where q i (z i = k) represents the probability of z i taking the value k under the distribution of q i . The details of the derivations can be found in any related textbook or lecture notes. Let us represent the lower bound in (2) as It is well-known that: 1) (2) holds for any distribution q i (z i = k) and 2) if and only if the distribution q i (z i = k) is chosen as then L(θ ) = Q LB (θ ), i.e., the lower bound Q LB (θ ) becomes exactly equal to the log-likelihood L(θ ) in (1) for the given parameters θ .
Let us represent the posterior as γ (k) is the posterior distribution of z i given the observation x i and the parameter θ . Similarly, the prior is p(x i |z i = j ; θ), which represents the probability of x i given that the data sample has been drawn from cluster j for a fixed parameter set θ , and ϕ j = p(z i = j ; θ), which represents the probability of cluster j for a given parameter set θ . Finally, the posterior probability, i.e., after we observe a specific observation (data sample) x i , is obtained by the Bayes rule and The EM algorithm will make the log-likelihood increase monotonically converging to a locally optimum solution. Thus, the likelihood that the given dataset is withdrawn from the distribution is maximized by the EM algorithm. Further details and proofs can be found in any related textbook. Table I presents the notation used in Sections II and III for convenience and clarity. Table II summarizes the standard EM algorithm.

B. EM for GMM
In this section, we summarize the standard EM algorithm for the well-known GMM. The GMM model with K different Gaussian distributions (K mixtures) has the following parameters.
µ k : The mean of the k'th Gaussian distribution. C C C k : The covariance matrix of the k'th Gaussian distribution.  I   NOTATION TABLE   TABLE II   BRIEF EXPLANATION OF THE STANDARD EM ALGORITHM Here, k = 1, 2, . . . , K . Each Gaussian distribution is given by From (4), the posterior probability (i.e., after we observe a particular data sample x i , the probability that this observed data sample x i is drawn from the cluster k) in the GMM is obtained by The goal is to maximize the log-likelihood in (1) for the GMM model, i.e., max µ, (3), data sample x i is assigned to cluster k with the following probability: Applying the Lagrange multipliers' optimization approach yields the optimum parameters for the M-step of the EM algorithm for the GMM. The optimum GMM parameters (for the M-step) are obtained as follows, which can be found in any related textbook or lecture notes (see [19]): (11) and and Equation (13) is obtained by the constraints z i = 1 (for the details of the derivations by the Lagrange multipliers, see [19]).

III. FUSION OF THE GMM AND THE GRAPH
CLUSTERING BY THE EM In this section, we extend the EM formulation for the GMM with a novel weighted dissimilarity loss, as explained in the following.

A. Soft Version
In this section, we propose and analyze the "soft version" of our hybrid clustering algorithm. The same log-likelihood First, let w i j denote the dissimilarity between samples x i and x j . Here, we define a new concept called "sum of weighted dissimilarities between the sample x i and cluster k" as follows.
Definition: The sum of weighted dissimilarities between the sample x i and cluster k at time (t) is defined as where γ (k) z i (t − 1) represents the probability that the sample x i is assigned to cluster k (assuming each component stands for a cluster) in the Expectation (E)-step at time (t − 1).
We introduce α and k i into the argument of the Gaussian distribution in the GMM and split the argument of each Gaussian distribution into two parts as follows: (15), then the proposed priori is exactly equal to the standard GMM. Similar to (8), the posterior probability (after observing a specific sample x i ), the probability that the sample x i has been drawn from mixture component (cluster) k of the proposed hybrid clustering, is obtained by the Bayes rule (15). In other words, assuming that each mixture component represents a cluster, the posterior γ (k) z i represents the probability that this specific sample x i is assigned to the cluster (mixture component) k for a given parameter set The goal of our proposed hybrid model is to maximize the log-likelihood in (1), i.e., max L(θ ) = N i=1 ln{ p(x i ; θ)}. As in the GMM case in (9), we maximize its lower bound for the proposed hybrid model, denoted as where ϕ k = p(z i = k; θ) and K j =1 ϕ j = 1. Applying the Lagrange multipliers' optimization approach and following the derivation of the standard GMM yield the same optimum parameters in (11)-(13) for the M-step simply because the terms k i and α are constants as far as µ k , C C C k , and ϕ k are concerned. The main difference between the standard GMM and the proposed hybrid model is in determining the posterior probability γ (k) z i = p(z i = k|x i ; θ) as seen by comparing (8) and (16).
shows the probability that a specific sample x i has been drawn from mixture component (cluster) k, after observing the sample x i . It is obvious that the value of γ (k) z i plays a critical role in the clustering performance of either of the two models.
In order to illustrate how the prior and posterior probabilities of the proposed hybrid clustering are completely changed as compared to the standard GMM case, we compare an arbitrary Fig. 1(a), the standard GMM clearly gives γ (2) z i > γ (1) z i . For the "soft" GMM, the probability that the sample x i = 3.65 is assigned to cluster 2 is much higher than that of cluster 1. Therefore, for the "hard" GMM, the sample x i = 3.65 is assigned to cluster 2 at this step. Thus, the "winning cluster" of the standard GMM is cluster 2 in this case, and any other choice is not possible. However, for the proposed hybrid algorithm, the "winning cluster" can be either of the two clusters: The sample x i = 3.65 in Fig. 1 may be assigned to cluster 2 or cluster 1 depending on the value of α and i 1 and i 2 . As an example, if i 1 = 1 and i 2 = 4, then the "the winning cluster" is 1, i.e., the proposed hybrid algorithm assigns the same sample x i = 3.65 to cluster 1, as shown in Fig. 1(b), which is the opposite of the GMM.
In summary, for the snapshot in Fig. 1, while the standard GMM dictates that the "winning" cluster is 2, the proposed hybrid algorithm allows cluster 1 to be the "winner" depending on α and i1 in the E-step, as seen in Fig. 1(b).
The soft version of the proposed hybrid clustering is summarized in Table III.

B. Hard Version
In order to convert a soft clustering into a hard clustering, each data sample is often assigned, with probability 1, to the Gaussian distribution to which they most likely belong. Here, we will follow the same approach. First, we calculate the sum of weighted dissimilarities ( k i (t)) between the sample x i and cluster k using (14). Then, let us define where j = 1, 2, . . . , K . Then, the posterior probability in (16) in the hard clustering turns into the following: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. where k = 1, 2, . . . , K . Thus, in the hard version, a specific data sample x i is assigned to only one cluster with probability 1 for a given parameter set µ k , C C C k , ϕ k , α during the E-step. Thus, (19) yields K solid sets {S j (t)} K j =1 ; each of which includes the indices of those data samples that are assigned to cluster k with probability 1 at time t in the E-step where j = 1, 2, . . . , K . Note that γ ( j ) z i = 1 denotes that the data sample x i is assigned to cluster (component) j with probability 1 (in the hard version) in the corresponding E-step, and K j γ ( j ) z i = 1, for any data sample x i . Writing (19) and (20) into (14), the dissimilarity k i (t) in (14) turns into the following equation: where S k (t − 1) is the set of indices of data samples that were assigned to cluster k in the previous step at time (t − 1). On the other hand, in the M-step, we maximize the lower bound LB H (µ k , C C C k , ϕ k , α). The optimum parameters µ k , C C C k , and ϕ k in the M-step of the hard version are found by writing (19)-(21) into (11)- (13), which yields and and where N k represents the number of data samples assigned to cluster k with probability 1 in the E-step at time t. The hard version of the proposed hybrid clustering is summarized in Table IV.

IV. H-NAC: H-NAC
It is well-known that, if the covariance matrix in (12) is chosen as the unit matrix, then the standard hard GMM yields the standard k-means clustering (see [24]). Because the soft version of the proposed hybrid clustering in Table III is reduced to the standard GMM setting α = 0, it implies that, if the covariance matrices in (23) are chosen as unit matrices and α = 0, then the hard version of the proposed hybrid algorithm in Table IV results in the standard k-means clustering. If 0 < α < 1, then the proposed clustering in Table III (soft version) and Table IV (hard version) is a hybrid algorithm. In order to analyze the proposed algorithm, we need first to summarize some definitions from the graph theory [15].
Definition: Cut of a Simple Graph: A cut means a partition of the vertices of the graph into two sets.
Definition: Size of a Cut: The size of a cut is the sum of the edges cutting the graph into two groups.
Definition: Max-Cut of a Graph: A maximum cut is a cut whose size is not smaller than the size of any other cut.
Definition: k-Cut of a Graph: A k-cut is defined as the sum edges cutting the graph into k groups.
Definition: Max k-Cut of a Graph: A max k-cut is a k-cut whose size is not smaller than the size of any other k-cut.
Definition: Dissimilarity between any two data samples x i and x j , represented as w i j , represents how dissimilar they are, and it can be defined by different measures, such as spatial distance or correlation. Let f (x i , x j ) represent the dissimilarity function between sample x i and x j . Thus, Definition: I k : Sum of intracluster dissimilarities for cluster k, k = 1, 2, . . . , K , is defined as where w i j ≥ 0 represents the dissimilarity between samples x i and x j .
Definition: vol(W): Volume of a matrix W, where w i j ≥ 0, is defined as Definition: Q G (t): Sum of intracluster dissimilarities over all clusters is defined as follows: Definition: Q km (t): Sum of Intracluster Error Squares: Given a dataset {x i } N i=1 ∈ R L in a deterministic setting, the sum of intracluster error squares is given by where set S k (t) represents the indices of data samples in cluster k at time t. Note that the k-means clustering (Lloyd's algorithm) minimizes the sum of intracluster error squares (i.e., the variances) in (28). Now, we are ready to present one of our findings as follows. In this section, without loss of generality and for the sake of simplicity, we define the dissimilarity w i j between arbitrary samples x i and x j as We present our hybrid clustering algorithm called H-NAC as follows: Definition: Q H ({C s } L s=1 ): H-NAC cost function at time t is defined as follows: where 0 ≤ α ≤ 1, Q km (t) is the k-means cost function (quantization error) defined by (28), and Q G (t) is the sum of intracluster dissimilarities defined by (27). Let N k represent the number of samples in cluster k at a particular time. Using x i − c k 2 2 = x i 2 2 + c k 2 2 − 2x i , c k , and c k = (1/N k ) i∈S k x i , it is straightforward to obtain the following: Using (31) and (32) in (30) yields the following H-NAC cost function: Proposition 1: Given a data sample set {x i } N i=1 ∈ R L in a deterministic setting and the cluster number K , if 1) 0 ≤ α ≤ 1.
3) The covariance matrix C C C j is chosen as the identity matrix I L .
4) The dissimilarity w i j is chosen as (29), then the hard version of the proposed hybrid clustering in Table IV above locally minimizes the hybrid cost function Q H N (t) in (30).

[Note that the H-NAC cost function Q H (t) is
given by (33) if w i j is defined by (29)].

Proof:
1) The E-step of the hard version in Table IV (20) at time t. Thus, S j (t) becomes the set of indices of data samples that are assigned to cluster j with probability 1 at time t. From (19) to (20) and the definitions above, we write sum of intra-cluster weights (34) whereS k (t) represents the set of indices of all other data samples except those in cluster k, and I k (t) is the sum of intracluster dissimilarities defined by (25). Let us denote the "winner cluster index" for data sample x i at time (t + 1) as z w i so that γ (w) Furthermore, let us denote the "winner cluster index" for data sample x i at the previous time (t) as z p i , which means that From (19), (20), (35), and (36), we have x j ∈ S z p j (t) and x j ∈ S z w i (t + 1) i.e., the sample x i is moved from cluster z p i to cluster z w i at time (t + 1). Form (19), (20), and (25), we have Let d i k (t) represent the squared distance between sample x i and centroid c k at time t where k{1, 2, . . . , K }. We also define the following "hybrid dissimilarity" (τ k i (t)) between the sample x i and cluster k at time t as follows: where d k i is given by (38), i k is in (21), and 0 ≤ α t ≤ 1. From (15), (18)- (20), (35), and (39), we obtain that In other words, (40) is obtained because of the fact that the smallest τ k i (t) in (39) yields the highest β (k) i (t) in (18) if ϕ j = 1/K is the same for each cluster. Using (27), (28), (39), and (40), the E-step by (19) and (20) results in the following H-NAC cost function in (30) at time (t) and time (t + 1), respectively: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where k i is given by (21), Q others G,i represents the sum of all terms without x i in (27), d k i (t) is given by (38), and Q others km,i represents the sum of all terms without x i in (28). Using (40)- (42) and the fact that Q others G,i (t) = Q others G,i (t + 1) and Q others km,i (t) = Q others km,i (t + 1), we conclude that In the M-step, keeping the posterior γ (k) z i (t) constant that is found in the E-step, we update the optimum cluster H-NAC centroids µ k by (22). Applying the E-step and the M-step in the next time step, Q H (t) is further decreased due to (43). (33) is finite, the H-NAC algorithm converges to local minima after a finite number of steps. This completes the proof.

Because the global minimum of the H-NAC cost function Q H (t) in
The H-NAC algorithm in Proposition 1 (i.e., 0 ≤ α t ≤ 1, Table IV) is simplified in Table V. In the H-NAC in Table V, the update of centroid µ k can be done either after every q i (z i ) is calculated by (19) (i.e., sample-based update) or after all the {q i (z i )} N i=1 are calculated (i.e., epoch-based update). Here, we denote the latter one as H-NAC-e. Proposition 2: The H-NAC algorithm in Proposition 1 (i.e., 0 ≤ α t ≤ 1, ϕ k = (1/K ), C C C k = I L , and w i j = x i − x j 2 2 ) includes the k-means clustering (Lloyd algorithm [5], [6]) in computer science, the standard HNN in electrical engineering, the basic GADIA [2] in wireless communications, and the Ising model in statistical mechanics as its special cases. To be precise, the H-NAC in Table V (in Proposition 1) is reduced to 1) The k-means clustering (Lloyd algorithm [5], [6]) if α = 0. 2) The basic GADIA in [2] if α = 1 maximizing the K -cut of the dissimilarity matrix W (graph clustering).
3) The HNN [7], [8], equivalently, Bruck's Ln algorithm in [1], if K = 2 and α = 1. 4) The Ising model in statistical mechanics if K = 2 and α = 1. Proof: 1) Taking α = 0 in (39) and (40) yields the "winning cluster index" for the data sample x i in the E-step as The E-step in (44) and the M-step (22) finding the cluster centroid µ k yield the standard k-means clustering. 2) By definition, the K -cut of matrix W is equal to Taking α = 1 in (39) and (40) yields min Q H N (t) = min Q G (t) and the "winning cluster index" for the data sample x i in the E-step as Using (21), (27), (30), (34), and (46), and the fact that vol(W) is constant, we conclude that, under the light of Corollary 2 and Proposition 6 of [9], the K − cut(W) in (45) strictly increases each time an arbitrary x i is assigned from z p i to z w i by (46) in the E-step of the H-NAC. Because the maximum of K − cut(W ) is finite, the H-NAC algorithm converges to local maxima of the K − cut(W) after a finite number of steps. Comparing the cluster update of the H-NAC algorithm in (46) with the channel allocation update of the basic version of the pioneering algorithm GADIA in [2], we observe that they are equivalent: while k i represents the sum of pairwise dissimilarities between sample x i and cluster k in the proposed H-NAC, it actually corresponds to the sum of cochannel interference experienced by the transmitter i in channel k in the basic GADIA in [2]. It is proven in [ (25) above. This shows that the H-NAC algorithm in Table V is equivalent to the basic GADIA in [2] (although their problems, motivations, and formulations are completely different) if α = 1. 3) Bruck [1] and Uykan [9] further investigate the working principle of the HNN and show that maximizing the K − cut(W) in (45) is equivalent to minimizing the HNN's energy Lyapunov) function where y i (k){−1, +1} is the state of neuron i, b i is the bias of neuron i , i = 1, 2, . . . , N, and w i j is defined in (29). Uykan [9] proves that the HNN [7], [8] is a special case of the basic GADIA in [2]. This shows that the H-NAC algorithm in Table V is equivalent to the  TABLE VI   COMPARISON OF THE H-NAC WITH THE k-MEANS CLUSTERING, HNNS,  AND THE BASIC GADIA HNN and, thus, Bruck's Ln algorithm in [1] if α = 1 and K = 2. 4) It is well-known that the HNN's energy function in (47) corresponds to the Hamiltonian function of the classical Ising model in statistical mechanics. The Ising model is a mathematical model of ferromagnetism in classical statistical mechanics and consists of discrete variables that can be in one of two states {−1, +1} that represent magnetic dipole moments of atomic spins. Proposition 2 above and (47) imply that the H-NAC includes the Ising model as its special case, which completes the proof. The findings in Proposition 2 are summarized in Table VI. We extend Bruck's Ln algorithm [1] from a two-cluster case to an arbitrary k-cluster case in Appendix A, which will be called the k-Ln algorithm in this article. A connection between the k-Ln algorithm and the spectral clustering formulation is presented in Appendix B, relaxing the discrete-solution constraint. Furthermore, the extended k-Ln clustering algorithm turns out to be equivalent to the basic version of the pioneering algorithm GADIA of Babadi and Tarokh [2]. We call the basic GADIA in [2] when applied to the data clustering problem as k-GADIA.

A. Increasing/Decreasing the Parameter α t
The parameter α t (0 ≤ α t ≤ 1) determines how much the proposed hybrid algorithm looks like a centroid-based clustering (the k-means algorithm) and how much it looks like a graph clustering (the k-Ln) algorithm. The smaller the α t is, the more it looks like k-means, and the less it looks like the k-Ln, and vice versa. In the proposed H-NAC in Table V, the parameter α is a constant. It is straightforward to show that the parameter α t at time t (0 ≤ α t ≤ 1) may gradually be increased all the way up to 1 or alternatively may gradually be decreased all the way down to 0 as follows: let > 0, a small positive number. 1) Gradually increasing α t to 1 (i.e., gradually turning the H-NAC to a graph-clustering algorithm, such as the basic k-GADIA): If d s j i > s j i , then α t is gradually increased, i.e., α t+1 = min{α t + , 1}; otherwise, α t+1 = α t . This satisfies Q H (t + 1) ≤ Q H (t).

B. How to Choose the Parameter α
If the parameter α of the H-NAC is kept as a constant, then, as a rule of thumb, α is first chosen in the range of 1/(N/K ), where N is the number of data points and K is the number of clusters, as will be seen in most of the simulations in Section V. This is because of the following reason: The assignment of x i to cluster k depends on the "hybrid dissimilarity" τ k i (t) in (39), which consists of d k i in (38) (centroid-based clustering part) and i k in (21) (graph clustering part). Because d k i is between the data sample x i and the centroid of cluster k, while k i is between the data sample x i and all other data samples in cluster k, very roughly speaking, the magnitude of i k is in the order of (N/K ) times the magnitude of d k i , assuming that the number of samples in cluster k is around (N/K ). In other words, in the hybrid dissimilarity τ k i = (1 − α)d k i + α k i , there are two terms "competing" with each other; however, one term ( k i ) is, roughly speaking, about (N/K ) times bigger than the other term (d k i ). In order to balance these two terms in τ k i , as a rule of thump, α is first chosen in the range of 1/(N/K ). This choice of α gives priority to neither the "centroid-based clustering part" nor the "graph clustering part." Otherwise, for example, if α 1/(N/K ), then the H-NAC looks more like a graph clustering algorithm, and if α 1/(N/K ), then it looks more like a centroid-based clustering algorithm.

C. Discussion About the H-NAC, the k-Means, and Graph Clustering
As explained in Sections II and III, the proposed H-NAC is a hybrid clustering touching both the k-means and the k-GADIA (graph clustering) simultaneously. Our starting point in Section II was the EM formulation as applied to the GMM whose argument has been extended by the dissimilarities of the samples (in the feature space). Unlike the spectral clustering and most of the spectral clustering-based graph clustering, the proposed H-NAC does not calculate any eigenvector, nor does it apply an eigendecomposition to a Laplacian matrix (although the k-Ln and k-GADIA are also related to the spectral clustering formulation, as shown in Appendix B). Instead, the H-NAC is implemented like the k-means clustering by taking also the pairwise dissimilarities into account at the cost of an additional computational burden. Because the dissimilarities are precalculated offline and are read from a lookup table, the H-NAC's computational cost is slightly higher than the k-means clustering. Therefore, as far as the implementation is concerned, it would be fair to say that the H-NAC resembles more like the k-means than any graph clustering algorithm. That is why we also provided detailed literature on the k-means clustering in Section I. It is well-known that the k-means minimizes the intracluster distances (dissimilarities), hoping that the intercluster distances (dissimilarities) are also maximized. In other words, the k-means does not explicitly take the intercluster dissimilarities into account during optimization. On the other hand, the proposed H-NAC simultaneously takes both into account: the H-NAC simultaneously both minimizes the sum of intracluster distances (dissimilarities) and maximizes the sum of intercluster dissimilarities during optimization (see Corollary 1 in Appendix A and (51) in Appendix B). What is more, the H-NAC also performs graph clustering due to the weighted dissimilarity loss added to the argument. As explained in Section II, we do not impose any assumption on how the dissimilarity between samples (features) is defined. For the sake of brevity and clarity, we define the dissimilarity as the squared Euclidean norm in this article.

V. SIMULATION RESULTS
In this section, we compare the clustering performance of the proposed H-NAC with those of k-means, the k-GADIA, spectral clustering, and the SGL in [46], which represents one of the state-of-the-art clustering algorithms. The reason why we have chosen the SGL in [46] as a reference algorithm in our article is simply because the SGL [46] has shown superior performance compared to many state-of-the-art clustering methods, such as the accelerated low-rank representation (ALRR) published in 2018 [72], the K -multiple-means (KMMs) in 2019 [73], the efficient sparse subspace clustering (ESSC) (an ESSC algorithm) in 2020 [74], the fast normalized cut (FNC) (normalized-cut-based clustering method) in 2018 [75], and the sparse subspace clusteringorthogonal matching pursuit (SSC-OMP) (a popular sparse subspace clustering (SSC) algorithm) [76]. In [46], the SGL's multiview version (MSGL) is also compared to the stateof-the-art multiview clustering algorithms recently published in [77]- [80]. Furthermore, we also examine the performance of the traditional spectral clustering [53], [54] algorithm in our simulation campaigns for comparison reasons. All the algorithms (the k-means, the H-NAC, spectral clustering, and the SGL [46]) run under the same initial conditions using the same features for the sake of fairness. Toy examples are presented in the supplementary material.
Here, we use the clustering accuracy (ACC), the normalized mutual information (NMI), and the purity to evaluate the clustering performance, such as in many other articles (e.g., [46]). We conduct all our experiments on a computer with a 2-GHz Intel Core i7 CPU and 16-GB RAM MATLAB R2016a.

A. Example 1: 20NewsGroup Dataset 1
The k-means clustering has widely been applied to the text document clustering, which is an important and challenging problem (see [29] and [30]). We run a large simulation campaign using the well-known 20Newsgroups dataset [20].
Each document is represented as a 26 214-D normalized tf-idf [21], [22] sparse vector [20]. The best parameter α if chosen over the following range as a rule of thumb in all the simulations below without any further tuning or optimization, whatsoever α ∈ (1/N k )[0, 0.5, 1, 2, 3, 4, 5, N k ]. The initial conditions are always the same for all clustering algorithms k-means, the basic GADIA (i.e., k-Ln algorithm), the spectral clustering [53], and the SGL [46] for the sake of fairness.   Clustering performance (ACC, NMI, and purity) of the algorithms for a subset of the 20NewsGroup dataset with three clusters and 100 samples in each cluster is shown in Fig. 2 and for the 20NewsGroup dataset with 20 clusters and 100 samples in each cluster is shown in Fig. 4, respectively. The evolution of the H-NAC cost function Q H (t) in (33) for the three-cluster case is shown in Fig. 3, which confirms Proposition 1. As seen from Figs. 2 and 4, the best performance is achieved by the proposed H-NAC in terms of the ACC, NMI, and purity.

B. Example 2: MNIST Dataset 2
Each sample in the well-known Mixed National Institute of Standards and Technology (MNIST) database of handwritten digits has 784 features. The number of clusters is 10, and we take 300 samples from each cluster. In Fig. 5, we examine how sensitive the clustering performance is to the parameter α over α ∈ (1/300)[0, 0.5, 1, 2, 3, 4, 5, 300]. The figure shows that the best performance of the H-NAC is very stable with respect to a large range of α values for the MNIST dataset. From Fig. 5, the H-NAC parameter α is chosen as α = 0.01.   The clustering performance (ACC, NMI, and purity) of the algorithms for the MNIST dataset (with ten clusters and 300 samples in each cluster) is shown in Fig. 6. The best performance of the SGL [46] is obtained by m = 50, β = 0.01, and α = 0.1. The evolution of the H-NAC cost function Q H (t) in (33) is shown in Fig. 7, which confirms Proposition 1. As seen from Fig. 6, the best performance is achieved by the proposed H-NAC in terms of the ACC, NMI, and purity for this subset of the MNIST dataset.

C. Example 3: TR45 Dataset 3
Each sample in the well-known TR45 dataset has 8261 features. The number of clusters is 10, and we take 14 samples from each cluster because there existed only 14 samples in one of the clusters in the TR45 dataset (in this article, we assume that the samples are distributed to the clusters evenly). In Fig. 8    On the other hand, the best performance of the SGL [46] is obtained by m = 50, β = 1, and α = 10 for this particular dataset.
The clustering performance (ACC, NMI, and purity) of the algorithms for the TR45 dataset (with ten clusters and 14 samples in each cluster) is shown in Fig. 9. The evolution of the H-NAC cost function Q H (t) in (33) is shown in Fig. 10, which confirms Proposition 1. As seen from Fig. 9, the best performance is achieved by the proposed H-NAC in terms of the ACC, NMI, and purity for the subset of the TR45 dataset.

D. Example 4: Caltech Dataset 4
The Caltech-7 is a multiview object recognition dataset. Because we focus on the single-view clustering in this article, we take only one view that is the Gabor features of the dimension of 48. (We call it Caltech-Gabor dataset.) The number of clusters is 7, and we take 34 samples from each cluster because there existed only 34 samples in one of the clusters in this single-view Caltech-Gabor dataset (in this  article, we assume that the samples are distributed to the clusters evenly). The best H-NAC performance is obtained by α = 0.03 out of eight different values. On the other hand, the best performance of the SGL [46] is obtained by m = 25, β = 0.01, and α = 10 out of 24 different combinations.
The clustering performance (ACC, NMI, and purity) of the algorithms for the Caltech-Gabor dataset (with seven clusters and 34 samples in each cluster) is shown in Fig. 11. The evolution of the H-NAC cost function Q H (t) in (33) is shown in Fig. 12, which confirms Proposition 1. As seen from Fig. 11, the proposed H-NAC gives comparable performance to the SGL [46] and better performance than the other algorithms for the subset of the TR45 dataset.

VI. CONCLUSION AND FUTURE WORK
In this section, we summarize the main conclusions of our article as follows.
1) This article extends the EM formulation for the GMM with a novel weighted dissimilarity loss. This extension results in the fusion of two different clustering methods, namely, centroid-based clustering and graph clustering, resulting in a simple "soft" asynchronous hybrid clustering method. The proposed algorithm may start as a pure centroid-based clustering algorithm (e.g., k-means), and as the time evolves, it may eventually and gradually turn into a pure graph clustering algorithm (e.g., basic GADIA [2]) as the algorithm converges and vice versa. 2) The proposed H-NAC takes both the intracluster distances (dissimilarities) and the intercluster dissimilarities chosen by the user simultaneously and explicitly into account during optimization. In short, the H-NAC simultaneously both minimizes the intracluster distances and maximizes the intercluster dissimilarities (chosen by the user) during optimization.
3) The "hard" version of the proposed algorithm is called H-NAC and includes the standard HNNs (and, thus, Bruck's Ln algorithm in [1] and the Ising model in statistical mechanics), Babadi and Tarokh's basic GADIA in [2], and the standard k-means [3], [4] (and the Lloyd [5], [6] algorithm) as its special cases. 4) We apply the proposed H-NAC algorithm to various using well-known benchmark datasets. The computer simulations results confirm the findings and the superior performance of the H-NAC compared to the k-means clustering, k-GADIA, spectral clustering, and one of the very-recent state-of-the-art clustering algorithm SGL in [46]. 5) We extend Bruck's Ln algorithm [1] from a two-cluster case to an arbitrary k-cluster case, which is called the k-Ln algorithm in this article. The extended k-Ln clustering algorithm turns out to be equivalent to the basic version of the pioneering algorithm GADIA of Babadi and Tarokh [2]. As far as the implementation of the proposed H-NAC is concerned, it is actually like the k-means clustering by taking also the pairwise dissimilarities into account at the cost of an additional computational burden. Because the dissimilarities are precalculated offline and are read from a lookup table, the H-NAC's computational cost is slightly higher than the k-means clustering. This implies that the possible imminent applications of the proposed H-NAC are abundant in signal processing, computer vision, astronomy, image processing, machine learning, and so on. Due to the multidisciplinary nature of this article's research topic touching HNNs and optimization in computer science, Ising model in statistical mechanics, interference avoidance in wireless communications systems, k-means clustering, graph clustering, and feature extraction in machine learning, the presented results in this article allow us to build new connections among all these different areas. In this article, we present the simulation results only for the hard version of the proposed hybrid clustering, called the H-NAC. Our current and near-future research subjects include examining the performance of the soft version of the proposed hybrid clustering (presented in Section III). Last but not least, in this article, we focus on the basic versions of the k-means clustering. However, the variants of k-means and variants of the k-GADIA (i.e., k-Ln) can also be combined in the same way by the proposed hybrid model. [1] In what follows, we extend Bruck's Ln algorithm in [1] to arbitrary k-cluster case, as shown in Fig. 13.

APPENDIX A EXTENSION OF BRUCK'S LN ALGORITHM IN
Let us assign N data samples to K clusters, and let the sets {S k } K k=1 = {S 1 , S 2 , . . . , S K } represent the cluster sets, i.e., the indices of data samples in each cluster. The cost function to be minimized by the k-Ln clustering is Q G ({S k (t)} K k=1 ) in (27), that is, the sum of intracluster dissimilarities over all clusters.
Bruck's Ln algorithm in his 1990 article [1] showed an alternative proof for the convergence of the discrete-time real HNN. Its working principle is sketched in Fig. 13(a). Because the HNN has two possible output values (−1 and +1), it represents a two-cluster case, which yields the following cluster update:  The extension of the Ln algorithm in [1] to an arbitrary k-cluster case is called the k-Ln algorithm and is sketched in Fig. 13(b) for an arbitrary data sample j at an arbitrary time t. The steps of the k-Ln algorithm are given in Table VII.
Corollary 1: Given a dataset {x i } N i=1 ∈ R L in a deterministic setting, the k-Ln algorithm in Table VII above 1) Locally minimizes the sum of intracluster dissimilarities Q G (t) in (27). 2) Locally maximizes the k-cut for the graph W.
3) Is equivalent to the basic version of the pioneering algorithm GADIA of Babadi and Tarokh [2].
The proof of Corollary 1 is obvious in the light of the findings in Section III.

APPENDIX B CONNECTION BETWEEN THE k-LN ALGORITHM AND
SPECTRAL CLUSTERING Defining the cluster indicator matrix H as follows: where i = 1, . . . , N, and j = 1, . . . , k, from [16] and [53], and it is straightforward to obtain the following: where Tr(·) represents trace operation for a matrix and L is the unnormalized Laplacian matrix of the dissimilarity matrix W (for further details, see [16] and [53]). Equation (51) is a standard trace maximization problem, and relaxing the discrete-solution constraint of (50), it is well-known from linear algebra that the optimum solution is given by choosing H as the matrix, which contains the greatest k eigenvectors of the unnormalized Laplacian matrix L as columns (e.g., [53]). The fact that the k-Ln algorithm partitions the graph W maximizing the sum of the intercluster dissimilarities (as shown in Corollary 1 in Appendix A) implies a connection between the k-Ln (i.e., k-GADIA) and the spectral clustering formulation.