SEC: More Accurate Clustering Algorithm via Structural Entropy

Authors

  • Junyu Huang Central South University, Changsha, China Xiangjiang Laboratory, Changsha, China
  • Qilong Feng Central South University, Changsha, China Xiangjiang Laboratory, Changsha, China
  • Jiahui Wang Central South University, Changsha, China Xiangjiang Laboratory, Changsha, China
  • Ziyun Huang Penn State Erie, The Behrend College
  • Jinhui Xu State University of New York at Buffalo, NY, USA
  • Jianxin Wang Central South University, Changsha, China Xiangjiang Laboratory, Changsha, China The Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha, China

DOI:

https://doi.org/10.1609/aaai.v38i11.29152

Keywords:

ML: Clustering, ML: Applications

Abstract

As one of the most popular machine learning tools in the field of unsupervised learning, clustering has been widely used in various practical applications. While numerous methods have been proposed for clustering, a commonly encountered issue is that the existing clustering methods rely heavily on local neighborhood information during the optimization process, which leads to suboptimal performance on real-world datasets. Besides, most existing clustering methods use Euclidean distances or densities to measure the similarity between data points. This could constrain the effectiveness of the algorithms for handling datasets with irregular patterns. Thus, a key challenge is how to effectively capture the global structural information in clustering instances to improve the clustering quality. In this paper, we propose a new clustering algorithm, called SEC. This algorithm uses the global structural information extracted from an encoding tree to guide the clustering optimization process. Based on the relation between data points in the instance, a sparse graph of the clustering instance can be constructed. By leveraging the sparse graph constructed, we propose an iterative encoding tree method, where hierarchical abstractions of the encoding tree are iteratively extracted as new clustering features to obtain better clustering results. To avoid the influence of easily misclustered data points located on the boundaries of the clustering partitions, which we call "fringe points", we propose an iterative pre-deletion and reassignment technique such that the algorithm can delete and reassign the "fringe points" to obtain more resilient and precise clustering results. Empirical experiments on both synthetic and real-world datasets demonstrate that our proposed algorithm outperforms state-of-the-art clustering methods and achieves better clustering performances. On average, the clustering accuracy (ACC) is increased by 1.7% and the normalized mutual information (NMI) by 7.9% compared with the current state-of-the-art (SOTA) algorithm on synthetic datasets. On real-world datasets, our method outperforms other clustering methods with an average increase of 12.3% in ACC and 5.2% in NMI, respectively.

Published

2024-03-24

How to Cite

Huang, J., Feng, Q., Wang, J., Huang, Z., Xu, J., & Wang, J. (2024). SEC: More Accurate Clustering Algorithm via Structural Entropy. Proceedings of the AAAI Conference on Artificial Intelligence, 38(11), 12583-12590. https://doi.org/10.1609/aaai.v38i11.29152

Issue

Section

AAAI Technical Track on Machine Learning II