Computer Science > Distributed, Parallel, and Cluster Computing
[Submitted on 30 Jun 2023]
Title:Hashing-Based Distributed Clustering for Massive High-Dimensional Data
View PDFAbstract:Clustering analysis is of substantial significance for data mining. The properties of big data raise higher demand for more efficient and economical distributed clustering methods. However, existing distributed clustering methods mainly focus on the size of data but ignore possible problems caused by data dimension. To solve this problem, we propose a new distributed algorithm, referred to as Hashing-Based Distributed Clustering (HBDC). Motivated by the outstanding performance of hashing methods for nearest neighbor searching, this algorithm applies the learning-to-hash technique to the clustering problem, which possesses incomparable advantages for data storage, transmission and computation. Following a global-sub-site paradigm, the HBDC consists of distributed training of hashing network and spectral clustering for hash codes at the global site. The sub-sites use the learnable network as a hash function to convert massive HD original data into a small number of hash codes, and send them to the global site for final clustering. In addition, a sample-selection method and slight network structures are designed to accelerate the convergence of the hash network. We also analyze the transmission cost of HBDC, including the upper bound. Our experiments on synthetic and real datasets illustrate the superiority of HBDC compared with existing state-of-the-art algorithms.
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
Connected Papers (What is Connected Papers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.