A new clustering method based on multipartite networks

The clustering problem is one of the most studied and challenging in machine learning, as it attempts to identify similarities within data without any prior knowledge. Among modern clustering algorithms, the network-based ones are some of the most popular. Most of them convert the data into a graph in which instances of the data represent the nodes and a similarity measure is used to add edges. This article proposes a novel approach that uses a multipartite network in which layers correspond to attributes of the data and nodes represent intervals for the data. Clusters are intuitively constructed based on the information provided by the paths in the network. Numerical experiments performed on synthetic and real-world benchmarks are used to illustrate the performance of the approach. As a real application, the method is used to group countries based on health, nutrition, and population information from the World Bank database. The results indicate that the proposed method is comparable in performance with some of the state-of-the-art clustering methods, outperforming them for some data sets.


INTRODUCTION
Clustering methods aim to identify groups of similar instances of data to reveal common characteristics that are not visible by other means.They represent one of the major classes of machine learning methods with multiple practical applications (Ezugwu et al., 2022) as they reveal connections within the data without using supplementary information.There are many criteria for their classification, but recent trends group them into two major categories: traditional and modern (Xu & Tian, 2015;Anand & Kumar, 2022).Traditional methods are textbook approaches that tend to be used more by practitioners in other research fields for practical applications.While they may be outperformed by some newer approaches for some data sets, they have also passed the test of time, as they offer consistent and competitive results that are interpretable and reliable.They are also relatively easy to use and available in various implementations.Modern approaches tend to use newly developed concepts such as deep neural networks or graph theory.A separate trend is to propose general improvement methods, e.g., methods that can enhance the results of the clustering by ameliorating the data set (Li et al., 2023;Wang et al., 2018).
The fact that the market for new clustering methods is saturated with variations of traditional and modern methods proposed either as stand-alone clustering algorithms or as solutions for particular practical applications using specific information related to the data should not be seen as dismissing the exploration of new approaches.An attempt at a new approach is proposed in this article: the Multipartite Network Clustering (MN-C) algorithm.MN-C uses a multipartite network to identify clusters in the data.The layers of the multipartite network correspond to attributes of the data.The nodes of each layer represent intervals for the corresponding attribute and are populated with instances having the value of this attribute within that interval.Edges of the network connect the nodes that contain the attributes of the same data instance.The clusters approximately correspond to paths in the network.Numerical experiments illustrate the behaviour of the approach on a set of synthetic and real-world benchmarks.

METHODS
Most clustering methods approach the problem in one of the following manners: using a representation for the clusters in the form of central points or distribution in representative or partition-based clustering; constructing clusters by successively aggregating/dividing data in hierarchical clustering; using data density to define clusters (Bhattacharjee & Mitra, 2020).Graph-based models, in which the data is converted into a graph, and spectral models, closely connected to the latter, in which matrix representations of graphs are further analysed, are also popular (Hloch, Kubek & Unger, 2022).There are many variants of these main approaches, using concepts from related fields, such as fuzzy computing or neural networks (Ayyub et al., 2022).
The standard versions of the approaches mentioned above are considered traditional methods (Xu & Tian, 2015).They include variants of partition-based methods, such as kmeans, partitioning around medoids (PAM), or Clustering Large Applications (CLARA).DBSCAN and its variants are density-based methods in this group.Other traditional methods are hierarchical trees, expectation-maximization clustering, and fuzzy analysis clustering.Most of them can be found in textbooks (Zaki & Meira Jr., 2014) and have implementations available for free or within commercial data analysis software packages.
Modern methods extend the traditional ones by including techniques from related fields, such as network analysis or deep learning (Wierzchoń & Klopotek, 2017).They include the class of spectral clustering methods (Nascimento & de Carvalho, 2011), affinity propagation clustering (Frey & Dueck, 2007), density peaks clustering methods (Hou, Zhang & Qi, 2020), and various deep clustering methods (Anand & Kumar, 2022;Zhou et al., 2022).Advanced methods using graphs aim to perform the clustering on graphs, such as graph neural networks (Tsitsulin et al., 2020), a marginalized graph autoencoder in Wang et al. (2017) and a local high-order graph clustering in Yin et al. (2017).
The present article presents a network-based clustering method that identifies clusters from a multipartite network constructed from attributes of the data.In what follows, other clustering methods that use network techniques are succinctly reviewed, and the newly proposed method, Multipartite Network Clustering (MN-C), is introduced.

Related work
Let X ∈ R n×d be a data set containing n instances from R d and with d attributes/features.We denote by X j the n-dimensional vector corresponding to attribute j, j = 1,...,d.The clustering problem consists of grouping instances based on some similarity indicator.While it is ideal for a method to determine the number of clusters in the data, sometimes, for specific applications, the desired number of clusters k is indicated or required.
Most graph-based clustering methods construct a graph by treating the instances of the data as the nodes in the graph.A similarity measure is used to compute the weight of the edge connecting two nodes.Similar nodes will form cliques or communities that can be detected using various tools, thus revealing the clusters in the data.Spectral methods analyse matrix representations of the graph to extract information about communities (Washio & Motoda, 2003;Foggia et al., 2009).A discussion of the importance of constructing the graph and suggested solutions can be found in Nie et al. (2016); Maier, Luxburg & Hein (2008).
In Huang et al. (2020), an ultra-scalable spectral clustering algorithm, called UNSPEC, designed for extremely large-scale data sets with limited resources is presented.

Multipartite network clustering (MN-C)
Multipartite-Network Clustering clusters the data by using a multipartite graph constructed from the data set.The network layers correspond to attributes of the data, and we assume that we are searching for k classes.The main components of the MN-C algorithm are described in detail in what follows.

The multipartite graph
A multipartite graph (Van Dam, Koolen & Tanaka, 2016) is a graph whose nodes can be divided into disjoint sets, which are also called layers, within which no two nodes are adjacent.A d-partite graph contains d layers.
MN-C uses a multigraph denoted by G(V ,E|X), where V is the set of nodes and E is the set of edges constructed from the data set X .The number of layers of G is equal to the number of attributes of X, denoted by d in this article.Each layer has a maximum of k nodes, where k is the number of classes that we are searching for.

Layers and nodes
Each layer of the graph corresponds to an attribute in the data set.Each attribute X j , j = 1,...,d, is divided into k intervals denoted by I jl , l = 1,...,k and a network node V j,l is created for each interval.The way intervals are computed defines the structure of the network.They can be designed in various manners, e.g., of equal length, to contain an equal number of instances, or by including some problem-dependent information.In this approach, MN-C constructs intervals of equal lengths between min(X j ) and max(X j ).
The network has d layers and a maximum of d × k nodes.Each node 'contains' a number of instances.An instance x i ∈ X will be placed in a node in each layer of the network, corresponding to the interval within each of its components belongs: Algorithm 1 Construction of Multipartite Network 1: input: data set X, number of classes k; 2: output: network G = (V ,E|X); 3: V = ∅; 4: for each attribute j do 5: Compute k intervals I jl , l = 1,...,k {of equal size} between min(X j ) and max(X j ); 6: Add node v jl to V , where j is the layer of the node 7: end for 8: for each data instance x i ∈ X do 9: Create path v 1l 1 ,v 2l 2 ,...,v dl d with x ij ∈ I jl j ∀j=1,...,d − 1, by adding edges (v jl j ,v j+1,l j+1 ) to E; if an edge already exists, increase its weight by 1; 10: end for 11: Remove empty nodes from V ; 12: return d-partite graph G = (V ,E|X), and I = {I jl } j=1,d,l∈{1,...,k} intervals corresponding to nodes;

Edges
Each instance x i connects the nodes to which its attributes belong.Thus, if instance x i is placed in nodes V j,l and V j+1,l , for j = 1,...,d − 1, an edge is added between the two nodes if it does not yet exists.If an edge already exists between the two nodes, its weight is increased by one.Thus, each instance in the data creates a path between the first layer of the network and the last one.The weight of an edge represents the number of instances that are placed in the connected nodes, i.e., belonging to the same corresponding intervals.The number of instances placed in each interval will vary depending on their distribution.Intervals with no instances will be ignored, i.e., empty nodes will be removed from the network.Example 1 Consider a data set with 100 instances, six attributes, four clusters that are well separated (generated by using the make_classification function from the scikit-learn package in Python (Pedregosa et al., 2011), using for class_sep=10, resulting in four wellseparated clusters).Figure 1 presents the corresponding multipartite network with six layers, each having two or three nodes.

Initial clusters
Each instance of the data set represents a network path from the first layer to the last one.It is reasonable to assume that instances representing the same path may belong to the same cluster.In the first step of MN-C, instances that share a common path throughout all the layers of the network are placed in the same cluster.
Depending on the structure of the data, this procedure may result in any number of clusters, which may be greater than k, due to the number of possible paths in the network.Not all paths in the network represent instances.We denote by k 0 the number of clusters resulting from the this initialization of the clustering.In what follows, a procedure to merge clusters in order to reduce their number is performed, if necessary, i.e., if k 0 > k.

Merging clusters
A cluster C represents a path starting from the first layer to the last layer of the network G.In the initial stage, each cluster contains one path.Clusters are merged by finding paths that have the most common elements.Merging is performed in the order of the number of instances corresponding to each cluster, starting with the smallest ones.The size of the cluster is denoted by µ(C) and is the number of instances assigned to it.In the initial stage, it represents the number of instances that belong to all node intervals from the cluster.
It is assumed that larger clusters are more stable, while those containing fewer instances should be merged.In order to find, for each cluster C, the cluster for it to be merged with, we compute the following strength indicator for two clusters C and C : (1) Thus, S adds the weights of common edges between two clusters.Each cluster is merged with the one for which the strength indicator is the highest, i.e., they have more edges/instances in common.Thus (2) Clusters C and C * are merged by placing all instances from C in C * .All clusters are merged except the k largest ones.This process reduces the number of clusters from k 0 to k 1 .To reach the desired number of clusters k, the process is repeated for several iterations while the number of clusters is greater than k, and stops when it reaches it, or gets smaller than k.
Remark 1 If S(C,C ) = 0 for all C = C, then the merging cluster is found by looking for the cluster having the most nodes in common with C. We write If there is no such cluster, then C is not merged with any other cluster.

Outcome
The outcome of MN-C is a set of clusters, corresponding to network paths.Each instance in the data is assigned to a cluster.Order clusters in ascending order of size µ(C); 9: for each cluster C l ,l=1,...,k it − k do 10:

NUMERICAL EXPERIMENTS
The performance of MN-C is evaluated on a set of synthetically generated and real-world data sets for clustering and classification and compared with that of other standard state-of-the-art clustering models.

Experimental set-up
In this subsection, the data used for the experiments is described, as well as the methods used for the comparisons and the performance metric used for evaluating the results.
All reasonable combinations of parameter were considered, i.e., excluding the settings with more classes than the number of instances.

Real-world data sets
A selection of data sets used for clustering and classification from the UCI Machine learning repository (Dua & Graff, 2017) is presented.The names and characteristics of the data are listed in Table 1.

Comparisons with other methods
MN-C is compared with 4 clustering methods that aim to find a given number of clusters: Kmeans, Gaussian Mixture (GM), Affinity Propagation (AP), Birch, and Zaki & Meira Jr. (2014); Frey & Dueck (2007); Zhang, Ramakrishnan & Livny (1996).Their corresponding implementation in the sklearn Python library (Pedregosa et al., 2011) is used with its default parameters.In addition, for the real-data sets, the results are also compared to two spectral clustering methods, the standard implementation in Python, which we will call SC and the Ultra-Scalable Spectral Clustering (UNSPEC), using the code provided by the authors (Huang et al., 2020).

Performance evaluation
In order to evaluate and compare the results, the normalized mutual information indicator (NMI) is used (Zaki & Meira Jr., 2014).NMI takes values between 0 and 1 and can be used to compare results provided by a clustering method with a baseline represented by the known clusters.Values closer to 1 indicate a better match.

Synthetic data sets
We generated 588 synthetic data sets.The results are summarised in Fig. 2 and Table 2. Table 2 summarises the results obtained on the synthetic data sets in the following manner: for each characteristic of the data set, the percentage of results for which MN-C obtained higher NMI than the other method is presented.Next to each number, an asterisk (*) indicates if, considering all data sets with the same characteristic, a t -test comparing MN-C's values of the NMI finds them to be significantly greater than those obtained by the other method.A minus sign (-) indicates that overall there is no significant difference between the methods.An (x) indicates that the results obtained by the other method are significantly better.
The results indicate that MN-C is a competitive method, providing better results than Kmeans, Gaussian Mixture, and Affinity propagation, but not better than the Birch method, for the synthetic data sets.However, Fig. 2 shows that while the results obtained by MN-C are worse in most cases, they are close to those obtained by Birch as they align to the first bisector.We find in general MN-C better than the other three methods in more instances for the most difficult settings, for example for class separator values of 0.1, a smaller number of instances, a large number of attributes, and a larger number of classes.

Varying the number of clusters
MN-C uses the number of clusters as a parameter.For different values of this parameter, it will divide the data into clusters that also contain grouped data, in a manner similar to that

Notes.
*An asterisk (*) indicates that overall results for that particular data set characteristic obtained by MN-C are significantly better than the other method.-A minus sign (-) indicates that there is no significant difference, and a (×) indicates that results obtained by the other method are better. of the other clustering methods.For example, given a synthetic data set with 100 instances, 150 attributes, 30 classes, and a class separation parameter of 0.1, the NMI obtained by all the methods, except AP which determines the number of clusters on its own, have a similar trend, as illustrated in Fig. 3.The increasing values of NMI for all the methods indicate that when more clusters are obtained, instances that belong to the same clusters are grouped together in sub-clusters.Since NMI is an external performance measure it cannot be used to detect the number of clusters, but it can show that the performance of the method may be considered robust with respect to this parameter, as it places instances from the same cluster together.It may also indicate that any method of determining the number of clusters in the data based on some internal quality measure, such as the elbow method (Yager & Filev, 1994;Thorndike, 1953), or information-based methods (Sugar & James, 2003), can be used with MN-C in a manner similar to how they are used with other clustering methods (Fu & Perry, 2020).Figure 4 illustrates the values of three internal indicators that are known to be used with an elbow method to evaluate the number of clusters based on the results of a learner for the same data set as in Fig. 3.The silhouette score, distortion, and inertia all have descending values; a linear trend may be observed at their left.

Real-world benchmarks
Results obtained by the four methods on the real-world benchmarks are presented in Table 4.The value of the NMI and number of clusters determined by each method are indicated.The data sets vary in the number of clusters, instances, and attributes.We show a variety of situations in which the results obtained by MN-C are better than those obtained by the other methods (including Birch, unlike the case of the synthetic data sets).
The results also illustrate the reality of the variability in clustering performance with large differences between the methods for some data sets, e.g., R6, for which MN-C obtains a value of 0.662 for the NMI, while the other methods obtain 0.041, 0.026, 0.453, and 0.031, respectively.AP obtains 0.453 but with 11 clusters instead of two.This situation arises also

A real world application
A country's health and nutrition situation is often assumed to be directly linked with economic indicators.Each year, the World Bank categorizes countries into four income groups: Low income, Lower middle income, Upper middle income, and High income based, on GNI per capita.GNI per capita stands for Gross National Income per capita.It is a measure used to assess the economic well-being of a country and its residents.GNI represents the total income a country's residents earn, including domestically and income generated abroad.'Per capita' means that the GNI is divided by the country's total population, giving an average income figure for each individual.GNI per capita is often used as an indicator to compare the average income levels between different countries or to track a single country's economic growth and development over time.It provides a useful metric for understanding the average income and standard of living of the population in a particular nation.Figure 5 presents a map of the distribution of countries into the four groups in 2022.The DataBank (https://databank.worldbank.org/,accessed June, 2023) also reports a multitude of yearly values of indicators related to various socio-economic statuses of countries around the world.Among them, we find Health, Nutrition, and Population statistics (https://databank.worldbank.org/source/health-nutrition-and-populationstatistics,accessed June 2023).There are 467 indicators for 266 countries, for health, nutrition, and population.To illustrate the behaviour of MNC on real data, we have used the health, nutrition, and population indicators from 2021 to group countries.When retrieving the data from the database, we found that more values were available for this year than for the latest, 2022.We ran all the algorithms on the resulting data set.MNC obtained an NMI of 0.32 for the four clusters.All the other methods obtained NMIs below 0.05. Figure 6 illustrates the four country groups identified by MNC.While these groups do not overlap with the income level classes, their distributions are not dissimilar.We find that most countries in the upper-middle and high income groups are in the same group in terms of the given health indicators.For example, while China and Russia are in a different income group than the United States of America and Canada (upper middle income versus high income), they are placed all together, also with high-income countries from Europe, by MNC based on the health indicators.
The clusters identified by MNC may also indicate that there is a common level for health, nutrition, and population status among countries with similar income as well as based on geographic regions.It also shows that lower-income countries have similar values for these indicators also.

CONCLUSIONS
A simple network-based approach to the clustering problem has been presented.Data attributes are separated in intervals and placed in the nodes of a layer of a multipartite network.Thus, the network has a number of layers equal to the number of attributes.Each data instance adds a path to the network from the first to the last layer.MN-C identifies clusters by finding instances that are on the same or close paths in the network.Numerical experiments show that the approach is competitive against some standard state-of-the-art clustering techniques on a set of synthetic and real-world benchmarks.A real-world application that groups countries based on Health Nutrition and Population information available from the World Bank database is presented.Future research directions can explore different ways of constructing intervals for the network nodes and finding ways to identify or recommend a number of clusters.

Figure 1
Figure 1 Example 1 The network corresponds to a data set having 100 instances, six attributes, and four classes.The data are well separated.The weight of an edge indicates the number of instances that have components in the two intervals represented by the nodes.Full-size DOI: 10.7717/peerjcs.1621/fig-1 1) 4: Assign all data instances belonging to the same network path from the first to last layer to the same cluster forming k 0 clusters; 5: it = 0; 6: Change = True; 7: while k it > k and Change do 8:

Figure 2
presents scatter plots of the values of the NMI obtained by MN-C and each of the other methods.It is used to illustrate the overall performance of MN-C compared to the other method: points located above the first bisector (represented in each image by a grey dashed line) indicate that MN-C performs better.The points located below it indicate that the NMI obtained by the other method are better.For each method, the plot is separated based on the values of the class separator parameter, as this is the one that controls the difficulty of the clustering problem.

Figure 2
Figure 2 Overview of results for the synthetic data sets.Scatter plots of NMI values obtained by each method compared with MN-C.A point above the first bisector represented as a dashed line indicates that the NMI obtained by MN-C for the data set is greater than that obtained by the other method.Compared with Kmeans 84.01% of MN-C results are above the line; with GM, 86.39% of results are above the line; with Birch, 17.68% of results are above the line; and with AP, 88.94% of results are above the line.Full-size DOI: 10.7717/peerjcs.1621/fig-2

Figure 3
Figure 3 Evolution of values of NMI for different numbers of clusters k for a synthetic data set with 100 instances, 150 attributes, 30 classes, and a class separation parameter of 0.1.The vertical line indicates the number of clusters in the data set.Full-size DOI: 10.7717/peerjcs.1621/fig-3

Figure 4
Figure 4 Three internal performance indicators for different numbers of clusters, the same data set as in Fig. 3.The vertical lines indicate the number of clusters in the data set.An almost linear trend can be noticed at the left of this line.Full-size DOI: 10.7717/peerjcs.1621/fig-4