Next Article in Journal
Stability Evaluation of Medium Soft Soil Pile Slope Based on Limit Equilibrium Method and Finite Element Method
Next Article in Special Issue
Low Dissipative Entropic Lattice Boltzmann Method
Previous Article in Journal
Sizing and Design of a PV-Wind-Fuel Cell Storage System Integrated into a Grid Considering the Uncertainty of Load Demand Using the Marine Predators Algorithm
Previous Article in Special Issue
Accelerating Extreme Search of Multidimensional Functions Based on Natural Gradient Descent with Dirichlet Distributions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Entropy-Randomized Clustering

Federal Research Center “Computer Science and Control” of Russian Academy of Sciences, 119333 Moscow, Russia
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(19), 3710; https://doi.org/10.3390/math10193710
Submission received: 18 August 2022 / Revised: 1 October 2022 / Accepted: 5 October 2022 / Published: 10 October 2022
(This article belongs to the Special Issue Mathematical Modeling, Optimization and Machine Learning)

Abstract

:
This paper proposes a clustering method based on a randomized representation of an ensemble of possible clusters with a probability distribution. The concept of a cluster indicator is introduced as the average distance between the objects included in the cluster. The indicators averaged over the entire ensemble are considered the latter’s characteristics. The optimal distribution of clusters is determined using the randomized machine learning approach: an entropy functional is maximized with respect to the probability distribution subject to constraints imposed on the averaged indicator of the cluster ensemble. The resulting entropy-optimal cluster corresponds to the maximum of the optimal probability distribution. This method is developed for binary clustering as a basic procedure. Its extension to t-ary clustering is considered. Some illustrative examples of entropy-randomized clustering are given.

1. Introduction

Cluster analysis of different objects is a branch of machine learning where the teacher’s labels are replaced by some internal characteristics of objects or external characteristics of clusters. The internal ones include the distances between objects within the cluster [1,2] and the similarity of objects [3]. Among the external characteristics, we mention the distances between the clusters [4]. As a mathematical problem, clustering has no universal statement. Therefore, clustering algorithms are often heuristic [5,6].
A highly developed area of research is cluster analysis of large text arrays. As a rule, latent features are first detected based on latent semantic analysis [7]. Subsequently, they are used for clustering [8,9]. Recently, there have appeared works based on the concept of ensemble clustering [10,11].
Most clustering algorithms involve the distance between objects, measured in an accepted metric, and enumerative search algorithms with heuristic control [12]. Clustering results significantly depend on the metric. Therefore, it is very important to quantify the quality of clustering [13,14,15].
This paper proposes a clustering method based on a randomized representation of an ensemble of possible clusters with a probability distribution [16]. The concept of a cluster indicator is introduced as the average distance between the objects included in the cluster. Since clusters are treated as random objects, the indicators averaged over the entire ensemble are considered the latter’s characteristics. The optimal distribution of clusters is determined using the randomized machine learning approach: an entropy functional is maximized with respect to the probability distribution subject to constraints imposed on the averaged indicator of the cluster ensemble. The resulting entropy-optimal cluster corresponds in size and composition to the maximum of the optimal probability distribution.
The optimal distribution of clusters is based on the method of the randomized maximum entropy estimation (MEE method) described in [16]. The method turns out to be effective in many machine learning and data mining problems. Among other features, it introduces the problem of entropy-randomized clustering. This article is devoted to a more detailed presentation of this problem in terms of proving the convergence of the multiplicative algorithm and the logical scheme of the clustering procedures.

2. An Indicator of Data Matrices

Consider a set of n objects characterized by row vectors x ( 1 ) , , x ( n ) from the feature space R m . Using these vectors, we construct the following n-row matrix:
X ( 1 , , n ) = x ( 1 ) x ( n ) .
Let the distance between the ith and jth rows be defined as
ϱ ( x ( i ) , x ( j ) ) = x ( i ) x ( j ) R m ,
where R m denotes an appropriate metric in the feature space R m . Next, we construct the distance matrix
D ( n × n ) = 0 ϱ ( x ( 1 ) , x ( 2 ) ) ϱ ( x ( 1 ) , x ( n ) ) ϱ ( x ( 2 ) , x ( 1 ) ) 0 ϱ ( x ( 2 ) , x ( n ) ) ϱ ( x ( n ) , x ( 1 ) ) ϱ ( x ( n ) , x ( 2 ) ) 0 .
We introduce an indicator of the matrix X ( 1 , , n ) as the average value of the elements of the distance matrix D :
d i s ( X ) = 2 n ( n 1 ) ( i , j ) = 1 , j i n ϱ ( x ( i ) , x ( j ) ) .
Below, the objects will be included in clusters depending on the distances in (2). Therefore, the important characteristics of the matrix X ( 1 , , n ) are the minimum and maximum elements of the distance matrix D :
i n f ( D ) = min i , j ϱ ( x ( i ) , x ( j ) ) , s u p ( D ) = max i , j ϱ ( x ( i ) , x ( j ) ) .
Note that the elements of the distance matrices of the clusters belong to the interval I = [ i n f ( D ) , s u p ( D ) ] .

3. Randomized Binary Clustering

The binary clustering problem is to arrange n objects between two clusters K ( s * ) and K ( n s * ) of sizes s * and ( n s * ) , respectively:
K ( s * ) = { i 1 , , i s * } , K ( n s * ) = { j 1 , , j ( n s * ) } ; ( i 1 , , i s ) ( j 1 , , j ( n s * ) ) , ( i α , j β ) = 1 , n ¯ , α = 1 , s * ¯ , β = 1 , n s * ¯ .
It is required to find the size s * and composition { i 1 , , i s * } of the cluster K ( s * ) .
For each fixed cluster size s , the clustering procedure consists of selecting a submatrix X ( s ) of some s < n rows from the matrix X ( 1 , , n ) . If the matrix X ( s ) is selected, then the remaining rows form the matrix X ( n s ) and the set of their numbers form the cluster K ( n s ) .
Clearly, the matrix X ( s ) can be formed from the rows of the original matrix X ( 1 , , n ) in C s n different ways (the number of s-combinations from the set of n elements). For each of them, the matrices X ( n s ) can be formed in a corresponding number of ways.
According to the principle of randomized binary clustering, the matrix X ( s ) is a random object and its particular images are the realizations of this object. The sets of its elements and the number of rows s are therefore random.
A realization of the random object is a set of row vectors from the original matrix:
X ( s ) = X ( s ) ( i 1 , , i s ) = x ( i 1 ) x ( i s ) .
We renumber this set as follows:
{ i 1 , , i s } k = 1 , K ( s ) ¯ ; K ( s ) = C s n .
Thus, the randomization procedure yields a finite ensemble of the form
X ( s ) = X ( s ) 1 , , X ( s ) K ( s ) .
Recall that the matrices in this ensemble are random. Hence, we assume the existence of probabilities p ( s , k ) for realizing the ensemble elements, where s and k denote the cluster size and cluster realization number, respectively:
X ( s ) ( k ) w i t h p r o b a b i l i t y p ( s , k ) , s = 1 , ( n 1 ) ¯ , k = 1 , K ¯ .
Then, the randomized binary clustering problem reduces to determining a discrete probability distribution p ( s , k ) , ( s = 1 , ( n 1 ) ¯ , k = 1 , K ¯ ) , which is appropriate in some sense.
Let such a function p * ( s , k ) be obtained; according to the general variational principle of statistical mechanics, the realized matrix will be
X ( s * ) ( k * ( s * ) ) , w h e r e k * ( s * ) = max s , k p * ( s , k ) .
This matrix corresponds to the most probable cluster of s * objects with the numbers
K 1 * = { i 1 * , , i s * * } k * ( s * ) .
The other cluster consists of the remaining ( n s * ) objects with the numbers
K 2 * = { j 1 * , , j ( n s * ) * } , ( j 1 * , , j ( n s * ) * ) ( i 1 * , , i s * * ) .
Generally speaking, there are many such clusters but they all contain the same ( n s * ) objects.

4. Entropy-Optimal Distribution p * ( s , k )

Consider the cluster K 1 of size s , the associated matrix
X ( s ) ( i 1 , , i s ) = x ( i 1 ) x ( i s ) = X ( s ) ( k ) , { i 1 , , i s } k ,
and the distance matrix
D ( s ) ( i 1 , , i s ) = 0 ϱ ( k ) ( x i 1 , x i 2 ) ϱ ( k ) ( x i 1 , x i s ) ϱ ( k ) ( x i s , x i 1 ) ϱ ( k ) ( x i s , x i 2 ) 0 = D ( s ) ( k ) .
We define the matrix indicator in (4) for the cluster K s as
d i s ( X ( s ) ( k ) ) = 2 s ( s 1 ) ( t , h ) = 1 , t h s ϱ ( k ) ( x i t , x i h ) .
Since the matrices X ( s ) ( k ) are supposed random objects, their ensemble has a probability distribution p ( s , k ) . We introduce the average indicator in the form
M { d i s ( X ( s ) ( k ) ) } = s = 1 n 1 k = 1 K ( s ) p ( s , k ) d i s ( X ( s ) ( k ) ) .
For determining the discrete probability distribution p ( s , k ) , we apply randomized machine learning with the Boltzmann–Shannon entropy functional [17]:
H B [ p ( s , k ) ] = s = 1 n 1 k = 1 K ( s ) p ( s , k ) ln p ( s , k ) max
subject to the constraints
0 p ( s , k ) 1 , s = 1 , ( n 1 ) ¯ , k = 1 , K ( s ) ¯ ,
i n f ( D ) s = 1 n 1 k = 1 K ( s ) p ( s , k ) d i s ( X ( s ) ( k ) ) s u p ( D ) .
Here, the lower inf ( D ) and upper sup ( D ) bounds for the elements of the distance matrix are given by (5); the indicator d i s ( X ( s ) ( k ) ) is given by (16).

5. Parametrical Problems (18)–(20)

We treat Equations (18)–(20) as finite-dimensional: the objective function (entropy) and the constraints both depend on the finite-dimensional vector p composed of the values of the two-dimensional probability distribution p ( s , k ) :
p = { p ( 1 , k 1 , K ( 1 ) ¯ ) , , p ( ( n 1 , k K ( n 2 ) + 1 , K ( n 1 ) ¯ ) ) } .
The dimension of this vector is
M = s = 1 ( n 1 ) K ( s ) .
The constraints in (19) can be omitted by considering the Fermi entropy [18] as the objective function. Performing standard transformations, we arrive at a finite-dimensional entropy-linear programming problem [19] with the form
H ( p ) = s = 1 n 1 k = 1 K ( s ) p ( s , k ) ln p ( s , k ) + ( 1 p ( s , k ) ) ln ( 1 p ( s , k ) ) max 0 p ( s , k ) 1 , s = 1 n 1 k = 1 K ( s ) p ( s , k ) d i s ¯ ( X ( s ) ( k ) ) 1 , d i s ¯ ( X ( s ) ( k ) ) = d i s ( X ( s ) ( k ) ) inf ( D ) , s = 1 n 1 k = 1 K ( s ) p ( s , k ) d i s ̲ ( X ( s ) ( k ) ) 1 , d i s ̲ ( X ( s ) ( k ) ) = d i s ( X ( s ) ( k ) ) sup ( D ) .
To solve this problem, we employ the Karush–Kuhn–Tucker theorem [20], expressing the optimality conditions in terms of Lagrange multipliers and a Lagrange function. For Equation (23), the Lagrange function has the form
L [ p , λ 1 , λ 2 ] = H ( p ) + λ 1 1 s = 1 n 1 k = 1 K ( s ) p ( s , k ) d i s ¯ ( X ( s ) ( k ) ) + + λ 2 1 s = 1 n 1 k = 1 K ( s ) p ( s , k ) d i s ̲ ( X ( s ) ( k ) ) .
The optimality conditions for the saddle point of the Lagrange function in (24) are written as
p L ( p * , λ 1 * , λ 2 * ) = 0 , L ( p * , λ 1 * , λ 2 * ) λ i 0 ,
λ i L ( p * , λ 1 * , λ 2 * ) λ i = 0 , λ i 0 , i = 1 , 2 .
The first condition in (25) is analytically solvable with respect to the components of the vector p :
p * ( s , k | λ 1 , λ 2 ) = exp λ 1 d i s ¯ ( X ( s ) ( k ) ) λ 2 d i s ̲ ( X ( s ) ( k ) ) 1 + exp λ 1 d i s ¯ ( X ( s ) ( k ) ) λ 2 d i s ̲ ( X ( s ) ( k ) ) , s = 1 , ( n 1 ) ¯ , k = 1 , K ( s ) ¯ .
The second condition in (25) yields the inequalities
L λ 1 ( p * ( s , k | λ 1 * , λ 2 * ) = 1 s = 1 n 1 k = 1 K ( s ) p * ( s , k | λ 1 * , λ 2 * ) d i s ¯ ( X ( s ) ( k ) ) 0 , L λ 2 ( p * ( s , k | λ 1 * , λ 2 * ) = 1 s = 1 n 1 k = 1 K ( s ) p * ( s , k | λ 1 * , λ 2 * ) d i s ̲ ( X ( s ) ( k ) ) 0 ,
and the condition in (26) yields the following equations:
λ 1 * L λ 1 ( p * ( s , k | λ 1 * , λ 2 * ) = 0 , λ 2 * L λ 2 ( p * ( s , k | λ 1 * , λ 2 * ) = 0 , λ 1 * 0 , λ 2 * 0 .
The non-negative solution of these inequalities and equations can be found using a multiplicative algorithm [19] with the form
λ 1 q + 1 = λ 1 q ( 1 + γ L λ 1 ( p * ( s , k | λ 1 q , λ 2 q ) ) , λ 2 q + 1 = λ 2 q ( 1 + γ L λ 2 ( p * ( s , k | λ 1 q , λ 2 q ) ) , ( λ 1 0 , λ 2 0 ) > 0 .
Here, γ > 0 is a parameter assigned based on the G -convergence conditions of the iterative process in (30).
The algorithm in (30) is said to be G -convergent if there exists a set G in the space R + 2 and scalars a ( G ) and γ such that, for all ( λ 1 0 , λ 2 0 ) G and 0 < γ a ( G ) , this algorithm converges to the solution ( λ 1 * , λ 2 * ) of Equation (30), and the rate of convergence in the neighborhood of ( λ 1 * , λ 2 * ) is linear.
Theorem 1.
The algorithm in (30) is G -convergent to the solution of Equation (29).
Proof. 
Consider an auxiliary system of differential equations obtained from (30) as γ 0 :
d λ i d t = λ i L λ i ( p * ( s , k | λ 1 , λ 2 ) ) , i = 1 , 2 .
First, we have to establish its stability in the large, i.e., under any initial deviations in the space R + 2 .
Second, we have to demonstrate that the algorithm in (30) is a Euler difference scheme for Equation (31) with an appropriate value γ .
Let us describe some details of the proof. We define the following function in R + 2 :
V ( λ 1 , λ 2 ) = i = 1 2 λ i * ( ln λ i ln λ i * ) .
The function is strictly convex in R + 2 . Its Hessian is
Γ = d i a g [ λ i * λ i 2 | i = 1 , 2 ] 0 .
Hence, min R + 2 V ( λ 1 , λ 2 ) = 0 and it is achieved at the point ( λ 1 * , λ 2 * ) . Thus, V ( λ 1 , λ 2 ) > 0 , ( λ 1 , λ 2 ) R + 2 , and V ( λ 1 * , λ 2 * ) = 0 .
We define the time derivative along the trajectories of (31):
d V d t = λ 1 * L λ 1 ( p * ( s , k ) | λ 1 , λ 2 ) λ 2 * L λ 2 ( p * ( s , k ) | λ 1 , λ 2 ) .
According to (28) on R + 2
d V d t = < 0 if λ 1 * > 0 , λ 2 * > 0 = 0 if λ 1 * = λ 2 * = 0 .
Hence, the function V is a Lyapunov function for Equation (31) in the space R + 2 . All solutions of Equation (31) are asymptotically stable under any initial conditions λ 1 0 > 0 and λ 2 0 > 0 .
The algorithm in (30) is a Euler difference scheme. Due to the asymptotic stability of the solutions of (31), there always exists a step γ > 0 and an initial condition domain under which the Euler scheme will converge. □
By the general principle of statistical mechanics, the realized cluster corresponds to the maximum of the probability distribution:
K s * , k * ( s * , k * ) = max s , k p * ( s , k ) .

5.1. Randomized Binary Clustering Algorithms

In this section, the algorithms of the clustering procedures in terms of logical schemes are proposed.

5.1.1. Algorithm R 2 K ( s ) with a Given Cluster Size s

1.
Calculating the numerical characteristics of the data matrix X ( 1 , , n )
(a)
Constructing the row vectors
x ( 1 ) = { x 11 , , x 1 n } , , x ( n ) = { x n 1 , , x n n } .
(b)
Calculating the elements of the (Euclidean) distance matrix D ( n × n ) in (3):
ϱ ( x ( i ) , x ( j ) ) = ϱ i j = ( k , l ) m ( x k ( i ) x l ( j ) ) 2 , ( i , j ) = 1 , n ¯ .
(c)
Calculating the data matrix indicator in (4):
d i s ( X ) = 2 n ( n 1 ) ( i j ) = 1 n ϱ i , j .
(d)
Calculating the upper and lower bounds for the elements of the matrix D ( n × n ) in (3):
inf ( D ) = min i j ϱ i j , sup ( D ) = max i j ϱ i j .
2.
Forming the matrix ensemble X ( s ) ( i 1 , , i s )
(a)
Forming the correspondence table
i 1 , , i s k , i j = 1 , n ¯ , j = 1 , s ¯ ; k = 1 , K ( s ) ¯ , K ( s ) = C s n .
(b)
Constructing the matrices X ( i 1 , , i s ) .
(c)
Calculating the elements of the distance matrices D ( s ) ( k ) in (15):
ϱ ( k ) ( x i h , x i q ) = ( i k , i q ) = 1 m ( x i h x i q ) 2 , i k i q .
(d)
Calculating the indicator of the matrix X ( s ) ( k ) :
d i s ( X ( s ) ( k ) ) = 2 s ( s 1 ) i h , i q s ϱ i h , i q .
3.
Determining the Lagrange multipliers λ 1 and λ 2 for the finite-dimensional problem
(a)
Specifying the initial values for the Lagrange multipliers:
λ 1 ( 0 ) > 0 , λ 2 ( 0 ) > 0 .
(b)
Applying the iterative algorithm in (30):
λ 1 q + 1 = λ 1 q 1 γ [ 1 + s = 1 n 1 k = 1 K ( s ) exp λ 1 q d i s ¯ ( X ( s ) ( k ) ) λ 2 q d i s ̲ ( X ( s ) ( k ) ) 1 + exp λ 1 q d i s ¯ ( X ( s ) ( k ) ) λ 2 q d i s ̲ ( X ( s ) ( k ) ) d i s ¯ ( X ( s ) ( k ) ) ] ,
λ 2 q + 1 = λ 2 q 1 + γ [ 1 s = 1 n 1 k = 1 K ( s ) exp λ 1 q d i s ¯ ( X ( s ) ( k ) ) λ 2 q d i s ̲ ( X ( s ) ( k ) ) 1 + exp λ 1 q d i s ¯ ( X ( s ) ( k ) ) λ 2 q d i s ̲ ( X ( s ) ( k ) ) d i s ̲ ( X ( s ) ( k ) ) ] ,
where
d i s ¯ ( X ( s ) ( k ) ) = d i s ( X ( s ) ( k ) ) inf ( X ( s ) ( k ) ) , d i s ̲ ( X ( s ) ( k ) ) = d i s ( X ( s ) ( k ) ) sup ( X ( s ) ( k ) ) .
(c)
Determining the optimal probability distribution:
p * ( k | s , λ 1 * , λ 2 * ) = exp λ 1 * d i s ¯ ( X ( s ) ( k ) ) λ 2 * d i s ̲ ( X ( s ) ( k ) ) 1 + exp λ 1 * d i s ¯ ( X ( s ) ( k ) ) λ 2 * d i s ̲ ( X ( s ) ( k ) ) .
(d)
Determining the most probable cluster K 1 * :
k * = arg max p * ( k | s , λ 1 * , λ 2 * ) , K 1 * = { i 1 * , , i s * } k * ( s ) .
(e)
Determining the cluster K 2 * :
K 2 * = { j 1 * , , j ( n s ) * } , ( j 1 * , , j ( n s ) * ) ( i 1 * , , i s * ) .

5.1.2. Algorithm R 2 K with an Unknown Cluster Size s [ 1 , ( n 1 ) ]

1.
Applying step 1 of R 2 K ( s )
ϱ ( x ( i ) , x ( j ) ) , ( i , j ) = 1 , n ¯ ; d i s ( X ) , inf ( D ) , sup ( D ) .
2.
Organizing a loop with respect to the cluster size s = 1 , ( n 1 ) ¯
(a)
Applying step 2 of R 2 K ( s )
i 1 , , i s k , i j = 1 , n ¯ , j = 1 , s ¯ ; k = 1 , K ( s ) ¯ , K ( s ) = C s n .
ϱ ( k ) ( x i h , x i q ) = ( i k , i q ) = 1 m ( x i h x i q ) 2 , i k i q .
d i s ( X ( s ) ( k ) ) = 2 s ( s 1 ) i h , i q s ϱ i h , i q .
(b)
Applying step 3 of R 2 K ( s )
p * ( s , k ) = exp λ 1 * d i s ¯ ( X ( s ) ( k ) ) λ 2 * d i s ̲ ( X ( s ) ( k ) ) 1 + exp λ 1 * d i s ¯ ( X ( s ) ( k ) ) λ 2 * d i s ̲ ( X ( s ) ( k ) ) .
(c)
Putting p * ( s , k ) into the memory.
(d)
Calculating the conditionally maximum value of the entropy:
H B * [ p * ( s , k ) ] = k = 1 K ( s ) p * ( s , k ) ln p * ( s , k ) = H * ( s ) .
(e)
Putting H ( s ) in the memory.
(f)
If s < n 1 , then returning to Step 2a.
(g)
Determining the maximum element of the array H ( s ) , s = 1 , ( n 1 ) ¯ :
s * = arg max 1 s ( n 1 ) H ( s ) .
(h)
Extracting the probability distribution
p ( s * , k ) , k = 1 , K ( s * ) ¯ .
(i)
Executing Steps 3d and 3e of R 2 K ( s ) :
k * ( s * ) = arg max p ( s * , k ) , K 1 * = { i 1 * , , i s * * } k * ( s * ) .
K 2 * = { j 1 , , j n s * } , ( i 1 * , , i s * * ) ( j 1 , , j n s * ) .

6. Functional Problems (18)–(20)

Consider a parametric family of all constrained entropy maximization problems with the form
H [ p ( s , k ) , ε ] = s = 1 n 1 k = 1 K ( s ) p ( s , k ) ln p ( s , k ) + ( 1 p ( s , k ) ) ln ( 1 p ( s , k ) ) max , s = 1 n 1 k = 1 K ( s ) p ( s , k ) d i s ( X ( s ) ( k ) ) = inf ( D ) + ε ( sup ( D ) inf ( D ) ) = Δ ( ε ) , 0 ε 1 .
The solutions of (22) and (23) will coincide under an unknown value of the parameter ε . It can be determined by solving Equation (23) and fixing the values of the entropy functional. Its maximum value will correspond to the desired value ε * .
Let us turn to Equation (23) with a fixed value of the parameter ε . It belongs to the class of Lyapunov-type problems [21]. We define a Lagrange functional as
L [ p ( s , k ) , ε , λ ] = H [ p ( s , k ) , ε ] + λ Δ ( ε ) s = 1 n 1 k = 1 K ( s ) p ( s , k ) d i s ( X ( s ) ( k ) ) .
Using the technique of Gâteaux derivatives, we obtain the stationarity conditions for the functional in (24) in the primal (functional) p ( s , k ) and dual (scalar) λ variables; for details, see [22,23]. The resulting optimal distribution parameterized by the Lagrange multiplier λ is given by
p * ( s , k | λ ( ε ) ) = exp λ d i s ( X ( s ) ( k ) ) 1 + exp λ d i s ( X ( s ) ( k ) ) .
The Lagrange multiplier λ satisfies the equation
s = 1 n 1 k = 1 K ( s ) exp λ d i s ( X ( s ) ( k ) ) 1 + exp λ d i s ( X ( s ) ( k ) ) d i s ( X ( s ) ( k ) ) = Δ ( ε ) .
The solution λ * ( ε ) of this equation belongs to ( , + ) and depends on ε . Hence, the value of the entropy functional H [ p * ( s , k | ε ) , ε ] depends on ε . We choose
ε * = arg max ε H [ p * ( s , k | ε ) , ε ] .
The randomized binary clustering procedure can be repeated t / 2 times to form t clusters. At each stage, two new clusters are generated from the remaining objects of the previous stage.

7. Illustrative Examples

Consider the binary clustering of iris flowers using Fisher’s Iris dataset (in this dataset, iris flowers are described by the petal width x 1 and the petal length x 2 ). The database contains this feature information for three types of flowers: “setosa” (1), “versicolor” (2), and “virginica” (3), in the amount of 50 two-dimensional points for each species. Below, we study types 1 and 2 and 10 data points for each type.
Example 1.
The data matrix contains the numerical values of the two features for types 1 and 2; see Table 1.
Figure 1 shows the arrangement of the data points on the plane.
First, we apply the algorithm R 2 K ( 10 ) ; see Section 5.1.
The minimum and maximum elements are
inf ( D ( 20 × 20 ) ) = 0 , sup ( D ( 20 × 20 ) ) = 3.73 .
The data matrix indicator is d i s X = 1.7382 . Let ε = 0.15 .
The ensemble of possible clusters has the size K ( 10 ) = 184786 . The cluster with number k = 256 has the form i 1 = 1 , i 2 = 2 , i 3 = 3 , i 4 = 4 , i 5 = 5 , i 6 = 6 , i 7 = 7 , i 8 = 14 , i 9 = 15 , i 10 = 20 . The distance matrix D ( 10 ) ( 256 ) is presented in Table 2.
The indicator of the matrix X ( 10 ) ( 256 ) corresponding to the cluster K ( 10 ) ( 256 ) is
d i s ( X ( 10 ) ( 256 ) ) = 1.5021 .
The indicators for the clusters k = 1 , , 184786 are shown in Figure 2.
The entropy-optimal probability distribution for s = 10 has the form
p * ( k | 10 ) = exp ( λ * d i s ( X ( 10 ) ( k ) ) ) 1 + exp ( λ * d i s ( X ( 10 ) ( k ) ) ) , λ * = 12.1153 .
The cluster K 1 with the maximum probability is numbered by
k * = 166922 , K 1 = { 4 , 5 , 6 , 7 , 11 , 14 , 15 , 16 , 17 , 20 } , d i s K 1 = 0.1354 .
The cluster K 2 consists of the following data points: { 1 , 2 , 3 , 8 , 9 , 10 , 12 , 13 , 18 , 19 } .
The arrangement of the clusters K 1 and K 2 is shown in Figure 3.
A direct comparison with Figure 1 indicates a perfect match of 10/10: no clustering errors.
Example 2.
Consider another data matrix from the same dataset (Table 3).
Figure 4 shows the arrangement of the data points.
Similar to Example 1, we apply the algorithm R 2 K ( 10 ) .
We construct the distance matrix D ( 20 × 20 ) and find the minimum and maximum elements:
inf ( D ( 20 × 20 ) ) = 0 , sup ( D ( 20 × 20 ) ) = 3.73 .
Let ε = 0.15 .
The ensemble of possible clusters has the size K ( 10 ) = 184786 . The indicators for the clusters k = 1 , , 184786 are shown in Figure 5.
The entropy-optimal probability distribution for s = 10 has the form
p * ( k | 10 ) = exp ( λ * d i s ( X ( 10 ) ( k ) ) ) 1 + exp ( λ * d i s ( X ( 10 ) ( k ) ) ) , λ * = 100 .
The cluster K 1 with the maximum probability is numbered by
k * = 177570 , K 1 = { 5 , 6 , 7 , 8 , 11 , 14 , 15 , 16 , 17 , 20 } , d i s K 1 = 0.4420 .
It consists of the following data points: { 5 , 6 , 7 , 8 , 11 , 14 , 15 , 16 , 17 , 20 } . The cluster K 2 consists of the following data points: { 1 , 2 , 3 , 4 , 9 , 10 , 12 , 13 , 18 , 19 } .
The arrangement of the clusters K 1 and K 2 is shown in Figure 6. A direct comparison with Figure 4 indicates a match of 8/10.

8. Discussion and Conclusions

This paper has developed a novel concept of clustering. Its fundamental difference from the conventional approaches is the generation of an ensemble of random clusters, accompanied by the matrices of inter-object distances averaged over the entire ensemble (the so-called indicators). Random clusters are parameterized by the number of objects s and their set k { i 1 , , i s } . Therefore, the ensemble’s characteristic is the probability distribution of the clusters in the ensemble, which depends on s and k. A generalized variational principle of statistical mechanics has been proposed to find this distribution. It consists of the conditional maximization of the Boltzmann–Shannon entropy. Algorithms for solving the finite-dimensional and functional optimization problems have been developed.
An advantage of the novel randomized clustering method is complete algorithmization, independent of the properties of the clustered objects data. All existing clustering methods involve, more or less, various empirical techniques related to data properties.
However, this method requires high computational resources to form an ensemble of random clusters and their indicators.

Author Contributions

Conceptualization, Y.S.P.; Data curation, A.Y.P.; Methodology, Y.S.P., A.Y.P., and Y.A.D.; Software, A.Y.P. and Y.A.D.; Supervision, Y.S.P.; Writing—original draft, Y.S.P., A.Y.P. and Y.A.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Higher Education of the Russian Federation, project no. 075-15-2020-799.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mandel, I.D. Klasternyi Analiz (Cluster Analysis); Finansy i Statistika: Moscow, Russia, 1988. [Google Scholar]
  2. Zagoruiko, N.G. Kognitivnyi Analiz Dannykh (Cognitive Data Analysis); GEO: Novosibirsk, Russia, 2012. [Google Scholar]
  3. Zagoruiko, N.G.; Barakhnin, V.B.; Borisova, I.A.; Tkachev, D.A. Clusterization of Text Documents from the Database of Publications Using FRiS-Tax Algorithm. Comput. Technol. 2013, 18, 62–74. [Google Scholar]
  4. Jain, A.; Murty, M.; Flynn, P. Data Clustering: A Review. ACM Comput. Surv. 1990, 31, 264–323. [Google Scholar] [CrossRef]
  5. Vorontsov, K.V. Lektsii po Algoritmam Klasterizatsii i Mnogomernomu Shkalirovaniyu (Lectures on Clustering Algorithms and Multidimensional Scaling); Moscow State University: Moscow, Russia, 2007. [Google Scholar]
  6. Lescovec, J.; Rajaraman, A.; Ullman, J. Mining of Massive Datasets; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  7. Deerwester, S.; Dumias, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by Latent Semantic Analysis. J. Am. Soc. Inf. Sci. 1999, 41, 391–407. [Google Scholar] [CrossRef]
  8. Zamir, O.E. Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine Results. Ph.D. Thesis, The Univeristy of Washington, Seattle, WA, USA, 1999. [Google Scholar]
  9. Cao, G.; Song, D.; Bruza, P. Suffix-Tree Clustering on Post-retrieval Documents Information; The Univeristy of Queensland: Brisbane, QLD, Australia, 2003. [Google Scholar]
  10. Huang, D.; Wang, C.D.; Lai, J.H.; Kwoh, C.K. Toward multidiversified ensemble clustering of high-dimensional data: From subspaces to metrics and beyond. IEEE Trans. Cybern. 2021. [Google Scholar] [CrossRef] [PubMed]
  11. Khan, I.; Luo, Z.; Shaikh, A.K.; Hedjam, R. Ensemble clustering using extended fuzzy k-means for cancer data analysis. Expert Syst. Appl. 2021, 172, 114622. [Google Scholar] [CrossRef]
  12. Jain, A.; Dubs, R. Clustering Methods and Algorithms; Prentice-Hall: Hoboken, NJ, USA, 1988. [Google Scholar]
  13. Pal, N.R.; Biswas, J. Cluster Validation Using Graph Theoretic Concept. Pattern Recognit. 1997, 30, 847–857. [Google Scholar] [CrossRef]
  14. Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. On Clustering Validation Techniques. J. Intell. Inf. Syst. 2001, 17, 107–145. [Google Scholar] [CrossRef]
  15. Han, J.; Kamber, M.; Pei, J. Data Mining Concept and Techniques; Morgan Kaufmann Publishers: Burlington, MA, USA, 2012. [Google Scholar]
  16. Popkov, Y.S. Randomization and Entropy in Machine Learning and Data Processing. Dokl. Math. 2022, 105, 135–157. [Google Scholar] [CrossRef]
  17. Popkov, Y.S.; Dubnov, Y.A.; Popkov, A.Y. Introduction to the Theory of Randomized Machine Learning. In Learning Systems: From Theory to Practice; Sgurev, V., Piuri, V., Jotsov, V., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 199–220. [Google Scholar] [CrossRef]
  18. Popkov, Y.S. Macrosystems Theory and Its Applications (Lecture Notes in Control and Information Sciences Vol 203); Springer: Berlin, Germany, 1995. [Google Scholar]
  19. Popkov, Y.S. Multiplicative Methods for Entropy Programming Problems and their Applications. In Proceedings of the 2010 IEEE International Conference on Industrial Engineering and Engineering Management, Xiamen, China, 29–31 October 2010; pp. 1358–1362. [Google Scholar] [CrossRef]
  20. Polyak, B.T. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987. [Google Scholar]
  21. Joffe, A.D.; Tihomirov, A.M. Teoriya Ekstremalnykh Zadach (Theory of Extreme Problems); Nauka: Moscow, Russia, 1974. [Google Scholar]
  22. Tihomirov, V.M.; Alekseev, V.N.; Fomin, S.V. Optimal Control; Nauka: Moscow, Russia, 1979. [Google Scholar]
  23. Popkov, Y.; Popkov, A. New methods of entropy-robust estimation for randomized models under limited data. Entropy 2014, 16, 675–698. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Data points on the two-dimensional plane.
Figure 1. Data points on the two-dimensional plane.
Mathematics 10 03710 g001
Figure 2. Indicators for k [ 1 , 184786 ] .
Figure 2. Indicators for k [ 1 , 184786 ] .
Mathematics 10 03710 g002
Figure 3. Randomized clustering results.
Figure 3. Randomized clustering results.
Mathematics 10 03710 g003
Figure 4. Data points on the two-dimensional plane.
Figure 4. Data points on the two-dimensional plane.
Mathematics 10 03710 g004
Figure 5. Indicators for k [ 1 , 184786 ] .
Figure 5. Indicators for k [ 1 , 184786 ] .
Mathematics 10 03710 g005
Figure 6. Randomized clustering results.
Figure 6. Randomized clustering results.
Mathematics 10 03710 g006
Table 1. Data matrix.
Table 1. Data matrix.
No. x 1 x 2 Type
14.51.52
24.61.52
34.71.42
41.70.41
51.30.21
61.40.31
71.50.21
83.91.42
94.51.32
104.61.32
111.40.21
124.71.62
134.01.32
141.40.21
151.40.21
161.50.21
171.50.11
184.91.52
193.31.02
201.40.21
Table 2. Distance matrix for cluster K ( 256 ) .
Table 2. Distance matrix for cluster K ( 256 ) .
No.12345678910
100.10.223.013.453.323.273.363.363.36
20.100.143.13.553.423.363.453.453.45
30.220.1403.163.613.483.423.513.513.51
43.013.13.1600.450.320.280.360.360.36
53.453.553.610.4500.140.20.10.10.1
63.323.423.480.320.1400.140.10.10.1
73.273.363.420.280.20.1400.10.10.1
83.363.453.510.360.10.10.1000
93.363.453.510.360.10.10.1000
103.363.453.510.360.10.10.1000
Table 3. Data matrix.
Table 3. Data matrix.
No. x 1 x 2 Type
16.43.22
26.52.82
37.03.22
45.43.91
54.73.21
64.63.41
74.63.11
85.22.72
95.72.82
106.62.92
115.13.51
126.33.32
135.52.32
144.42.91
154.93.01
165.03.41
174.93.11
186.93.12
194.92.42
205.03.61
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Popkov, Y.S.; Dubnov, Y.A.; Popkov, A.Y. Entropy-Randomized Clustering. Mathematics 2022, 10, 3710. https://doi.org/10.3390/math10193710

AMA Style

Popkov YS, Dubnov YA, Popkov AY. Entropy-Randomized Clustering. Mathematics. 2022; 10(19):3710. https://doi.org/10.3390/math10193710

Chicago/Turabian Style

Popkov, Yuri S., Yuri A. Dubnov, and Alexey Yu. Popkov. 2022. "Entropy-Randomized Clustering" Mathematics 10, no. 19: 3710. https://doi.org/10.3390/math10193710

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop