Weighted k-Prototypes Clustering AlgorithmBased on theHybrid Dissimilarity Coefficient

+e k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). +e proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. +e real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.


Introduction
Cluster analysis belongs to unsupervised learning and is an important research direction in the field of machine learning [1]. Clustering analysis, as an important data analysis tool, can divide data objects into different subclusters by calculating the dissimilarity of data objects without marked samples. is study aimed to achieve the purpose that the data objects in the same cluster have less dissimilarity and the data objects in different clusters have more dissimilarity.
Clustering aims to find out the correlation between subclusters in datasets and to evaluate the dissimilarity among data objects in these subclusters [2]. In the field of Categorical Data clustering, the classical k-modes algorithm [3] uses the modes vector to represent the Cluster Centers. e modes vector is a combination of the eigenvalue that occurs most frequently of each feature in the subcluster. e dissimilarity between data objects to be clustered and the cluster is calculated by simple Hamming distance, and only the Categorical Data can be processed. In the field of Numerical Data clustering, the classical k-means algorithm [4] uses the means vector to represent the Cluster Centers, and the means vector is the average value of each eigenvalue in the subcluster. Euclidean distance is used to calculate the dissimilarity between the cluster and the data objects to be clustered, and only Numerical Data can be processed. Both the k-modes algorithm and the k-means algorithm can only handle a single type of data.
In the actual clustering division, people not only need to deal with the Categorical Data and Numerical Data, but also need to deal with a large number of mixed-type datasets composed of the Categorical Data and Numerical Data. Because there is a big difference between Categorical Data and Numerical Data, and mixed-type datasets are usually high dimensional, it is very complicated to deal with mixedtype data in cluster analysis. A simple method to deal with mixed-type data is data preprocessing, which directly converts the Categorical Feature into Numerical Feature. In other words, the mixed-type data are directly converted into the Numerical Data, and then the Numerical Data clustering algorithm is applied to the clustering. For example, the Categorical Feature is converted to binary string, and then the algorithm based on Numerical Data is used to do clustering division. However, there are four disadvantages to using binary encoding for data preprocessing: (1) the original structure of the Categorical Data is destroyed, resulting in the meaningless binary features after conversion; (2) the implicit information of dissimilarity is ignored, which cannot truly reflect the structure of the dataset; (3) if the range of eigenvalues is large, the converted binary eigenvalues will have a larger dimension; and (4) maintenance is difficult, if new eigenvalues are added for the Categorical Feature, then all data objects will change [5].
To solve these problems, researchers have carried out a series of exploratory studies. k-prototypes algorithm [6] and its variant algorithm are mixed-type data clustering algorithms that take into account the Dissimilarity Coefficient of Categorical Feature and Numerical Feature at the same time. Such algorithms can process both Categorical Data and Numerical Data at the same time, but clustering parameters need to be set artificially. OCIL algorithm [7] is a hybrid clustering algorithm based on no parameters, and a uniform Dissimilarity Coefficient is given based on entropy. Like k-prototypes and its variants, the OCIL algorithm uses the k-means example to process mixed-type data. It is an iterative algorithm, sensitive to initialization, and more suitable for spherical distributed data. Ji et al. [8] improved the k-prototypes algorithm by considering the influence of feature weight on the clustering process and proposed the Dissimilarity Coefficient based on feature importance. Renato et al. [9] further improved the algorithm of Ji, endowed different features with different weights, and used Lp Distance Function as the new Dissimilarity Coefficient. Yao et al. [10] proposed an anonymous algorithm for hierarchical clustering based on k-prototypes (short name KLS). KLS algorithm improves the formula of the Dissimilarity Coefficient and unifies the weight setting of the Categorical Feature and Numerical Feature, but the weight needs to be specified in advance by experts. DPC-KNN-PCA algorithm [11] introduced the density peak algorithm into the k-prototypes algorithm to determine the initial Cluster Centers and improved the local neighborhood density ρ i through the nearest neighbor algorithm. However, the selection of the nearest neighbor value k is easily affected by the dataset distribution. Dongwei et al. [12] proposed a k-prototypes algorithm based on the adaptive determination of the initial centroids (short name KP-ADIC). e KP-ADIC algorithm can determine the initial Cluster Centers adaptively, but its Dissimilarity Coefficient cannot fully calculate the dissimilarity between the data. Sangam et al. [13] proposed an equi-biased k-prototypes algorithm for clustering mixed-type data (short name EKACMD) in 2018. EKACMD algorithm is a variant algorithm of k-prototypes, which improves the Dissimilarity Coefficient by considering the relative frequency and distribution of each Categorical Feature. EKACMD algorithm can fully consider the structural characteristics of data in some cases and improve the clustering accuracy, but it is still not applicable in the case that the occurrence frequency of each eigenvalue of the Categorical Data is equal.
Cui et al. [14] applied rough sets to k-prototypes algorithm and proposed RS-KP algorithm, which used rough sets to calculate the dissimilarity between eigenvalues. Although the RS-KP algorithm can deal with the outliers in the clustering of mixed-type data, it is difficult to cluster the discretized data when the eigenvalue ranges overlap, that is, the clustering results of the RS-AP algorithm are easily affected by the discretization.
In this paper, k-prototypes algorithm and its variants were analyzed and compared, and the automatic determination method of initial Cluster Centers was improved, and then a new Hybrid Dissimilarity Coefficient is proposed. e value of these improvements lies in the following: (1) avoiding the randomness of the selection of the initial Cluster Centers; (2) making the clustering method more suitable for the characteristics of mixed-type data; (3) there is no need to manually set various parameters in the clustering process, such as the number of clustering k and the weight parameter c; (4) there is no limitation on the types of clustering data. We can process Categorical Data, Numerical Data, and mixed-type data at the same time, which not only makes the clustering results more ideal, but also provides a new idea for the analysis and mining of realworld data. e organizational structure of this paper is as follows: Section 2 introduces the symbols related to this paper. Section 3 introduces the k-prototypes algorithm. In addition, Section 4 details the design of WKPCA algorithm, and Section 5 gives the experimental results and analysis. Finally, Section 6 is a summary of this paper. Table 1 shows the symbols associated with this article.

The k-Prototypes Algorithm
Huang [6] proposed a k-prototypes algorithm for clustering mixed-type data, which combines the ideas of k-means algorithm [2] and k-modes algorithm [3]. e k-prototypes algorithm divides the dataset into k(k ∈ N + ) different subclusters to minimize the value of the Cost Function. e Cost Function is shown in the following formula: (1) e k-prototypes algorithm combines the "means" of the numerical part and the "modes" of the categorical part to build a new hybrid Cluster Center "prototype". On the basis of "prototype," it builds a Dissimilarity Coefficient formula and the Cost Function applicable to the mixedtype data. e parameter c is introduced to control the influence of the Categorical Feature and the Numerical Feature on the clustering process. It is assumed that the mixed-type dataset has p Numerical Feature and m − p Categorical Feature. For any x i , q l ∈ D, the definition of the Dissimilarity Coefficient of k-prototypes is shown in the following formula: 2 Mathematical Problems in Engineering where δ x i,s , q l,s � 0, x i,s � q l,s , e k-prototypes algorithm divided the Dissimilarity Coefficient of the mixed-type data into two parts for separate calculation. e categorical part adopts the simple Hamming distance, and the numerical part adopts the square of the Euclidean distance [15]. e proportion of the two types of data in the Dissimilarity Coefficient was adjusted by parameter c [12]. It is an important adjustable parameter for k-prototypes algorithm. e purpose of introducing parameter c is to avoid the clustering result value deviation from the Categorical Feature or the Numerical Feature and control the relative weight of dissimilarity between Categorical Data and Numerical Data. When δ(x C i,s , q C l,s ) is equal to 0; when x C i,s ≠ q C l,s , δ(x C i,s , q C l,s ) is equal to 1; the basic steps of the k-prototypes algorithm are described as follows: Step 1: k data objects were randomly selected from dataset D as the initial Cluster Centers.
Step 2: formula (2) is used to calculate the dissimilarity between x i and q l . According to the calculation result, x i is allocated to the nearest cluster.
Step 3: according to the current Cluster Centers, the dissimilarity of the data object is recalculated. Reassign the data objects to the nearest subcluster, the values with the highest frequency are used in the categorical part, and the numerical part uses the method of average value to determine. Update the Cluster Centers.
Step 4: repeat Steps 2 and 3 until the Cost Function is no longer changing. If the Cost Function is no longer changing, the algorithm ends. Otherwise, skip to Step 2 to continue. x i,s s th Categorical eigenvalues of i th data object; x i,s ∈ DOM(A s ), s � 1, 2, . . . , p, p + 1, . . . , m is the feature dimension of the dataset, the former p− dimension is the Categorical Feature, and the latter m − p− dimension is the Numerical Feature Calculates the dissimilarity between data objects C l e cluster of l th , C � C 1 , . . . , C l , . . . , C k is all the clusters contained in the dataset D q l e Cluster Center corresponding to the l th cluster: Q � q l , l � 1, 2, .., k is the Cluster Centers set |C l | # of data objects in cluster C l , given by formula Consider the Cost Function of the weight d e average distance between two data objects ρ i Local neighborhood density d c Cutoff distance L i Relative distance 3.1. Description of Problem. k-prototypes algorithm can cluster mixed-type data, and the principle is simple and easy to operate, but there are still some shortcomings in the clustering process: (1) e random selection of the initial Cluster Centers results in the uncertainty and randomness of the clustering results, and the number of clusters (k) should be manually determined; (2) the simple Hamming distance is used to calculate the dissimilarity between the Categorical Data and the Cluster Centers, resulting in the loss of information and the inability to objectively reflect the real situation between the data objects and the clusters, resulting in inaccurate clustering results; (3) parameter c used to adjust the proportion between Categorical Data and Numerical Data needs to be manually determined; and (4) the structural characteristics of Categorical Data and Numerical Data and the overall distribution of datasets have not been fully considered.

Problem with the Dissimilarity Coefficient.
With the help of the artificial dataset D 1 shown in Table 2, the disadvantages of user directive parameter c in the clustering process are discussed. D 1 contains 27 data objects, and each data object is described by two Numerical Features and one Categorical Feature. Categorical Feature A 3 has three ei- About the feature of A 3 , a solid triangle, a solid circle, and a solid square are prescribed to represent eigenvalue data objects A, B, C. When parameter c � 0, clustering results of D 1 only depends on two features A 1 and A 2 . e clustering results are shown in Figure 1. D 1 has three clusters. To facilitate observation, the three clusters were separated by dotted lines. When c > 0, the data object x 7 can be moved to the C 2 , because most of the feature (A 3 ) of the data objects is the same between the object x 7 and cluster C 2 . Similarly, based on the above reasons, a data object x 10 can be moved to cluster C 1 . When the value of the parameter γ changes, the clustering results of the data objects x7 and x10 will change accordingly. e data objects x7 and x10 may be divided into cluster C1 or cluster C2. Data objects x 1 and x 20 may remain in the original cluster because they are too far away from the other clusters nearby, even if they have the same eigenvalues as the data objects in the other clusters nearby. In summary, it is important to define the parameter c on the same scale. For related discussion, see literature [13].

Problems in Initial Cluster Center Selection.
e classical k-prototypes algorithm is very sensitive to the initial Cluster Centers, which are selected by random initialization method or manual setting method, both of which lead to the instability of clustering results to a certain extent. e initial Cluster Centers with different locations and k values will produce different clustering results. As shown in Figure 2, the actual cluster number of this dataset is k � 3. Figure 2 shows the clustering results generated by different initial Cluster Centers when the initial cluster number is set to k � 2, k � 3, and k � 4 (the contents described in Figure 2 from left to right are random selection of initial Cluster Centers, clustering iteration process, and final clustering result). erefore, it is very important for the clustering algorithm to find a suitable initial Cluster Center.

Weighted k-Prototype Clustering Algorithm
Based on Hybrid Dissimilarity Coefficient (WKPCA) e motivation for the proposed algorithm is (1) to provide an effective method for the expression of Dissimilarity Coefficient of mixed-type data clustering and (2) to avoid the uncertainty caused by random selection of initial Cluster Centers.
In order to solve the problem of quantitative measurement of information, in 1948 Shannon cited the concept of thermal entropy in thermodynamics and proposed the concept of "Information Entropy". e occurrence probability of discrete random events is defined as Information Entropy. e size of the Information Entropy is related to the probability of random events. e smaller the probability of an event, the more information is generated and the smaller the entropy of information. For example, the Information Entropy of the event "Heavy rain in a place where it does not rain frequently" is large; the larger the probability of an event, the less information is generated, and the greater the entropy of information. For example, the event " e sun rises in the east" will definitely happen, so it has very little information. Shannon's Information Entropy formula [16] is defined as follows: p(x i ) � (|x i |/|D|), 1 < i < n, represents the probability of a random event x i . x i is a divided subset of dataset D. When D satisfies the condition p(x 1 ) � p(x 2 ) � · · · � p(x n ), En(A) takes the maximum value log|D|; when D satisfies the condition p(x 1 ) � 1, n � 1,En(A) takes the minimum value of 0. Information Entropy has the following basic properties: Nonnegative: there is a negative sign in the Information Entropy formula, which represents the reduction or elimination of the disordered state after the system is obtained, that is, the magnitude of the uncertainty is eliminated Symmetry: all variables of a function are interchangeable without affecting the value of the function

Dissimilarity Coefficient of Categorical Based on Entropy
Weight. Information Entropy can be used to calculate the discreteness of data and assign appropriate weight to each feature to improve the clustering effect. In the clustering 4 Mathematical Problems in Engineering   1  12  20  28  18  20  25  29  33  24  45  45  48  52  A 2  39  36  30  52  44  41  54  46  55  59  63  70 70 x 14 x 15 x 16 x 17 x 18 x 19 x 20 x 21 x 22 x 23 x 24 x 25 x 26 x 27 A 1  51  52  53  54  55  61  62  64  67  69  71  73  76  79  A 2  66  63  58  23  14  8  66  19  30  7  24  11  23 27 Mathematical Problems in Engineering process, the importance of a Categorical Feature is inversely proportional to its dissimilarity [15]. To some extent, the Information Entropy of each Categorical Feature reflects the weight(w s ) of each Categorical Feature. erefore, according to the uncertainty of the value of each Categorical Feature, this paper uses Information Entropy to calculate the importance of each Categorical Feature in the clustering process and assigns weight(w s ) to the Dissimilarity Coefficient.
Definition 1 (the intracluster relative frequency of eigenvalues). Suppose that most of the data objects in cluster C l contain the same Categorical Eigenvalue A C s,t which means that the eigenvalue A C s,t appears frequently in cluster C l , so the intracluster dissimilarity of the eigenvalue A C s,t will be low. e intracluster relative frequency of the eigenvalues is defined as follows: Definition 2 (the intercluster distribution frequency of eigenvalues). e intercluster distribution frequency of the eigenvalue refers to the occurrence frequency of the eigenvalue A C s,t relative to the total frequency of the eigenvalue in all clusters. e intercluster distribution frequency of the eigenvalue is defined as follows: Definition 3 (Dissimilarity Coefficient of categorical). Let d C (x i , q l ) represent the dissimilarity of the Categorical Data portion of the mixed-type dataset, and the definition is shown in the following formula: Definition 4 (the entropy of Categorical Feature). From the perspective of information theory, the importance of feature can be seen as the dissimilarity of the dataset relative to the feature. Basak [17] mentions that if the information content of a feature is high, the dissimilarity of the feature is also high. Let x i be a discrete random variable belonging to the finite dataset D, and P(x i ) is a probability function of the discrete random variable x i . Because the eigenvalue domain of Categorical Data is certain, the eigenvalues in the eigenvalue domain can be regarded as discrete and independent. Suppose there are n s different eigenvalues of a certain Categorical Feature (A C s ), and the probability of the occurrence of each eigenvalue is p(A C s,t ), 1 ≤ t ≤ n s , then the importance of Categorical Feature can be calculated by formula (7) [17]. P(A C s,t ) is the intracluster relative frequency of the eigenvalue A C s,t mentioned in Definition 1: e larger the value of P(A C s,t ), the larger the proportion of the eigenvalues A C s,t in the feature A C s . en, the intracluster dissimilarity between data objects to be clustered with the eigenvalues A C s,t and the cluster C l is smaller. Observing formula (7), we can find that the more the possible values of eigenvalue A C s,t are, the smaller the entropy of the Categorical Feature is. In practice, it is not the case that the larger the value domain of eigenvalues, the higher the importance. Considering the actual situation, the more different values a data object has on a feature, the less influence this feature has on clustering. In order to reduce the influence of Categorical Feature with too many different values on clustering [18], formula (7) is further modified as follows.
Definition 5 (quantified entropy). When defining the entropy of the feature A C s , we divide by the number (n s ) of possible values of feature A C s . e definition of entropy of Categorical Feature (A C s ) after quantization is shown as follows: Definition 6 (weight (w s ) on the s th -dimensional feature of Categorical Feature). e eigenvalue distribution of each dimension feature is different, and the eigenvalue of different distribution will make the feature of different dimension Categorical Feature occupy different importance. In order to better discover all or part of the "prototypes" hidden in the dataset, the weight of each dimension should be taken into account when defining the Dissimilarity Coefficient. Let the weight of each dimension feature be w 1 , . . . , w s , . . . , w m and w s > 0, s � 1, 2, ..., m, the weighted data object is . . , n. e weight W is defined as shown in the following formula: e weight of the Categorical Feature per dimension is defined as the ratio of the redundancy of the dimension feature to the sum of the overall redundancy. e calculation method of entropy redundancy is 1 − E n ′ A C s . All quantized entropy values in the dataset were normalized to obtain the entropy weight of each Categorical Feature. e definition of w s is shown as follows: Theorem 2 (0 ≤ w s ≤ 1, p s�1 w s � 1).
p s�1 E n ′ A C s is the sum of the weights of all features. e larger the value w s , the larger the weight of the feature A C s .
Definition 7 (Dissimilarity Coefficient of categorical based on entropy weight). For any x i , q l ∈ D, the Dissimilarity Coefficient of categorical based on entropy weight between x i and q l is defined as shown in the following formula: e proposed Dissimilarity Coefficient is demonstrated by using the artificial dataset D 2 which is shown in Table 3.  632130 and d(x 16 ,q 3 ) � 0.946240. According to the above calculation, it can be seen that the correct clustering division of x 16 can be carried out by using WKPCA algorithm.

Quantized Numerical Dissimilarity Coefficient
Definition 8 (quantitative numerical Dissimilarity Coefficient). e classical k-prototypes algorithm uses the Euclidean Distance to calculate the dissimilarity of the numerical part. Direct calculation of data of different orders of magnitude will not only increase the difficulty of Mathematical Problems in Engineering calculation, but also cause a large error between the calculated results and the real situation. erefore, Numerical Data should be dimensionless before calculation. e paper adopts Max-Min Standardization, and the processing method is shown in formula (12). e quantified numerical Dissimilarity Coefficient is defined as follows: erefore, for arbitrary data object in the dataset D, the Dissimilarity Coefficient between them is defined as follows:

Determination of Initial Cluster Centers.
e classical k-prototypes algorithm is very sensitive to the selection of Cluster Centers. e appropriate initial Cluster Centers and cluster number k are particularly important for k-prototypes algorithm.
Definition 10 (the average distance). e average distance between two data objects x i and x j is defined as follows: Definition 11 (local neighborhood density). e local neighborhood density is defined as shown in the following formula: value that limits the search scope.
Definition 12 (distance threshold). L is defined as the distance threshold between arbitrarily data objects x i and x j in dataset D, which is defined as follows: e Cluster Centers generally satisfies the following two assumptions. Firstly, the local neighborhood density of the central point of the cluster is higher than that of the surrounding noncentral point of the cluster. Second, the relative distance between the center points of each cluster is large. Based on the above assumptions, this section presents the specific process of self-determining the initial Cluster Centers: Step 1: formula (2) is used to calculate the distance matrix of the data.
Step 2: formula (16) is used to calculate the local neighborhood density value (ρ i ).
Step 3: formula (17) is used to calculate the distance threshold (L i ).
Step 4: sort the data in the dataset in descending order according to the local neighborhood density and get the sort sequence D ′ � x 1 ′ , x 2 ′ , x 3 ′ , . . . x n ′ , . x 1 ′ is the initial Cluster Center q 1 , and q 1 is stored into the Cluster Centers set Q.
Step 5: for x i ′ ∈ D ′ , determine whether there is a x 1 ′ that satisfies dist(x i ′ , q l ) > L. If it is satisfied, x i ′ is taken as the Table 3: Artificial dataset D 2 . Cluster Mathematical Problems in Engineering next Cluster Center and put into the Cluster Centers set Q. Otherwise, proceed to the next data object x i+1 ′ .
Step 6: determine whether all the data objects in D ′ have been accessed. If not, skip to Step 4 to continue execution. Otherwise, the elements in the collection of Q are the initial Cluster Centers, and |Q| is clustering number k.

Cost Function considering Weights
Definition 13. (Cost Function considering weights). WKPCA algorithm is to find k subclusters where the following Cost Function as shown in formula (18) is minimized: where u il represents the membership degree of the data object x i to the cluster C l . U n×k represents the membership degree matrix. u il � 1 indicates that the data object x i belongs to the cluster C l , and u il � 0 indicates that the data object x i does not belong to the cluster C l . F(U, Q) is the cost of dividing x i , that is, the sum of the dissimilarity of all data objects in cluster C l to the center of its cluster. When the value of the Cost Function reaches a minimum value when the constraint conditions: u il ∈ 0, 1 { }, 1 ≤ i ≤ n, 1 ≤ l ≤ k, k l�1 u il � 1, 1 ≤ i ≤ n, 0 < n i�1 u il < n, 1 ≤ l ≤ k are satisfied, the clustering process ends. e WKPCA algorithm steps are described as follows: Input: dataset D containing n data objects Output: k subclusters after clustering Step 1: initialization procedure. Formula (2) is used to calculate the dissimilarity between the data objects.
Step 2: according to the automatic selection method of initial Cluster Centers in Section 4.4, k data objects are selected from dataset D as the initial Cluster Centers.
Step 3: iterative process. Formula (11) was used to calculate the dissimilarity between the data object and the Cluster Center, and x i was assigned to the nearest cluster according to the calculation results.
Step 4: according to the current cluster center, the dissimilarity of the data object is recalculated. Update the Cluster Centers.
Step 5: repeat Steps 2 and 3 until the Cost Function is no longer changing. If the Cost Function is no longer changed, the algorithm ends. Otherwise, skip to Step 2 to continue. e flowchart of WKPCA algorithm is shown in Figure 3. e time complexity of the algorithm in this paper is higher than that of the classical k-prototypes algorithm, which is mainly consumed in the process of selecting the initial Cluster Centers. However, after the optimal initial Cluster Centers is determined, the number of iterations will be reduced and satisfactory clustering results will be obtained, so as to make up for the high time complexity to some extent.  Table 4. e dataset used in this article has data missing phenomena, such as the ACA dataset. Deleting missing data directly from the dataset does not affect the clustering results.

Experimental Results and Analysis
erefore, before the experiment, all the data with missing values were deleted to ensure the integrity of the dataset and the accuracy of the clustering results. e complete version of the ACA dataset has 690 pieces of data, and the paper selects 623 pieces of data with complete eigenvalues to form a cleaned dataset. In addition, the numerical features are normalized by using the Maxi-Mini Normalization methods.

Performance Index.
In order to evaluate the quality of clustering, the index AC (clustering accuracy) shown in formula (19) was used as the evaluation criterion. e indicator AC represents the ratio of the number of data objects correctly divided into the cluster C l to the total number of data objects. e closer the clustering result is to the real clustering partitioning result of the dataset, the larger the index AC value is, the better the clustering result is, that is, the better the clustering algorithm is. NUM + indicates the number of data objects correctly assigned to C l :

Analysis of Experimental Results.
In order to verify the universality of WKPCA algorithm and eliminate the accidental results of a single experiment, we experimented with multiple real UCI datasets. e experiment compared the performance of WKPCA algorithm with k-prototypes algorithm proposed by Huang [6] and the EKACMD algorithm proposed by Sangam [13] on mixed-type datasets. Each algorithm was executed 30 times on each dataset to take the average value, and the statistics of clustering results are summarized in Tables 5-9. For k-prototypes algorithm and EKACMD algorithm, five different clustering parameters k � 2, k � 4, k � 6, k � 8, and k � 10 were set in the paper for experiments. Since neither EKACMD nor WKPCA needed to set clustering parameter c, the paper only sets cluster parameters of k-prototypes separately. e Bank dataset has 41,188 data objects, 10 Categorical Features, 10 Numerical Features, and 2 clusters. e Bank dataset was sampled at a sampling rate of 4.8%. Table 5 shows that when k � 2, the AC values of WKPCA are 9.43% and 5% higher than those of k-prototypes and EKACMD, respectively.
e Zoo dataset has 101 data objects, 15 Categorical Features, 1 Numerical Feature, and 7 clusters. Table 6 shows that when k � 7, the AC values of WKPCA were 11.88％ and 4.95％ higher than those of k-prototypes and EKACMD, respectively.
e Heart Disease dataset has 303 data objects, 7 Categorical Features, 6 Numerical Features, and 2 clusters. Table 7 shows that when k � 2, the AC values of WKPCA were 10.35% and 2.7% higher than those of k-prototypes and EKACMD, respectively.
e Lym dataset has 148 data objects, 7 Categorical Features, 6 Numerical Features, and 4 clusters. Table 8 shows that when k � 4, the AC values of WKPCA are 9.62％ and 6.45％ higher than those of k-prototypes and EKACMD, respectively.
e ACA dataset has 690 data objects, 9 Categorical Features, 5 Numerical Features, and 2 clusters. Table 9 shows that when k � 2, the AC values of WKPCA are 5.02% and2.99% higher than those of k-prototypes and EKACMD, respectively. e results in Tables 5-9 show that, in terms of clustering accuracy, the proposed algorithm achieves better clustering results than other algorithms. e Hybrid Dissimilarity  Coefficient considers the importance of each feature in the clustering process and can automatically calculate the weights of different features. e above reasons enable the algorithm in this paper can obtain better clustering results. Figure 4 shows the clustering accuracy of WKPCA, k-prototypes, and EKACMD on the Bank dataset with different parameters k. It can be seen from Figure 4 that WKPCA has a good clustering result. Figure 5 shows the clustering accuracy of WKPCA, k-prototypes, and EKACMD on the Zoo dataset with different parameters k. It can be seen that the curve of WKPCA is higher than that of k-prototypes and EKACMD. Figure 6 shows the clustering accuracy of WKPCA, k-prototypes, and EKACMD on the Heart dataset with different parameters k. It can be seen that when k � 4 and k � 6, the clustering precision of k-prototypes and EKACMD is relatively close. e true cluster number of the Heart dataset was k � 2, and the clustering accuracy of WKPCA at k � 2 is much better than that of EKACMD.      Figure 7 shows the clustering accuracy of WKPCA, k-prototypes, and EKACMD on the Lym dataset with different parameters k. It can be seen that the effect of WKPCA algorithm is better than that of k-prototypes and EKACMD overall. Figure 8 shows the clustering accuracy of WKPCA, k-prototypes, and EKACMD on the ACA dataset with different parameters k. When k � 2, the clustering accuracy is WKPCA > EKACMD > k-prototypes.
From Figures 4-8, it can be seen that, in the case of random initialization, the proposed parameter-free WKPCA algorithm is superior to the k-prototypes algorithm and the EKACMD algorithm in clustering accuracy. As can be seen from the detailed information of the dataset shown in Table 4, the ratio of the Categorical Features and the Numerical Features in the selected datasets is mostly different. For example, the Zoo dataset has 1 Numerical Feature and 15 Categorical Features; the Lym dataset has 2 Numerical Features and 16 Categorical Features. Although there is a large gap between the two types of feature distribution of these datasets, the WKPCA algorithm still achieves satisfactory clustering results. is indicates that the proposed Hybrid Dissimilarity Coefficient is applicable for various  complex datasets, and it is not necessary to set any parameters manually to adjust the weights of the Categorical Features and the Numerical Features.

Conclusions
e weighted k-prototypes clustering algorithm based on the Hybrid Dissimilarity Coefficient is an extension of the classical k-prototypes clustering algorithm. e method of automatic selection of initial Cluster Centers is improved by means of average distance, local neighborhood density, and relative distance. Considering the spatial distribution information of the data, the Cluster Center is more in line with the actual situation. e uncertainty of clustering caused by different initial Cluster Center selection is avoided. For Categorical Data, the coefficient of type dissimilarity based on entropy weight is used. For Numerical Data, different numerical values are standardized by using quantized numerical Dissimilarity Coefficient. For mixed-type data, the paper used a weighted Hybrid Dissimilarity Coefficient. e proposed Hybrid Dissimilarity Coefficient not only retained the characteristics of different types of data, but also effectively improved the clustering accuracy and clustering effectiveness, and its robustness was better than other k-prototypes clustering algorithms. Finally, WKPCA algorithm is proposed to realize mixed-type data clustering. In Step 1, the WKPCA algorithm automatically determines the initial Cluster Centers by calculating the average distance and local neighborhood density. Compared with other k-prototypes algorithms, it takes more time, but a more accurate Cluster Centers can be selected in the initial stage of clustering. Make sure the Cluster Centers is located in the region with the highest sample density, and the distance between them is the longest, which reduces the number of algorithm iterations. e paper algorithm improves the clustering accuracy, but sacrifices the time performance. erefore, the next step will focus on improving time complexity. To sum up, although the time performance of the proposed algorithm needs to be improved, its clustering accuracy and clustering effectiveness have been significantly improved.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.