Load Clustering Method Based on Improved Density Peak and Gaussian Mixture Model

A load clustering method based on improved density peak and Gaussian mixture was proposed to solve the problems that standard Gaussian mixture clustering algorithm (GMM) with random initial clustering center was easy to fall into local optimum in the clustering analysis of residential electricity behavior, and the density peak algorithm (DPC) needed to manually select the clustering center.The method combines the Cosine distance and Euclidean distance, which reflect the similarity of curve shape, and selects the clustering center automatically through the sine value relation of relative Angle between two points in γ descending graph. The clustering center selected by the improved DPC algorithm was used to initialize the parameters of the GMM model. Finally, load clustering is carried out by GMM model. The results of calculation examples show that the proposed method can effectively improve the efficiency of daily load curve clustering.


Introduction
With the construction of smart grids, projects such as intelligent buildings and intelligent communities have emerged. Power companies can collect load data of power users by installing various data collection terminal equipment and setting different sampling periods [1].The power data contains valuable information such as energy consumption characteristics.Reasonable analysis and use of power data can make power companies and power users a win-win situation [2]. Therefore, power load classification has gradually become a research focus. Load classification is divided into supervised classification and unsupervised classification. Supervised classification usually refers to artificial intelligence neural network method [3,4], and unsupervised classification usually refers to cluster analysis.Cluster analysis generally uses fuzzy C-Means clustering algorithm (FCM) [5,6], hierarchical clustering algorithm [7], K-MEANS clustering algorithm [8][9][10][11], Gaussian mixture model ( GMM) [12] and other methods. Among them, GMM clustering belongs to "soft classification",which is not simply a sum of multiple Gaussian distributions,but the superposition of each Gaussian distribution after assigning a weight.In the end, each cluster is represented by a Gaussian mixture model composed of multiple Gaussian distributions, so GMM clustering has a more flexible cluster shape. However, the random initialization of the GMM clustering model parameters makes it easy to fall into the local optimum.
In GMM, the number of clusters needs to be manually determined and its parameters need to be initialized randomly, which makes it easy to fall into the problem of local optimization and highdimensional data reducing the efficiency of the clustering algorithm. This paper proposes a load clustering method based on improved peak density and Gaussian mixture model for cluster analysis of daily load data.

Gaussian mixture model clustering
The Gaussian Mixture Model (GMM) is a generative model based on probability. It assumes that all samples in the data set are generated by a multivariate Gaussian distribution with given parameters [12].The process of clustering samples is actually to project each sample to each Gaussian mixture model to obtain the probability that each sample belongs to each cluster, select the Gaussian mixture model with the highest probability, and divide the sample into the cluster.For Gaussian mixture models with unknown parameters, expectation-maximization algorithm (EM) is usually used to estimate the parameters of GMM.

Fig. 1
Gaussian mixture model Given the number of clusters K, for the sample x in the sample set, the probability density function of a GMM can be represented by a mixed distribution composed of K multivariate Gaussian distributions: is the mean vector of  , Σ is the probability density function of the multivariate Gaussian distribution of the covariance matrix, i w is the weight of the i-th multivariate Gaussian distribution in the mixture model, and where j z is the component in the mixed model, that is, a distribution.
Where m is the total number of data in the data set. . TThat is, each sample will be assigned to the cluster of the sub-model with a high probability of coming from the sub-model,and finally K clusters are obtained.

DPC algorithm
Clustering by fast search and find of density peaks(DPC) algorithm is based on the assumption that the cluster center is surrounded by surrounding samples with low local density, and at the same time, the cluster center is at a relatively large distance from any other point with high local density, and the number of clustering clusters is determined according to the decision graph [13].
Given data set , for any sample i x , DPC calculates two quantities: local density i  and ; ij d represents the Euclidean distance between samples i x and j x ; c d represents the cut-off distance, which is generally considered to be within the range of c d , the value of c d is appropriate when the average number of samples around each sample is 1%~2% of the total number of samples in the dataset.
Since the cut-off distance will affect the local density, which will affect the clustering result to a certain extent, the Gaussian kernel function is adopted to increase the local density: The minimum distance i  of sample i x is defined as: , sort  in descending order and draw  descending order decision diagram.Finally, a point with a larger value of  is artificially selected from the decision diagram.

DPC algorithm based on automatic cluster center selection
The traditional DPC algorithm uses Euclidean distance to calculate the distance between samples. The Euclidean distance measures the absolute distance between each sample, which is strongly correlated with the position coordinates of each sample, and focuses on reflecting the absolute difference in numerical characteristics between samples.However, as the dimensionality increases, the metric of Euclidean distance will gradually deteriorate, and the Euclidean distance between two highdimensional vectors with higher morphological similarity may be very large. Therefore, this article uses cosine similarity to correct the Euclidean distance to make the similarity difference of highdimensional data more significant and improve the rationality of sample distance measurement in the traditional DPC algorithm.

Cosine distances. The cosine distance of the load is defined as
. c S is the cosine similarity of the load, that is, the cosine of the angle between the two vectors. The greater the cosine similarity, the more similar the two vectors, and the more attention is paid to the difference in orientation between the vectors, rather than the difference in value. The cosine similarity of the load is defined as:  . The larger the cosine distance, the lower the similarity between the two load vectors.

Two-scale distance measurement.
Aiming at the problems of Euclidean distance in highdimensional vector distance measurement, this paper combines cosine distance and Euclidean distance, comprehensively considering the absolute distance between loads and the similarity of load shapes, and proposes a two-scale distance measurement method, which is defined as Among them, e D is the Euclidean distance between the two loads, and c D is the cosine distance between the two loads. The Euclidean distance between vectors will increase as the vector dimension increases, and the cosine distance has a fixed value range [0,2]. The higher the similarity between the vectors, the closer the cosine distance is to 0, otherwise the cosine distance is closer to 2. Therefore, the cosine distance has the effect of scaling, reducing the Euclidean distance between loads with high morphological similarity, and making the difference in similarity between high-dimensional vectors more significant.

Automatic selection of cluster centers. The traditional DPC algorithm calculates
, draws a  descending decision diagram and artificially selects a point with a larger value of  as the cluster center. However, when the data set distribution is complex, the accuracy of clustering center selection will be greatly reduced due to the subjective factors. In response to this problem, this paper proposes an automatic selection of cluster centers based on the sine relationship of the relative angles between two points in a  descending graph.
Relative angle i  is defined as the smaller angle between the line of the minimum point of is included in the cluster center candidate set candidates . The remaining samples are directly classified as ordinary points.
Step 2: Calculate the distance between each data point in the cluster center candidate set candidates , and define min  as the distance between the two closest samples in the cluster center candidate set. For the sample j x in the candidate cluster center set candidates , when 1

Clustering quality assessment
It is generally believed that high-quality clusters have the characteristics of "high similarity within clusters and low similarity between clusters". This paper uses Silhouette Coefficient, Calinski Harabasz Score, and Davies-Bouldin Index (DBI) as the evaluation indicators of clustering quality.

Silhouette Coefficient
The calculation formula of the SC score is:

Calinski Harabasz Score
The formula for calculating CH score is: ; z represents the mean value of all data in the data set, j z is the mean value of the jth cluster, and N is the cluster value The number, K is the current cluster. The value of   K CH is positively correlated with the clustering result.

Davies-Bouldin Index
The calculation formula of DBI is:

Calculation example analysis
This paper selects the historical electricity load of some residents collected from a certain community in Beijing as the experimental data, and 500 data points are collected for each household every day. After preprocessing the original daily load data of users, the variation of daily load data of 249 households is shown in the figure.

Analysis of clustering results
This paper uses the K-means algorithm, the traditional GMM algorithm, and the AD-DPC-GMM algorithm that automatically selects the cluster center based on the improved density peak to perform cluster analysis on the collected data, and the clustering results of these three methods Perform analysis and comparison to verify the rationality and superiority of the AD-DPC-GMM algorithm..  Table 1 shows that the index values of AD-DPC-GMM algorithm are significantly improved compared with other clustering methods, indicating that in the clustering results of AD-DPC-GMM algorithm, the distance between clusters is relatively far and the distance within the clusters is relatively close. At the same time, the effectiveness of ad-DPC method to automatically search cluster centers and the improvement of AD-DPC on GMM clustering performance are also proved.  Figure 4 shows the three cluster centers selected by the AD-DPC algorithm in the descending decision diagram. Figure 5 shows the clustering results of the AD-DPC-GMM algorithm on the daily load curve of 249 users, where the black curve represents the overall average of each type of load. In the clustering results, the number of load curves of each cluster is 80, 78, and 91, respectively. The average load of these three types of users is analyzed: the peak power consumption of the first type of load occurs at 11:00~13:00 and 18:00~20:00, which are usually meal time and conform to the life rules of office workers from 9 to 5. The main peak of power consumption of the second type of load appears after 18:00, but there is a small peak of power consumption in the morning and afternoon, which is consistent with the characteristics of group tenants. The overall power consumption of the third type of load is lower than that of the first two types of load, and the end time of power consumption is earlier, which is consistent with the behavior of elderly households.  Figure 6 are the three user daily load curves in the data set used in this article. It can be seen from the figure that the shape similarity of the daily load curves of user 1 and user 3 is more similar than the shape similarity of the daily load curves of user 1 and user 2. high. However, by calculating the Euclidean distance between the three curves, it is found that the Euclidean distance between user 1 and user 3 is greater than the Euclidean distance between user 1 and user 2. From the Euclidean distance, the curve similarity between user 1 and user 2 is higher, which is inconsistent with the actual situation.This is because with the increase of the dimension,the Euclidean distance becomes larger and larger, which makes the Euclidean distance between the two curves with higher similarity become very large.

Two-scale EC distance validity proof
Multiply the Euclidean distance between the three curves by their corresponding cosine distances to obtain the dual-scale EC distance. As shown in Table 2, the dual-scale EC distance between user 1 and user 3 is significantly smaller than that between user 1 and user 2. The two-scale EC distance of is consistent with the actual situation, indicating that the measurement of the two-scale EC distance for the distance between high-dimensional data is significantly better than the Euclidean distance.

Conclusion
In this paper, a load clustering method based on improved density peak and Gaussian mixture is proposed, and the collected daily load data of users is taken as an example for cluster analysis. The experimental results prove the effectiveness of the algorithm to automatically select cluster centers, effectively avoiding the problem of the traditional GMM clustering algorithm easily falling into the local optimum, improving the clustering quality, and having better engineering feasibility.