Research on fast marking method for indicator diagram of pumping well based on K-means clustering

Indicator diagram is the key basis for fault diagnosis of pumping wells in oil exploitation. With the rapid development of machine learning, the fault diagnosis of indicator diagram based on deep learning has garnered increasing attention. This kind of methods train neural network models with marked samples, and then inputs images into the trained models and outputs their categories. At present, the preparation of indicator diagram sample set relies on experts' analysis of indicator diagram images one by one. However, it involves extensive manual work and manual marking is prone to errors, so the marked samples are often insufficient in quantity. In order to quickly mark a large number of indicator diagram samples, the oil well data was plotted into standardized indicator diagram, and then three feature extraction methods for indicator diagrams were proposed: feature extraction based on original vector, feature extraction based on three-dimensional pixel tensor, feature extraction based on convolutional neural network. These methods convert the indicator diagram into corresponding feature vectors, which are then clustered using the K-means clustering algorithm, enabling the corresponding indicator diagrams to be classified into different categories based on the clustering results. Using 20,000 randomly selected pieces of data from 100 pumping wells, this study clusters the sample set using the three proposed methods. The results indicated that the time consumption were 0.2, 8.3, and 0.7 h, with accuracy rates of 98%, 92%, and 95%, respectively. For indicator diagrams, the clustering method based on the original vector has outstanding performance in terms of efficiency and accuracy. This provides an automatic tool for the preparation of the pumping well fault diagnosis dataset, and its efficiency can be increased by tens of times compared with manual marking.


Introduction
The pumping unit serves as the main equipment for oil exploitation [1], and the fault diagnosis of pumping unit is one of the crucial issues in the field of oil exploration and development.Indicator diagram is composed of a series of displacement and load sample points when the pumping unit is in operation [2], which serves as the key basis for fault diagnosis of pumping unit.Researchers have extensively investigated the problem of pumping well fault diagnosis using two main approaches.The first is the method based on theoretical analysis [3], which involves establishing the operating principles and corresponding mathematical models of the sucker rod pumping system, followed by the simulation calculation of various pumping well fault types.However, given the complexity of the rod pumping system and the interactions among various fluids in the wellbore, this method has limited applications.The second is based on machine learning [4], which involves preparing a series of sample sets based on different shapes of indicator diagrams and corresponding fault types.Subsequently, the appropriate machine learning model is selected to learn the sample set, facilitating the accurate identification of the fault.This approach avoids the need for complex physical mechanism description and has emerged as the mainstream method for pumping well fault diagnosis.
With the advancement of machine learning methods, deep learning has emerged as a powerful tool and its application has been expanding rapidly [5].The methods based on deep learning have demonstrated significant potential and prospects in the field of fault diagnosis.[6].Jia et al. employed a Deep auto-encoder for intelligence diagnosis of rotating machinery [6], Gan and Wang utilized deep belief network for fault pattern recognition of rolling element bearings [7], Zhao et al. adopted recurrent neural network for machine health monitoring [8], Zhang et al. applied convolutional neural network for bearing fault diagnosis [9], applied convolutional neural network for bearing fault diagnosis.Notably, convolutional neural network has emerged as the leading architecture and has exhibited the best performance in numerous tests [10].In recent years, deep learning has also been extensively applied to fault diagnosis of pumping units [11][12][13][14].
The fault diagnosis method for pumping wells based on deep learning heavily relies on the quantity and quality of the sample set.In current research, the number of samples usually ranges from hundreds to tens of thousands, with fault types mostly ranging from several to dozens.In order to verify the proposed method, Feng et al. collected 700 samples of five typical faults (discharge valve or pipe blockage, suction valve or pipe blockage, suction valve leakage, discharge valve leakage and discharge and suction valve leakage.) and extracted the required features [11].Cheng et al. considered eight different fault diagnosis types, which include normal operation condition, downstroke pump bumping, upstroke pump bumping, combination of leaking standing and traveling valves, gas interference, insufficient liquid supply, sand production and abnormal dynamometer card.For each type, they selected 1000 samples for identification and classification [12].Du et al. compared different models on 18,500 manually marked fault samples of pumping units in ten categories [15].Lv et al. collected 7887 indicator diagrams under different working conditions from Shengli Oilfield in China to prove the effectiveness and performance of the proposed fault diagnosis method, and each sample was manually marked [13].Zhang et al. used [17].A summary of the above studies is shown in Table 1.
With the construction of oilfield intelligence, automatic data acquisition has become a reality.Massive data has been accumulated to form big data.For instance, Shengli Oilfield in China generates an average of 48 indicator diagrams per well per day, resulting in hundreds of millions of indicator diagrams each year [18].These indicator diagrams cover almost all types of faults that may occur in the operation of pumping wells.However, manually marking such an enormous amount of data is unfeasible, which leads to the inability of massive information to play its value.Consequently, the sample size has emerged as a critical challenge that hinders fault diagnosis.The conventional method of manual marking on a one-by-one basis is inadequate for batch marking of massive sample data.
The clustering analysis method in unsupervised learning is being increasingly applied to various fields [19,[20][21][22][23][24][25], which helps us make full use of a large number of unmarked samples and saves a lot of manual work and time.There are also many applications that combine clustering analysis with machine learning algorithms.Hao utilized clustering to analyze physique data of college students, and subsequently trained a BP neural network [26]).Li et al. utilized K-means clustering to classify grape leaves, and subsequently used a random forest algorithm to detect and grade grape downy mildew [27].Ida et al. applied K-means clustering to classify and study the waveform of volcanic earthquakes, and used their findings to observe magma rising events in Sakurajima [28].
At present, a large number of unmarked samples are not fully utilized in the fault diagnosis of pumping wells, and sample labeling requires a lot of manual labor.In order to realize the automatic and fast clustering of pumping well indicator diagrams, we introduced the K-means clustering method, and compared the clustering performance of three different

Standardized indicator diagram drawing
A typical pumping well system is shown in Fig. 1.The engine converts the rotary motion into the up and down reciprocating motion of the horsehead through the walking beam.The rod string is connected between the horsehead and the pump at the bottom of the well.A dynamometer is installed at the top of rod string.It can record the displacement and load in the movement of horsehead.
Usually, during a reciprocating movement up and down the horsehead, the dynamometer records 200 data points for displacement and load respectively, forming a displacement vector (W = [w 1 , w 2 , …, w 200 ]) and a load vector (Z = [z 1 , z 2 , …, z 200 ]).Through the informationization construction of oilfields, nearly all pumping wells are equipped with dynamometers, allowing for continuous collection and transmission of displacement and load data to the oilfield data center [29].Table 2 presents a portion of the displacement and load vectors of one of the wells in an oilfield on July 10, 2021.Taking into account information transmission and storage capabilities, this oilfield collects displacement and load vectors every half hour per well.
To facilitate manual analysis, displacement and load vectors are often plotted as indicator diagrams.Indicator diagram take displacement vector as x-axis and load vector as y-axis, forming a closed-loop curve.It shows the load changes of the sucker rod pump in a reciprocating cycle, which can reflect the working condition of the pumping well.In this study, all indicator diagrams are drawn with reference to the standards commonly used in oilfields, as follows.
• • Ticks and tick labels: off.Fig. 2 presents three examples of standardized indicator diagrams (normal operation condition, insufficient liquid supply, rod parting).The standardized drawing of indicator diagrams helps us eliminate some misjudgment of fault types.If we use default settings of drawing tool such as Python, the closed curve of rod parting will occupy the entire diagram, despite a small difference between the maximum and minimum load, which very easy confuse with normal operation condition, and should be avoided.
Images can be represented as matrices made up of pixel values.In the case of binary images, their 2-dimensional matrices consist of only two values: 0 and 1.0 represents black and 1 represents white.For grayscale images, one channel is sufficient for representation, with pixel values ranging from 0 to 255.White is represented as 255, and black is represented as 0. Therefore, black and white pictures are also called grayscale images, which are widely used in medical and image recognition fields [30].RGB images, are 3-dimensional matrices that represent chromatic image, and require three channels to represent.Each pixel in RGB images can be represented by different colors, with each color having 0 to 255 levels of brightness.For example, blue is (0, 0, 255).The indicator diagram is an RGB image.

Feature extraction based on original vector
In this method, we directly use the displacement vector and load vector acquired by the dynamometer to form an original vector as the feature.The original vector O = [ w 1 , w 2 , …, w 200, z 1 , z 2 , …, z 200 ].Table 3 gives some examples of the processed original vector of the indicator diagram.Through concatenation, each indicator diagram corresponds to a vector with a length of 400, where 1-200 represents the displacement vector and 201-400 represents the load vector.
The displacement vector and load vector are the raw data collected by the dynamometer, so it can be considered that this method retains all the information of the indicator diagram.Nevertheless, the feature vectors created by this method are not intuitive for This method uses a combination of displacement vectors and load vectors to characterize indicator diagram information.The displacement vector and load vector are the raw data collected by the dynamometer, so it can be considered that this method retains all the information of the indicator diagram.Nevertheless, the feature vectors created by this method are not intuitive for human vision, and humans need to convert the feature vectors into indicator diagrams before they can effectively distinguish between samples.

Feature extraction based on three-dimensional pixel tensor
The indicator diagram is represented as a color image, which is a three-dimensional array or a tensor in a computer, with the size of 200 × 100 × 3.Among them, 3 represents three channels, corresponding to the colors red, green, and blue, respectively.A colored image consist of three colors each corresponding to a primary color channel.Take Fig. 2 (a) as example, indicator diagram can be divided by 3 channels, Fig. 3 shows the each channel.The pixel tensor is shown in Fig. 4.
Scalars are often referred to as 0-dimensional tensors, while vectors are referred to as 1-dimensional tensors.Similarly, matrices are regarded as 2-dimensional tensors, and RGB images can be represented by 3-dimensional tensors.Therefore, we can think of tensors as  In this method, the first step is to use the displacement vector and load vector to draw a standardized indicator diagram.The drawing specification follows the oilfield standard, ensuring that all the information of the indicator diagram is preserved.Unlike the feature extraction based on the original vector, this method provides more intuitive visual information.To prevent information loss, the feature vector dimension in this method reaches 60,000.

Feature extraction based on convolutional neural network
Convolutional neural network (CNN) is an efficient recognition method that has emerged in recent years and has garnered widespread attention and discussion in various fields [31].A convolutional layer is composed of multiple convolution kernels, also known as the feature extraction layer.Mainly used to learn and obtain the features in the image, and the feature weight values in the convolution kernels can be automatically learned and updated.In recent years, CNN has been applied to object detection [32],medical treatment [33,34], image recognition [35,36], etc.We need to use CNN to extract the features of the indicator diagrams, and there are a series of publicly available pre-trained CNN models trained on the ImageNet dataset for us to choose from.
The ImageNet dataset is widely regarded as one of the most essential and frequently used datasets for image classification, detection and location in the field of deep learning.It comprises tens of thousands of images in diverse categories such as animals, flowers, fruits, and people.From 2010 to 2017, the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) competition utilized ImageNet dataset, leading to the creation of deep learning network models like AlexNet (2012), VGG (2014), and ResNet (2015).By learning the basic features of thousands of objects on the ImageNet dataset, the CNN has formed a pre trained CNN model with weight parameters [37], which has a strong feature extraction ability.The MobileNetV2 model, as illustrated in Fig. 5, was proposed by Google in 2018 [38].In comparison to its predecessor, MobileNetV1, the model is not only more accurate, but also more compact in size.In this study,   Prior to feature extraction, it is essential to normalize the indicator diagram.Normalization is a data processing technique that restricts the data between 0 and 1. Normalizing the data leads to a significant reduction in the optimal solution's fluctuations and a faster convergence rate.The normalization method can be performed as follows (assuming the ith column).
where x i is the original characteristic parameter, max (x i ) represents the maximum value, and min (x i ) represents the minimum value.
To ensure consistency in picture size, the indicator diagram (200 × 100 × 3) undergoes preprocessing before being imported into the picture dataset, flattened and normalized to enhance the operational convergence rate.Upon inputting the image to the convolution layer, MobileNetV2 maps all the feature containing local information (including the height, width, and number of channels of the feature) to the 7 × 4 × 1280 dimensions.Fig. 6 illustrates the changes in parameters for each layer when using MobileNetV2 to extract features of indicator diagram.
The output of the model is a 7 × 4 × 1280 dimensional array, which is ultimately transformed into a vector with a length of 35840.
On the basis of method 2 (feature extraction based on three-dimensional pixel tensor), pre-trained MobileNetV2 is introduced for further feature extraction in this method.Through a sequence of operations such as neural network convolution and pooling, the   information is compressed and transformed.As a result, the feature dimension is reduced from 60,000 to 35,840.However, Due to the lack of interpretability in neural networks, it is not possible to predict the level of information retention in advance.It needs to be evaluated according to the actual clustering result.

Clustering of indicator diagrams
The conversion of each indicator diagram into a vector was accomplished.Subsequently, cluster the vectors of each indicator diagram to maximize the differences between each class and minimize the differences within the class.

K-means clustering
Clustering mainly refers to classifying data by identifying features that can describe their correlation or differentiation [39].The effectiveness of clustering is determined by the degree of intra-class similarity, which should be high, and inter-class dissimilarity, which should be low.In other words, the main task of clustering is to achieve high similarity within a class and low similarity between classes, making the distance between classes as large as possible, and the distance between samples within a class and class centers as small as possible.As one of the most commonly used techniques in data analysis, clustering algorithm has been widely used in pattern recognition [40], machine learning [41], image processing [42], data mining [43], etc. Algorithms for data clustering are grouped into two major categories, namely, hierarchical clustering algorithms and partitioned clustering algorithms [44].The method based on partition being the most commonly used clustering approach.It divides all objects into mutually exclusive categories, and each object belongs to a category.Its purpose is to increase intra-cluster similarity and reduce inter-cluster similarity.Common algorithms based on partition include K-means, K-medoids, and K-prototype.In this study, the K-means algorithm was selected for feature vector clustering due to its simple calculation principle, easy implementation, high efficiency, and success in various fields.K-means algorithm is currently the most widely used method in clustering.
K-means clustering counts the distance between the data object and the cluster centroid, and data objects close to the centroid are divided into the same category.
The specific steps of K-means algorithm are.
(1) Specify K initial centroids, each of which is a class.
(2) For each remaining sample, count their distances from all centroids, and then put them into the nearest cluster.
(3) After the division, recalculate the centroids of each class, and then count the distance from each sample to the centroids of each class to reclassify each sample.(4) Repeat steps (2) and (3) until the centroid does not change.
However, different cluster centers may produce different clustering results.We must overcome the negative impact of the selection of initial cluster centers on the clustering results, make the initial cluster centers as far away from each other as possible, and use K-means++ algorithm to optimize.
The specific steps of K-means++ algorithm are.
(1) A sample is randomly selected as the first initial cluster center in the sample set.
(2) Calculate the distance D(x) from other samples x to the nearest cluster center.
(3) Calculate the probability P(x) that all samples become the next cluster center, as follows: (4) Select the sample corresponding to the maximum probability value as the next cluster center.

Distance measurement
Common distance measures include Euclidean distance, Manhattan distance, Chebyshev distance [45].The K-means algorithm in the deep learning library sklearn encapsulates Euclidean distance for distance measurement, because Euclidean distance can be used as the distance calculation problem of any spatial structure, representing the actual length between two points in an n-dimensional spatial structure, or the actual length of a vector to the origin, while the Euclidean distance in a 2-dimensional or 3-dimensional space refers to the actual distance between two points.The Euclidean distance d of points x and y is： where n is the number of sample points contained by x or y, x i and y i represent the ith sample point.

Evaluation index of clustering
The K-means algorithm requires optimization to ensure that samples within the same category exhibit high similarity [46].There are currently many indicators for evaluating clustering effectiveness, with the most commonly used being SSE, SC, and H. SSE (sum of squares due to error): where K is the number of clusters, x is the point in K i , K i is the ith category, and c i is the centroid of the ith category.Silhouette Coefficient (SC) is one of the main indexes to evaluate the clustering results [47], which is divided into cohesion and separation, and can be understood as describing the definition of the contour of each category after clustering.Cohesion can be understood as reflecting the similarity between samples and elements within the class, while separation can be understood as reflecting the differences between samples and elements outside the class.The equation of SC is as follows: where a(i) represents the cohesion of the sample, that is, the average of the compactness of vector I to other vectors in the same cluster, which is calculated as follows: where n represents the number of clusters, j represents other samples in the same class as sample i, and distance represents the distance between i and j.The smaller a(i) is, the closer the class is.b(i) represents the dissimilarity between clusters, that is, the average dissimilarity between vector i and other clusters, which is calculated as follows: The S(i) is between [− 1,1].The larger the SC, the better the clustering effect.If a cluster contains only one category of indicator diagrams, it satisfies Homogeneity (H).In fact, it can also be considered as accuracy (the proportion of correctly classified samples in each cluster to the total number of samples in that cluster).The equation of H is as follows: where K is the number of clusters, K i is the ith category, and R i is the correctly classified samples of the ith category.N represents the number of samples.

Pseudo codes of the proposed methods
Combining the three feature extraction methods with the K-means clustering algorithm introduced in this section, three clustering algorithms using different feature extraction strategies are obtained.The pseudo codes of the algorithms are shown in Algorithm 1 to 3.

X. Wang et al.
Pseudo code of the algorithm 1.
Pseudo code of the algorithm 2.
Wang et al.

Experimental data
From a pool of 2000 pumping wells, 100 wells were randomly selected and 200 samples were collected from each, resulting in a total of 20000 samples.Each sample contained one displacement vector with a length of 200 and one load vector with a length of 200.The dimension of drawn indicator diagrams were 200 × 100 × 3.
Three feature extraction methods were applied, resulting in 20000 feature vectors for each method.Method 1 produced feature vectors with a length of 400, Method 2 produced feature vectors with a length of 60000, and Method 3 produced feature vectors with a length of 35840.

Experimental result
For the 20000 samples mentioned above, the K-means algorithm is employed for clustering, where the number of clustering categories is set to 80, corresponding to 80 folders.Upon completion of the clustering process, the indicator diagrams, numbered 0-19999, are placed in the respective folders based on their clustering results.
Clustering the original vector data in Table 3, the category will be saved in the last column, as shown in Table 4. Next, the vector will be drawn into standardized indicator diagram and saved in appropriate folders, partial indicator diagrams of six categories illustrated in Fig. 10.
Table 5 presents the comparison of the time consumption of the three different indicator diagram clustering methods discussed in this paper.The total time includes the time consuming of feature extraction and clustering.
In order to investigate the impact of different sample numbers on clustering time, a range of 4000-20000 samples was examined, and the clustering time was calculated for each.Fig. 7 illustrates the changes in clustering time as the number of samples increases.Considering that the feature vector lengths obtained by three feature extraction methods are different, the relationship between feature vector length and clustering time is plotted as shown in Fig. 9. From the figure, it can be seen that as the feature vector length increases, the clustering time also increases, and the increase rate follows an exponential relationship.This further indicates the importance of selecting an appropriate feature extraction method to improve clustering efficiency.
It is worth noting that clustering of indicator diagrams may lead to certain errors, whereby indicator diagrams of different types are sometimes grouped together in the same folder.Nonetheless, the majority of folders achieve correct clustering.Six folders with incorrect clustering were selected for each of the three methods, as demonstrated in Figs.10-12.
From an intuitive perspective, among the folders with clustering errors, the first method has the least proportion of samples with errors.To further measure the clustering, we employ the H as the average accuracy for each method.After statistical analysis of each folder, the distribution of accuracy of the three methods are shown in Fig. 14.After calculation, the H of the three methods are 98%, 92%, and 95%, respectively.
As can be seen from Fig. 14, the first clustering method has the largest number of folders with an accuracy of 100%, and the lowest number of folders with an accuracy rate of less than 80%.Additionally, the accuracy of the first method is predominantly distributed above 90%, with a comparatively low number of folders displaying an accuracy rate below 90%.These results indicate that the first clustering method is a highly reliable and accurate approach, producing a considerable number of precise outcomes with minimal discrepancies.Figs.7 and 8 shows that the Feature extraction based on three-dimensional tensor takes significantly longer compared to the other two clustering methods.Furthermore, Fig. 14 reveals that the second method does not exhibit the highest number of folders with an accuracy of 100%, while has a higher number of folders with an accuracy rate of less than 80%.In addition, the SC for the three methods are 0.53, 0.20, and 0.23, respectively.This indicates that the higher the H, the larger the SC.From the SC, feature extraction methods based on original vectors also demonstrated its advantages.
Considering that the feature vectors obtained from three different feature extraction methods have different lengths, the relationship between the length of the feature vectors and the clustering accuracy metrics H and SC is plotted in Fig. 13.From the figure, it can be observed that method 1, with a feature vector length of only 400, achieves the highest H and SC values, indicating the best clustering accuracy.Method 3, with a feature vector length of 35,840, performs second best, followed by method 2 with a feature vector length of 60,000.This suggests that the clustering accuracy is not directly related to the length of the feature vectors, but rather to the amount of information preserved within them.
Although method 2 has the longest feature vector length, over 90% of the vector consists of zeros, resulting in low information density and lower clustering accuracy.Method 3 compresses the feature vectors based on method 2, sacrificing some information but increasing information density, thereby improving clustering accuracy compared to method 2. Furthermore, both method 2 and method 3 initially transform the original data into indicator diagrams for visual information extraction.However, from the perspective of clustering results, although visual approaches are more intuitive and easier for humans to interpret, these approaches does not show superiority for clustering algorithm.
Due to the extensive computation and long time consuming, the second method does not exhibit better clustering results than other methods.Therefore, further research on this method is not considered.Different feature extraction methods can be compared from a macro perspective by clustering different numbers of samples into the same categories or clustering the same number of samples into different categories.The findings of these two experimental schemes are summarized in Table 6 and Table 7.
Table 6 shows that with the increase of sample quantities, the clustering accuracy is decreasing.Table 7 shows that with the increase of the number of categories, the clustering accuracy is increasing.At this point, SSE continues to decrease, and SC reaches its maximum value, indicating that SC can be used to assist in determining the optimal clustering category.From Tables 7 and it can be seen that the best clustering category obtained by the first method is 112, the best clustering category obtained by the second method is         144, and the second method has a more detailed division of the dataset.Fig. 15 shows the changes in SC.As the number of categories increases, we observed that the SC initially increases with the number of categories.However, beyond a certain threshold, the SC reaches a plateau and no longer improves.Based on the comparative analysis of the two clustering methods presented in Tables 6 and 7, it can be concluded that the clustering method based on the original vector yields a smaller SSE, and a larger SC.These findings suggest that the clustering effect of this method is better.
The effect of this study was further analyzed based on practical application.If 20,000 well samples need to be marked for working condition type, it takes an experienced oilfield engineer approximately 20 seconds to mark one sample, resulting in a total time of approximately 111.111 h to complete the marking process.However, if the fast marking method based on original vector proposed in this study is used to assist in the marking, previous tests have shown that the clustering of 20,000 samples can be completed in just 0.235 h with an accuracy of 98%.Only 400 samples would require manual intervention, which would take approximately 2.222 h.Consequently, the entire process can be completed in just 2.457 h.Compared with manual labeling, the proposed method in this study can achieve a 45-fold increase in efficiency.

Conclusions
The traditional sample set preparation for fault diagnosis of pumping wells relies on expert experience for marking, which is both time-consuming and labor-intensive.This research proposed three fast marking methods, each of which adopts a different feature extraction strategy and is coupled with K-means clustering algorithm.
The performance comparison analysis of the methods was conducted through experimental testing.The total time consumed by the three methods was 0.235h, 8.266h, and 0.652h, respectively.The accuracy measures for the three methods, H and SC, were 98%, 92%, 95% and 0.53, 0.20, 0.23, respectively.Method 1 is the most outstanding in terms of efficiency and accuracy among the three methods.
For indicator diagram clustering task, the clustering time consumption is influenced by the number of samples, cluster categories, and feature dimension.It increases with an exponential speed as the number of samples and feature dimensions increase, while it has a linear relationship with the number of cluster categories.However, there is no direct correlation between clustering accuracy and feature dimension.Clustering accuracy depends more on the amount and density of information.Although plotting the original data into indicator diagram is more visually distinguishable by humans, this method does not achieve better accuracy for clustering algorithm.
According to practice, the efficiency of pumping well working condition marking by using the clustering algorithm proposed in this study can be increased by 45 times compared with manual marking.Nevertheless, clustering too many samples at once may still result in excessive computational load and prolonged calculation time.In addition, there is still a lack of clear criteria for determining the number of clustering categories.
The proposed method can also be applied to other sample marking problems, such as well logging interpretation.It is important to note that data or images characterize for other problems may differ from the indicator diagram, which could result in biased accuracy and efficiency.
Future research can focus on improving algorithmic efficiency, exploring fast marking for large sample sets, addressing sample imbalance, and balancing computational efficiency with information preservation.

Data availability statement
Data will be made available on request.X. Wang et al.
dynamometer image feature extraction methods, including feature extraction based on original vector, feature extraction based on three-dimensional pixel tensor, and feature extraction based on convolutional neural network.This study selects 20,000 samples randomly and conducts comparative experiments to analyze the efficiency and accuracy of the aforementioned methods.The remainder of this paper is organized as follows.Section 2 introduces the drawing process of standardized indicator diagram.Section 3 describes the feature extraction methods of indicator diagram.Section 4 introduces the clustering for indicator diagrams based on K-means algorithm.Section 5 presents the experiments results and analysis.Finally, Section 6 concludes the paper.
multidimensional arrays and the indicator diagram as three-dimensional arrays or tensors.After reading a 200 × 100 × 3 indicator diagram into a computer and converting it into a three-dimensional tensor, it contains 60000 pixel points.
we employ the pre-trained CNN model of MobileNetV2 to extract indicator diagram features.The model maps local features of the indicator diagram to various neurons, which are then aggregated to form global information.Finally, the output is a sequence of digital feature vectors, capable of discerning between distinct types of indicator diagrams.The feature extraction based on CNN makes full use of the feature extraction ability of CNN.For feature extraction rather than classification, the MobileNetV2 model removes the classification layer and used to extract the characteristics of indicator diagram.

X
.Wang et al.
Pseudo code of the algorithm 3. et In order to analyze the impact of different clustering categories on the clustering time, a range of 16-80 categories was examined, and the clustering time was calculated for each.Fig. 8 illustrates the changes in clustering time as the number of categories increases.As depicted in Figs.7 and 8 that with the increase of the number of indicator diagrams or the increase of the clustering categories, the clustering time of feature extraction based on three-dimensional pixel tensor increases rapidly and takes a long time, which does not meet the research purpose of fast clustering for indicator diagrams.Additionally, it can be seen from Figs. 7 and 8 that these two periods are exponential growth and linear growth.

Fig. 7 .
Fig. 7. Relationship between the number of indicator diagrams and clustering time.

Fig. 8 .
Fig. 8. Relationship between the categories of clustering and clustering time.

Fig. 9 .
Fig. 9. Relationship between feature vector length and clustering time.

X
.Wang et al.

Fig. 13 .
Fig. 13.Relationship between length of feature vector and clustering accuracy.

Fig. 15 .
Fig. 15.Changes in two methods of SC.
displacement load data from an oil plant to create indicator diagrams and manually marked 7500 data sets of five fault types [16].Wei et al. collected 2000 indicator diagrams of 12 fault types in Y Oilfield and manually marked each diagram before diagnosis Wei et al. collected 2000 indicator diagrams of 12 fault types in Y Oilfield and manually marked each diagram before diagnosis

Table 1
A summary of related works.

Table 3
Original vector of indicator diagram.
Fig. 3. Three channels of indicator diagram.X.Wang et al.

Table 5
Comparison of time consuming.

Table 4
Vector clustering results.

Table 6
Comparison of clustering effects of different sample quantities.

Table 7
Comparison of clustering effects of different categories.
X.Wang et al.