The metrologically reasonable clustering of measurement results

This paper is dedicated to the metrological aspects of clustering procedure. It is shown that taking into account the information on uncertainty of the data to be clustered allows easily making reasonable decisions that are metrologically supported. The paper presents that the problems that are usually difficult to solve in practice of the clustering become simpler: the proposed approach allows determining the number of maximum recognizable clusters and thus protects from the unreasonable conclusions and allows performing the preconditioning of data to be clustered simpler.

grounds and that's why they may be unreasonable. The clustering procedure isn't the exceptionsome of the discovered clusters can be parts of the one cluster in reality and look separate because of random errors in initial data; some of the identified clusters may consist several unrecognizable smaller subclusters we aren't able to seethey merged into one since their boundaries are blurred and overlap because of errors of the clustered results of measurement. To detect such situations is extremely difficult problem in practice and is supposed to be solved using the only data to be clustered, which usually doesn't contain any corresponding information in real-life situations. That's why to accompany any decision with a characteristic of its quality, we need to take into account the metrological characteristics of data we process. Accounting the information on the initial data uncertainty brings benefits in many cases: we can easily answer some questions that are difficult if we deal with data treated as absolutely accurate. Applying to clustering, the algorithm shown below allows estimating the maximum reasonable quantity of recognizable clusters. To get such number is a hard problem in practice that is often solved subjectively based on some a priori information, but if we have information on metrological characteristics of data to be clustered then this quantity can be easily obtained and used as restriction or reinforcement of conclusions and decisions.
Traditional approach to clustering consists of the following main stages. 1) Collecting the initial data for clustering. The array ={1, 2, …, n} of states of studied object, process or phenomenon of our interest is composed at this stage. Each element i is described by a vector of values xi T = (xi1, xi2, …, xim) having the one structure for all i. Its elements are a set of measured characteristics of studied object and are disturbed with measurement errors. All measured vectors xi, i = 1, 2, … n form a set Х of observations on studied object, process or phenomenon.
2) Preparing the data before clustering. The vectors xi can be inconvenient to deal with during clustering. So, they can be transformed to vectors yi = (yi1, yi2, …, yir) of may be the smaller dimension or that are more comfortable to deal with. The elements of yi inherit the uncertainty from the measurements results xij and are inaccurate too. All vectors yi, form a set Y consisting of principal components of quantitative description of the studied object, process or phenomenon state.
The preconditioning of data can use different and extremely complex algorithms -including for example neural networks and self-organized maps [14].
3) Choosing measure of distance between elements of the set Y.
The measure (yi, yj) that allows quantifying degree of proximity of vectors yi and yj towards each other is chosen at this phase. Since vectors yi and yj are associated with elements of the array  by the chain of transformations   X  Y, the value (yi, yj) allows making a conclusion on the proximity degree of the corresponding elements i and j.

4) Choosing clustering algorithm.
On this stage, a rule f(yi) should be chosen that assigns to each vector yi a number from array C = {1, 2, ... K}, which determines the cluster that contains yi and, as a consequence, i. Results of clustering by particular algorithm essentially depend on measure (yi, yj) that is used for calculations (one algorithm can produce different results for different measures just like different methods of clustering do that use the same measure ). The choice of a particular clustering algorithm may depend in practice on number of analyzed vectors n and their length m.

5) Clustering
The clustering procedure in introduced terms has the meaning of the mapping f: Y→C. For each cluster its center's coordinates сk, k = 1, 2, .., K, should be determined. Number of clusters K is given by user and, as a rule, is based on available a priori information. The quantitative expression of quality of performed clustering is usually the sum of measures of proximity between vectors yi and the closest cluster centers: The listed stages are the main for clustering procedure; in concrete situations it may differ only in some particular details.
We can conclude that the value R(K) computed from the measurements results xij inherits its uncertainty and that's why is uncertain too. This circumstance creates the possibility for determining the maximum possible number Kmax of clusters that can be recognized from the initial data. To achieve this goal, it is necessary to metrologically accompany all preformed calculations from the list above: for any computation, we should obtain not only the calculation results but also the metrological characteristic of its accuracy. Then we will be able to connect the uncertainty of value R(K) as the quantitative quality estimation with the accuracy of the initial datathe measurement results xij. For this, it is necessary at the first stage of clustering procedure to involve information on the possible absolute errors xij of values xij. In the most frequent practical case this will be the maximum possible error ij: |xij|ij, i = 1, 2, … n and j = 1, 2, … m. If values xij are measured then the values of ij are known from technical documentation for used measuring instruments. If not only maximum possible error values are available from the manuals, then all other metrological characteristics should be taken into account.
To estimate the metrological characteristics of the results of the calculations with values xij, we can use one from many approaches and methods that are developed to the present moment to propagate uncertainty of measurement results through the computational procedure. These methods can be grouped in two big sets: methods that use the randomization (like Monte-Carlo approach) and methods that perform automatic analysis by the overloading operations performed during calculations (a linearization of calculated functions with automatic differentiation or interval arithmetic). All of these techniques can be applied to any computational problem; the clustering isn't the exception. We propose to use the approach of paper [15] for propagating the uncertainty of inaccurate initial data through all the computations because this approach has a metrological orientation, but any other suitable method can be also chosen.
If it is no need in initial data preliminary processing before clustering, then a computationally cheap simple procedure can be used in the case of small uncertainties of initial data. It can be described as follows. Let for each vector xi T = (xi1, xi2, …, xim) of initial data the vector xi T = (i1, i2, …, im) of maximum possible absolute errors of its components be known (from the technical documentation). Since no processing is provided, yi = xi and yi = xi. Otherwise, we need to use one of the mentioned methods to propagate information on errors xi during calculation the values yi and, as a result, estimate their maximum possible absolute errors yi. To automatically determine the maximum possible number of clusters Кmax, the following iterative procedure should be applied.
1) Assign a zero value to the number of clusters К and assign R(0)=+.
2) Assign K the value (K+1), perform the clustering procedure by chosen method and get the value of R(K) with the limits R(K) of its possible absolute error inherited from the initial data. 3) If |R(K)-R(K-1)|>R(K), then go to step 2. Otherwise, the procedure should be finalized and Кmax = K. Thus, the idea of the proposed approach is as follows. If the clarification of clustering results caused by increase in number of clusters turns out to be more significant than the results' error, inherited from inaccuracy of initial data, then the number of clusters should be increased. Otherwise, the clustering procedure should be stopped because of reaching the maximum number of clusters that can be recognized in such an inaccurate data. This idea is a particular case of a more general idea: if iterative procedure deals with inaccurate initial data, then it should be stopped earlier in accordance with data accuracy [16].
The obtaining the value Кmax protects us from unreasonable conclusions on clusters structure. The dealing at any step of clustering procedure with uncertainty characteristics creates the opportunity for better applying clustering results for classification of further observations of the state of the studied object, process or phenomenon: we will get the error characteristics сk for clusters center's coordinates сk and use it to find out if there are reasons to include the new observation xn+1 to several clusters at once. The usual practice is to attribute the vector to the closest cluster, but in reality the distances to clusters' centers are distorted by errors inherited from initial data used during clustering and this brings us in some cases to situation when we aren't find out what cluster is the closest one. Taking into account the uncertainty of measurement results allows us to detect this situation and draw right conclusion.
The other great possibility that is the consequence from the taking into account the uncertainty of initial data is that the proposed approach produces the reasonable and objective way to compare different measures  (or clustering algorithms at whole) and to choose the most suitable. Really, the best measure  from the set {1, 2, 3…} is one that for the same data to be processed produces the largest value Kmax: this means that this measure allows recognizing the maximum number of clusters. At the moment, the choosing of measure  is poorly formalized procedure and usually is based on some subjective grounds or is ill-founded. The proposed approach produces the objective reasons for choice. The presented method can be applied for any dataset that should be clustered and allows individual determining the measure of proximity what is the new word in clustering.

Using example
Let us examine some metrological situation as an example of using the proposed approach. Let the physical process  be studied that takes place during electrochemical reaction in substance. Let the flow of this process can be described only with two electrical potentials x1 and x2 that can be measured directly. Different modes of process  flow can be recognized varying the external conditions, which x1 and x2 depend on. The measurements of x1 and x2 are performed independently but using the same voltmeter. All obtained values form the vectors xi T = (xi1, xi2) that are processed in the following manner: the measurements results are scaled to fit the diapason [-1; 1] V. So, using the notation defined in previous section of the paper, the vectors yi T = (yi1, yi2) are calculated from preliminary data processing performed before clustering procedure applying to determine different states of process . Let the quantity of measurements results be equal to n = 40 and, as before, index i = 1, 2, … n.
Let the described physical process  tend to have Ktrue = 4 separate modes of its flow. In figure 1a, the all obtained vectors yi are shown. Different colors are used to mark data that are corresponded to different states of the process and consequently to different clusters. In practice, we almost always don't know the actual value Ktrue, so we should assign the maximum quantity of clusters Kmax and its value potentially can be equal to any number from 1 to n. The concrete value depends on additional information and on some subjective factorsintuition and personal experience of the researcher. If we can't take them into account, the only way to produce the reasonable results of clustering is to use information on the accuracy of the data to be processed. The figure 1b contains the graphical representation of data and its potential absolute errors' bounds inherited from used measurement instruments.  Let us apply the proposed approach to the data shown in Fig. 1 and take into account measurements uncertainties. As the clustering procedure, the 'kmeans' method was chosen because of its popularity in practice. Earlier we suppose that there is no additional information on the nature of the process . So, we need to use the random guesses as the initial approximation of clusters centers' positions. In