CONSTRUCTING STABLE CLUSTERING STRUCTURE FOR UNCERTAIN DATA SET

The paper deals with the problem of constructing the stable clustering structure for the uncertain data set. The problem of explaining of stability of the clustering structure in automatic classification of objects for varying attributes values is formulated. The proposed method of the uncertain data clustering is based on heuristic algorithms of possibilistic clustering. Basic concepts of the heuristic approach to possibilistic clustering based on the concept of allotment among fuzzy clusters, a validity measure and techniques of the data preprocessing are considered. A method of constructing the set of values of most possible number of fuzzy clusters for the uncertain data is provided and a technique of constructing the stable clustering structure is proposed. An illustrative example of the proposed technique application to the oil data set is carried out. An analysis of the experimental results is given and preliminary conclusions are formulated.


INTRODUCTION
Some preliminary remarks are considered in the first subsection.Types of clustering structures are defined in the second subsection.

Preliminary remarks
The need for mechanisms that help to treat ambiguous, fuzzy and vague knowledge explains the grown-up interest in fuzzy systems.In particular, fuzzy clustering methods have been applied effectively in image processing, data analysis and modeling.Heuristic methods of fuzzy clustering, hierarchical methods of fuzzy clustering and optimization methods of fuzzy clustering were proposed by different researchers.
The most widespread approach in fuzzy clustering is the optimization approach and the optimization methods of fuzzy clustering are based on the concept of fuzzy cpartition which is expressed as follows: where c is the number of fuzzy clusters is the membership degree.So, the fuzzy cpartition can be arrayed as a ) ( n c  matrix ] [ li u P  .Objective function-based fuzzy clustering algorithms can be divided into two types: object versus relational.The best known object approach to fuzzy clustering is the method of fuzzy c -means [1].From other hand, the most popular examples of fuzzy relational clustering are the RFCM-algorithm [2], and the ARCA-algorithm [3].
The most important problem of fuzzy clustering is neither the choice of the numerical procedure nor the distance to use but concerns the number c of fuzzy clusters to look for.This is the so-called cluster validity problem.The classical approach to cluster validity for fuzzy clustering is based on directly evaluating the fuzzy c -partition.Many authors have proposed several measures of cluster validity associated with fuzzy cpartitions.For example, the partition coefficient is described in [1] and compactness and separation index was defined in [4].The compactness and separation index is most popular cluster validity criteria.Notable, that the index is appropriate for the ARCA-algorithm, because the ARCA-algorithm, though being a relational clustering algorithm, generates prototypes.
A possibilistic approach to clustering was proposed by Krishnapuram and Keller in [5] and developed by other researchers.This approach can be considered as a way in the optimization approach in fuzzy clustering because all methods of possibilistic clustering are objective functionbased methods.A concept of possibilistic partition is a basis of possibilistic clustering methods and membership values li can be interpreted as the values of typicality degree.For each object i x n i , , 1   the grades of membership should satisfy the conditions of a possibilistic partition: So, the family of fuzzy sets 2) is met.Objective function-based fuzzy clustering algorithms are the most widespread methods in fuzzy clustering.
However, heuristic algorithms of fuzzy clustering display low level of a complexity and high level of essential clarity.Some heuristic clustering algorithms are based on a definition of a cluster concept and the aim of these algorithms is cluster detection conform to a given definition.These algorithms are called algorithms of direct classification or direct clustering algorithms.
An outline for a new heuristic method of fuzzy clustering was presented in [6], where a basic version of direct clustering algorithm was described and the version of the algorithm is called the D-AFC(c)-algorithm [7].The D-AFC(c)-algorithm can be considered as a direct algorithm of possibilistic clustering.demonstrated in [7].The heuristic approach to possibilistic clustering was developed, for example, in [8].

Types of clustering structures
Most fuzzy clustering techniques are designed for handling crisp data with their class membership functions.However, the data can be uncertain.Different types of uncertainty can be characterizing the initial data which must be processed by clustering algorithms.For example, a brief review of uncertain data clustering methods is given in [9].An interval uncertainty of the initial data is a basic type of uncertainty in clustering problems.The interval-valued data is a particular case of the three-way data and the type of uncertainty is the subject of the consideration.So, a problem of the three-way clustering arises.
The problem of clustering the three-way data can be formulated as follows [9] . So, the three-way data can be presented by a poly-matrix as follows: .
In other words, the three-way data are the data, which are observed by the values of 1 m attributes with respect to n objects for 2 m situations.The purpose of the clustering is to classify the set } ,..., { into c fuzzy clusters and the number of clusters c can be unknown because it is depend on the situation.
In the situation of interval uncertainty, the only information that we have about the actual value 1  ˆt i x of some attribute is that the value belongs to some interval, , and the situation can be described by the then the initial data are the ordinary object data and it can be presented as the usual matrix of attributes, The aim of the work is a detailed consideration of the method of the discovering the unique clustering structure, which corresponds to most natural allocation of objects among fuzzy clusters for the uncertain data set.In particular, the allocation among the most "plausible" unknown number of fuzzy clusters с must be detected.
So, the contents of this paper is as follows: in the second section basic concepts of the heuristic method of possibilistic clustering, a validity measure and techniques of the data preprocessing are considered, in the third section the method of constructing the stable clustering structure is proposed, in the fourth section an example of application of the proposed method to the Ichino and Yaguchi's oil data set [11] is given, in the fifth section some final remarks are stated.

HEURISTIC POSSIBILISTIC CLUSTERING
Basic concepts of the heuristic method of possibilistic clustering based on the allotment among fuzzy clusters concept are considered in the first subsection.A validity measure for the D-AFC(c)-algorithm is presented in the second subsection and techniques of the data preprocessing are given in the third subsection of the section.

Basic definitions
Direct heuristic algorithms of possibilistic clustering can be divided into two types: prototype-based [8] versus relational [6], [7].The concept of fuzzy tolerance is the basis for the concept of fuzzy  -cluster.That is why definition of fuzzy tolerance must be considered in the first place.
being its membership function.Fuzzy tolerance is the fuzzy binary intransitive relation which possesses the symmetricity property Different fuzzy tolerances were considered in [6].However, the essence of the method here considered does not depend on the kind of fuzzy tolerance and basic concepts are described for any fuzzy tolerance T .
Fuzzy similarity relation S is the fuzzy binary relation which possesses the symmetricity property (4), the reflexivity property (5), and the (max-min)-transitivity property: Let some fuzzy binary relation be represented by a matrix R of size n and define The fuzzy binary relation R  is a transitive closure of where an operation  for two different fuzzy relations d R and g R is defined by the expression and the composition of fuzzy relations d R and g R is defined in [12] as follows: ( , ) ( , .
The transitive closure T  of some usual fuzzy tolerance T is a fuzzy similarity relation S .Let R denote a binary fuzzy relation.A level fuzzy relation , (11) where  R is  -level of the fuzzy relation R and ) , ( R for some  can be defined as follows: be the initial set of objects.Let T be a fuzzy tolerance on being its membership function and  be the  -level value of and li  is the membership degree of the element X x i  for some fuzzy cluster . Value of  is the tolerance threshold of fuzzy clusters elements.The membership degree of the element can be defined as a where an  -level A is the support of the fuzzy cluster l A ) ( .The value of a membership function of each element of the fuzzy cluster is the degree of similarity of the object to some typical object of fuzzy cluster.Moreover, membership degree defines a possibility distribution function for some fuzzy cluster l A ) ( , and the possibility distribution function is denoted by Notable that the number c of fuzzy clusters can be equal the number of objects, n .
Let T is a fuzzy tolerance on X , where X is the set of elements, and } ,..., { is the family of fuzzy clusters for some  .The point is called a typical point of the fuzzy cluster . A fuzzy cluster can have several typical points and symbol e is the index of the typical point.
be a family of fuzzy clusters for some value of tolerance threshold  , which are generated by some fuzzy tolerance T on the initial set of elements } ,..., { is met for all , n c  with the membership degree higher than zero.Obviously, the definition of the allotment among fuzzy clusters ( 15) is similar to the definition of the possibilistic partition (2).
of the set of objects among n fuzzy clusters for some threshold  is the initial allotment of the set } ,..., { . In other words, if a matrix of some level fuzzy tolerance in the sense of formulae (11) and ( 12) is given then lines or columns of the matrix are level fuzzy sets If some allotment corresponds to the formulation of a concrete problem, then this allotment is an adequate allotment.If condition ( ) , , are met for all fuzzy clusters of adequate allotments must be made on the basis of evaluation of allotments.The criterion where c is the number of fuzzy clusters in the allotment is the number of elements in the support of the fuzzy cluster l A ) ( , can be used for evaluation of allotments. Maximum of criterion (18) corresponds to the best allotment of objects among c fuzzy clusters.So, the classification problem can be characterized formally as determination of the solution is the set of adequate allotments corresponding to the formulation of clustering problem.
The condition (19) must be met for the some unique allotment Otherwise, the number c of fuzzy clusters in the allotment sought ) (X R  is suboptimal.Detection of fixed c number of particularly separate fuzzy clusters can be considered as the aim of classification.A general plan of the relational D-AFC(c)algorithm is given, for example, in [6], [7] and [9] From other hand, detection of the unique allotment ) (X R  among unknown number с of fully separated fuzzy clusters is the matter of the prototype-based D-AFC-TC-algorithm.The transitive closure T  of some usual fuzzy tolerance T is constructed according to formulae ( 7) - (10).The fuzzy relation T  is possesses properties (4), ( 5) and (6).The transitive closure is used in the clustering procedure and an idea of a leap in ordered sequence 1 0 for fuzzy sets [12] is a parameter for the D-AFC-TC-algorithm.A plan of the D-AFC-TC-algorithm is presented in [8].The allotment ) ( X R  among the unknown number c of fully separated fuzzy clusters, the corresponding value of tolerance threshold  and normalized coordinates of prototypes l  of fuzzy clusters are results of classification. .The quadratic measure of fuzziness of the allotment was defined in [13] as follows:  ( in the equation ( 21) can be defined as

A validity measure
. Using the validity quadratic measure (20) the optimal number of fuzzy clusters can be obtained by maximizing the index value.

Notes on the data preprocessing
The D-AFC(c)-algorithm can be applied directly to the data given as a matrix of fuzzy tolerance )] , ( [ . This means that it can be used with the objects by attributes data by choosing a suitable metric to measure similarity.The three-way data can be normalized as follows [9]: can be considered as a type-two fuzzy set and ) ( are its membership functions.In the case of three-way data each object . Dissimilarity coefficients between the objects can be constructed on a basis of generalizations of distances between fuzzy sets [9] and these generalizations are taken into account dissimilarities between objects attributes as well as attributes situations.In particular, a generalization of the squared normalized Euclidean distance for type-two fuzzy sets is described by the expression for all , the formula (24) will be rewritten as the usual squared normalized Euclidean distance [12]: The matrix of fuzzy tolerance )] , ( [ can be obtained after application of complement operation ) , ( 1) , ( to the matrix of fuzzy intolerance )] , ( [ From . So, the formula (24) can be rewritten as follows: for all n j i , , . Different distances and similarity measures for interval-valued fuzzy sets were proposed in other publications [14], [15].For example, a similarity measure was defined by Ju and Yuan in [14] as follows: From other hand, the normalized Euclidean distance between interval-valued fuzzy sets based on Hausdorff metric was defined by Grzegorzewski in [15] as follows:

THE PROPOSED METHOD
A procedure for constructing the set of values of most possible number of fuzzy clusters in some sought structure is described in the first subsection.The second subsection includes the technique of constructing the stable clustering structure.Let us remember the concept of fuzzy number which is useful for constructing the set of values of most possible number of fuzzy clusters [16].
Let L or R be decreasing, shape functions from where m is called the mean value of V and a and b are called the left and right spreads.In LR-type fuzzy numbers, the triangular and Gaussian fuzzy numbers are most commonly used.
The concept of S-norm is also important for the consideration.If A and B are fuzzy sets in a universe U and ) (u according to a selected S-norm; 5. Construct the  -level fuzzy set for the fuzzy set D as follows:

Constructing the stable clustering structure
A technique of constructing the stable clustering structure for the uncertain data set can be considered as a two-step process, where the set  D of values of most possible number of fuzzy clusters is a preliminary result of classification.The allotment ) ( X R  among a priori unknown number of fuzzy clusters can be considered as the sought clustering structure.So, the technique of constructing the allotment among fuzzy clusters for the uncertain data set can be summarized as follows: 1.The initial data are contained in the poly-matrix of attributes and the procedure of constructing the set of values of most possible number of fuzzy clusters should be applied to the data set; 2. The matrix of tolerance coefficients )] , ( [ can be constructed from the normalized initial data by choosing a suitable distance for typetwo or interval-valued fuzzy sets; 3. The D-AFC(c)-algorithm using some cluster validity index can be applied directly to the matrix of tolerance The proposed technique can be generalized for a case of the fuzzy c -partition (1) very simply.An application of the proposed technique to classification problem will be illustrated on the interval-valued data example in the next section.

AN ILLUSTRATIVE EXAMPLE
The Ichino and Yaguchi's interval-valued oil data set is described in the first subsection and results of the data processing are presented in the second subsection.

The oil data set
Let us consider the set of interval-valued data [11] which is presented in Table 1.The data set consists of the specific gravity, iodine value, and saponification value measured for 8 types of oils.The analysis of types of oils in Table 1 highlights that the first six oils are vegetable and the remaining two are animal.That is why we expect to find two clusters in this data set.The first class is formed by 1 elements and the second class includes 7 elements.So, five misclassifications are presented in the resulting allotment ) ( X R  .In order to compare the presented results with the results obtained from the relational ARCA-algorithm of fuzzy clustering [3], we observed that the minimal value of the compactness and separation index [4] corresponds to the three fuzzy clusters and the maximal value of the partition coefficient [1] corresponds to the four fuzzy clusters in the fuzzy c -partition.So, presented results of the proposed technique of constructing the stable clustering structure seem to be appropriate.

FINAL REMARKS
Preliminary conclusions are discussed in the first subsection of the section.The second subsection deals with the perspectives on future investigations.

Conclusions
The technique of constructing the stable clustering structure for the uncertain data set is proposed in the paper.The results of application of the proposed technique to the oil data set show that the technique is the effective tool for solving the classification problem under uncertainty of the initial data.
Two heuristic algorithms of possibilistic clustering are the basis of the proposed technique.However, the results obtained from these algorithms depend on the selection of the dissimilarity measure and the initial data normalization method.Moreover, the set of values of most possible number of fuzzy clusters in the sought clustering structure depends on the type of the selected S-norm.So, we can conclude that the use of some one dissimilarity measure may produce serious hesitation.It will be a reasonable way to make use of various dissimilarity measures and compare the obtained clustering results.
The allocation of objects among the a priori unknown number of fuzzy clusters, which is the result of application of the proposed technique to the initial uncertain data set,


is the membership function of the fuzzy relation R .The membership function of the level fuzzy relation ) ( fuzzy sets are fuzzy clusters.Membership functions of these fuzzy clusters are defined by the formula (13) for some value ] be considered as clustering components.
of classification obtained from the D-AFC(c)-algorithm.
which is corresponds to the result of classification for the given number c of fuzzy clusters,

D
formula (23).The procedure for constructing the set of values of most possible number of fuzzy clusters was applied to the normalized data using the squared normalized Euclidean distance (25) for the D-AFC-TC-algorithm and the maximum operation (31) as the fuzzy union.So, the set of values of most possible number of fuzzy clusters in the sought allotment ) and corresponding possibility degrees are shown in Fig. 1.

Fig. 1
Fig. 1 Possibility degrees constructed according to the maximum operationThe set  D and possibility degrees ) ( g с 

Fig. 2
Fig. 2 Plot of the quadratic measure of fuzziness of the allotment as a function of the number of clusters using the formula (27) Membership functions of two classes of the allotment ) ( X R  are presented in Fig. 3.

Fig. 3 Fig. 5
Fig. 3 Membership functions of two classes obtained from the D-AFC(c)-algorithm using the formula (27)

Fig. 6
Fig. 6 Membership functions of two classes obtained from the D-AFC(c)-algorithm using the formula (29)

Table 1
The Ichino and Yaguchi's oil data set