Modified Structural and Attribute Clustering Algorithm for Improving Cluster Quality in Data Mining: a Quality Oriented Approach

The need of Data mining is because of the explosive growth of data from terabytes to petabytes. Data mining preprocess aims to produce the quality mining result in descriptive and predictive analysis. The quality of a clustering result depends on both the similarity measure used by the method and its implementation. A straightforward way to combine structural and attribute similarities is to use a weighted distance function. Clustering results are arrived based on attribute similarities. The clusters balance the attribute and structural similarities. The existing Structural and Attribute cluster algorithm is analyzed and a new algorithm is proposed. Both the algorithms are compared and results are analyzed. It is found that the modified algorithm gives better quality clusters.


INTRODUCTION
Data mining: The need of Data mining is because of the explosive growth of data from terabytes to petabytes.As databases grows larger, decision-making from the data is too complicate so we need data mining to derive knowledge from the stored data.Data mining is also called as Knowledge Discovery (mining) in Databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.Data mining preprocess aims to produce the quality mining result in descriptive and predictive analysis (Jeyabalaraja and Edwin Prabakaran, 2012).Data mining techniques are used in different application to analysis and predict the data for decision support system.Data mining refers to extracting or mining knowledge from large amounts of data.
Cluster quality: A good clustering method will produce high quality clusters with high intra-class similarity and low inter-class similarity.The quality of a clustering result depends on both the similarity measure used by the method and its implementation.The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns (Han et al., 2011).
Structural similarity: Clusters who are having only structures which give the outcomes based on vertex connectivity are called as Structural similarity.
Similarity is expressed in terms of a distance function.A straightforward way to combine structural and attribute similarities is to use a weighted distance function.Distances are normally used to measure the similarity or dissimilarity between two data objects.

Attribute similarity:
Clustering results are arrived based on attribute similarities.The clusters balance the attribute and structural similarities.Vertex distances and similarities have been measured by random walk principle.A unified framework based on Neighbourhood random walk is to integrate structural and attribute similarities.
The clusters balance the attribute and structural similarities.Vertex distances and similarities have been measured by random walk principle.The purpose of this problem is to partition the attributed graph into k clusters with intracluster attributes.This partitioning is complicated because attributed and structural similarities are independent.In this study, consider a dataset of scores obtained by university students in two subjects.Each pair represents the vertex of the graph.The techniques adopted in this study are listed below:  Propose a unified Neighbourhood random walk distance measure to combine attribute and structural similarities. Theoretical methods are given to boosten the presentations of attribute similarity to the unified Neighbourhood random walk distances for studying the closeness of the vertices. Apply a weight self-adjustment method to analyze the degree of contributions of attributes in random walk distances. Perform suitable experiments using designed clustering algorithm.

LITERATURE REVIEW
Graph clustering techniques have been analyzed in many directions and primarily concentrated on topological structures.Strehl and Ghosh (2002), have studied Ensemble analysis which improves classification accuracy and the general quality of cluster solution.They have also discussed the availability of multiple segmentation solutions within an ensemble and the method is Meta clustering algorithm and is based on the notion of "clustering on clusters".Pons and Latapy (2006), have proposed short random walks of length ' ' to measure the similarity between two vertices in a graph for community detection.Tsai and Chui (2008), have developed a feature weight self-adjustment mechanism for -means clustering on relational datasets.Here, an optimization model is designed to find feature weights in which the partitions within clusters are minimized and that between clusters are maximized.Orme and Johnson (2008) have discussed ensemble analysis for improving -means cluster analysis and the methods have been described with the help of numerical illustrations.Zhou et al. (2009), have proposed graph clustering algorithm based on both structural and attribute similarities and estimated the effectiveness of SA cluster as compared with other three clusters, through experimental analysis.Rai and Singh (2005) have summarized and described the types of clusters and different clustering methods.Zanghi et al. (2009), have adopted generative process and proposed a probabilistic model to cluster attributed graphs.Tajunisha and Saravanan (2011), have proposed a method to find initial centroid for kmeans and they have used similarity measure to find the informative genes.The goal of their clustering approach is to perform better cluster discovery on sample with informative gene.Cheng et al. (2011), have studied graph clustering using unified random walk distance measures.A comparative analysis of clusters and their efficiencies have been carried out.Ghasemi et al. (2013) have presented a new procedure to find functional modules in PPI networks.The main idea is to model a biological concept and to use this concept for finding good functional modules in PPI networks.In order to evaluate the quality of the obtained clusters, the results of proposed algorithm is compared with those of some other widely used clustering algorithms.Wang and Davidson (2010), have proposed a research work as a natural extension to unconstrained spectral clustering and are interpreted as finding the normalized min-cut of a labeled graph.The effectiveness of this approach by empirical results on real-world data sets, with applications to constrained image segmentation and clustering benchmark data sets with both binary and degree-of-belief have been validated.Jayabrabu et al. (2012) have analyzed the formulated clusters quality based on quality parameters by using Data mining agents.Clustering algorithms will produce clusters based on given input data.But, it is noted that all clusters are not good clusters.

METHODOLOGY Distance measures:
The distance measure is defined as the distance between two objects O 1 and O 2 universe of objects denoted as d (O 1 , O 2 ) which non-negative real number is always.Distance measure are use to obtain the similarity or dissimilarity between any pair of objects.In general, distance measures are used for Numeric attributes (Minkowski metric) (Han et al., 2011), Binary attributes, Nominal attributes, Ordinal attributes and Mixed type attributes.
Unified neighbourhood random walk distance: Let P be the transition probability matrix of augmented graph G and it is formed by using the transition probabilities from vertices γ to γ through attribute and structure edges.
Given the length of the random walk as ' ' with the probability of restart ∈ 0, 1 .The unified Neighbourhood random walk distance , from to in is defined as: where, is the path from to whose length is denoted as with transition probability .The Eq. ( 1) can be written in matrix form as: Here, is the Neighborhood random walk distance matrix.
Clustering process: Clustering process has the duty of separating the data into different clusters with same or different characters.The selection of good initial centroid is more powerful than that of randomly selected initial centroids.
In order to select the centroids, define the density function of vertex.
The density function of a vertex is the sum of the influence functions of on all vertices in .
The influence function is stated as: Hence, the density function is written as: It is noted that the influence of on is proportional to the random walk distance from to .We know that larger random walk distance gives more influence.If has a large density value, then connects to many vertices.
By using the density functions given in "(4)", the vertices are arranged in descending order of their densities and select the top vertices whose initial centroids are stated as , , … , .After a large number of iterations are performed, the centroids in the iteration are , , … , .Consider ω 0 is the initial weight of structure edge and 1 , 2 , … , m are the initial weights of attribute edge which are relative to ω 0 .Assuming 1.0 and ⋯ 1.5.Let = , , … , be the attribute weights in the iteration.An increment ∆ is weight update of attribute between the and 1 iterations.The weight of in the 1 iteration is defined as the average of weight in the iteration and its increment.That is:

∆
(5) The expression for the increment of the weight in the iteration is used in "(5)" which gives:

EXISTING STRUCTURAL AND ATTRIBUTE CLUSTER APPROACH
The following algorithm is used to evaluate cluster centroids and adjusted weights for different iterations.This clustering result balances the structural and attributes similarities.

Algorithm: Attributed Graph Clustering Structural and attribute Cluster.
Input: An attributed graph , a length limit of random walk paths, a restart probability , a parameter of influence function, cluster number .an attributed graph , a length limit of random walk paths, a restart probability , a parameter of influence function and cluster number .In step 1, consider the Intial weights as equal i.e., ⋯ 1.5 fix 1.0.In step 2, measure the unified random walk distances between the vertices and construct unified random walk distance matrix .In step 3, the density functions of the vertices are calculated and choose the highest density value which is the key to select the initial centroid.By doing this process, initial centroids are obtained.In step 4, perform a large number of iterations, the centroids in the iteration are obtained as , , … , .In step 5, assign each vertex to the closest centroid with largest random walk distance from .In step 6, use random walk distance vectors and its average, the cluster centroids are updated with the most centrally situated vertex in each cluster.In step 7, the weights of the i th attribute are obtained in large number of iterations.In step 8, recalculate the unified random walk distances between the vertices and construct unified random walk distance matrix .In the last step the required clusters are evaluated as , … , .
Modified SA cluster: Modified Structural and Attribute Clustering (MSAC) algorithm is an advancement of SA clustering technique.SAC is used only for graph based clustering.The proposed MSAC can be used for any type of data.The input for MSAC, i.e., C 1 , C 2 ......C k is the set of clusters generated from DBSCAN algorithm.In step 1, the values of the structural and attribute similarity are calculated for the clusters C 1 , . . . . ., C k .In step 2, centroids are chosen in random for each and every cluster with low similarity and attribute value.In step 3, the members with high similarity and attribute value are reallocated to another cluster so as to maintain the low value.This process is repeated for all the cluster members C 1 , . . . .C K .In step 6, the cluster members in each cluster are checked for the distance between the centroid and the member with less similarity and attribute value in each cluster.

RESULTS AND DISCUSSION
As stated in Section above, consider 50 sets of scores of students in two subjects which are dependent with each other.Each pair of scores is considered as vertex of a graph.After fixing the vertices and their edges in the two dimensional graph, by using Structural and Attribute Clustering (SAC) Algorithm, the Adjacency matrix (Adjacency matrix), Transition probability matrix (Transition probability matrix) and Neighborhood Random walk distance matrix (Neighborhood random walk distance matrix) are found.Subsequently, on applying distance matrix in the influence functions which give density functions (Densities).These results are given in the Appendix.Finally, the vertices are grouped into several clusters.It is observed that the vertices are grouped into 5 clusters (Fig. 1) which are produced by SAC Algorithm.
In other way, the number of clusters is easily found.When Modified Structural and Attribute Clustering (MSAC) Algorithm is adopted to classify the vertices into clusters, DBSCAN plays a main role to give the number of clusters of the vertices.After obtaining the number of clusters, the adjacency matrix and the subsequent measures until the density functions are obtained.These results exhibit 6 clusters (Fig. 2).On comparing both Algorithms, MSAC is better than SAC, since MSAC identified more number of clusters and quality clusters than that of SAC.

CONCLUSION
The results for the modified structural and attribute clustering algorithm show that the cluster is of good quality when compared with the existing SA cluster.In the SAC algorithm, the k value has been given by the user, but in MSAC algorithm, the k value has been calculated by DBSCAN.It is concluded that MSAC is more effective than the existing Algorithm.Densities (50): 14.79778, 30.98239, 30.99801, 26.99986, 16.8756, . . . . . , 30.99269, 23.935, 11.50352, 0, 0.

Fig
Fig. 2: Clustering by MSAC algorithm Structural and Attribute Cluster (SAC) algorithm considers both structural and attribute similarities.The input parameters for the SAC algorithm are as follows: The quality clusters Q 1 , . . ., Q k are obtained finally.Algorithm Modified SA Cluster.Input: Set of clusters C 1 , . . . .C k from DBSCAN.Database used: A student database was used in the existing SAC algorithm and proposed Modified SAC algorithm.The student details table consists of 2 attribute and 5547 rows.The 2 attributes are scores of Subject I and subject II.