Study of parameters of the nearest neighbour shared algorithm on clustering documents

Document clustering is one way of automatically managing documents, extracting of document topics and fastly filtering information. Preprocess of clustering documents processed by textmining consists of: keyword extraction using Rapid Automatic Keyphrase Extraction (RAKE) and making the document as concept vector using Latent Semantic Analysis (LSA). Furthermore, the clustering process is done so that the documents with the similarity of the topic are in the same cluster, based on the preprocesing by textmining performed. Shared Nearest Neighbour (SNN) algorithm is a clustering method based on the number of "nearest neighbors" shared. The parameters in the SNN Algorithm consist of: k nearest neighbor documents, ɛ shared nearest neighbor documents and MinT minimum number of similar documents, which can form a cluster. Characteristics The SNN algorithm is based on shared ‘neighbor’ properties. Each cluster is formed by keywords that are shared by the documents. SNN algorithm allows a cluster can be built more than one keyword, if the value of the frequency of appearing keywords in document is also high. Determination of parameter values on SNN algorithm affects document clustering results. The higher parameter value k, will increase the number of neighbor documents from each document, cause similarity of neighboring documents are lower. The accuracy of each cluster is also low. The higher parameter value ε, caused each document catch only neighbor documents that have a high similarity to build a cluster. It also causes more unclassified documents (noise). The higher the MinT parameter value cause the number of clusters will decrease, since the number of similar documents can not form clusters if less than MinT. Parameter in the SNN Algorithm determine performance of clustering result and the amount of noise (unclustered documents ). The Silhouette coeffisient shows almost the same result in many experiments, above 0.9, which means that SNN algorithm works well with different parameter values.


Introduction
Document clustering is to recognize clusters or groups of documents that have the same attribute value [1]. One of the goals of document clustering is to help search information quickly and accurately to document data. The development of information technology, requires a tool to enables representation of the information in accordance with user request.
Text Mining is part of the document clustering, and is the indexing process for database in documents searching. The Text Mining process includes: document keyword extraction and concept extraction. In this paper, the extraction of keyword documents using the Rapid Automatic Keyphrase Extraction (RAKE) method [2]. The RAKE method considers the word association by computing the matrix of blurring with one word against another. The matrix is used to measure the candidate's score and then ranking it [3]. Taking a keyword candidate using an induvidu document-based approach without relying on document collections, thus speeding up computational time. The process of keyword extraction using the RAKE method has 5 main stages: keyword candidate extraction, calculating the co-ocurrence matrix, calculating the value of the ratio, calculating the basic feature value, and selecting the keyword with the highest feature value. Extraction concept using Latent Semantic Analysis method. The process of concept vector formation with LSA method has two stages: construct a weighted matrix of tf-idf and apply Singular Value Decomposition (SVD) method to the matrix formed. The LSA is a method for finding linkages and similarities between documents, fragments and words that appear in documents [4]. Weighted tf-idf performs calculations that describe how important the term (term) in a document. The SVD method is used to minimize the matrix dimension [5]. The result of the SVD decomposition process will give birth to a vector of document concepts. The clustering algorithm do how to construct clusters of equal size and density, how to cope with noise and how to determine the number of clusters so that the clustering results presenting the members of a cluster bear a high resemblance compared to the different cluster members.
The clustering constraint is generally if the data sets are in large dimensions, the accuracy of the clustering results will decrease, because the similarity between the data objects is higher, making it more difficult for the clustering process [6]. Clustering with the SNN algorithm overcomes it, by determining the nearest residence of all data objects, then the new equality values between the data objects are determined by the number of shared states. The disadvantage of the SNN algorithm is having to specify the tagar parameter value resulting in the desired clustering [7]. In a study conducted by Adriano Moreira, Maribel Y. Santos and Sofia Carneiro stated that the SNN clustering results gave better results for certain datasets than the k-Means algorithm [8]. SNN algorithm requires 3 input parameters as minimum threshold value to build cluster, ie k number of nearest neighbor; ε the number of shared neighbors, and MinT the number of cluster members. After specifying the input parameters, the dokume cluster finds the nearest neighbor of each document. Then the similarity between the two documents is calculated based on the nearest neighbor ε together. By using a similarity measure, the density of each document is calculated as the number of shared neighbors of two documents. Next, the documents are classified as core points, if the density of the document is greater than the core point threshold [9]. Giving optimal parameter values in clustering, capable of handling data in various sizes, controlling cluster quality, and able to handle noise.
Document clustering is to recognize clusters or groups of documents that have the same attribute value [1]. One of the benefits of clustering documents is to help search information quickly and accurately to document data. The development of the world of information, requires a tool capable of presenting the search results information in accordance with user demand. Text Mining is part of the document clustering, and is the indexing process for database searches in the form of documents. The Text Mining process includes: document keyword extraction and concept extraction.
In this paper the extraction of keyword documents using the Rapid Automatic Keyphrase Extraxction (RAKE) method [2]. The RAKE method considers the word association by computing the matrix of blurring with one word against another. The matrix is used to measure the candidate's score and then ranking it [3]. Taking a keyword candidate using an induvidu document based approach is independent of document collections, thereby speeding up computational time. The process of keyword extraction using the RAKE method has 5 main stages: keyword candidate extraction, calculating the co-ocurrence matrix, calculating the value of the ratio, calculating the basic feature value, and selecting the keyword with the highest feature value.
While the extraction concept using Latent Semantic Analysis method. The process of concept vector formation with LSA method has two stages: construct a weighted matrix of tf-idf and apply Singular Value Decomposition (SVD) method to the matrix formed. The LSA is a method for finding linkages and similarities between documents, fragments and words that appear in documents  [4]. Weighted tf-idf performs calculations that describe how important the term (term) in a document. The SVD method is used to minimize the matrix dimension [5]. The clustering algorithm is often confronted with how to construct clusters of equal size and density, how to cope with noise and how to determine the number of clusters so that the clustering results presenting the members of a cluster bear a high resemblance compared to the different cluster members.
The clustering constraint is generally if the data sets are in large dimensions, the accuracy of the clustering results will decrease, because the similarity between the data objects is higher, making it more difficult for the clustering process [6]. Clustering with the SNN algorithm overcomes it, by determining the nearest residence of all data objects, then the new equality values between the data objects are determined by the number of shared states. The disadvantage of the SNN algorithm is having to specify the tagar parameter value resulting in the desired clustering [7]. In a study conducted by Adriano Moreira, Maribel Y. Santos and Sofia Carneiro stated that the SNN clustering results gave better results for certain datasets than the K-Means algorithm [8]. SNN algorithm requires 3 input parameters to construct clusters, ie k number of nearest neighbors, ε density threshold (number of shared neighbors), and MinT number of core points (cluster members). After specifying the input parameters, the dokume cluster finds the nearest neighbor of each document K. Then the similarity between two documents is calculated based on the number of nearest neighbors with it. By using a similarity measure, the density of each document is calculated as the number of shared neighbors of two documents. Next, the documents are classified as core points, if the density of the document is greater than the core point threshold [9]. This study examines the optimal parameter values in clustering that is capable of handling data in various sizes, controlling cluster quality, and able to handle noise.

Document keyword extraction using rapid automatic keyphrase extraxction (RAKE) method
The experiment uses 525 documents. Clustering the document uses the research title as input on RAKE. The RAKE method has the following stages: 1. Candidate keyword extraction.
The extraction of a keyword candidate begins by splitting the text using stopword and punctuation. Suppose that is the i th document of the data. After the document is extracted, we get the keyword candidate Td i = {t 1 , t 2 , t 3 , ...,t n }, where t 1 , t 2 , t 3 , ...,t n be words, or phrases of keyword candidates 2. Build matrix After the keyword candidate is obtained, the next step is to calculate the co-ocurrence matrix. The co-ocurrence matrix represent the frequency of occurrence of a word and keyword phrase 3. Calculating ratios.
Ratio value is the ratio between word degree and word frequency. The degree of a word is the number of word occurrences in the document plus the number of phrases that contain the word. The degree of the word on the co-ocurrence matrix is obtained from the sum of one column or one line. The word frequency is the number of words in the text. The frequency value can be obtained at the diagonal value of the co-ocurrence matrix. The ratio can be calculated using the following where : Nfd = value of basic features ratio (t ) = keyword ratio of t ratio (ft) = phrase ratio containing the keyword of t.

Concept extraction using latent semantic analysis (LSA) method
The steps of concept vector formation are as follows: Then, carry out of the result praprocess text mining using by RAKE and LSA and then simulate using parameter in SNN Algorithm to cluster documents. So, characteristic of parameter in SNN algorithm will be study. The experiment were descripted below :  The value of k must be less than or equal to ε value. Based on the experimental results, the parameter value of k on documents clustering does not always result in a large number of cluster, depending on the ε and MinT values. Determine k value above 10, if the value of ε is small, tends to produce a single cluster, even any MinT value has no effect. While the determination of the small value of k (less than 1% of the amount 7 1234567890 ''"" of data) tends to represent more large number of clustered documents than value of k >1% of the amount of data. When ε approach to k value, it will increase number of clusters according by decreasing of MinT value. Giving a large value of ε, will form clusters that contain documents with high topic similarities. This results in each cluster having a high accuracy value because the relevance of the document to the keyword is also high, so that only a few irrelevant documents appear on each cluster indicate a minimum threshold of the number of documents in a cluster. Based on the experiment show that greater MinT value will form t less the number of clusters. The noise document is an unclassified document. The percentage noise shows the percentage ratio of unclassified documents against the total number of documents.This percentage increases when the difference in k value with ε value reduce, moreover MinT increases. In the experiments show that smaller ε will give the result smaller percentage of noise. Small noise percentage value means to show more documents that can be clustered .
Silhouette coefficient refers to a method of interpretation and validation of consistency within clusters of data.The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.The Silhouette coefficient shows almost the same result in many experiments , above 0.9, which means that SNN algorithm works well with different parameter values.

Conclusion
Here are the conclusions obtained during the clustering process with the SNN algorithm: 1. Each cluster is formed by keywords that are shared by documents in a cluster. SNN algorithm allows a cluster can be built more than one keyword, if the value of the frequency of along with keywords is also high. Thus, a cluster is formed based on the keywords used together by the documents in there. 2. Determination of parameter values on the SNN algorithm will affect the results of document clustering, in the form of: a. The value of k must be less than or equal to ε value so it form clustering. The higher k value will increase the number of nearest neighbor documents for each document. It can decrease similarity of neighboring documents. The clustering results show that the low similarity values between documents will be in one cluster. b. High ε value cause more noise, but among documents in one cluster show a high similarity.
The high ε value results in the document only 'pulling' neighboring documents have a high similarity when forming a cluster. c. The MinT value limits the number of clusters based on the minimal number of documents that can form a cluster. If the MinT value is high, the number of clusters will decrease, since the number of documents less than MinT can not form a cluster. Determination value of this parameter depends on requirement of a problem. In the document clustering requires a low k value, a high ε value (k> ε )and low MinT value. It can represent a good clustering.