Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms

Clustering is a useful technique that organizes a large number of non-sequential text documents into a small number of clusters that are meaningful and coherent. Effective and efficient organization of documents is needed, making it easy for intuitive and informative tracking mechanisms. In this paper, we proposed clustering documents using cosine similarity and k-main. The experimental results show that based on the experimental results the accuracy of our method is 84.3%.


INTRODUCTION
Document clustering has become an increasingly important technique for the organization of documents without supervision, automatic topic extraction, and rapid retrieval of information.Categorizing electronic documents such as scientific papers based on certain topics requires extra time and effort, making it difficult for users to categorize relevant documents.The clustering method can be used to automatically group documents according to specific topics.In this case, the efficiency of grouping is highly desirable because of the requirements of high data volumes.The clustering algorithm is mainly categorized into Hierarchical and Partitioning methods.The hierarchical grouping method works by grouping data objects into cluster trees, while the partitioning method groups documents based on partition [1].
In this paper we use the K-means method for document clustering into 3 topic categories, where after documents are converted to plain text format, then preprocessing is carried out, among others, tokenize, stop-words filtering, stemming and conversion of all characters into lowercase letters, then mapped to a high dimensional vector with one dimension per "term".We use cosine similarity to measure the similarities between the two vectors.After this introduction the second part will be discussed our methods about clustering documents, followed B.K. Triwijoyo, Kartarina | 165 by the third part the results and discussion are described.In the end, we explained the conclusion and our future work.
Text documents are represented as a set of words, where words are assumed to appear independently and not sequentially.The "bag of word" model is widely used in information search and text mining [2].Words are counted in a bag, which is different from the mathematical definition of a set.Each word is related to dimensions in the data space that is generated and each document then becomes a vector consisting of non-negative values in each dimension.Here we use the frequency of each term as its weight, which means the terms that appear more often are more important and descriptive for documents.Let D= {d1,…,dn} be a set of documents and T = {t1,…, tm} the set of distinct terms occurring in D. A document is represented as an m-dimensional vector Although words that are more often considered more important, this usually does not occur in practice.For example, words such as "a" and "the" may be the words that appear most often in English text, but both are not descriptive or important for the subject of the document.

Figure 1. Angles in two-dimensional space
More complex strategies such as the Inverse document frequency (idf) weighting scheme are widely used.Documents are presented as vectors and measured by the level of similarity between two documents as a correlation between the vectors, which can then be quantified as angular cosines between the two vectors.Figure 1 show angles in two-dimensional space but in practice document space usually have tens or even thousands of dimensions.The term is basically words with some standard transformations in vector representations of basic terms.First, it removes stop words or non-descriptive words for document topics, such as "a", "and", "or" and "do".In English text documents, there are approximately 527 stop words.
Second, words were stemmed using the Porter algorithm [3], so words with different endings will be mapped into one word.For instance, production, produce, produces and product will be mapped to the stem product.The underlying assumption is that different morphological variations of word with the same root/stem are thematically similar and should be treated as a single word.Third, Elimination of words that appear with a threshold frequency less than a limit, because in many cases they are not too descriptive of the subject of the document and contribute little to the similarity between the two documents.
Rare terms can also be removed from the grouping process and make similarity calculations more efficient.
The clustering process is compared to similarities between two groups or between groups and objects.In hierarchical clustering, this is usually calculated as a complete link, single link or average link distance [4].However, in the partition clustering algorithm, a cluster is usually represented by a centroid object.For example, in the K-means algorithm, the centroid of a cluster is the average of all objects in the cluster, the centroid value in each dimension is the arithmetic average of the dimensions above all objects in the cluster.Let C be a set of documents.The centroid is defined as: Where the average value of all term vectors at the set.Then vector normalization, the most frequently occurring terms are not always the most informative.Conversely, terms that often appear in a small number of documents but rarely in other documents tend to be more relevant and specific for certain groups of documents, and therefore more useful for finding similar documents.Term frequency tf(d,t) and the weighting scheme tf.idf (term frequency and inverse document frequency) is the frequency weight of a term t in document d with a factor that ignores its importance with its appearance in the entire document, which is defined as: Where df (t) is the number of documents where the term appears, d is the document and t is the term.Clustering, in general, is an important and useful technique that automatically organizes the collection of a large number of data objects into a small number of coherent groups [4][5].In certain text documents, clustering has proven to be an effective approach and is widely applied in several search engines to help identify users quickly and focus on relevant sets of results, as well as to provide collaborative recommendations.In bookmarks or collaborative tagging, a group of users who share certain characteristics identified from their annotations.
Many grouping methods have been proposed, such as k-means [6], naıve Bayes or Gaussian models [7][8][9], single links [7] and DBSCAN [10].From a different perspective, this grouping method can be classified into agglomerative or divisive, hard or fuzzy, deterministic or stochastic.The task of clustering documents has very high dimensions, ranging from several hundred to thousands of dimensions, so first need to project documents into lowerdimensional subspaces where the document semantic structure becomes clear.In low dimensional semantic space, traditional grouping algorithms can be applied.For this purpose, spectral clustering [11][12], clustering using LSI [13] and clustering based on non-negative matrix factorization [14][15] are the most commonly used techniques.
Text document clustering groups similar documents that form coherent clusters, while different documents are separated into different clusters.However, the definition of pairs of documents that are similar or different is not always clear and usually varies with setting the actual problem.For example, when grouping research papers, two documents are considered the same if they share the same thematic topic.This type of grouping can be useful for further analysis and use of datasets such as information retrieval and information extraction, by grouping the same types of information sources together.Accurate grouping requires a precise definition of the closeness between a pair of objects, in similarity or distance.Various similarities or distance measurements have been proposed and widely applied, such as the cosine similarity and Jaccard correlation coefficient.The similarity calculation between documents is measured using a simple matching coefficient [16] and the Vector Space Model method in determining the similarity percentage of each document [17].Meanwhile, similarities are often conceived in terms of inequality or distance as well [18].Steps such as Euclidean distance and relative entropy have been applied in grouping to calculate the distance of the object pair.
Spectral clustering shows its ability to handle highly non-linear data (data space has high curvature in each local area).Also, strong connections to differential geometry make it able to find document space type structures.Spectral grouping usually groups data points using the top eigenvector of the Laplacian graph, which is defined in the data point affinity matrix.Spectral clustering tries to find the best chart pieces so that the function of the predetermined criteria can be optimized.Many criteria functions, such as cutting ratios [19], average  [11], normalized pieces [11], and minimum pieces [9] have been proposed along with related problems to find their optimal solutions.From the perspective of dimensional reduction, spectral clustering infuses data points into low-dimensional spaces where traditional grouping algorithms such as K-means can be applied.One of the main disadvantages of the spectral clustering algorithm is that they use dimensional reduction which is only defined in training data.They must use all data points to study embedding so that the data set is very large and will cause expensive computing costs, which limits the application of spectral grouping to large data sets.
Latent Semantic Indexing (LSI) [20] is one of the most popular linear document indexing methods that produce low dimensional representations.LSI aims to find the best sub-space estimates into the original document space in the sense of minimizing global reconstruction errors.In other words, LSI seeks to uncover the most representative features of the most discriminatory features for document representation.Therefore, LSI may not be optimal in distinguishing documents with different semantics which is the final goal of grouping.Xu et al. apply the Non-negative Matrix Factorization (NMF) algorithm for grouping documents [14][15].They model each cluster as a linear combination of data points, and each data point as a linear combination of clusters.And they calculate linear coefficients by minimizing global reconstruction errors from the data points using the Non-negative Matrix Factorization.Thus, the NMF method still focuses on the global geometric structure of the document space.In addition, repetitive update methods for solving NMF problems are computationally expensive.

METHODS
As shown in Figure 2, in general, our method consists of seven stages of the process.The first document file was collected in a folder, in this study we used 83 paper documents with 3 kinds of topics randomly, then document files that were still in pdf format were converted to plain text format, we used the Zilla PDF to TXT converter application.The second step is tokenization.The tokenized process is an integral part of the information retrieval system, involving the pre-processing of the given document and producing each token [21].
The tokenization technique calculates tokens to set the value of "Word Count or Token Count" which can be used as an indexing/ranking process.Figure 3 shows the tokenization algorithm.The third step is filtering to remove stopwords from the document.The stop-words list or stop-lists is a list of words that do not contain information.Luhn, a computer scientist, and information expert who paved the way for automatic indexing and information retrieval.Removing stop-words from indexing can reduce the space and time needed by 30-50%.This innovation was adopted by van Rijsbergen [22] where he suggested a list of 250 stop-words in English.Stemmer is a sensitive end context suffix algorithm.This is the most widely used stemmer and implementation is available in many languages.But the number of definitions of stemmer need to be made before the steps can be explained.The following definitions are presented in [23].Consonants are letters other than A, E, I, O or U and besides Y which are preceded by consonants.For examples in the word boy, consonants are B and Y, but in their experiments T and R. Vowels are any letters that are not consonants.A consonant list greater than or equal to length one will be denoted by C and a list of vowels that are equal to V [23].
Any word can, therefore, be represented by the single form; Where the m denotes m repetitions of VC and the square brackets [ ] denote the optional presence of their contents [23].The value m is called the measure of a word and can take any value greater than or equal to zeros and is used to decide whether a given suffix should be removed.All such rules are of the form; (condition) S1 → S2 which means that the suffix S1 is replaced by S2 if the remaining letters of S1 satisfy the condition [23].terminal y i.The remaining steps are relatively easy and contain rules for dealing with various classes of order sufficiency, initially converting double sufficiency into a single suffix and then removing the adequacy of the relevant requirements fulfilled [23].The fifth step is the transformation of all characters in the document to lowercase letters before the measurement of similarity.We use the cosine similarity function to calculate document similarity [24].For the two documents di and dj, the similarities between them can be calculated: Since the document vectors are of unit length, the above equation is simplified to: When the cosine value is 1 the two documents are identical, and 0 if there is nothing in common between them, their document vectors are orthogonal to each other [24].The sixth step is to transform to lower case for all character in a document.
The final step is clustering.For our analysis, we have chosen the K-means algorithm to group documents.This is a repetitive partitioning process that aims to minimize the least-squares error criteria [25].As mentioned earlier, Partition clustering algorithms have been recognized to be more suitable for handling large document datasets than hierarchical ones, due to relatively low computational requirements [26][27][28].The standard K-means algorithm functions as follows.Given a set of data objects D and the number of k clusters that are predetermined, the data object k is chosen randomly to initialize the cluster k, each being the centroid of a cluster.The remaining objects are then assigned to the cluster represented by the closest or most similar centroid.Next, the new centroid is recalculated for each cluster and in turn, all reassigned documents are based on the new centroid.This step repeats until the convergence solution is still reached, where all data objects remain in the same cluster after the centroid update.The resulting clustering solution is locally optimized for the given data set and initial seed.The choice of different initial seed sets can produce very different end partitions.Methods for finding a good starting point have been proposed [29].However, we will use the basic K-means algorithm because optimizing of grouping is not main-focus of this paper.The K-means algorithm works by distance steps which basically aim to minimize the distance in the cluster.Therefore, the similarity steps do not directly enter the algorithm, because smaller values indicate differences.K-Main algorithm as follows: Table 1 shows the results of the clustering process.Of the total 83 document files, 16 categorized files into cluster 0 with the topic Hypertension Retinopathy.From 16 files as many as 15 files are categorized correctly, while one file is not categorized correctly.While 42 files are categorized into cluster 1 with the topic Convolutional Neural Network (CNN), where 34 files are categorized by correctly and 8 files are not categorized correctly.To measure the accuracy of clustering results, we use the following [24] formula, where accuracy r is defined as: where ai is the number of documents correctly categorized in cluster i and n is the number of documents in the dataset.Based on clustering results in table 1, where 70 documents are correctly classified from a total of 83 documents, the accuracy of clustering is 84.3%.The results of the information retrieval process from 83 documents, consisting of tokenizing, English filter stop-words, Porter stemming and transform all character to lower case, generated 4366 attributes or words with centroid values in cluster 0, cluster 1 and cluster 2. Table 2 shows the results of measurement of the value of centroid in each cluster, for 6 keywords with 2 keywords for each cluster randomly selected.for the words "retinopathi" and "hypertens" the largest centroid values of 0.1033 and 0.1411 are in cluster 0, the results are in accordance with the topic in cluster 0, Hypertension Retinopathy.In the words "cnn" and "convolut" the largest centroid values are 0.0433 and 0.0351 respectively in cluster 1.These results are in accordance with the topic in cluster 1, namely Convolutional neural network or CNN, while for the words "deep" and "learn" the largest centroid value of each 0.0313 and 0.0337 are in cluster 2 the results are in accordance with the topic on the cluster namely deep learning, the value differs slightly from the value in cluster 1, because the word "deep learning" often appear in documents that discuss Convolutional neural networks or CNN.In Figure 5 shows the index value of term frequency and inversed document frequency for the word "retinopathi" where the largest value in the document id is 49, the result is appropriate where the document id 49 is included in cluster 0 with the topic Hypertension Retinopathy.While Figure 6 shows the index value of term frequency and inversed document frequency for the word "convolut" where the largest value in the document id 20, the result is also appropriate where document id 20 is included in cluster 1 with the topic Convolutional Neural Network or CNN.Different results are shown in figure 7 where the largest index value for the word "convolut" in document id 52, which is included in cluster 1 with the topic Convolutional Neural Network or CNN, does not cluster 2 with the topic Deep learning.This happens because CNN is one of the deep learning architectural models, so in documents that discuss CNN there are many words of deep learning.The index value term frequency and inversed document frequency can be used as an indicator of how many keywords appear in each document so that it becomes information to choose which documents are most relevant to the desired topic.

CONCLUSION
This paper presented the results of an experimental study of the grouping technique of 83 scientific document files according to 3 topics are Hypertension Retinopathy, Convolutional Neural Network, and Deep Learning.After the document file has been converted to plain text, then the information retrieval process is tokenized, stop-words English filter, porter stemming and transform all character to lower case.Next, calculate document similarity using cosine similarity.Finally, is the document grouping, we use the K-means standard.Our results show that our method results in an accuracy of 84.3%.The implication of ⃗⃗⃗⃗ .Let tf (d,t) denote the frequency of term t ∈ T in document d ∈ D. Then the vector representation of document d is  ⃗⃗⃗⃗ = ((,  1 ), … , (,   )) Since then they have formed a classic keyword list, used by default or as a basis in a text database.The fourth step is stemming.Stemming is the process of reducing words that are inflected or derived form basic words.In this study, we use the Porter stemming algorithm.Porter Stemmer is a Stemmer merger developed by Martin Porter at the University of Cambridge in 1980.

Figure 3 .
Figure 3. Tokenization Algorithm [21].The first step of this algorithm is designed to deal with participle and previous plural forms.This step is the most complex and separated into three parts in the original definition.The first part deals with the plural, for example sses→ss and deletion s.The second part deletes ed and ing, or eed→ee if necessary.The second part continues only if ed or ing is deleted and changes the remaining bars to ensure that certain adequacy is recognized later.The third part changes the

Figure 5 .
Figure 5. Graph of the index value of the word "retinopathi" in each document

Figure 6 .
Figure 6.The index value of the word "convolut" in each document.

Figure 7 .
Figure 7. Graph of the index value of the word "deep" in each document.