INTER AND INTRA CLUSTER ON SELF-ADAPTIVE DIFFERENTIAL EVOLUTION FOR MULTI-DOCUMENT SUMMARIZATION

Multi – document as one of summarization types has become more challenging issue than single-document because its larger space and its variety of topics from each document. Hence, some of existing optimization algorithms consider some criteria in producing the best summary, such as relevancy, content coverage, and diversity. Those weighted criteria based on the assumption that the multi-documents are already located in the same cluster. However, in a certain condition, multi-documents consist of many categories and need to be considered too. In this paper, we propose an inter and intra cluster which consist of four weighted criteria functions (coherence, coverage, diversity, and inter-cluster analysis) to be optimized by using SaDE (Self Adaptive Differential Evolution) to get the best summary result. Therefore, the proposed method will deal not only with the value of compactness quality of the cluster within but also the separation of each cluster. Experimental results on Text Analysis Conference (TAC) 2008 datasets yields better summaries results with average ROUGE-1 score 0.77, 0.07, and 0.12 on precision, recall, and f – measure respectively, compared to another method that only consider the analysis of intra-cluster.


Introduction
Documents can be contained with long text that present some information with specified topics. Along with this, the increasing of document quantity and document size makes the determi-Alifia Puspaningrum et.al, Inter And Intra Cluster On Self-Adaptive Differential Evolution For Multi-Document Summarization 87 nation of useful information has become a challenging task. Thus, it needs a solution to overcome this problem efficiently. Recently, one of the recognized solutions to determine useful information is text summarization. Text summarization is the process to transform a text into a shorter form without losing its information [1]. The summary of a text provides a user a quick glance of the text's main topic. Therefore, it simplifies the acquisition of useful information where it is helpful for user to save time [2]. Text summarization methods can be divided into two types, i.e. extractive and abstractive methods. Extractive method uses some sentences contained in the source text that deemed to represent the main topic of the text. Abstractive method tries to generate new text from the source text. Furthermore, text summarization can be a single-document summarization or multi-document summarization according to the number of summarized source documents. Single-document summarization produces a short summary from only one document, whereas multi-document summarization produces a short summary from two documents or a set of documents consist of multiple documents [3]. Multi-document summarization is more challenging issue in extracting important sentence than single-document summarization because it has larger search space compared to single document summarization [2].
Several researches about multi-document sum-marization have been investigated to produce optimal summary result based on abstractive summarization method. Some of them are using nature inspired optimization algorithm, such as Differential Evolution [4], Cuckoo Search [2], Cat Swarm [3], etc. Differential Evolution has been used in many sectors, especially in the optimizing process. In addition, because of its stochastic search technique such as crossover, mutation, and selection, Differential Evolution becomes a robust and effective algorithm.
Optimization algorithms consider some criteria in producing the best summary, such as relevancy, content coverage, and diversity. However, those criteria based on the assumption that the multi-documents are already located in the same cluster. But, in a certain condition, multi-documents consist of many categories and need to be clustered first. Text summarization can be implemented to the document clustering process then. Consequen-tly, the prior studies didn't consider the overlapping topic in the resulted summary with other clusters. Even though, document clustering is one of the fundamental tools for understanding documents [4]. [5] consider clustering analysis in multi document summa-rization by proposing inter and intra cluster similarity of each sentence. But, this method only calculates the sentence value with respect to its cluster without consider that the summary result contains different information that is either related or unrelated to the main topic.
There are several clustering techniques, such as k-means clustering, hierarchical clustering, fuzzy clustering, etc. K-Means clustering is one of the good methods in time complexity compared to hierarchical clustering, because k-means clustering linear in the number of data objects. So, it is good  for large datasets [13]. Moreover, k-means clustering minimized the dispersions of the cluster [14].
In this paper, we propose an inter and intra cluster to summarize multi-document, which consist of four weighted criteria functions (coherence, coverage, diversity, and inter-cluster analysis) to be optimized by using SaDE (Self Adaptive Differential Evolution) to get the best summary result. Therefore, the proposed method will deal not only with the separation of each cluster but also the value of compactness quality of the cluster within.
The paper's structure is organized as follows Section 2 will briefly present a detail description of proposed method general framework in each stage. Section 3 elaborates the experimental setup, dataset, results and analysis each experimental setup. Section 4 addresses the conclusions and future works.

Methods
Multi-document summarization is a process to compress multi-document text into a short summary without losing its useful information automatically [4]. This proposed method is inspired by SaDE (Self Adaptive Differential Evolution) [3]. There are five main steps such as clustering phase, preprocessing phase, input representation phase, summary optimization, and final summary. The general framework of the proposed method is shown in Figure.1. Multiple documents with different topics are given as input to the proposed method. Then, the documents are clustered based on its topic. After that, the results are given into preprocessing phase and input representation phase. Finally, summary optimization is applied to extract the final summary.

Clustering Phase
Document clustering is one of fundamental tools for understanding documents [4]. The main objecttive in clustering phase is grouping document set into several clusters, where documents in the same cluster have a similar topic. We implemented kmeans clustering method on the multiple documents because this method is easy to implement and has rapid convergence. However, k-means clustering method is affected by the number of cluster that must be initialized at the first [9]. In this proposed method, the number of cluster is restricted on two. Therefore, each test will be done using multiple documents from two topics.
The first step in document clustering is transformed documents into feature space, which represent the weight of words in a document. Weight on each word can be calculated into similarity representation of each document. Finally, the last step is clustering around multiple document input based on similarity representation, which is generated on the previous step.

Preprocessing Phase
Preprocessing phase is a step to transform the clustering phase results into distinct term which used to calculate weight for each sentence. Figure 2 shows that there are three sub processes in this phase, i.e. 1) sentence extraction, 2) sentence normalization, and 3) tokenization. Sentence extraction is the first sub processes in pre-processing phase, which aim to extract documents sentence related to its main content. Result of sentence extraction is representted as a sentence list. Afterwards, normalized sentences are generated using stopword removal, punctuation removal, and stemming process. Stopword removal process is using stopword from Journal Machine Learning Research stopword list 1 and Porter Stemmer algorithm 2 for the stemming process. After that, the next sub process is tokenized each normalize sentence into list of distinct terms. The rest of the phases will be performed for each resulting cluster.

Input Representation Phase
For each cluster, distinct term obtained from the previous process is used to calculate term weight. Term weight calculation is calculated using term frequency-inverse sentence frequency (TF-ISF). It can be formulated by the following equation (1) and equation (2).
In the equation (1), N represents the size of document sentences that will be summarized. N m is the size of sentences containing term m. isf m represents the term m inverse sentence frequency of each sentence retrieval. In equation (2), W nm denotes weight of distinct term from each sentence in documents source that will be summarized. Tf nm denotes frequency of term m that occurs in sentence n.
After calculating weight of each term in each sentence, then we calculate the similarity between sentences using cosine similarity. Cosine similarity can be formulated by the following equation (3).
Sentence's similarity can be the basis calculation of the summary criteria function because it is considering similarities the main content in the original documents and summary candidate [10].

Summary Optimization
The sentence summarization is completed during this phase. As explained in prior work [4], the summary optimization process composed of some sub-processes, such as initialization, binarization, sentences ordering, solution evaluation, mutation, crossover, and selection. This sub-process is performed iteratively for a fixed number of generation. Every generation yield a set of solutions. Therefore, the last generation is regarded to produce the most optimal set of solutions. The generation iteration of the optimization method is stopped after reached the specified maximum generation parameter . Summary optimization flowchart is shown in Figure. 3.

Initialization
In this sub-process, initial set of solutions are generated to be further processed in the next subprocess. This sub-process will be only performed once for entire summary optimization sub-process. A set of solutions are generated, and each solution represented by a vector, where elements from the vector represent sentences in a cluster. Each element from the solution vector is assigned a real number value calculated with equation (4).
In equation (4), , ( ) denotes the th element of the target vector of solution P in tth generation. Notation and are real number value of lower bound and upper bound respectively, specified by user, and , is a uniform random value between 0 and 1. Results of this subprocess, which is set vectors consist of real value number as elements, is called the target vectors.

Binarization
Binarization sub-process aims to transform target vectors, which each vector's element is a real number, to binary vectors, which each vector's element is binary value. In the summary optimization phase, for each generation, this sub-process is done twice, because both sub-processes mutation and crossover use target vectors, which contain real number values, as the input. Consequently, both sub-processes mutation and crossover yield vectors, which also contain real number values. Therefore, binarization is required to transform the real values vectors to binary values vectors after both sub-processes are completed. The inclusion of sentences in a summary solution is represented by the binary value in the resulting binary vector. If the ith element of a binary vector P is 1, then ith sentence in the cluster is included at the summary solution P, otherwise the sentence does not include in the summary solution P.
Transformation from target vectors to binary vectors performed using equation (5) to Alguliev et al. [1] ( )can be calculated with equation (6).

Sentence Ordering
Sentence ordering sub-process aims to improve summary solution coherency by arranging sentences order. The arranged order of the sentences is stored in summary solution sentences order vector where each element indicates sentences index by the arranged order.
Umam et al. proposed two ordering algorithms [10], dubbed as Algorithm A and Algorithm B. Algorithm A arranges sentences order based on the similarity between neighboring sentences. Whereas Algorithm B put the most similar pair of sentences at the beginning of the summary paragraph. The prior study shows that Algorithm A performed better than Algorithm B. Therefore, Algorithm A used in this summary optimization method.

Solutions Evaluation
To find the optimal solution for every generation, Umam et al. used three criteria to evaluate summary solutions consists of coverage, diversity, and coherence [10]. Coverage criterion represents the conformity of solution summary to main content of the text source, hence the intra-cluster analysis.
Coverage can be calculated with equation (7). In the equation, ( ) denotes the binary vector of the summary solution P at the t-th generation. N is the number of sentences in the cluster. Notation denotes the centroid vector of all sentences in the cluster, and ( ) denotes the centroid vector of all sentences in summary solution P at the t-th generation.
denotes the nth sentence vector which elements represent term weights, and , denotes the nth element of the binary vector summary solution P.
In this paper, we introduce the inter-cluster analysis, which is a distance calculation between solution summary and other text sources. The distance calculation can minimize overlapping topic between solution summary and other text sources.
This inter-cluster analysis criterion, henceforth called heterogeneity, calculated with equation (8). Most notations in equation (8) share the same meaning as in equation (7), except C which denotes the number of cluster and which denotes centroid vector of cth cluster, where c is not equal current cluster.
Diversity criterion prevents information redundancy of solution summary. This criterion calculates similarity between a sentence and other sentences in a solution summary, as shown in (9).
Coherence criterion ensures the information flow quality of a solution summary. The continuity of sentences information can improve solution summary readability. This criterion calculates the similarity of adjacent sentences, as shown in equation (10).
In equation (10), ( ) denotes the sentences order vector of summary solution P at the t-th generation, where the ith element of the vector denotes by , ( ).
( ) denotes the number of sentence in the summary solution P at the t-th generation.
Fitness function formulized as in equation (11) is utilized to find the optimal solution summary. The local best solution summary's target vector in generation is stored in ( ), and the local worst solution summary's target vector in generation is stored in ( ). The global best solution summary's target vector ( ) will also be updated in each generation.

Mutation
Mutation is a sub-process where target vectors are transformed into mutant vectors using local best summary's target vector which denoted as ( ) and global best summary's target vector which denoted as ( ). The mutant vector is calculated with equation (13), where ( ) denotes the mutant vector of summary solution P at the t-th generation, which nth element of the vector is denoted by , ( ). In equation (13), ( ) is a random target vector chosen from the set of summary solutions at the t-th generation, where ≠ . The mutant factor at the t-th generation which is denoted by ( ), calculated with equation (12).
In order to prevent the value of mutant vector out of boundary constraints, value conformation is applied according to equation (14) Crossover Crossover sub-process aims to combine target vectors and mutant vectors from the set of summary solutions. The result of this sub-process will henceforth be called trial vectors. Elements of a trial vector chosen either from target vector or mutant vector, shows in equation (15). In equation (15), ( ) denotes the trial vector of summary solution P at the t-th generation, which the nth element of the vector denoted by , ( ). , denotes uniform random number and denotes the crossover rate, which acquired by equation (17). Coefficient k is random integer value ranged from 1 to n, to ensure the use of at least one mutant vector component to form the trial vector.
In equation (17), to calculate crossover rate, relative distance denoted by first has to be calculated with equation (16), where fitness function denoted by ( ) calculated with equation (11). Tangent function denoted by tanh( ) can be calculated using equation (18).
After this sub-process is completed, binaryzation will be performed to transform the resulting trial vector which contains real value numbers, to binary vector which contains binary values.
Selection is a sub-process to produce a new set of target vectors for the next generation. The new target vectors are composed from the old target vectors and trial vectors with the highest fitness function value. The next generation target vector of the summary solution P denoted by ( + 1) acquired either from the current generation trial vector denoted by ( ), or the current generation target vector denoted by ( ), based on the fitness scores of both vectors, as shown in equation (19). for each topic in a document set. Therefore, total of 128 summaries (64 × 2) produced from 64 set of documents by a summarization method.

Results and Analysis
The proposed method will be compared to CoDiCo method [10] from prior work, Luhn [11] and Kullback Leiber [12] text summarization algorithm. The number of cluster is set to 2 for every set of documents. Both proposed method and CoDiCo method used 0.9 as sentences similarity threshold , in the sentences ordering phase. In initialization phase, both methods used 3, 11, -5, and 5 as parameter value for population size ( ), maximum generation ( ), lower bound ( ), and upper bound ( ), respectively. We present CoDiCo to get the comparison value between four weighted criteria and three weighted criteria. The performance of the proposed method is also tested to make cluster according to each topic of the dataset.
The experiment is implemented in MATLAB Version 2016a in Windows 10 operating systems. Experimental result will be evaluated using Recall-Oriented Understudy of Gisting Evaluation-N (ROUGE-N) [7], where N indicates the type of Ngram. In this experiment ROUGE-1 and ROUGE-2 will be used. Evaluation metrics such as recall, precision, and F-measure are calculated using ROUGE-N. ROUGE-N is measured based on summary's quality factors such as coverage, diversity, coherence, and heterogeneity Table 1 shows that according to ROUGE-1 and ROUGE-2 score, the proposed method can be outperformed compared to CoDiCo in all kinds of aspect. When extracting summary, both methods not only focus on relevance score of sentences to the whole sentence collection, but also the topic representative of sentences. CoDiCo only considers the intra quality criteria of the cluster such as coverage, diversity, and coherence. In contrary, the summary result of the proposed method not only deals with the compactness of intra cluster, but also considers the separation between clusters. So that, the proposed method can summarize multidocument although multi-categories are inputted.
However, according to Figure 4 and Figure 5, the deviation value of Proposed Method, CoDiCO, Luhn, and Kullback-Leibler do not change significantly in unigram or even in bigram evaluation. Some factors may cause that problem, such as the election of document in the clustering process. By using K-means as clustering method, the result then will be used as input for the next step without any clustering evaluation. So that, if there's any mistake in this step, some documents may be misclassified. Furthermore, the clustering result will be processed to get the candidate summary in optimization process using SaDE. In short, the result of the clustering process will influence the final summary result. The figure 4 and 5 shows that Kullback leibler and Luhn can not give optimum result compared to the proposed method. Kullbackleibler used the probability of word frequency for each sentence, the higher value will be used as sentence of the summary result. In addition, Luhn only uses the significance of word to summarize the documents without considering the frequency or even the similarity between words and sentences.
The summarization result was evaluated by using ROUGE. ROUGE recall explains that the ngram result in the reference summary is also exist in the summary result. In addition, ROUGE precision explains that the n-gram result in the summary result is also exist in the reference summary. To sum up, one of the reason why the precision value is too low compared to the recall value is that the summary result contains more sentences compared to the reference summary. So that, the overlapping n-gram is less to be found. Figure 6 and Figure 7 show that the comparison between cluster 1 and cluster 2 is not clearly different. One of the possible factors is that the topics used in the experiment were not totally difference. So that, both cluster sometimes used same term to express their documents However, in a certain condition, the K-Means clustering result has a good performance but the value of ROUGE of the proposed method is not significantly different compared to CoDiCo. One of the possible factors which can affect is fitness function effect. Both methods use fitness function as a parameter to choose the best summary from some existing candidate summaries. Based on fitness formula that the proposed method is used, the fitness function is calculated based on the value of coverage, diversity, coherence, and heterogeneity. Nevertheless, the value of each criteria has a different interval. This problem can influence the value of the summary result. By all these criteria, diversity is a criterion which has the biggest value compared to other criteria. In this case, the summary result will major in representing the spread of document term. For the next research, the fitness function can be replaced by weighted function which has coefficients for each criterion.

Conclusion
This paper proposed inter and intra cluster by using four criteria for summarizing multi-document. This method considers not only the compactness quality of the intra-cluster, but also separation between clusters (inter-cluster). Experimental result on TAC 2008 demonstrate the good effectiveness of the models. In addition, the performance of the proposed method is outperformed compared to CoDiCo as a model which only considers intra cluster by using three weighted criteria. For the next research, we will investigate the performance of other clustering algorithm and use weighted value for each fitness function criteria.