Arabic Text Summarization Using Latent Semantic Analysis

This work was carried out in collaboration between all authors. Authors FMB-A and GHG designed the study, performed the statistical analysis, wrote the mathematical model


INTRODUCTION
The rapid increase in the amount of online text information causes many problems for users due to the information overflowing. One of those problems is the short coming of an effective technique to look for the required information. Text search and text summarization are two important techniques to handle such problem [1,2]. The search engine tool is used to find out the set of relevant documents, while the text summarization tool is used to find out the desirable set of documents [2]. Automatic text summarization (ATS) is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document and that is less than half (30%) of the original document [3]. The process could be partitioned into three phases: analysis, transformation and composition. The analysis phase is concerned with text features extraction and important features selection. The transformation phase is concerned with summary representation based on selected features during the previous phase. The composition phase is concerned with an appropriate summary generation. The resulted summary should contain the necessary information with cohesive and coherent manner. The cohesion concept is concerned with the surface level structure of the text. It is defined as grammatical and lexical structures that relate text parts to each other by using pronouns, conjunctions, time references and so on. While coherence concept, is concerned with the semantic level structure of the text. Text summarization can be created using a single document or multiple documents. Generally, there are two approaches for automatic summarization: extraction and abstraction. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate. Such a summary might contain words not explicitly present in the original. Summarization systems for Arabic text are however still not as sophisticated and as reliable as developed for languages like English and other European languages. The resources and software tools available for Arabic text summarization are still limited. Researchers and software developers should do more in this aspect. The main goal of this paper is to develop and implement a generic text summarization algorithm based on latent semantic analysis method (LSA). LSA is a semantic form of analyzing relationships between a set of sentences. It deals with the word description as well as the sentence description for each concept or topic. LSA creates the word by sentence semantic matrix of a document or documents. Each word in the matrix row is represented by word variations such as root, stem and original word. Arabic corpus is used for evaluating the algorithm performance [2].

RELATED WORK
Lots of different text summarization methods are existed in literature. Most of them are extractive methods, while others are abstractive methods. On the one hand, Extractive methods are concerned with extracting of most important topics of input documents. They are also concerned with selecting sentences that are more related to those selected concepts to generate the desired summary. Such methods are based on surface level information, statistics, and knowledge bases (ontology's and lexicons) and so on. They can be classified in to six classes:

Surface Level Method
The idea behind this method is associated with terms frequency. The more frequent terms are the ones that are most important. The sentence include those frequent terms are considered to be more important than other sentences and are selected to be included in the output summary.

Statistical Method
The idea behind this method is associated with relevance information extracted from lexicons, WordNet and used together with natural language processing technique. For instance, the number of occurrence of the term "automobile" is incremented when the word "car" is watched.

Text Connectivity Based Method
This method deals with extracting semantic relations of terms such as synonym and antonymic using lexicons and Word Net. Semantic relations lexical chains are constructed and used for extracting important sentences in the documents.

Graph Based Method
This method deals with graph concepts where each node in the graph represents a sentence at the same time an edge represent the similarity between connected sentences.

Machine Learning Based Method
This method assumes that text features are independent or dependent. The machine learning based summarization algorithms uses techniques like Hidden Markov Model, Log-Linear Models, Decision Tree, and Neural Networks.

Latent Semantic Analysis Method
This method is concerned with computing similarity between sentences and terms based on singular value decomposition. A few existing projects concerning with text summarization. The most closely related to this work are surveyed and reported: Md. Monjurul Islam and A. S. M. Latiful Hogue [1] developed an automated essay grading system AEG using Generalized Latent Semantic Analysis (GLSA) which makes n-gram by document matrix instead of word by document matrix. They evaluated the system using details representation. They reported that the proposed AEG system achieved higher level of accuracy as compared to human grader.
Yingjie Wang and Jun Ma [2]: proposed a comprehensive LSA-based text summarization algorithm that combines term description with sentence description for each topic. They reported that their approach obtains higher ROUGE scores than several well-known methods.
Rui Yang et al. [3] proposed a Chinese summarization method based on Affinity propagation (AP) clustering and latent semantic analysis (LSA). They reported that they got more comprehensive and high-quality summarization.
Madhuri Singh Member IAENG and Farhat Ullah Khan [4] developed a summarizer that produces an effective and compact summary using probabilistic approach of LSA. They mentioned that they used incremental EM instead of standard EM. They also reported that they performed a performance comparison experiment on the standard and incremental EM. They stated that experiment results prove that incremental EM makes summarizer fast in comparison to standard EM.
Jen-Yuan Yeh et al. [5] proposed two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA+ T.R.M.). They stated that they evaluated LSA and T.R.M. both with single documents and at the corpus level. They mentioned that the two methods were evaluated at several compression rates on a data corpus composed of 100 political articles. They mentioned that when the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA+ GA, 44% and 40% for LSA + T.R.M. in single-document and corpus level were achieved respectively.
A N K Zaman et al. [6] evaluated the use of English stop word lists in Latent Semantic Indexing based Information Retrieval systems with large text datasets. They stated that they compare three different lists: two were compiled by IR groups at the University of Glasgow and the University of Tennessee, and one is their own list developed at the University of Northern British Columbia. They reported that they found tailored stop word lists improves retrieval performance compared to non-tailored stop word lists.
Makbule Gulcin Ozsoy et al. [7] mentioned that they extracted important information from huge amount of text data using two Latent Semantic Analysis (LSA) algorithms. They stated that they evaluated both algorithms on Turkish documents, and their performances were compared using their ROUGE-L scores. One of them produced the best scores.
Thomas Hofmann [8] stated that he proposed a novel method for unsupervised learning, called Probabilistic Latent Semantic Analysis, which is based on a statistical latent-class model. Also, he reported that he experimentally verified the claimed advantages in terms of perplexity evaluation on text data as well as on linguistic data and for an application in automated document indexing, achieving substantial performance gains in all cases. Probabilistic Latent Semantic Analysis has thus to be considered as a promising novel unsupervised learning method with a wide range of application in text learning, computational linguistics, information retrieval, and information filtering.
Michal Campr and Karel Jezek [9] developed a similar method for comparative summarization using latent semantic analysis. Also they stated that the results were compared with the results of a similar method based on Latent Semantic Analysis.
[10] proposed a method using sentence location. They stated that the method significantly improved automatic speech summarization performance for the condition of 10% summarization ratio. They also reported that results of correlation analysis between subjective and objective evaluation scores confirmed that objective evaluation metrics, including summarization accuracy, sentence F-measure and ROUGE-N, were effective for evaluating summarization techniques.
Rasha Mohammed Badry et al. [11] introduced an approach to summarize a text using semantic oriented analysis in order to determine the important sentences. They reported that they used, an algebraic method known as Latent Semantic Analysis (LSA) in determination of important sentences. They also stated that they obtained successful results.
[12] applied both LSA and PLSA in a system for grading essays written in Finnish, called Automatic Essay Assessor (AEA). They report that they compared PLSA and LSA results based on three essay sets from various subjects. They stated that methods were found to be almost equal in the accuracy measured by Spearman correlation between the grades given by the system and a human. Jasminka Dobša and Bojana Dalbelo Bašić [13] stated that they introduced a method to deal with the problem of addition of new documents in collection when documents were represented in lower dimensional space by concept indexing. They also mentioned that the proposed method was tested for the task of information retrieval.
On other hand, abstract methods are also introduced. Abstract summarization algorithms attempt to understand the input text even those without explicitly stated topics and create new sentences as output text summarization. Those algorithms are similar to the way of human summarization. Unfortunately, it is too difficult to obtain the human summary performance. Also, such algorithms produce the text summarization based on ontological, fusion, compression and extracted concepts.

LATENT SEMANTIC ANALYSIS
Latent Semantic Analysis (LSA) is an algebraic technique that is used to analyze relationships between a set of sentences by producing a set of concepts related to the sentences. LSA assumes that words which are close in meaning will occur close together in text. So it can handle the problem of identifying synonymy and the problem with polysemy. LSA uses SVD (Singular Value decomposition) for decomposing matrices. The SVD is a mathematical process, which is often used for data reduction, but also for classification, searching in documents and for text summarization. The SDV of a matrix A m×n whose rank is r and m ≥n. There exist two orthogonal matrices U m×n = (u 1

Dimensionality Reduction and Document Analysis
After eliminating stop words from the document. Next the document is segmented into sentences which are considered as small units in terms of extractive summarization. Each sentence is segmented into tokens (words/terms). In turn, each word is segmented into affixes, suffixes, infixes, stem and root, where the white space and punctuation marks are used as boundary markers [15,16]. Getting term stem and term root are very important step in the process of determining term/word frequencies to reduce the number of terms. Without stemming, the term frequencies will give illusion results. The algorithm used in this paper for computing the stem is Ahmed Khorsi stemmer [17,18] while the algorithm used for computing the root is Abderrahim Boudlal algorithm [19]. After document preprocessing, it represented by a matrix. The matrix is created by words representing rows as well as sentences representing columns. Since each term has three variations: word, stem and root then the matrix is constructed three times based on the term variations requirements. In another word, a document D with m terms and n sentences such that m>n, it can be represented as A=[a ij ] m×n where each cell a ij can be filled using three different methods. As soon as the matrix A m×n is created, SDV is used to decompose such a matrix into three different matrices that are U m×n , ∑ m×n and V T n×n where U m×n and V T n×n are the left and the right orthogonal matrices and ∑ m×n is a rectangular matrix with positive singular values appear in decreasing magnitude on the diagonal. The effectiveness of word variation representation is measured and the root is determined as the most efficient representative. Then the experiment is conducted again using the part of speech tagger (POS) as well as the root representative. In fact, the part of speech corresponding to each word in a given document is identified in order to reduce the documents dimensionality and to get rid of the ambiguity [19,20].

Summary Composition
The summary is composed from the important concepts which are included within the target text. Each concept can be represented by sentences and words descriptions. Such sentences have largest index value in the corresponding right singular vector. While the words have the largest index value in the corresponding left singular value. Assume that a document D is decomposed into sentences, D= {s 1 , s 2 , s n } where n is the number of sentences such that sentences form a set C of candidate sentences. M is a predefined number which indicates number of sentences to be included in the summary S. α is the number of concepts which can be selected and β is the number of sentences related to the α-th concept. As it is mentioned earlier, A m×n is decomposed by SDV into U m×n = (u 1 ,u 2 ,u n ), ∑ m×n = (σ 1 , σ 2 … σ n ) and V T n×n =(v 1 ,v 2 ,..v n ). In the right singular vector space, each sentence j is described by the column vector ψ i =[v i1 ,v i2 ,,…,v ir ] T of V T n×n . Also in the left singular vector space, each word i is described by the row vector χ j=[ u 1j ,u 2j,… ,u rj] . In terms of sentence and word selections, the algorithm starts sorting both V T , U by the largest index value, for the concept α, the α-th right singular vector from matrix V T n×n is selected. Then the sentence which has the largest index value from the α-th right singular vector is selected and included in the summary. Then V T n×n is updated and number of sentences for the concept α is incremented by 1. The top n largest index values from the α-th left singular vector u α and set W={w p ,w q ,w s } where n is the number of words that describes the concept and specified in the experiment. On the one hand, the process of selecting the concept continues until the set of words W becomes empty that means (W=ɸ).On the other hand, the process of selecting sentences for the concept starts deleting common words in both W and current sentence from W. The process continues selecting sentences for the same concept, update V T n×n , W and number of sentence for the current concept, otherwise the process set W to null. Then it increases the number of concept and repeats sentences selection for the next concept. Algorithm 1 shows the formal descriptions of sentences selections for each concept [1,21].

Weighting Basic Methods
There are different methods that are used to fill each cell a i,j of the matrix A, the cell values can change the results of S, in this experiment a comprehensive method is used which is concerned with global, local and adjacent weight for word i in sentence -j .Where the word -i has three variations: word stem, word root and the word itself. Such a method can be described as in equation (1 Such that L (w ij ) is the local weight for wordi in sentencej G (w ij ) is the global weight for word-I in the whole document and N(w ij ) is the adjacent weight for wordi in sentence -j , it includes four adjacent sentences [22,23]. Such sentences are considered in order to include the semantic features behind the content of targeted sentence. Two of them occur before the target sentence while the other two occur after the same sentence [24]. On the one hand, the local weight is represented by different patterns based on alternative formulae that applied for stem, root and word with pronoun and without: 1. Binary representation: the cell filled out with 1/0 as in equation (2).
2. Word frequency: the cell is filled out by the frequency of word -i in sentence -j as in equation (3).
3. Augmented weight: the cell is filled out by modified frequency of word -i in sentence -j as in equation (4).
4. Logarithm weight: the cell is filled out by logarithm of modified frequency of word-i in sentence -j as in (5).
On the other hand, the global weight G(w ij ) can be computed by one of the following strategies: 2. Inverse Sentence Frequency weight: the cell field out with the value computed using formula (7) Where n is the total number of sentences in the document and n i is the number of sentence where the word -i occurs.
3. Word Frequency-Invers Sentence Frequency: the cell is filled out with wf-isf of the word. The higher wf-isf value indicates that the word is much more representative for that sentence than the other in the document as in (8).
4. Log Entropy: the cell is filled out with log-entropy value of the word, which gives information on how informative, the word in the sentence. It is calculated by the formula (9).

Fig. 1. Sentence selection flowchart
In some cases, concepts and topics can't be realized or disambiguated based on only one specific sentence, but they can be realized based on the context of a set of antecedent or postcedent sentences. In addition a pronoun refers back to specific word in antecedent sentence or complement sentence. For this reasons an adjacent sentences weight is extended to four sentences rather than two for more understanding of the concepts. Thus the adjacent sentences weight is considered as in equation (10).
Where ψ=0.5 for this experiment.

Data Set and Experiment Setting
The data set used in this experiment is produced and distributed by the Linguistic Data Consortium (LDC) at the University City of Penn USA. The LDC provides two Arabic collections, the Arabic GIGAWORD and the Arabic NEWSWIRE-a corpus [23]. The source documents are represented as UTF-8 files; such documents include meta-data as well as tags. The dataset contains a hundred documents which are used as input for the proposed summarizer. The output results (machine summary) along with the original documents are distributed to hundred independent evaluators who are expertise, researchers or lecturers in the Arabic Linguistics and Journalism departments. In this experiment, three linguistic models of document representation are used under the proposed summarizer. Such linguistic models are word root, word stem, and original word. As soon as the best representative model is empirically specified, then it is combined with another linguistic model which is called part of speech (POS) tagger [21,25]. It is used for improving the LSA performance. The combined model is associated with different weighting techniques which specify cells weights of matrix A. The weighting techniques are derived from the main formula (1), such derived techniques are:  T 1: a ij =Binary Representation (BR)*Entropy Frequency (EF) +four Adjacent Sentences (4ADJ).  T 2 : a ij =Augment Weight (AW)*Entropy Representation(ER) +four Adjacent Sentences (4ADJ).  T 3 : a ij =Logarithm Weight (LW)*Entropy Frequency (EF) + four Adjacent Sentences (4ADJ).  T 4 : a ij =Augment Weight (AW)* Invers Sentence Frequency (IF) +four Adjacent Sentences (4ADJ).  T 5 : a ij =Augment Weight (AW)*Entropy Representation(ER) +two Adjacent Sentences (2ADJ).

Evaluation
There are two types of summary measures which are Form and Content measures. The first one is associated with assessment of summary grammar, organization and coherence, while the other one is associated with assessment of precision as well as recall. Also, there are automatic evaluation measures such as ROUGE-n. The assessment of the proposed algorithm results is implemented manually and automatically. The manual assessment depends on the text overall responsiveness, while the automatic assessment depends on ROUGE-n measure [22,25]. Where S m is the machine summary (candidate summary),S h is the human summary (reference summary), n is the length of the n-gram, gram n , Count match (gram n ) is the maximum number of ngrams co-occurring in a machine summary and the human summary. For the manual assessment, human evaluators are given three types of summaries which are generated based on representative models. The human evaluators are asked to evaluate these summaries and to generate a hundred independent ideal summaries for the documents under the following constraints: the extracted summary is assigned an integer grade in the range 1 to 5 based on the overall responsiveness of the summary. Each word of the summary should belong to the original documents words. A summary should be assigned 5, if it covers the important concepts of the related documents including language fluency and readability [26]. A summary should be assigned a zero, if it is either unreasonable, unreadable summary or if it contains very limited information from the related documents. Finally, each summary size should be about 25% of the original document. As soon as human summaries are collected, one human summary among a hundred of human summaries is selected to be a reference; it is selected by Arabic Linguistics and Journalism experts.

Experiment Results Analysis
In this section, the algorithm results as well as the assessment of the algorithm results are analyzed and presented. As mentioned in section 5.3, the assessment results are conducted manually and automatically. The manual assessment is based on the text overall human responsiveness at the same time the automatic assessment is based on ROUGE method. Overall grading of representation models and human responsiveness scores are shown in Table 1, Fig. 2, Table 2 and Fig. 5, respectively. Also, formula (10) implementation is presented along with linguistic representation models.
Since the root model outperforms the other two models such as stem and original word where Fscore and average ROUGE of the root model are (0.6267361) and (0.485) respectively. In this experiment, really the root model gives an indication that it is the most representative linguistic models among those models used in the experiment. Different results are shown in Table 3, Table 4, Fig. 3 and Fig. 4. Therefore; the experiment is repeated with the root as a text representative for comparing some weighting techniques such as (T 1 ), (T 2 ), (T 3 ), (T 4 ) and (T 5 ) which are derived in section 5.2. After implementing these techniques along with the root representative model, the performance is measured, where F-scores of 0.6779 is obtained. The new result is shown in Table 5 and Fig. 6 respectively.
Since the weighting technique T 2 has the highest F-score (0.6779) as mentioned in Table 5, therefore, it outperforms other weighting techniques included in this experiment. Although T 2 and T 5 have the same combination of features but T 2 outperforms T 5 , because of T 2 uses four adjacent sentences rather than only two adjacent sentences. For improving the summarizer performance, the same experiment is repeated again with using the part of speech (POS) tagger as a text preprocessor to get rid of the text contents ambiguity such as pronouns. Then, the weighting techniques are implemented again on the same dataset and different results are obtained, recorded in Table 6 and Fig. 6. Such results emphasize that the weighting technique T 2 is more efficient compared to other techniques, where Rouge-1 of T 2 is 0.67408465 and the rouge average is 0.595.

CONCLUSION
In this paper, an improved Arabic text summarization algorithm based on LSA method is proposed. The algorithm concentrates on word and sentence descriptions (specifications) for each concept. Each word is represented by Arabic word variations: root, stem and original word, then the algorithm specified that the root is the most efficient representative for the word, where the computed F-measure and average ROUGE for the root are 0.6267 and 0.46 respectively. Therefore, again the algorithm is implemented along with the root and some different weighting techniques. Empirically the optimal combination is specified as the most efficient and accurate tool for text summarization.
The efficiency and the accuracy occur when the algorithm combines some features such as augmented weighting, entropy representation and four adjacent sentences. Such combination is called T 2 that is the most efficient technique among those included in the experiment where the computed F-score is 0.6779. Finally, POS tagger is used as a preprocessor tool for the input text disambiguation, and again the algorithm is implemented then the rouge average is obtained as 0.595 as shown in Table 6 and Fig. 7. Empirical results indicate that the proposed algorithm obtains higher scores compared to several well-known methods. Unfortunately, there is a limitation of the algorithm performance, thus any future work should be based on Neural Network, Genetic