A Framework for Word Embedding Based Automatic Text Summarization and Evaluation

: Text summarization is a process of producing a concise version of text (summary) from one or more information sources. If the generated summary preserves meaning of the original text, it will help the users to make fast and e ﬀ ective decision. However, how much meaning of the source text can be preserved is becoming harder to evaluate. The most commonly used automatic evaluation metrics like Recall-Oriented Understudy for Gisting Evaluation (ROUGE) strictly rely on the overlapping n-gram units between reference and candidate summaries, which are not suitable to measure the quality of abstractive summaries. Another major challenge to evaluate text summarization systems is lack of consistent ideal reference summaries. Studies show that human summarizers can produce variable reference summaries of the same source that can signiﬁcantly a ﬀ ect automatic evaluation metrics scores of summarization systems. Humans are biased to certain situation while producing summary, even the same person perhaps produces substantially di ﬀ erent summaries of the same source at di ﬀ erent time. This paper proposes a word embedding based automatic text summarization and evaluation framework, which can successfully determine salient top-n sentences of a source text as a reference summary, and evaluate the quality of systems summaries against it. Extensive experimental results demonstrate that the proposed framework is e ﬀ ective and able to outperform several baseline methods with regard to both text summarization systems and automatic evaluation metrics when tested on a publicly available dataset.


Introduction
With the tremendous and growing of smart phones and web technologies, the amount of text data on the web is increasing exponentially. Users should spend more time to find relevant information. This has inspired the development of automatic text summarization systems for producing a concise summary that preserves core idea of the original document [1]. In other words, any automatic text summarization system is intended for distilling most relevant information from the original document to create a shortened version. Further, Torres-Moreno [2] provided six reasons why automatic text summarization system is needed: (1) it creates summaries that reduce users reading time; (2) when researching documents, the generated summaries make the selection process easier; (3) it improves the effectiveness of indexing; (4) automatic summarization system is less biased than human summarizers; (5) the produced summaries are also useful in question-answering systems; and (6) using automatic summarization system enables commercial abstract services to increase the number of texts they are able to process.
Text summarization systems can be classified into several groups from different perspectives [3,4]. For instance, based on the input type, summarization can be classified into single and multi-document summaries. As the name suggests, single document summarization systems are designed to generate summary from a single document whereas multi-document summarization systems are intended to generate summary from multiple documents. Based on the purpose, text summarization can also be categorized into generic, domain specific and query-based summary. Generic summaries are produced with regardless of specific topic or domain whereas domain specific and query-based summaries are generated based on specific area and request made by the users respectively. In the literature, the most commonly mentioned types of summaries are extractive and abstractive types that are classified on the basis of output type.
Extractive text summarization involves identification, ranking and merging most important units from the source whereas an abstractive summarization involves selecting and compressing content of the source text to generate a summary of perhaps entirely new sentences. Further, it has been demonstrated that mixed summary can be produced by generating new words and copying fragments from the original text [5]. In order to identify salient sentences in the source text, numerous features have been considered for computing the relevance score of sentences. For instance, word/phrase frequency [6]; common lexical tokens, and location of the sentence [7]; and term frequency-inverse document frequency (TF-IDF) based cosine similarity, and longest common match [8], and different levels of text embeddings for computing cosine similarity [9]. Very recently, following sequence-to-sequence learning schemes in machine translation domain, several extractive [10][11][12], abstractive [13][14][15], and mixed [5,16,17] text summarization models have been proposed.
According to [18], extractive text summarization is easier and more successful text summarization approach when compared to abstractive summarization method. However, the reason behind the success of extractive text summarization seems that it has been favored by existing automatic evaluation metrics. For example, the most widely used automatic evaluation metrics like ROUGE intuitively disregard an abstractive summary that contains the same information as an extractive summary. Almost all existing automatic evaluation metrics including ROUGE have been designed based on ordinary co-selection measurements such as precision, recall and f-score. These measurements are used to complement techniques such as TF-IDF based latent semantic analysis, term frequency scheme, and n-gram matches between system and reference summaries [19]. We argue that evaluating abstractive summarization models based solely on lexical overlaps is completely unfair. Thus, there has been an awakening interest in the meaning of oriented automatic evaluation metrics.
Another possible challenge of automatic evaluation of text summarization systems is lack of consistent ideal reference summary. Studies have shown that there is low agreement among human summarizers in determining sentences for producing an ideal summary [20]. Even the same person perhaps produces drastically different summary of the same source text at different time. According to [20], there is more variability among the human summaries than among the systems summaries with large margin and very little agreement between human and system sentences selections.
In this paper, we propose a word embedding based automatic text summarization and evaluation framework. We used acronyms WETS and WEEM4TS for referring to our proposed word embedding based text summarization and word embedding based evaluation metric for text summarization respectively, and the code is available at GitHub (https://github.com/TuluTilahun/Text-Summarization). WETS and WEEM4TS are both simple yet surprisingly powerful text summarization and evaluation methods.
Our research questions are as follows: RQ1: For salient top-n sentences determination, how can we leverage publicly available pre-trained word embedding models? In order to answer this research question, we develop a system called word embedding based text summarization (WETS for short), and compare with the baseline systems.
RQ2: Are publicly available pre-trained word embedding models useful for developing automatic evaluation metrics that are suitable to evaluate all kinds of system summaries? To answer this research question, we design a word embedding based evaluation metric for text summarization (WEEM4TS), and compare it with the commonly used automatic evaluation metrics.
Our contributions compared to previous work are as follows: • We propose an automatic evaluation metric called WEEM4TS for evaluating the performance of text summarization systems. WEEM4TS is designed to evaluate the quality of system summaries with regard to the preserved meaning of the original document. Hence, we believe it represents an appropriate evaluation metric for all types of systems summaries: extractive, abstractive, and mixed.

•
We propose a method called WETS for determining the most important sentences of original document that can be used as a reference summary, which helps to address lack of ground truth summaries. This reference summary is produced carefully to be used by WEEM4TS for evaluating systems summaries against it. • By comparing with six baseline text summarization systems, we validate the utility of summaries that generated by WETS. We also evaluate performance of WEEM4TS by correlating with human judgments. Further, we compare WEEM4TS with commonly used automatic evaluation metrics in the domain of text summarization and machine translation. The experimental results demonstrate that both WETS and WEEM4TS achieve promising performance.
The remainder of this paper is organized as follows: In Section 2, we discuss prior work on text summarization and evaluation of text summarization systems. In Section 3, we describe the proposed approaches. Experimental datasets, tools, baselines, results and discussion are presented in Section 4. Finally, the conclusions and some future research directions are discussed in Section 5.

Related Work
Our review focuses on two lines of literature most relevant to our work: text summarization and the evaluation of text summarization systems.

Text Summarization
Recently, numerous automatic text summarization approaches have been proposed exhaustively, and can be categorized based on the input type, purpose, and output type. Input type-based summarization can be classified into single-and multi-document summarization. Based on the purpose, text summarization can be divided into three main groups: generic, domain-specific, and query-based summarization. Based on the output type, it can also be categorized into extractive and abstractive summarization.
Extractive summarization approaches are intended to generate summaries by selecting important units of the original document. In contrast, abstractive summarization approaches are aimed at generating summary with new words or phrases. In order to preserve core idea of the original text, an abstract summarization approach requires advanced NLP tools such as ideal semantic parsers, and language generation systems. Therefore, pure extractive summarization methods are more applicable than abstract summarization methods [18]. On the other hand, as it requires extraction and preprocessing at the early stage, it is difficult to generate pure abstractive summaries. For instance, in order to generate an abstract summary from multiple documents, the model in [21] has been trained to extract relevant sentences first and then perform compression. Similarly, [22] used statistical methods to extract promising sentences to train a model that can mimic human summarizers in generating abstractive summary.
In the literature, most of the extractive summarization systems are purely heuristic. For instance, study in [7] encompasses several features such as sentence position and pragmatic words; picking the first three sentences of the source text as a summary [12,23]; for text ranking: the longest common substring and term frequency were used in [8]; and the overlapping of frequent words with the title, heading, and query words [6,24]. The presence of frequently occurring words in the sentence might attribute to the relevance of the sentence [25]. In contrast, some common words (like stop words) have no significant contribution in conveying the importance of a sentence. To address this discrepancy, inverse document frequency has been employed in [26,27]. Further, to extract the most relevant sentences from source text, the fuzzy logic technique has been applied in [28].
Supervised and unsupervised learning approaches have also been considered to perform an extractive summary as classification [29] and clustering [30] tasks respectively. A trainable summarization model has also been proposed in [31]. The goal is to generate a summary by capturing several features: sentence position; positive and negative keywords in the sentence; sentence similarity with other sentences; sentence resemblance to the title; occurrence of proper noun in the sentence; sentence inclusion of numeric data; relative sentence length; bushy path of the sentence; summation of similarities for each sentence. The authors first investigated the effect of each sentence features on the summarization task. Then, to obtain a suitable combination of feature weights, they used all features score function to train a genetic algorithm and mathematical regression models, which learn relevant features from summaries produced by human expertise.
Perhaps humans often generate a summary by creating new sentences rather than simply extracting and concatenating fragments of source text. With regard to automatic summarizers, the problem of generating new condensed text from long source text is relatively difficult. In line with this, the automatic evaluation of abstractive summaries is a very challenging task. Consequently, abstractive text summarization has somehow received less attention. However, recently, several methods have been proposed to generate abstractive summary. For instance, study in [32] followed the statistical machine translation scheme to generate a short title. Similarly, the syntactic tree and hidden Markov model have been used to produce summary as headlines [33]. In some studies, manually crafted rules have been used to remove less important part of the sentences and retain the most relevant components for compression [22,34]. More recently, several methods or features have been considered to generate good quality summary: employing embedding models at different levels [9,35,36], query-based [37], attention mechanism oriented sequence learning [14,[38][39][40], pointer-generator network was applied for hybrid text summarization in [5] and hierarchical or graph based summarization models [41][42][43]. Studies in [9,35,36] leveraged word embedding models for extractive text summarization that made them more related to ours. However, the way we used vector values of the words and computation of sentence relevance score are different. For instance, in [9,35,36], vector values of the words have been used to form sentence level embedding, and later used for calculating sentence relevance score, whereas sentence embedding is not required in our proposed text summarization system. To compare our work with these studies, the dataset used in each work is different and also not convenient for us to reproduce the result. For example, experimental datasets used in [9] are multi-document and multilingual, which are not appropriate in our case because our text summarization method is intended for single document and monolingual investigation.
Moreover, deep learning approaches have also been proposed generating summaries [44][45][46]. Noticeably, deep learning approaches require large dataset. Thus, to satisfy the demand, several studies have proposed ways of collecting large original-summary pairs. For instance, [47] employed deep learning approach to identify and prepare large-scale high quality document-summary pairs. Likewise, 1.3 million article-summary pairs high-quality dataset was prepared and made publicly available by [23].Study in [48] also proposed a large scale corpus for Chinese text summarization task. These can help to address the lack of publicly available dataset for improving quality of text summarization models. Further, there are also few public and private datasets that have been used to train, validate, and test text summarization models. For example, CNN/Daily Mail dataset [49]; DUC 2004 Task [50]; Gigaword summarization dataset [40,51], NEWSROOM dataset to train Abs-N summarization model [23]; Webis-TLDR-17 Corpus [52]; and Google dataset [53] have been used to train summarizer models for generating different types of summaries: extractive [10][11][12], abstractive [13][14][15]53,54], and mixed [5,16,17].

Evaluation of Text Summarization Systems
Automatic evaluation metrics can be broadly classified into two categories: extrinsic and intrinsic methods [55].
Intrinsic methods are intended to evaluate quality of system summaries in terms of ideal reference summaries [7] or by comparing with the important units of original document [56,57]. In contrast, extrinsic evaluation methods do not compare system and human reference summaries, and rather assess the impact of generated summaries on the performance of other NLP systems. For example, the quality of generated summaries can be determined by their suitability for surrogating original documents for categorization [58,59], information retrieval [60], and question-answering tasks [61]. Thus, in the literature, extrinsic evaluation methods are sometimes referred to as task-based evaluation methods.
For intrinsic based evaluation, the standard information retrieval measurements such as precision, recall, and f-score have been used in several automatic evaluation metrics. For instance, the most widely used automatic evaluation metric called ROUGE (Recall-Oriented Understudy for Gisting Evaluation) relies on these measurements [62]. Although ROUGE has been found to correlate well with human judgments [63], it is not capable to evaluate an abstractive type of system summaries. Further, as the extractive summarization sometimes dangles anaphoric references, these metrics are limited to assess the coherence of the system summaries. To tackle this problem, study in [64] engages two jurors to complement automatic coherence evaluation of generated summaries. Another semi-automatic method called pyramid involves manual judgments to quantify the relative importance of facts to be conveyed [65].
In order to improve performance of automatic evaluation metrics, series of text summarization evaluation campaigns have been organized since the late 1990 [3]. For example, SUMMAC (1996)(1997)(1998) [66], DUC (the Document Understanding Conference, 2000-2007) [50], and more recently TAC (the Text Analysis Conference, 2008-present). However, still, the automatic evaluation of text summarization systems is a challenging issue. Lack of a standard evaluation metric has caused summary evaluation to be difficult [3]. Existing automatic evaluation metrics that have been developed based on co-selection measurements, n-gram matching, and term frequency based metrics are not appropriate for evaluating abstractive summaries. They all fail to provide equal or closer scores for system summaries that convey the same information with different words and phrases. According to [67], ordinary recall and precision are unsuitable measurements for text summary because a small difference in the system summary can noticeably affect the evaluation score. In order to address this problem, [68] proposed Roget's Thesaurus based sentence-ranking method. However, because of acknowledged challenges of thesaurus construction, it is difficult to get ideal thesaurus [69]. It would be better to employ other publicly available resources like the work similar to ours [70]. However, word weighting and final score computing techniques in our study is different from that of [70].

Methods
As depicted in Figure 1, the proposed framework consists of two major components: generation of reference summaries and evaluation of system summaries. We describe them in the following sections, and see example provided in Appendix A.
Information 2020, 11, x FOR PEER REVIEW 6 of 23 . Figure 1. The general workflow architecture of proposed framework for automatic text summarization and evaluation. Given original sentences O = O1, O2, O3 … Omwith their corresponding sequence of words oij = oi1, oi2, oi3 … oin, assign the highest cosine similarity value to the words by comparing them with the keywords. Keywords are nonstop words from first sentence of the original document and top k number of frequent words from the document, (we experimented with values between 1 and 15, and found the optimal value to be k = 6). Then sum up all weights (Σ) and divide by the number of words (n) in the corresponding sentence for ranking. Then top-y sentences are considered as a reference summary.
Although human generated summary has become a de facto ground truth, it is not available for most application domains. Another challenging issue in the domain of text summarization is lack of semantic oriented automatic evaluation metrics. Thus, in this study, we propose a word embedding based text summarization and evaluation framework. Word embedding is the most commonly used vector representation of words. Study in [71] introduced two efficient and high quality word representation model architectures that was successfully trained on millions of words. The authors have improved quality word vectors in their other work [72]. Word embedding has capability of preserving syntax and semantic regularities [73]. Further, sub-word level embedding was also introduced by [74], which helps to preserve morphological regularity. In this study, we use word embedding models that have been developed based on three different algorithms: Word2Vec [71], GloVe [75], and FastText [74].

Word Embedding Based Text Summarization
In order to answer the first research question (RQ1), we propose word embedding based text summarization (WETS) method for identifying, ranking and concatenating salient top-y sentences that can be used as a reference summary.
The most commonly used sentence relevance judgment method in the literature is checking the presence of frequent and pragmatic words. In view of that, a sentence that consisted of more of these words is considered most important. However, we argue that this technique has two major drawbacks: first, it encourages redundancy in the new summary; second, it fails to assign appropriate score for very important sentences consisted of other words. Thus, a redundancy handling mechanism and meaning oriented sentence relevance assessment technique is vital.
Accordingly, in this study, the preliminary task is removing irrelevant tokens like stop words from the original document. Then, we utilize words of the first sentence together with frequent words as keywords. The main reason we tend to focus on the words of the first sentence is that linguistic literature show an explicit thesis statement mostly located at the beginning of the Although human generated summary has become a de facto ground truth, it is not available for most application domains. Another challenging issue in the domain of text summarization is lack of semantic oriented automatic evaluation metrics. Thus, in this study, we propose a word embedding based text summarization and evaluation framework. Word embedding is the most commonly used vector representation of words. Study in [71] introduced two efficient and high quality word representation model architectures that was successfully trained on millions of words. The authors have improved quality word vectors in their other work [72]. Word embedding has capability of preserving syntax and semantic regularities [73]. Further, sub-word level embedding was also introduced by [74], which helps to preserve morphological regularity. In this study, we use word embedding models that have been developed based on three different algorithms: Word2Vec [71], GloVe [75], and FastText [74].

Word Embedding Based Text Summarization
In order to answer the first research question (RQ1), we propose word embedding based text summarization (WETS) method for identifying, ranking and concatenating salient top-y sentences that can be used as a reference summary.
The most commonly used sentence relevance judgment method in the literature is checking the presence of frequent and pragmatic words. In view of that, a sentence that consisted of more of these words is considered most important. However, we argue that this technique has two major drawbacks: first, it encourages redundancy in the new summary; second, it fails to assign appropriate score for very important sentences consisted of other words. Thus, a redundancy handling mechanism and meaning oriented sentence relevance assessment technique is vital.
Accordingly, in this study, the preliminary task is removing irrelevant tokens like stop words from the original document. Then, we utilize words of the first sentence together with frequent words as keywords. The main reason we tend to focus on the words of the first sentence is that linguistic literature show an explicit thesis statement mostly located at the beginning of the paragraph [76], which indicates that the important words might exist in the first sentence. In addition, relatively, at least a few words in the title might also exist in the first sentence. It is possible to conduct comparative analysis by considering words in the middle sentences or words in the last sentence. However, it is out of scope to deal with this here.
As depicted in Figure 1 and Algorithm 1, the cosine similarity between each word of each sentence in the document and the keywords is computed and value of the most similar key word is considered as a weight value of the target word. If the target word does not exist in the vocabulary of word embedding model, the word will be assigned with the weight value of 0. This weighting technique helps to allot fair score for the relevant sentences that comprised of words different from keywords. Further, in order to discourage redundancy, words of other sentences that also exist in the first sentence are ignored. Thus, the relevance score of the sentences composed of words in the first sentence become lower. for sentence in sentences:

14:
weight←max(cosine_similarity(tokenized[n], keywords)) 15: sentweight←sentweight + weight 16: relevancescore←sentweight/len(nonstopwords) 17: top-y←put_sentence_in_order(relevancescore) 18: return top-y The obtained cosine similarity values (weights) of all words in each sentence are added and then divided by the updated length of the corresponding sentence, as in Equation (1). Updated length means the length of the sentences after the removal of redundant and stop words. Based on the relevance scores, the sentences are ranked from top to bottom. Finally, concatenate top-y sentences as per the required length. It should be noted that although sentence relevance score is calculated on the basis of semantic similarity, WETS is an extractive text summarization system. In this study, length of reference summary is variable, i.e., it can be adjusted according to the required length of system summaries. This helps to compare system summaries with different lengths. For instance, assume the original documents comprised of 200 words; and if summary of text summarization systems: A and B comprised of 150 and 100 words respectively. Then length of reference summary can be adjusted to be 150 for system A and 100 for system B by picking the required number of words from top to bottom. However, a longer system summary might be favored. To control this, the length of system summary should comply with the pre-determined threshold or compression ratio. In this study, system summary with longer length is indirectly penalized by modified bigram precision as in Equation (3).
where w ij is the highest cosine similarity value between the word o ij and the keywords, and |O i | represents the updated length of sentence at i-th position in the original document.

Pre-processing
Remove irrelevant units such as stop words and characters from both system and reference summaries.

Word weighting
By using word embedding model, we compute cosine similarity between words in the system and reference summaries. Accordingly, if a word is shared by system and reference summaries, it receives a + 1 score. If a target word in the system summary does not appear in the reference summary but exist in the word embedding vocabulary, a cosine similarity value between the word and the closest word in the vector space is considered as a weight value of the target word. If none of the two happen, a score of 0 is given to the target word.

Computing modified unigram recall
Based on weight values of the words in step 2 above, we compute a modified unigram recall, as in Equation (2). It should be noted that the term modified is used to indicate the recall used in this study is different from the commonly used recall. The standard recall is based on surface form matching whereas the modified version of it relies on both surface form and word embedding based matching. For example, in the system summary: [He is a decent boy.]; and reference summary: [He is a good boy], the words/tokens such as 'He', 'is', and 'a' are all ignored because they are stop words. On the other hand, the word 'boy' in the system summary exists also in the reference summary, thus it receives score of +1 whereas the word 'decent' in the system summary does not exist in the reference summary. Thus, the highest cosine similarity value between 'decent' and among nonstop words in the reference summary is considered as a weight value of the word. In other words, the cosine similarity value between 'decent' and 'good', which is via Word2Vec.
where n is the number of words in the reference summary and w i is a weight value of each word in the system summary. Based on the pre-determined compression ratio, the number of words in the system summary (n) is expected to comply with the number of words in the reference summary, i.e., n ≈ n.

Computing modified bigram precision
We also compute the sequence of bigram agreement between system and reference summaries. The bigrams are counted not only based on the exact matches, but also on the cosine similarity of the words. Thus, the term modified is used to indicate the bigram counting technique in this study is different from the conventional one. Accordingly, if the words next to shared word in the system and reference summaries are the most similar words then these words with their previous word can be considered as a bigram match that can be used for computing the modified bigram precision, as in as in Equation (3). For example, in the system summary: [Yesterday tornado hit the city]; and reference summary: [Yesterday cyclone caused great loss]; here, yesterday tornado can be counted as a bigram match with yesterday cyclone as the cyclone is perhaps the most similar word to tornado.
where n is number of words in the system summary.
In order to compute the final WEEM4TS score, the modified unigram recall and bigram precision are linearly combined with different importance levels, as in Equation (4).

Dataset
We tested our proposed methods on NEWSROOM summarization dataset (https://summari. es/) [23]. The corpus contains 1.3 million articles paired with summaries written by human expertise in newsrooms of 38 major publishers between 1998 and 2017. The articles and the corresponding summaries were extracted from HTML body content and metadata of different domains respectively. These summaries exhibit diversity of summarization styles: extractive, abstractive, and mixed.
Extractive NEWSROOM summaries were produced by concatenating fragments of the original documents whereas abstractive summaries of this corpus comprised of novel words. The NEWSROOM summaries that contain both extractive and abstractive summaries are referred as mixed summaries. In order to compare our proposed text summarization method (WETS) with a baseline text summarization method called Lede-3, we randomly picked some of the dataset from the NEWSROOM test dataset described in Table 1. In addition, we compare summaries generated by WETS with six text summarization systems including Lede-3 on a dataset prepared for manual evaluation (https://github.com/lil-lab/newsroom/tree/master/humaneval) as shared by [23]. The shared manual evaluation dataset comprises the original text, system summaries, and human judgment scores. We used scores provided by human assessors for evaluating performance of our proposed automatic evaluation metric (WEEM4TS). In order to perform manual evaluation of system summaries, the authors disseminated 60 summaries of seven systems to three assessors along with the original text via Amazon Mechanical Turk. The assessors are required to evaluate summaries in terms of four criteria: informativeness, relevance, fluency, and coherence. Informativeness and relevance were used to assess semantic nature of the summaries with respect to original articles whereas fluency and coherence were intended for collecting syntactic judgments.

Pre-Trained Word Embedding Models
For the purpose of developing both text summarization system and automatic evaluation metric, we adopted publicly available pre-trained word embedding models developed based on three different algorithms: Word2Vec [71], GloVe [75], and FastText [74], see Table 2.

Baselines
We compared our proposed text summarization method (WETS) with Lede-3 text summarization system on randomly selected NEWSROOM test dataset described in Table 1. We also compared WETS with six baseline text summarization systems including Lede-3 on human evaluation dataset shared by [23], see sample in Table 3. Likewise, we compared our proposed automatic evaluation metric (WEEM4TS) with the variants of three baseline automatic evaluation metrics. In the next subsections, we describe the baseline text summarization systems and automatic evaluation metrics consecutively. Table 3. Sample dataset: original text, reference summary, baseline systems summaries and summary generated by our summarization system (WETS).

Original Text: NEWSROOM Article
Mortgage rates are still pretty cheap, even though they have risen a full percentage point since hitting record lows about a year ago. And with the stronger economy pulling housing along, this is a good time to get into the market Anika Khan, Wells Fargo Securities senior economist, told CNBC is Squawk Box; on Friday. But many first-time homebuyers are being left on the sidelines, watching all that cheap money inch higher because lending requirements remain tight. The average rate on a 30-year loan ticked up to 4.41 percent from 4.40 percent last week. Fifteen-year mortgages increased to 3.47 percent from 3.42 percent. In this video, Khan gives three reasons why it is still so hard for would-be buyers to purchase their first home.

NEWSROOM Reference Summary
Many first-time homebuyers are being left on the sidelines watching all that cheap money inch higher because among other reasons lending requirements remain tight.

Lede-3
Mortgage rates are still pretty cheap, even though they have risen a full percentage point since hitting record lows about a year ago. And with the stronger economy pulling housing along, this is a good time to get into the market; Anika Khan, Wells Fargo Securities senior economist, told CNBC is Squawk Box on Friday. But many first-time homebuyers are being left on the sidelines, watching all that cheap money inch higher because lending requirements remain tight.

Pointer-S
mortgage rates are still pretty cheap, even though they've risen a full percentage point since hitting record lows about a year ago many reasons why the stronger economy pulling housing along, "this is a good time to get into the market, anika khan, wells fargo securities senior economist, told cnbc is" squawk box on friday many first-time homebuyers are being left on the sidelines, watching all that cheap money inch higher because lending requirements remain tight.

Pointer-N
mortgage rates are still pretty cheap-even though they have risen a full percentage point since hitting record lows about a year ago. and with the stronger economy pulling housing along

WETS
Mortgage rates are still pretty cheap, even though they have risen a full percentage point since hitting record lows about a year ago.

Text Summarization Systems
Lede-3 used in [23] is similar with the Lead-3 employed in [12] and [5]. It was intended to consider the first three sentences of the original text as a summary. According to [23], though simple, Lede-3 is competitive extractive summarization with state-of-the-art systems. Thus, we used Lede-3 as a baseline to compare with our method in two dataset configurations: (1) we considered the first three sentences of randomly selected 1081 documents from NEWSROOM test data as a summary, and (2) we used 60 summaries generated based on Lede-3 as shared by [23].
TextRank is a graph based unsupervised sentence level extractive summarization system in which pragmatic words, title and heading words, and sentence locations have been considered to calculate sentence relevance score [40]. Study in [8] improved TextRank by considering three additional components: longest common substring, TF-IDF based cosine similarity, and another TF-IDF variant called BM25. For comparison, we used 60 summaries generated based on TextRank as shared by [23].
Abs-N stands for an abstractive summarization model trained on NEWSROOM dataset by [23]. For comparison, we used 60 Abs-N summaries generated based on this model that made publicly available by [23]. To train the model, the authors followed a TensorFlow implementation of study in [1]. For further information refer [23].
The Pointer model was trained to generate a mixed summary that consisted of new tokens and fragments from source text [5]. The model was developed based on pointer mechanism [77] and attention history [78]. In order to compare with our method, we used summaries generated based on the pointer models trained on three different datasets by [23]: Pointer-C was trained on the CNN/Daily Mail dataset; Pointer-N was trained on the NEWSROOM dataset; and Pointer-S was trained on a random subset of NEWSROOM training data. The authors used each model to generate 60 summaries and made the summaries publicly available. We utilized these summaries to compare with our method. For more information refer [5,77,78] as the authors in [23] followed these studies.

Automatic Evaluation Metrics
ROUGE (https://pypi.org/project/rouge/) stands for recall-oriented understudy for gisting evaluation [62], see Equation (5). ROUGE is the most commonly used automatic evaluation metric in the domain of text summarization. It was intended to count the number of overlapping units. It has four major variants: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. ROUGE-N measures an n-gram recall between system summaries and reference summaries. ROUGE-L was aimed to match the longest sequence of words between system and reference summaries for computing f-measure. It does not require consecutive match rather same order of words in both candidate and reference summaries. ROUGE-W was proposed to value the gap between words in the longest sequence match. Another variant of ROUGE is ROUGE-S that stands for skip-bigram co-occurrence statistics. This was intended for counting any overlapping pair of words by allowing arbitrary gaps. In this study, for comparison, we considered unigram (R-1), bigram (R-2), and the longest sequence match (R-L) variants.

ROUGE-N =
S {Re f erenceSummaries} gram n S Count match (gram n ) S {Re f erenceSummaries} gram n S Count(gram n ) (5) where n stands for the length of the n-gram, gram n , and Count match (gram n ) is the maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries. BLEU (https://www.nltk.org/_modules/nltk/translate/bleu_score.html), i.e., bilingual evaluation understudy (BLEU for short), was designed to evaluate the quality of machine translation systems' output in terms of reference translation [79], see Equation (6). The primary task of this metric is to count position independent n-gram matches between system and reference translations. In order to discourage repeated words in the candidate translation, the redundant words were clipped, and modified precision was computed by dividing clipped n-grams for the number of n-grams in the candidate translation. To calculate the final BLEU score, the modified precision scores are multiplied by the brevity penalty. Although it was proposed to evaluate system translations, the precision oriented BLEU is on-par with the recall oriented ROUGE in the evaluation of system summaries [80]. It is also possible to compute BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores. However, BLEU-4 is the default and commonly referred to as BLEU. In our experiment, we used all variants.
output length re f erence length CHRF (https://www.nltk.org/_modules/nltk/translate/chrf_score.html) calculates a simple F-score that combines the recall and precision of character n-grams of maximum length 6 with different setting for the β parameter (β = 1, 2, or 3) [81], see Equation (7). Based on the β value, CHRF is referred to as CHRF1, CHRF2, and CHRF3 for β = 1, β = 2, and β = 3 respectively. β = 1 implies equal weights for recall and precision whereas β = 3 entails the recall has three times more weight. Variants of CHRF were the best in several translation directions among the involved automatic evaluation metrics in WMT16 [82] and WMT17 [83] metric shared task. Though it is not common for evaluating text summarization systems, based on its performance in the domain of machine translation, we used all variants of CHRF metric to compare with our proposed automatic evaluation metric.
where ChrP and ChrR stand for character n-gram precision and recall arithmetically averaged over all n-grams.

Evaluation Results of Text Summarization Systems
In Table 4, we report the ROUGE scores of our proposed text summarization method (WETS) and baseline method (Lede-3) on randomly selected dataset from NEWSROOM test data. The dataset statistics is described in Table 1. We further compare our proposed text summarization method with several baseline systems including Lede-3 on human evaluation dataset (https://github.com/lil-lab/ newsroom/tree/master/humaneval) that made publicly available by [23], and report the results in Table 5. The shared manual dataset is comprised of system summaries; the corresponding human judgment scores; and original article text. Among the seven systems, we selected six of them as baseline systems by ignoring extractive oracle fragments, because this system is favored as it has access to the reference summary. In the provided dataset, the reference summaries were not incorporated. Thus, in order to evaluate and compare system summaries, we fetch the corresponding reference summaries from NEWSROOM test data.
In order to produce a WETS summary, we used three different publicly available pre-trained word embedding models that had been trained on Word2Vec, GloVe, and FastText. However, we noticed that although their relevance scores are different, there are no significant differences among the WETS summaries that generated based on these models. Thus, we considered only Word2Vec based WETS summaries, and report the results in Table 4. Table 4. ROUGE scores of WETS and Lede-3. For this experiment, we used pairs of article-summary dataset that described in Table 1. The articles are the source text from which Lede-3 and WETS summaries were generated while the paired summaries are used as a reference (ground truth) to evaluate both Lede-3 and WETS. Better results are highlighted bold.

Extractive
Abstractive Mixed It is apparent from Table 4 that, out of nine comparisons, WETS performs seven times better than Lede-3. Similarly, from results in Table 5, we can see that WETS was best four times out of six comparisons, which was beaten by Lede-3 only two times in R-1 and R-2. Moreover, closer inspection of Table 5 shows that results of WETS, Lede-3, and TextRank are relatively closer to each other. This could be due to all of them are extractive text summarization systems.

Correlation Results of Automatic Evaluation Metrics
In order to evaluate performance of automatic evaluation metrics for text summarization, we used shared file that consisted of original text; system summaries; and human evaluation results [23]. The task organizers disseminated the same copy of 60 summaries of seven systems to three assessors along with the original text via Amazon Mechanical Turk. Based on the provided original text, the assessors rated each summary out of five in four dimensions: coherence, fluency, informativeness, and relevance. For the evaluation of each dimension, the annotators were provided question that help them to judge the summaries effectively. For instance, to rate informativeness level of the summary, the assessors were told to use the prompt: How well does the summary capture the key points of the article? Similarly, the annotators rated relevance of the summary based on the prompt: Are the details provided by the summary consistent with details in the article? For the fluency evaluation, the annotators rated the summary based on this question: Are the individual sentence of the summary well-written and grammatical? Likewise, for coherence judgment, the annotators were told to evaluate system summary based on this prompt: Do phrases and sentences of the summary fit together and make sense collectively?
Thus, as three annotators were involved in the evaluation process, for a single summary, three human evaluation scores were collected with regard to coherence, fluency, informativeness, and relevance. This means 1260 human evaluation scores were collected for a total of 1260 summaries: 60 summaries multiplied by three annotators and again multiplied by seven systems. In order to use human evaluation scores effectively, we deduplicated similar article-summary pairs by computing average score of each perspective of three annotators, so that number of rows decreased to 420 for 7 systems. We again computed a mean score of coherence, fluency, informativeness, and relevance scores of each summary of all systems to correlate with the automatic evaluation metrics scores. Finally, we ended up with the average score of each 60 summaries for each system.
In order to conduct correlation analysis between automatic evaluation metrics and human judgments, we computed scores of different automatic evaluation metrics of 60 summaries for each system. We computed scores of WEEM4TS via three publicly available pre-trained word embedding models that had been trained based on Word2Vec, GloVe, and FastText algorithms and we denote variants of WEEM4TS as WEEM4TS w (W w ), WEEM4TS g (W g ), and WEEM4TS f (W f ) respectively.
Consequently, using Pearson s correlation measurement (r) [84], we computed the correlation between automatic evaluation metrics and human judgments, as in Equation (8), and report the results in Table 6.
where H i is the human assessment score of each system summary and M i is the corresponding score as predicted by an automatic evaluation metric, in which H and M are their means respectively. Table 6. Pearson s (r) correlation between automatic evaluation metrics scores and human judgments. Automatic evaluation metrics scores are computed based on reference summaries generated by WETS.
The asterisk (*) shows significance is at p ≤ 0.05. The best score in that system is highlighted in bold.

Systems
Lede - Table 6 presents the Pearson s correlation results between variants of four automatic evaluation metrics scores and human judgments on the summaries of seven text summarization systems. As can be seen from the table, no significant correlation was found between automatic evaluation metrics and human judgments on the summaries generated by the systems depicted in the last four columns of this table. Further, correlation results of ROUGE-1, ROUGE-2, ROUGE-L, and BLEU are not significant on the summaries generated by Lede-3 and TextRank. What is interesting about the results in this table is that all variants of our proposed evaluation metric (WEEM4TS) are performing relatively better than all other metrics on almost all system summaries. Moreover, closer inspection of the table shows that, among variants of WEEM4TS, WEEM4TS w performs the best in five out of seven comparisons, whereas WEEM4TS g is the best in the remaining two comparisons.

Discussion
Very little was found in the literature on leveraging publicly available pre-trained word embedding models for text summarization and evaluation. We exploit linguistic regularities in publicly available pre-trained word embedding models for extractive summarization, which is the focus of our first research question (RQ1). RQ1: For salient top-n sentences determination, how can we leverage publicly available pre-trained word embedding models?
In order to answer RQ1, we developed a word embedding based text summarization (WETS) system. After the removal of stop words from the original text, all words of the first sentence and top-n frequent words are considered as keywords. The highest cosine similarity value between the word and the keywords is regarded as the weight value of the target word. Subsequently, sentence relevance score is computed by summing up the obtained weight values of all words in the sentence and divide by the updated length of the sentence. Finally, based on the relevance score of the sentences, we concatenated the most important sentences up to the required length. It should be noted that words that are not available in the word embedding vocabulary and have not exact matches are assigned a weight value of zero. The cumulative effect of heuristic rules considered in our text summarization system (WETS) is reported in Tables 4 and 5.
As can be seen from Table 4, WETS performs better than its counterpart Lede-3 in all cases except in R-1 and R-2 of the extractive summarization style. A possible explanation for this might be that perhaps WETS could determine the most important sentences in the original document than the first three sentences considered in Lede-3. In a given long paragraph, topic sentence (aka focus sentence) might reside in the middle or/and at the end [85], which Lede-3might fail to catch it whereas WETS is capable to identify this sentence as a salient sentence via keywords. Furthermore, results shown in Table 5 demonstrate superiority of the WETS.
From Table 4 and 5, it is possible to conclude publicly available pre-trained word embedding models are vital for developing text summarization systems. The results further confirm capability of pre-trained word embeddings for text summarization task [9,41]. WETS based summaries can therefore be considered as reference summaries for the automatic evaluation of system summaries. We believe this is a valuable alternative to reference summaries generated by human expertise for automatic evaluation task.
If we now turn to results of automatic evaluation metrics, for comparison purpose, we used variants of ROUGE, BLEU, and CHRF metrics. ROUGE is most commonly used to evaluate text summarization systems whereas BLUE and CHRF are used for evaluating machine translation systems. These metrics, however, are not suitable to evaluate quality of abstractive and mixed summaries, which is the focus of our second research question (RQ2).
RQ2: Are publicly available pre-trained word embedding models useful for developing automatic evaluation metrics that are suitable to evaluate all kinds of system summaries?
Thus, in order to answer RQ2, we propose automatic evaluation metric for text summarization, referred to as WEEM4TS, and compare it with the metrics identified as baselines. In WEEM4TS, we settled some similarity measurement criteria. Accordingly, if a word appears in the system and reference summaries, it receives a + 1 score. If a certain word in the system summary does not appear in the reference summary but exists in the word embedding vocabulary, the highest cosine similarity value of the words in the reference summary is considered as a weight value of that word. If neither of these two things happen, a weight value of 0 is given to that word. These weight values are used to compute the modified unigram recall and the modified bigram precision, and linearly combine them with different importance level to obtain the final WEEM4TS score. Consequently, we conduct a correlation analysis between all automatic evaluation metrics scores including WEEM4TS and human judgments as presented in Table 6.
Intuitively, we get a general observation that: (1) a correlation of WEEM4TS variants with human judgments is relatively best in most cases when compared with other automatic evaluation metrics; (2) when we compare WEEM4TS variants, WEEM4TS w is best five times out of seven comparisons, whereas WEEM4TS g wins two times; (3) Scores of the WEEM4TS f variant are also very close to the scores of the other best metrics in that system; (4) the correlation results clearly demonstrate the superiority of our proposed automatic evaluation metric. A possible explanation for this might be that the considered heuristic rules are empowered by the prominent word embedding models that can help to uncover meaning between system and reference summaries.
Further, from Table 6, it is interesting to note that there is a consistent correlation among WEEM4TS variants. However, according to their performance order, WEEM4TS w is the first followed by WEEM4TS g , and WEEM4TS f . It seems that the vocabulary size of the word embedding models positively influence the results of WEEM4TS variants. The results further support the idea that word embedding models are capable to bring similar words nearby each other [73,74].
Moreover, we examine the relationship among the automatic evaluation metrics in two ways: First, we picked the best scores among variants of each metric and compare them as shown in Figure 2. Accordingly, the best score of WEEM4TS variants is the highest in almost all cases, followed by the best scores of CHRF and ROUGE variants. Secondly, we use a heat map to show an association between automatic evaluation metrics in terms of correlation results of all system summaries ( Figure 3). As shown in the graph (Figure 3), all variants of WEEM4TS are positively correlated with all variants of CHRF and ROUGE metrics. On the other hand, although BLEU is reported as an on-par metric with ROUGE in [80], according to the results in Figure 3, it has a weak correlation with all metrics considered in our experiment.
Information 2020, 11, x FOR PEER REVIEW 17 of 23 summaries ( Figure 3). As shown in the graph (Figure 3), all variants of WEEM4TS are positively correlated with all variants of CHRF and ROUGE metrics. On the other hand, although BLEU is reported as an on-par metric with ROUGE in [80], according to the results in Figure 3, it has a weak correlation with all metrics considered in our experiment. .

Conclusion and Future Work
With the emergence of smart phones and web technologies, the amount of text on the web is increasing enormously. As a result, text summarization has received more attention in several application domains. To satisfy the demand, several text summarization methods have been proposed. However, meaning oriented text summarization is a challenging task. This paper explores ways of using word embedding based sentence relevance scores for ranking top-n sentences as a summary, which can be used as a ground truth. Another aim of this study was to examine semantic based automatic evaluation metric for evaluating quality of systems summaries.
Our extensive experimental studies on different data configurations confirm our proposed text summarization (WETS) and automatic evaluation metric (WEEM4TS) can achieve a significant performance when compared to baseline methods. However, we aware that our research may have three limitations: the first is as the focus of our proposed text summarization approach was on concatenating top-n sentences, there is a possibility that pronouns can refer back to incorrect nouns in the generated summary. Secondly, being limited to words of the first sentence and frequent Information 2020, 11, x FOR PEER REVIEW 17 of 23 summaries ( Figure 3). As shown in the graph (Figure 3), all variants of WEEM4TS are positively correlated with all variants of CHRF and ROUGE metrics. On the other hand, although BLEU is reported as an on-par metric with ROUGE in [80], according to the results in Figure 3, it has a weak correlation with all metrics considered in our experiment. .

Conclusion and Future Work
With the emergence of smart phones and web technologies, the amount of text on the web is increasing enormously. As a result, text summarization has received more attention in several application domains. To satisfy the demand, several text summarization methods have been proposed. However, meaning oriented text summarization is a challenging task. This paper explores ways of using word embedding based sentence relevance scores for ranking top-n sentences as a summary, which can be used as a ground truth. Another aim of this study was to examine semantic based automatic evaluation metric for evaluating quality of systems summaries.
Our extensive experimental studies on different data configurations confirm our proposed text summarization (WETS) and automatic evaluation metric (WEEM4TS) can achieve a significant performance when compared to baseline methods. However, we aware that our research may have three limitations: the first is as the focus of our proposed text summarization approach was on concatenating top-n sentences, there is a possibility that pronouns can refer back to incorrect nouns in the generated summary. Secondly, being limited to words of the first sentence and frequent words, the study did not explicitly evaluate the use of words in the middle or/and at the last

Conclusion and Future Work
With the emergence of smart phones and web technologies, the amount of text on the web is increasing enormously. As a result, text summarization has received more attention in several application domains. To satisfy the demand, several text summarization methods have been proposed. However, meaning oriented text summarization is a challenging task. This paper explores ways of using word embedding based sentence relevance scores for ranking top-n sentences as a summary, which can be used as a ground truth. Another aim of this study was to examine semantic based automatic evaluation metric for evaluating quality of systems summaries.
Our extensive experimental studies on different data configurations confirm our proposed text summarization (WETS) and automatic evaluation metric (WEEM4TS) can achieve a significant performance when compared to baseline methods. However, we aware that our research may have three limitations: the first is as the focus of our proposed text summarization approach was on concatenating top-n sentences, there is a possibility that pronouns can refer back to incorrect nouns in the generated summary. Secondly, being limited to words of the first sentence and frequent words, the study did not explicitly evaluate the use of words in the middle or/and at the last sentence/s as keywords.
In the future, alternative identification of relevant keywords and addressing issue of anaphora in text summarization would call for more research. Thirdly, although our proposed text summarization and automatic evaluation metric constitute promising meaning-oriented approaches, the performance of these methods was not evaluated on languages other than English. Thus, further research could be conducted to determine the effectiveness of the proposed methods on various languages.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Demonstration of how word embedding models are used in both WETS and WEEM4TS
Assume we are given an original text to generate WETS summary that consisted of three sentences and evaluate it against the pre-determined reference text. Here, we would like to show the procedures of doing so in two phases: first generate WETS summary, and then evaluate the generated summary against reference text. Phase 1: Perform the following activities for generating WETS summary.

1.
For the demonstration purpose, we used an original text and a reference text shown in the following box.
Original Text: A person who loves someone is surely loved in turn by the others {0}. Researchers show that the more a person loves people around him/her the better healthy life he/she has {1}. People who love others without any condition are mostly lead happy life {2}. Contrary, there are people who are ignorant and get satisfaction by hurting others {3}. Some of them develop this behavior from their childhood {4}. Adoring others will give you immense happiness and peace {5}. Reference Text: A person who loves someone is unquestionably loved successively by the others. In fact, there are also some people who get satisfaction by hurting others. Adoring others will provide you with immense happiness and peace.

2.
Using full stop as a delimiter, we split an original text into six sentences. We refer each sentence using index value enclosed by the green curly bracket.

4.
Update the keywords identified in step 3 by adding relevant frequent words. In this particular example, all of the frequent words are also found in the first sentence except the word 'people' that occurs two times in the original text. So, we updated the keywords by adding the word 'people' to the list.

5.
In order to compute relevance score of each sentence, we sum weight values of all words in that sentence and divide by the number of words in that sentence, Equation (1). Obviously, first sentence is favored to be a first top salient sentence. Before calculating relevance score of other sentences, weigh value of each word in that sentence is determined based on the following rules: If the word in that sentence also exists in the first sentence, we temporarily remove that word rather than assigning a weight value to it. This is to discourage redundancy. If the word does not exist in the first sentence but does in the keywords, assign a weight value equal to +1. If the word does not exist in the keywords but does in the vocabulary of word embedding model then compute cosine similarity between the word and all words in the keywords to consider the highest cosine similarity value as a weight value of that word. For instance, the first word of the second sentence {1}, 'Researchers', exists neither in the first sentence nor in the list of keywords. Hence, to determine weight value for the word 'Researchers', we compute cosine similarity between 'Researchers' and all words in the keywords via Word2Vec in which the word 'people' is found the most similar word with the highest cosine similarity value equal to 0.104. Following the same process, we assign weight value for all words, which consequently used to compute relevance score. Based on the obtained relevance score, from the highest to lowest, the following sentences are identified as salient top-3, in order: {0}, {5}, and {3}. Accordingly, WETS summary is: A person who loves someone is surely loved in turn by the others. Adoring others will give you immense happiness and peace. Contrary, there are people who are ignorant and get satisfaction by hurting others.

1.
Assume we are required to evaluate system generated summary, in this case WETS and Lede-3 summaries, against human generated reference summary.

2.
We use WEEM4TS to evaluate both WETS and Lede-3 summaries against human generated reference summary. As described in Section 3.2, WEEM4TS score is calculated by a linear combination of modified unigram recall (Equation (2)) and modified bigram precision (Equation (3)). In modified unigram recall, each word in the WETS summary is assigned by the highest cosine similarity value among the words in the reference summary. Then sum of these values is divided by the number of words in the reference summary. In the modified bigram precision, counting the number of bigrams matches and then divided by the number of bigrams in the system summary. It should be noted that the matching is not only on based on the surface form but also on semantic similarity. This makes the standard recall and precision different from the recall and precision employed in this study, which we use the term modified to designate the differences. Accordingly, we compute WEEM4TS variants score for WETS summary and Lede-3 summary: Reference Text: A person who loves someone is unquestionably loved successively by the others. In fact, there are also some people who get satisfaction by hurting others. Adoring others will provide you with immense happiness and peace. Lede-3 summary: A person who loves someone is surely loved in turn by the others. Researchers show that the more a person loves people around him/her the better healthy life he/she has. People who love others without any condition are mostly lead happy life.