ESTIME: Estimation of Summary-to-Text Inconsistency by Mismatched Embeddings

We propose a new reference-free summary quality evaluation measure, with emphasis on the faithfulness. The measure is designed to find and count all possible minute inconsistencies of the summary with respect to the source document. The proposed ESTIME, Estimator of Summary-to-Text Inconsistency by Mismatched Embeddings, correlates with expert scores in summary-level SummEval dataset stronger than other common evaluation measures not only in Consistency but also in Fluency. We also introduce a method of generating subtle factual errors in human summaries. We show that ESTIME is more sensitive to subtle errors than other common evaluation measures.


Introduction
Summarization must preserve the factual consistency of the summary with the text.Human annotation of factual consistency can be accompanied with detailed classification of factual errors, thus giving a hope that the annotation scores are reasonably objective (Kryscinski et al., 2020;Huang et al., 2020;Vasilyev et al., 2020b;Gabriel et al., 2020;Maynez et al., 2020).
Factual consistency, or 'faithfulness' of a summary is one of several summary qualities; for the purpose of human annotation these qualities can be specified in different ways (Xenouleas et al., 2019b;Kryscinski et al., 2020;Fan et al., 2018;Vasilyev et al., 2020b;Fabbri et al., 2020).Summarization models nowadays create satisfactorily fluent, coherent and informative summaries, but the factual consistency has a lot of room for improvement.Some factual errors (swapped named entities and crude hallucinations) are easily noticeable and make a summary look very bad right away; other factual errors could be hardly noticeable even for annotators (Lux et al., 2020;Vasilyev et al., 2020b) -which is arguably even worse.
Existing summary evaluation measures are based on several approaches, which may be more sensitive to one or another quality.A questionanswering based evaluation estimates how helpful is the summary in answering questions about the source text (Xenouleas et al., 2019a;Eyal et al., 2019;Scialom et al., 2019a;Deutsch et al., 2020;Durmus et al., 2020;Wang et al., 2020).A text reconstruction approach estimates how helpful is the summary in guessing parts of the source text (Vasilyev et al., 2020a,b;Egan et al., 2021).Evaluation measures that use some kind of text similarity can estimate how similar is the summary to special human-written reference summaries (Zhang et al., 2020;Zhao et al., 2019;Lin, 2004), or, more realistically, how similar is the summary to the source text (Gao et al., 2020;Louis and Nenkova, 2009).
In order to assess how well the summary factual consistency is evaluated, it is necessary either to have a dataset of human-annotated imperfect machine-generated summaries (Bhandari et al., 2020;Fabbri et al., 2020), or to have a dataset of artificially introduced factual errors in originally factually correct human-written summaries (Kryscinski et al., 2020).
In this paper we focus on presenting a new evaluation measure with emphasis on factual consistency.Our contribution: 1. We introduce ESTIME, Estimator of Summary-to-Text Inconsistency by Mismatched Embeddings.

Methods
In order to estimate consistency of a summary with the text, we attempt to find all the summary tokens that could be related to a factual error.Our motivation is that transformer-made token embeddings are highly contextual (Ethayarajh, 2019).We assume that even if phrased somewhat differently in the summary, the context should suggest approximately the same token embedding as an embedding that would be suggested by the corresponding context in the text.Thus, we check embeddings of all tokens of the summary that have one or more occurrences in the text.For each token embedding we find its match: the most similar embedding in the text.We assume that if the summary is factually correct then the matched tokens are the same.If the tokens are not the same, we count such mismatch as an indicator of inconsistency of the summary to the text.The algorithm for ESTIME is shown in more detail in Figure 1.
This approach is different from matching similar embeddings for sake of measuring similarity (e.g.similarity between a summary and a reference summary in BERTScore (Zhang et al., 2020)).It also differs from using a model trained to replace wrong tokens with correct ones (Cao et al., 2020;Kryściński et al., 2019).
In preliminary evaluations, similar to the ones presented in the next sections, we have not found any improvement from adding up one or another flavour of heuristics, such as covering only named entities or only certain parts of speech.For finding most similar embedding we are using simple unnormalized dot-product, -replacing it by cosine similarity makes all the results presented in the next sections slightly worse.The embeddings are taken using the pretrained BERT model (Devlin et al., 2018) bert-large-uncased-whole-word-masking of transformers library (Wolf et al., 2020).While there is no crucial difference with other varieties of BERT, ALBERT and RoBERTa, this model showed a better overall performance, and we used it for evaluations in the next sections.We found that there is no strong dependency on the parameters W , L, M of ESTIME, we set the input size W = 450 close to maximal BERT input length, and the distances L = 8 and M = 50 reasonably large.We present results for H = 12 as ESTIME-12, and for H = 24 as ESTIME-24, corresponding to the embeddings taken from the middle and from the top of the large BERT. 3 Performance on human-annotated machine-generated summaries We used SummEval dataset 1 (Fabbri et al., 2020) for comparing ESTIME with a few well known or promising evaluation measures.The part of Sum-mEval dataset that we use consists of 100 texts, each text is accompanied by 16 summaries generated by 16 different models, making altogether 1600 text-summary pairs.Each text-summary pair is annotated (on scale 1 to 5) by 3 experts for 4 qualities: consistency, relevance, coherence and fluency.We took average of the expert scores for each quality of a text-summary pair.Each text is also accompanied by 11 human-written reference summaries, for the measures that need them.We calculated scores of ESTIME and other measures for all the 1600 summaries, and presented their correlations with the average expert scores in Table 1.The measures in Table 1 are split into the group of reference-free measures (top) and the measures requiring human-written references (bottom).BLANC-help (Vasilyev et al., 2020a) is calculated in two versions 2 , which differ by the un-1 https://github.com/Yale-LILY/SummEval 2 https://github.com/PrimerAI/blanc#blanc-on-summevaldatasetderlying models: BLU -bert-large-uncased, and AXXL -albert-xxlarge-v2.ESTIME and Jensen-Shannon (Louis and Nenkova, 2009) values are negated.SummaQA (Scialom et al., 2019b) is represented by SummaQA-P (prob) and SummaQA-F (F1 score)3 .SUPERT (Gao et al., 2020) is calculated as single-doc with 20 reference sentences 'top20'4 .BLEU (Papineni et al., 2002) is calculated with NLTK.BERTScore (Zhang et al., 2020) is represented as F1 (BERTScore-F), precision (BERTScore-P) and recall (BERTScore-R) 5 .For ROUGE (Lin, 2004) the ROUGE-L is calculated as rougeLsum 6 .
By design ESTIME should perform well for consistency, and indeed it beats other measures in the table.Being a one-sided summary-to-text estimator of inconsistencies, ESTIME should not and does not perform well for relevance.Interestingly, ESTIME performs better than other measures for fluency, and reasonably well for coherence.
In Table 2 we show correlations on system level, meaning that the scores (of automated measures and of human experts) are averaged over the 100 texts, so that each array of scores has length only 16 rather than 1600 (Fabbri et al., 2020).The purpose of this would be a comparison of the summarization models.We present results for consistency in the table; for other qualities some measures have pvalue higher than 0.05.

Performance on human summaries with generated subtle errors
Machine-generated summaries, even by abstractive summarization models, generally follow the source text by frequently reproducing large spans from it.
It is no surprise that the most similar context should point to the same token, thus helping ESTIME to be a good factual consistency measure.Human summaries are more varied in describing the source text, and it is interesting how useful can be ESTIME for evaluating them.Fundamentally, we are asking how flexible are the embeddings in understanding the context.In order to answer this question, we made random selection of 2000 text-summary pairs from CNN/Daily Mail dataset (Hermann et al., 2015).For each human-written summary we then added the same summary modified by generated factual errors.We thus made 4000 text-summary pairs.We assigned the 'golden' scores as 1 to each clean summary, and 0 to each summary with errors.
Our 'subtle errors' generation method is simple, heuristic-free and easily reproducible.In order to generate an error, we randomly select a wholeword token in the summary, mask and predict it by an LM model (we used bert-base-cased).We then select the top predicted candidate that is not equal to the real token, and substitute it for the real token.The resulting 'subtle errors' are indeed subtle and similar to real machine-generated mishaps and hallucinations, with the fluency preserved.
Thus, we made the evaluation task double difficult: the summaries are human-written, and the errors are subtle.For the evaluation presented in Table 3 we created 3 subtle errors in each 'score=0' summary.ESTIME is more sensitive to such subtly generated errors than other measures, as shown in Table 3.All p-values in the table are less than 10 −3 , except 0.023 for BLANC-AXXL, 0.002 for Jensen-Shannon and 0.001 for SummaQA-P.

Conclusion
We introduced ESTIME: estimator of summary-totext inconsistency by mismatched embeddings, -a new reference-free summary quality measure, with emphasis on measuring factual inconsistency of the summary with respect to the text.In view of its good performance, we intend to use ESTIME for improving faithfulness of summarization.We also introduced and used for evaluation a method for generating 'subtle errors'; the method has a potential for creating consistent and realistic benchmark datasets for factual consistency.
Starting from token t, in the input window mask each Lth token with embedding not yet taken Take the input by model, take Hth hidden layer embeddings for all the masked tokens Place the obtained embeddings into their corresponding locations in embeddings_text # Get all embeddings from the summary: Repeat the above for the summary.Result: embeddings_summary # Count summary-to-text inconsistencies: num_inconsistencies = 0 For each embedding E s in embeddings_summary:Find in embeddings_text an embedding E t with highest similarity (dot-product) to E s .If the tokens corresponding to E s and E t are not equal, then num_inconsistencies += 1 Figure 1: ESTIME: Estimation of Summary-to-Text Inconsistency by Mismatched Embeddings.

Table 1 :
Summary level correlations ρ (Spearman) and τ (Kendall Tau-c) of quality estimators with human experts scores.The top rows evaluation measures are reference-free, separated from the lower rows evaluation measures, which need human references.In each column the highest correlation is bold-typed.The only p-values above 0.01 in this table are p=0.03for BERTScore-P and p=0.01 for ROUGE-2.

Table 2 :
System level correlation ρ (Spearman) and τ (Kendall Tau-c) of quality estimators with human experts scores of consistency.Top rows show referencefree evaluation measures.

Table 3 :
Correlation ρ (Spearman) and τ (Kendall Tauc) of quality estimators with the presence of generated subtle errors in human summary.The dataset of 4000 text-summary pairs was created by random pick of 2000 test-summary pairs from CNN / Daily Mail dataset, duplicating these 2000 pairs, and by generating subtle errors in the 2000 duplicated summaries.