The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification

Abstract In order to simplify sentences, several rewriting operations can be performed, such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgments on the simplicity achieved by executing specific operations (e.g., simplicity gain based on lexical replacements). In this article, we investigate how well existing metrics can assess sentence-level simplifications where multiple operations may have been applied and which, therefore, require more general simplicity judgments. For that, we first collect a new and more reliable data set for evaluating the correlation of metrics and human judgments of overall simplicity. Second, we conduct the first meta-evaluation of automatic metrics in Text Simplification, using our new data set (and other existing data) to analyze the variation of the correlation between metrics’ scores and human judgments across three dimensions: the perceived simplicity level, the system type, and the set of references used for computation. We show that these three aspects affect the correlations and, in particular, highlight the limitations of commonly used operation-specific metrics. Finally, based on our findings, we propose a set of recommendations for automatic evaluation of multi-operation simplifications, suggesting which metrics to compute and how to interpret their scores.


Introduction
Text Simplification consists of modifying the content and structure of a text in order to make it easier to read and understand, while preserving its main idea and as much as possible of its original meaning. Human editors simplify through several rewriting operations, such as lexical paraphrasing (i.e., replacing complex words/phrases with simpler synonyms and some rewording for fluency), changing the syntactic structure of sentences (e.g., splitting or reordering components), or removing information deemed non-essential to understand the main idea of the original text (Petersen 2007;Aluísio et al. 2008;Bott and Saggion 2011;Xu, Callison-Burch, and Napoles 2015). Modern systems for Automatic Text Simplification are sentence-level, and attempt to replicate this multi-operation rewriting process by leveraging corpora of parallel original-simplified sentence pairs (Alva-Manchego, Scarton, and Specia 2020). However, the simplicity of automatic sentence-level simplifications is measured with metrics that evaluate single specific operations. For instance, SARI (Xu et al. 2016) was designed to estimate simplicity gain when just lexical paraphrasing was being assessed, whereas SAMSA (Sulem, Abend, and Rappoport 2018b) attempts to quantify structural simplicity by verifying the correctness of sentence splitting. In a recent study,  showed that, for the same set of original sentences, human judges preferred manual simplifications where multiple edit operations had been applied over those where only one operation had been performed (i.e., only lexical paraphrasing or only splitting). However, the authors also provided preliminary evidence that both a general metric like BLEU (Papineni et al. 2002), and an operation-specific one like SARI had poor correlations with judgments of overall simplicity when computed using multioperation manual references.
In this article, we study the extent to which evaluation metrics can estimate the simplicity of automatic sentence-level simplifications where multiple rewriting operations may have been applied. In order to do so, we: (1) create a new data set with direct assessments of simplicity; (2) perform the first meta-evaluation of automatic metrics for sentence-level Text Simplification, focused on their correlation with human judgments on simplicity; and (3) propose a set of guidelines for automatic evaluation of sentence-level simplifications, seeking to improve the interpretation of automatic scores, especially for multi-operation simplifications. 1 In the remainder of the article, we first review manual and automatic evaluation methods in Sentence Simplification (Section 2). Then, we describe two existing data sets with human judgments on simplicity gain and structural simplicity of system outputs, whose limitations motivate the collection of a new data set with overall simplicity scores crowdsourced through Direct Assessment (Section 3). After that, we study the variation in sentence-level correlations between automatic metrics and human judgments under three test conditions: the level of perceived simplicity, the approach implemented by the simplification systems, and the set of manual simplification references (Section 4). For direct assessments of simplicity, in particular, we show that: (a) metrics can more reliably score low-quality simplifications; (b) most metrics are better at scoring system outputs from neural sequence-to-sequence models; and (c) computing metrics using all available manual references for each original sentence does not significantly improve their correlations. We also propose explanations on the low-to-moderate correlations achieved by simplification-specific metrics. Based on our findings, we propose a set of recommendations for better evaluation of automatic sentence-level simplifications and suggest ways to improve current practices (Section 5). Among these, we suggest to first compute BERTScore (Zhang et al. 2020) to verify that the system output is of high quality, and then use SARI and/or SAMSA to measure the gains in simplicity. Finally, we summarize our results, highlighting our contributions and conclusions (Section 6).

Background
The preferred method for evaluating the quality of automatic simplifications is eliciting human judgments on grammaticality, meaning preservation, and simplicity. However, these can be costly to obtain while tuning simplification models, especially at large scale. This creates scenarios where automatic metrics act as proxies for human judgments, so it is important to understand how these metrics behave under different circumstances, to better interpret their scores. We first review common practices for collecting human judgments on the simplicity of system outputs against which metrics are evaluated, and motivate our choice of Direct Assessment as our data labelling methodology. Then, we briefly explain the main automatic metrics to assess simplicity and motivate conducting a meta-evaluation on them.

Human Evaluation of Simplicity
When obtaining human judgments on the simplicity of system outputs, there are three components to consider: the question to elicit the judgment, what the judges are shown, and how they submit their judgment. It is generally agreed to show both the original and simplified sentences so that raters can determine if the latter is simpler than the former. However, several variations have been tested for the other two components.
Most work does not specify what "being simpler" entails, and trusts human judges to use their own understanding of the concept. In contrast, Xu et al. (2016) experimented with Simplicity Gain, asking judges to count "how many successful lexical or syntactic paraphrases occurred in the simplification". The authors argue that this framing of the task allows for easier judgments and more informative interpretation of the scores, while reducing the bias toward models that perform minimal modifications. In a similar fashion, Nisioi et al. (2017) and Cooper and Shardlow (2020) asked judges to count the number of changes made by automatic systems, and then to identify how many of them were "correct" (i.e., preserved meaning and grammaticality, while making the sentence easier to understand). On a different line of work, Sulem, Abend, and Rappoport (2018b,c) focused on Structural Simplicity, requesting judges to use the −2 to +2 scale to answer "is the output simpler than the input, ignoring the complexity of the words?" This is intended to focus the evaluation in a specific operation: sentence splitting.
For the data set collected as part of our study (Section 3.3), we follow common practice and present human judges with both original sentences and their automatic simplifications. Furthermore, because the focus of this article is on multi-operation simplifications, we rely on a general definition of simplicity instead of one for a specific (set of) operation(s). Finally, as in , we experiment with collecting continuous scores following the Direct Assessment methodology , since they can be standardized to remove individual rater's biases, resulting in higher inter-annotator agreement (Graham et al. 2013).

Automatic Evaluation of Simplicity
BLEU (Papineni et al. 2002) and SARI (Xu et al. 2016) are the most commonly used metrics in Sentence Simplification. Although BLEU scores can be misleading for several text generation tasks (Reiter 2018), in the case of Simplification they have been shown to correlate well with human assessments of grammaticality and meaning preservation (Wubben, van den Bosch, and Krahmer 2012;Štajner, Mitkov, and Saggion 2014;Xu et al. 2016;Sulem, Abend, and Rappoport 2018a;. SARI, on the other hand, is better suited for evaluating the simplicity of system outputs produced via lexical paraphrasing. It does so by comparing the automatic simplification to both the original sentence and multiple manual references, and measuring the correctness of the words added, kept, and deleted. Although not widely adopted, SAMSA (Sulem, Abend, and Rappoport 2018b) is another simplicity-specific metric, but focused on sentence splitting. It validates that each simple sentence resulting from splitting a complex one is correctly formed (i.e., it corresponds to a single Scene with all its Participants).
Studies on the correlation of human judgments on simplicity and automatic scores have been performed when introducing new metrics or data sets. 2 Xu et al. (2016) argued that SARI correlates with crowdsourced judgments of Simplicity Gain when the simplification references had been produced by lexical paraphrasing, while SAMSA was shown to correlate with expert judgments of Structural Simplicity. When introducing HSplit (Sulem, Abend, and Rappoport 2018a), a data set of manual references for sentence splitting, the authors argued that BLEU (Papineni et al. 2002) was not a good estimate for (Structural) Simplicity. However, these studies did not analyze if the absolute correlations varied in different subgroups of the data. In contrast, our study shows that correlations are affected by the perceived quality of the simplifications, the types of the simplification systems, and the set of manual references used.

Data Sets with Human Judgments on Simplicity
In this section, we describe the data sets that will be used in our meta-evaluation study. Each data set is composed of a set of original sentences, their automatic simplifications produced by various simplification systems, and human evaluations on some form of simplicity for all system outputs. These data sets were chosen (or created) because:

1.
Each provides a different simplicity judgment: Simplicity Gain (Xu et al. 2016), Structural Simplicity (Sulem, Abend, and Rappoport 2018c), and Direct Assessments of Simplicity (new). This allows studying the behavior of metrics along varied ways of measuring simplicity (Section 4.2).
2. Each includes system outputs from different types of simplification approaches. This allows analyzing the impact of the system type in the correlation of metrics (Section 4.3). Table 1 presents brief descriptions of the most representative models in these data sets.

3.
All original sentences come from TurkCorpus (Xu et al. 2016). This allows exploiting the alignment between TurkCorpus, HSplit (Sulem, Abend, and Rappoport 2018a), and ASSET  to investigate the effect in the correlations of using different sets of manual references when computing the metrics (Section 4.4).
Furthermore, we compare the data sets in terms of their human evaluation reliability using both inter-annotator agreement (IAA) and correlation coefficients, as suggested in Amidei, Piwek, and Willis (2019). For IAA, we compute intraclass correlation (ICC, Shrout and Fleiss 1979) with the implementation available in pingouin Table 1 Descriptions of simplification systems included in the studied data sets. Similarly to Alva-Manchego, Scarton, and Specia (2020), we classified them into phrase-based MT (PBMT), syntax-based MT (SBMT), neural sequence-to-sequence (S2S), and semantics-informed rules (Sem) by themselves or coupled with one of the previous types (i.e., Sem+PBMT, Sem+S2S).

Type
Name Description PBMT PBMT-R (Wubben, van den Bosch, and Krahmer 2012) Phrase-based MT model that chooses the candidate simplification that is most dissimilar to the original sentence. Transformer-based encoder-decoder (Vaswani et al. 2017) and memory-augmentation with paraphrasing rules from the Simple Paraphrase Database . ACCESS  Transformer-based encoder-decoder that conditions the generation of simplifications on explicit desired text attributes (e.g., length and/or dissimilarity with original input).
Sem DSS (Sulem, Abend, and Rappoport 2018c) Hand-crafted rules for sentence splitting based on either automatic or manual UCCA (Abend and Rappoport 2013) semantic annotations. Sem+PBMT Hybrid (Narayan and Gardent 2014) Phrase-based statistical MT model coupled with semantic analysis to learn to split sentences.

Sem+S2S
SENTS (Sulem, Abend, and Rappoport 2018c) Uses DSS for sentence splitting and then the resulting output goes through a MT-based model for further paraphrasing. Distribution of Simplicity Gain scores in the data set of Xu et al. (2016). (Vallat 2018). 3 For computing ratings correlations, we account for multiple annotators per instance by simulating two raters as follows: (1) we randomly choose one score as rater A and the average of the others as rater B; (2) we compute the Spearman's rank correlation coefficient between raters A and B using SciPy; 4 and (3) we repeat this process 1,000 times to report the mean and variance of all iterations. For interpreting the values of both calculations, we use the scale of Landis and Koch (1977) for IAA and the scale of Rosenthal (1996) for nonparametric correlation coefficients. Xu et al. (2016) created this data set to study the suitability of metrics for measuring the Simplicity Gain of automatic simplifications. The authors simplified 93 original sentences using four Sentence Simplification systems: PBMT-R, SBMT-BLEU, SBMT-FKBLEU, and SBMT-SARI. For the Simplicity Gain judgments, workers on Amazon Mechanical Turk (AMT) were asked to count the number of "successful lexical or syntactic paraphrases occurred in the simplification" (Xu et al. 2016). The judgments from five different workers were averaged to get the final score for each instance. In order to measure human evaluation reliability, we computed an ICC of 0.176 and a Spearman's ρ of 0.299 ± 0.036. The ICC only points to slight agreement between the annotators, and the Spearman's ρ implies a small correlation between the human ratings.

Simplicity Gain Data Set
This data set has limitations that could prevent generalizing findings based on its data. For instance, the number of evaluated instances (372) is small, and they were produced by only four automatic systems, three of which have very similar characteristics. In addition, as shown in Figure 1, the evaluated systems did not perform significant simplification changes (as judged by humans), since most instances were rated with Simplicity Gain scores below 1, with a high frequency of values between 0 and 0.25.

Structural Simplicity Data Set
Sulem, Abend, and Rappoport (2018c) created this data set to evaluate the performance of Sentence Simplification models that mix hand-crafted rules (based on a semantic parsing) for sentence splitting, with standard MT-based architectures for lexical paraphrasing. Sulem, Abend, and Rappoport (2018a) further exploited this data to examine the suitability of BLEU for assessing Structural Simplicity. 5 The authors simplified 70 sentences using 25 automatic systems: Hybrid; SBMT-SARI; four versions of NTS mixing initialization with default or word2vec embeddings, and selecting the highest or fourth-best hypothesis according to SARI; two versions of DSS, with either automatic or manual semantic annotations; eight versions of SENTS that first use a version of DSS for sentence splitting and then the resulting output goes through a version of NTS; and many variations of SENTS where NTS is replaced by Moses (Koehn et al. 2007).
Native English speakers were asked to use a 5-point Likert scale (−2 to +2 scores) to measure Structural Simplicity: "is the output simpler than the input, ignoring the complexity of the words?" (Sulem, Abend, and Rappoport 2018c). The judgments from three different annotators are averaged to obtain the final score for each instance. Our computation of human evaluation reliability found an ICC of 0.465 and a Spearman's ρ of 0.508 ± 0.013. The ICC points to a moderate agreement between the annotators, and the Spearman's ρ implies a medium correlation between the human ratings.
Compared to the Simplicity Gain data set, this one is bigger (1,750 instances) and with more variability in the system outputs collected. In addition, Figure 2 shows that the distribution of scores span across all possible values, indicating that some systems even hurt the Structural Simplicity of the original sentence. Despite the overrepresentation of simplifications with scores between 0 and 0.5, around 32% of instances improve Structural Simplicity, indicating that an analysis based on perceived quality across different levels is possible.

The New Simplicity-DA Data Set
We introduce a new data set with human judgments of simplification quality elicited via Direct Assessment (DA, Graham et al. 2017), a commonly used methodology in Machine Translation Shared Tasks Barrault et al. 2019). Leveraging publicly available system outputs on the test set of TurkCorpus (Xu et al. 2016), we collected simplifications from six systems: PBMT-R, Hybrid, SBMT-SARI, Dress-Ls, DMASS-DCSS, Table 2 Summary of characteristics of the data sets with human ratings of simplicity used for the meta-evaluation study.

Simplicity Gain
Structural Simplicity Simplicity-DA and ACCESS. For each system, we randomly sampled 100 automatic simplifications, not necessarily all from the same set of original sentences, but ensuring that the system output was not identical to the original sentence. Then, we crowdsourced human ratings using AMT. Workers were asked to assess the quality of the automatic simplifications in three aspects: fluency, meaning preservation, and simplicity. For each aspect, raters needed to submit a score between 0 and 100, depending on how much they agreed with a specific question. For simplicity, in particular, they were asked: Rate your level of agreement to the statement: "The Simplified sentence is easier to understand than the Original sentence". This is inspired by the DA methodology and, thus, we refer to this kind of simplicity judgments as Simplicity-DA. Each Human Intelligence Task (HIT) in AMT consisted of five sentences, with a maximum time of one hour for completion, and a payment of $0.50 per HIT. For quality control, workers had to pass a qualification test before participating in the rating task. All submissions to this test were manually reviewed to ensure understandability of the instructions. 6 This crowdsourcing methodology is similar to the preliminary metrics' correlation study in . However, our new Simplicity-DA data set includes more automatic simplifications than those collected before (600 vs. 100), allowing better generalization of our findings. For each simplification instance, we collected 15 ratings per quality aspect (fluency, meaning preservation, and simplicity), which are then standardized by the mean and standard deviation of each worker to reduce individual biases. The average of all 15 standardized ratings (also called zscore) is the final score for the instance per quality aspect. Our computation of human evaluation reliability found an ICC of 0.386 and a Spearman's ρ of 0.607 ± 0.026. The ICC points to a fair agreement between the annotators, and the Spearman's ρ implies a large correlation between the human ratings.

General Statistics
The annotation reliability for the collected ratings in our data set is higher than that for the Simplicity Gain data set, and comparable to that of the Structural Simplicity data set. In addition, our data set is bigger in size and offers more variability of system outputs than the Simplicity Gain data set. In particular, we included state-of-theart neural sequence-to-sequence models, the current trend in automatic simplification systems. See Table 2 for a summary comparing the characteristics of the three data sets. Furthermore, Figure 3 shows that the Simplicity-DA ratings are more diversely distributed across all scores values than the other data sets. This benefits our metaevaluation since one of our intended dimensions of study is the perceived low or high quality (in terms of simplicity) of the system outputs. Overall, we argue that the newly collected Simplicity-DA data set provides a valid alternative view at human judgments of simplicity. In particular, it is more reliable for analyzing automatic metrics in a multioperation simplification scenario since the judgments are not tied to the correctness of a specific rewriting operation.

Meta-Evaluation of Automatic Evaluation Metrics
In this section, we study how the correlations between automatic scores and human judgments vary across different dimensions. Our investigation is inspired by research in Machine Translation evaluation-in particular, by the WMT Metrics Shared Tasks that compare standard and new metrics in a common setting using human judgments collected through Direct Assessment , primarily in the latest years (Bojar et al. 2016;Bojar, Graham, and Kamran 2017;Ma, Bojar, and Graham 2018;Ma et al. 2019;. Data from these WMT Shared Tasks has allowed to further study the behavior of metrics at sentence level across different dimensions (Fomicheva and Specia 2019), to analyze the protocols for evaluating metrics at system level (Mathur, Baldwin, and Cohn 2020), to study the effect of the quality of references used to compute metrics (Freitag, Grangier, and Caswell 2020), among others.
In our study, we analyze the behavior of automatic metrics at sentence level since the data sets described previously contain human judgments for each individual simplification instance. Also, metrics explicitly developed to measure some form of simplicity, such as SARI and SAMSA, operate by definition at the sentence-level. 7 Our meta-evaluation analyzes the variation of correlations between automatic metrics with human judgments across three dimensions: the level of simplicity of the system outputs, the approaches used by the simplification systems, and the set of manual references used to compute the metrics.

Experimental Setting
Our study focuses on metrics developed to estimate the simplicity of system outputs, or that have been traditionally used for this task: 8 BLEU, SARI, SAMSA, FKGL (Kincaid et al. 1975), FKBLEU (Xu et al. 2016), and iBLEU (Sun and Zhou 2012). 9 We also experiment with the arithmetic mean (AM) and geometric mean (GM) of BLEU-SARI and SARI-SAMSA. Finally, we include BERTScore (Zhang et al. 2020), a reference-based metric that computes the cosine similarity between tokens in a system output and in a manual reference using contextual embeddings, namely, BERT (Devlin et al. 2019). This metric provides three types of scores: BERTScore Recall matches each token in the reference to its most similar in the system output, BERTScore Precision matches each token in the system output to its most similar in the reference, and BERTScore F1 combines the two. When multiple references are available, BERTScore compares the system output against all references and returns the highest value. In the context of Sentence Simplification, a modified version of BERTScore has been used to create artificial data for training a model that ranks candidate simplifications, obtaining promising results (Maddela, Alva-Manchego, and Xu 2021).
We used the implementations of these metrics provided by EASSE (Alva-Manchego et al. 2019). 10 Most of the metrics are sentence-level by definition, with the exception of BLEU and derivations. In this case, we used a smoothed version with method floor and default value 0.0 in SacreBLEU (Post 2018). 11 For a fair comparison, we detokenized and recased all original sentences and system outputs in the three data sets. Then, we set EASSE to compute all metrics with the same configuration: tokenization using SacreMoses 12 and case-sensitive calculation.
In order to compare the automatic evaluation metrics, we followed the methodology of recent editions of the WMT Metrics Shared Task (Ma, Bojar, and Graham 2018;Ma et al. 2019). First, we computed the correlations between automatic scores and human judgments via Pearson's r for each metric. Because the simplicity ratings in our human evaluation data sets are absolute instead of relative rankings between instances, this method is better suited and easier to apply than Kendall's Tau. Furthermore, we performed Williams significance tests (Williams 1959) to determine if the increase in correlation between two metrics is statistically significant or not.

Metrics across Simplicity Quality Levels
Our first dimension of analysis is the perceived quality of the automatic simplifications. We investigate whether it is easier or harder for metrics to evaluate low-quality or highquality simplifications, as determined by their human judgments on simplicity. In order to do this, we split the instances in each data set into two groups according to their simplicity score, and compute the Pearson's r between metrics and human judgments for the top 50% ("High"), the bottom 50% ("Low"), and "All" available instances. Table 3 presents the correlations in each quality split of this data set. Reference-based metrics were computed using manual simplifications from ASSET, since the Simplicity-DA judgment is not limited to a particular operation being performed, and simplifications in ASSET were created applying several of them.

Simplicity-DA.
When "All" instances are considered, BERTScore Precision shows a strong correlation with direct assessments of Simplicity, and no metric is better than that one. Flesch-based metrics (FKGL and FKBLEU) have the lowest correlations, providing further evidence that these type of metrics are unsuitable for sentence-level evaluation. Simplificationspecific metrics, SARI and SAMSA, also fare poorly. One possible explanation is that they were developed to assess the execution of particular simplification operations (lexical paraphrasing and sentence splitting, respectively), whereas the Simplicity-DA judgments are not operation-specific, but rather perceptions of general simplicity. Computing their arithmetic or geometric means does not yield good correlations in this data set either. BLEU shows a moderate correlation, and combining it with SARI through arithmetic or geometric mean does not significantly improve the correlation with Simplicity-DA judgments in this data set.
When comparing the correlations between the "Low" and "High" splits, we can notice that the ones in the latter are much lower. This could be interpreted as: low scores of some metrics indicate "bad" quality of a simplification (in terms of Simplicity-DA), but high scores do not necessarily imply "good" quality. Figure 4 further illustrates this behavior for three representative metrics. This could be explained by how (most of) the metrics assess the system outputs (i.e., by computing their similarity to the manual references), and by the question used to elicit Simplicity-DA judgments.
One possible reason is that simplifying a sentence may be limited to a few important changes that improve its readability (e.g., replacing some words or splitting a long sentence into two), while keeping the rest of the original sentence as is. Not performing these key modifications or executing unnecessary ones would be penalized Table 3 Absolute Pearson correlations between Simplicity-DA and metrics scores computed using references from ASSET, for low/high/all quality splits (N is the number of instances in the split). Correlations of metrics not significantly outperformed by any other in the quality split are boldfaced.

Metric
Low High All (N = 300) (N = 300) (N = 600)  by the human judges, resulting in low Simplicity-DA scores. However, similarity-based metrics could still provide high scores that, in fact, are indicative of the overlap between the system output and the references due to some degree of meaning preservation, but not of the changes that improve simplicity. The first example in Table 4 illustrates this scenario. The reference selected by BERTScore Precision as the most similar to the system output is a clever simplification that uses the adverb "successfully" to replace the clause "and was victorious" from the original sentence. Because the rest of the sentence is unchanged, it has a high overlap with the system output that merely deleted the "and was victorious" clause.
Finally, there could be a disagreement between the changes the human judges deemed necessary for a good Simplicity-DA score, and what the editors that created ASSET considered as valid simplifications. The second and third examples in Table 4 illustrate this scenario. The selected references are almost identical to the corresponding system outputs, and thus BERTScore Precision scored them very high. However, the human judges considered the changes insufficient to grant a high value of Simplicity-DA for improved simplicity. This may not be indicative that references in ASSET are incorrect, but rather that not all of them have the same degree of simplicity. Table 5 presents the correlations in each quality split of this data set. Reference-based metrics were computed using manual simplifications from Turk-Corpus, since the Simplicity Gain judgment is limited to counting lexical paraphrases, and references in TurkCorpus were created by only applying that operation.

Simplicity Gain.
In this data set, SARI has a moderate correlation, and the highest among all metrics when "All" evaluation instances are considered, similar to the results in Xu et al. (2016). Just like in the Simplicity-DA data set, Flesch-based metrics and SAMSA show low correlations, while BLEU and its variants have correlations in the middle of the group. The different versions of BERTScore are second-best, and have similar performances, i.e., there is no statistically significant difference between them. Also, combining SARI with BLEU does not improve its individual correlation. When comparing the correlations between the "Low" and "High" quality splits (also see Figure 5), most metrics have lower Pearson's r in "High." However, this is not a consistent behavior, and the differences are not as considerable as observed in the Simplicity-DA data set. Table 4 Examples of original sentences with some of their simplification references in ASSET, and system outputs with corresponding human and automatic scores from the Simplicity-DA data set. The reference selected by the automatic metric as most similar to the system output is emphasized.

Original Sentence
In 1998, Culver ran for Iowa Secretary of State and was victorious.

System Output
In 1998, Culver ran for Iowa Secretary of State. Sample References Culver ran and won Iowa's secretary of State in 1998.
In 1998, Culver successfully ran for Iowa Secretary of State.
In 1998, Culver ran for Iowa Secretary of State. He won the election.

Original Sentence
Below are some useful links to facilitate your involvement.

System Output
Below is some useful links to help with your involvement Sample References Here are good links to get you to do it.
Below are some useful links to help with your involvement.
Here are some useful links to help you.

Original Sentence
He was appointed Companion of Honour (CH) in 1988.

System Output
He was appointed Companion of Honour in 1988.

Sample References He was made the Companion of Honour (CH) in 1988.
He was appointed Companion of Honour in 1988.
In 1988 he was chosen as a Companion of Honour.
Simplicity-DA 0.436 BERTScore Precision 1.000 We hypothesize that the overall moderate-to-low correlations is due to most of the metrics not directly measuring Simplicity Gain. Almost all metrics compute the similarity between the system output and the references. However, measuring Simplicity Gain implies identifying the changes made by the system, and then verifying that they are correct. In order to do this, it is necessary to take the original sentence into consideration, Scatter plots showing the correlation (r) between BERTScore Precision , BLEU, and SARI, with human rating of Simplicity Gain, for different quality levels.

Table 6
Examples of original sentences and system outputs with corresponding human and automatic scores from the Simplicity Gain data set. Changes related to lexical paraphrasing are boldfaced.
Original Sentence Jeddah is the principal gateway to Mecca, Islam's holiest city, which ablebodied Muslims are required to visit at least once in their lifetime.

System Output
Jeddah is the main gateway to Mecca, Islam's holiest city, which sound Muslims must visit at least once in life.

Original Sentence
The Great Dark Spot is thought to represent a hole in the methane cloud deck of Neptune.

System Output
The Great Dark Spot is thought to be a hole in the methane cloud deck of Neptune.
Simplicity Gain 1.25 SARI 0.587 and not just the system output and the references. SARI is the only metric that attempts to follow this logic, by computing the correctness of the n-grams kept, deleted, and added. Lexical paraphrasing is, however, strongly related to performing replacements, an operation that SARI does not directly identify and measure. The examples in Table 6 show how this limitation hurts the metric: Whereas in the second instance there are fewer correct replacements than in the first one (1 < 3), the SARI score is higher (0.587 > 0.462). By not directly counting correct replacements, the metric is affected by the conservative nature of the outputs and references that copy most of the original sentences. It is the correctness of kept and deleted n-grams that contributes to getting a high score. Consequently, SARI is not measuring Simplicity Gain, which explains why the correlation with human judgments is barely moderate. 13 The concept of Simplicity Gain is easy to understand: It is the number of correct changes. If metrics were able to measure it accurately, automatic scores would be more straightforward to interpret, facilitating the comparison of simplifications generated by different systems. However, collecting this type of human judgment is difficult, especially in instances where multiple rewriting operations may have been applied, and identifying where the changes happened (and counting them) is not trivial. In addition, the Simplicity Gain data set from Xu et al. (2016) that we use in this study is quite small (only 372 evaluated instances), and contains automatic simplifications from only four systems, three of which are of similar characteristics (SBMT-based), without any current state-of-the-art neural models. All of this impedes generalizations that could be relevant in Sentence Simplification research. Table 7 presents the correlations in each quality split of this data set. Reference-based metrics were computed using manual simplification from HSplit, since the Structural Simplicity judgment is limited to qualifying sentence splitting, and references in HSplit were created by only applying that operation.

Structural Simplicity.
In this data set, most metrics have moderate correlations with human judgments when "All" evaluated instances are used. BLEU obtains the highest correlation, but it is not the best overall because its differences with BLEU-SARI (GM) and BERTScore Recall are not statistically significant. This would seem to contradict the findings of Sulem, Abend, and Rappoport (2018a), who argued that BLEU does not correlate well with Structural Simplicity. However, as will be shown in the next section, the magnitude of the correlation depends on the approach of the systems included in the study. Whereas Sulem, Abend, and Rappoport (2018a) only used models tailored for sentence splitting to reach that conclusion, in this first analysis we are using all available system outputs in the data set. The low correlation of SAMSA is surprising, since this metric was specifically designed to evaluate sentence splitting, and it showed better performance Table 7 Absolute Pearson correlations between Structural Simplicity and metrics scores computed using references from HSplit, for low/high/all quality splits (N is the number of instances in the split). Correlations of metrics not significantly outperformed by any other in the quality split are boldfaced.  in the data set of Sulem, Abend, and Rappoport (2018b). However, they measured the correlation at the system-level, whereas we are analyzing it at the sentence-level. Finally, BERTScore Precision , the best metric in the Simplicity-DA data set, has the poorest correlation in the "All" data split. From previous results, we know that BERTScore Precision is good at measuring the similarity between a system output and a reference. As such, its low correlation would indicate that simple similarity matching is not enough to measure Structural Simplicity. When comparing the correlations between the "Low" and "High" splits (also see Figure 6), we can notice that the ones in the former are much lower for all metrics but BERTScore Precision . In fact, this metric has the highest correlation in the "Low" split, with a substantial increase over its own correlation in the "All" data split. This could also be explained by our previous argument. A low score in Structural Simplicity implies that the system output does not contain any sentence splitting, or that the changes made are not structural. In these situations, BERTScore Precision would not be able to match a reference in HSplit, since they most likely contain only sentence splitting. In turn, the metric returns a low score that correlates well with a low human judgment.

Metric
We further analyze the behavior of SAMSA, a metric specifically designed to evaluate Structural Simplicity. By design, SAMSA first uses a semantic parser to identify the Scenes in the original sentence, and a syntactic parser to identify the sentence splits in the system output. Then, it counts how many of the words corresponding to the Participants of each Scene align with words in each sentence split. Ideally, all Participants of a single Scene should appear in a single sentence split. The first example in Table 8 illustrates a case where this logic may be problematic. SAMSA identifies that there is only one Scene in the original sentence and only one sentence split in the system output. Because both sentences are identical, the word alignment is perfect and SAMSA gives the simplification the highest possible score. However, the human judges gave the instance a score of 0 because no changes were performed. On the one hand, this could suggest that SAMSA should only be used when sentence splitting was actually performed in the simplification instance. On the other hand, it could be argued that the original sentence was already structurally simple, and that no splitting was necessary, making the human score of 0 unfair. This points out possible issues in the data collection, and that perhaps using a −2 to +2 scale is unsuitable for these scenarios.  Table 8 suggest that there are indeed problems. The second example shows that a perfectly reasonable and correct splitting (with a SAMSA score of 1.0) received a low score from the judges. More worryingly, the third example presents a sentence where no splitting was performed (and with substantial compression) that received the highest score for Structural Simplicity. This could indicate that the human judges did not consider sentence splitting as the only mechanism for improving the simplicity of the structure of a sentence. In an attempt to quantify this phenomenon, Figure 7 presents the distribution of Structural Simplicity scores for instances where sentence splitting was performed and where it was not. Instances with splitting only amount to 17% (306/1,750) of the total of instances in the data set. Although this is a low quantity, their human scores span along all possible values for Structural Simplicity. It is encouraging that most instances where no splitting was performed received a human score close to 0. However, there are many that were judged with high values of Structural Simplicity. We hypothesize that this is caused by misunderstanding of the rating instructions, since many of these instances also contain substantial levels of compression (as in the third example of Table 8), which could not be considered as a type of rewriting that improves the structural simplicity of a sentence. Improvement in Structural Simplicity is a relevant feature to evaluate in automatic simplifications. Isolating its assessment both manually and through metrics can contribute to a more fine-grained analysis of the performance of automatic systems. However, it is important to establish adequate quality control mechanisms that ensure the trustworthiness of the collected data, so that we can develop metrics that accurately resemble the intended human judgments.

Metrics across Types of Systems
We now investigate if metrics' correlations are affected by the type of system that generated the simplifications. For this study, we do not use the Simplicity Gain data set because it only provides simplifications produced by PBMT and SBMT systems. Distribution of Structural Simplicity scores in the data set of Sulem, Abend, and Rappoport (2018c) for instances with and without sentence splitting in the system output.
4.3.1 Simplicity-DA. Table 9 presents the correlations of each metric for the different system types in this data set, with reference-based metrics computed using simplifications from ASSET. BERTScore Precision achieves the highest correlations in all groups, and for S2S and Sem+PBMT models, in particular, no other metric is statistically equal. Most metrics show higher correlations in the S2S group than in others. However, because the number of data points is smaller in the latter, stronger conclusions cannot be formulated. Overall, because the current trend is to develop S2S models, it is encouraging that modern metrics are capable of evaluating them, but keeping in mind the nuances we signalled in the previous section regarding quality levels. Table 9 Pearson correlations between Simplicity-DA human judgments and automatic metrics scores computed using references from ASSET, for splits based on system type (N is the number of instances in the split). Correlations of metrics not significantly outperformed by any other in the system type split are boldfaced. Metrics are grouped in Reference-based (top) and Non-Reference-based (bottom).

Metric
SBMT PBMT S2S Sem+PBMT (N = 100) (N = 100) (N = 300) (N = 100)  Table 10 presents the correlations of each metric in the different system type groups in this data set. Reference-based metrics were computed using manual simplifications from HSplit. All metrics achieve their highest correlations in the S2S group, except for BERTScore Precision . As presented in the previous section, this metric is particularly good at judging instances with low Structural Simplicity, which seem to be those from the PBMT and SBMT groups, mainly. Previously, we observed that BLEU had high correlation with high-scoring quality judgments (in terms of Structural Simplicity). Here, we notice that this behavior is limited to simplifications produced by S2S and Sem+S2S systems. This appears to contradict the observations of Sulem, Abend, and Rappoport (2018a), who used this same data set to conclude that BLEU is a bad estimator of Structural Simplicity. The reason behind this disagreement is that for their sentence-level study "HSplit as Reference Setting," the systems they chose were those within the Sem and Sem+PBMT groups, for which BLEU, indeed, shows poor correlations. A possible reason for choosing this setup is explained by Figure 8. While S2S and Sem+S2S have more instances that were scored with good Structural Simplicity, these groups contain very few system outputs where sentence splitting was performed. Therefore, we believe that Sulem, Abend, and Rappoport's (2018a) conclusion should be more nuanced: BLEU is a bad metric to estimate Structural Simplicity in system outputs where sentence splitting was performed.
Nevertheless, not considering system outputs in the S2S and Sem+S2S groups reduces the future impact of the previous statement, since the current trend in Sentence Simplification research is developing that type of model. For their system-level study "Standard Reference Setting," Sulem, Abend, and Rappoport (2018a) included systems from the S2S group, but computed BLEU using references from Simple Wikipedia and TurkCorpus, which are not focused on sentence splitting. We believe that this experimental setting is unfair to BLEU, and that more cautious analysis should be performed Table 10 Pearson correlations between Structural Simplicity judgments and automatic metrics scores computed using references from HSplit, for splits based on system type (N is the number of instances in the split). Correlations of metrics not significantly outperformed by any other in the system type split are boldfaced. Metrics are grouped in Reference-based (top) and Non-Reference-based (bottom).

Figure 8
Distribution of Structural Simplicity scores in the data set of Sulem, Abend, and Rappoport (2018c) for instances with and without sentence splitting in the system output and for each system type.
to determine whether a metric should be used to assess Structural Simplicity in S2S models.

Effect of Simplification References
The third dimension of analysis for our meta-evaluation is the set of simplification references used to compute automatic evaluation scores. Because there can be multiple correct simplifications for the same original sentence, it is possible that a reference-based metric becomes more reliable if it has access to more manual references for comparison. It is worth remembering that whereas BLEU and SARI take all references for each original sentences into account when computing their scores, BERTScore takes one at a time and returns the maximum score. In this section, we investigate whether the correlations of reference-based metrics vary depending on using all available simplification references or particular subsets of them. We only experiment with the Simplicity-DA data set, because its simplicity judgments are not tied to performing a specific type of simplification operation, as is the case for the other data sets. Thus, having a more varied set of references could be beneficial for reference-based metrics in this scenario. In addition, we take advantage of the fact that the original sentences in the Simplicity-DA data set have corresponding manual simplifications in three multi-reference data sets: ASSET (10 references), TurkCorpus (8 references), and HSplit (4 references). Recall that the manual simplifications in each data set were produced via different operations: lexical paraphrasing in TurkCorpus; sentence splitting in HSplit; and lexical paraphrasing, compression, and sentence splitting in ASSET.
4.4.1 ASSET vs. All References. Table 11 presents the correlations of each reference-based metric computed using the 10 manual references from ASSET or their union with those from TurkCorpus (8 references) and HSplit (4 references), that is, what we refer to as "All References" (22). We further divide this data into "Low," "High," and "All" quality splits as in a previous section. As such, the left-hand side of Table 11 is the same as Table 3. We do not add the system type dimension since the number of instances in each subgroup would be too small to allow drawing strong conclusions. When using "All" instances, most metrics have a slight increase in their Pearson's r when All References are used, with BERTScore Precision achieving the highest correlations, and being statistically superior to every other metric. This improvement seems to be caused by better detection of "Low" quality simplifications. In fact, using All References slightly affects BERTScore Precision and most metrics when detecting system Table 11 Pearson correlations between Simplicity-DA judgments and reference-based metrics scores grouped by the set of manual references used. Within each group, we divide the data into Low/High/All quality splits. Correlations of metrics not significantly outperformed by any other in their group and quality split are boldfaced. The scores in the left-hand side (under ASSET) are the same ones as in Table 3 outputs of "High" Simplicity-DA. As in a previous section, we hypothesize that this is caused by the different degrees of simplicity that each manual reference has in each data set. By having more references available, BERTScore Precision is more likely to match one with a system output, and then return a high score. However, high similarity with a reference does not necessarily mean high improvements in simplicity, since the manual reference could correspond to a valid simplification but with a relatively low degree of simplicity.

ASSET vs. Selected References.
In the previous analysis, we changed the set of references for all sentences that are being assessed at the same time. We now analyze the effect of changing the set of references for each sentence individually. More concretely, we devise an experiment where, for each automatically simplified sentence, referencebased metrics compare it to a subset of all available references based on the simplification operations that were performed. Therefore, for each sentence: 1. Identify the operations that were performed. We use the annotation algorithms in EASSE to label deletions, replacements, and splits at the sentence-level. For deletions and replacements, these algorithms leverage automatic word alignments between the original sentence and the automatic simplification, extracted using SimAlign (Jalili Sabet et al. 2020). If a word in the original sentence is aligned but not to an exact match in the simplification, then it is considered a replacement. If a word in the original sentence is not aligned, then it is considered as deleted. For identifying splits, we compute the number of sentences in the original and simplified sides using NLTK, 14 and register a split if the number in the simplified side is higher than the one in the original side. In preliminary experiments with a sample of 250 sentences, these algorithms achieved F1 scores of 0.76 for deletions, 0.78 for replacements, and 0.87 for splits. More details can be found in Alva-Manchego (2020, chapter 3).

2.
Determine the references to use. Based on the operations identified, we treat three possible cases: (1) the system performed only sentence splitting; (2) the system performed only lexical paraphrasing and/or deletion; (3) the system performed another possible combination of operations. 15 Depending on the case, a different set of references would be used: HSplit for (1), TurkCorpus and ASSET for (2), and ASSET for (3). ASSET was added for case (2) because it also contains manual references where only lexical paraphrasing was applied.

3.
Compute the metrics. Calculate the metrics' scores using the selected set of manual references.
Column "Selected References" in Table 11 presents the correlations of referencebased metrics computed following the previous process. All metrics but SARI improve their correlations when instances of "All" qualities are used. As before, this is caused by better detection of "Low" quality simplifications.

Recommendations for Automatic Evaluation
Our meta-evaluation has allowed us to better understand the behavior of traditional and more modern metrics for assessing automatic simplifications. Based on those findings, in this section we set a list of recommendations related to the present and future of automatic evaluation of Sentence Simplification systems.

Evaluation of Current Simplification Systems
Automatic Metrics. It is difficult to determine an overall "best" metric across all types of simplicity judgments. For Simplicity-DA, BERTScore Precision achieved the highest correlations in all dimensions of analysis. For Simplicity Gain, SARI is better than all BERTScore variants, but that difference is not statistically significant when assessing low and high quality simplifications separately. In addition, there is not enough data to determine if that behavior translates to modern sequence-to-sequence models. The comparison is even less clear for Structural Simplicity, since the correlations are heavily dependent on the system type or, rather, on evaluating simplifications where sentence splitting was actually performed, instances of which are insufficient in the data set used. SAMSA was specifically developed for this type of simplicity evaluation, and manual inspection suggests that it is doing what it was designed for. As such, even though our analysis does not seem to support its use, we argue that this is caused by the lack of adequate data with judgments on Structural Simplicity. Overall, we suggest using multiple metrics, and mainly BERTScore Precision , for reference-based evaluation. SARI could be used when the simplification system only executed lexical paraphrasing, and SAMSA may be useful when it is guaranteed that splitting was performed.
Simplification References. Simplifications in ASSET are well suited for reference-based evaluation. Incorporating references from TurkCorpus and HSplit seems to only slightly improve the correlations. In addition, it appears that selecting which references to use for each sentence individually benefits the computation of metrics. However, for both cases, the improvements are limited to evaluation of low-quality simplifications.
Interpretation of Automatic Scores. For Simplicity-DA, low scores of most metrics appear to be good estimators of low quality, whereas high scores do not necessarilly suggest high quality. This indicates that metrics could be more useful for development stages of simplification models. Following the recommendation of using multiple metrics, we suggest to use BERTScore Precision to get a first evaluation. If the score is low, then it signals that the quality of the output is also low. However, when the score is high, it is important to look at other metrics, such as SARI or SAMSA, to verify the correctness of the simplification operations. Nevertheless, for final arguments on the superiority of one system over another, human evaluation should be preferred. For Simplicity Gain, metrics' correlations are low to moderate in general, so it is unclear if they are actually measuring this type of human judgment. In the case of Structural Simplicity, inconsistencies in the human judgments (i.e., high scores for instances where no splitting was performed) hinders the interpretation of results.

Development of New Metrics
Considering the advantages and disadvantages of current metrics, as well as the problems identified in the data used for evaluating them, we provide some suggestions for the development of new resources for automatic evaluation.
Collection of New Human Judgments. We experimented with crowdsourcing simplicity judgments following a methodology inspired by Direct Assessment, which has been successful in Machine Translation research. We believe that submitting continuous scores on how much simpler a system output is over the original sentence gives raters more flexibility on their judgments, and facilitates subsequent analyses. However, although the type of score collected (continuous or discrete) influences the ratings, it is even more important to ensure that raters submit judgments that follow the kind of simplicity that is intended to be measured. As such, it is paramount to train raters before they perform the actual task, and establish quality control mechanisms through all the data collection process. In relation to the kind of simplicity judgment to elicit, both Simplicity Gain and Structural Simplicity have advantages over requesting absolute simplicity scores. Therefore, we recommend collecting more human judgments based on them, using modern simplification models and simplification instances with adequate characteristics for what we are trying to evaluate.
Characteristics of New Metrics. For Simplicity-DA, Simplicity Gain, and Structural Simplicity, raters had to compare the automatic simplification to the original sentence, and then submit a particular kind of judgment. Therefore, if humans submit evaluations taking both the original sentence and the simplification into consideration, then we should expect that automatic metrics do so too. Both SARI and SAMSA follow this logic, and we would expect that new metrics take that idea even further. For example, by replacing n-gram matching in SARI and syntax-based word alignments in SAMSA by similarity of contextual word embeddings, as is done in BERTScore. Furthermore, we have explained that not every manual simplification in multi-reference data sets (i.e., ASSET, TurkCorpus, and HSplit) has the same simplicity level. Therefore, it could be useful to enrich references with human judgments on their simplicity. In this way, an automatic score would not be only based on the similarity to a reference, but also on the potential level of simplicity that the system output could achieve if it were an exact match with that particular reference. Perhaps metrics could even combine how similar the system output is to a reference with the simplicity level that could be achieved.
Analysis Beyond Absolute Correlations. Our meta-evaluation has shown that different factors influence the correlation of human judgments with automatic scores, namely: perceived quality level, system type, and set of references used for computation. As such, new automatic metrics should not only be evaluated on their absolute overall correlation. It is important to also analyze the reasons behind that value considering the different factors that could be affecting it. In this way, we can determine in which situations the new metrics prove more advantageous than others.

Conclusions
In this article, we studied the degree in which current evaluation metrics measure the ability of automatic systems to perform sentence-level simplifications, especially when multiple operations were applied.
We collected a new data set for evaluation of automatic metrics following the Direct Assessment methodology to crowdsource human ratings on fluency, meaning preservation and simplicity. The data set consists of 600 automatic simplifications generated by six different systems, three of which are based on modern neural sequence-to-sequence architectures. This makes it bigger and more varied than the Simplicity Gain data set. In addition, we collected 15 ratings per simplification instance to increase annotation reliability, contrasting with the Simplicity Gain data set that has five raters, and the Structural Simplicity data set that only has three. Our data collection process can be finetuned, and more system outputs should be included. However, our data set's features are sufficient to offer an alternative view at simplicity judgments over system outputs.
We used our newly collected data set (Simplicity-DA), together with the Simplicity Gain and Structural Simplicity data sets, to conduct, to the best of our knowledge, the first meta-evaluation study of automatic metrics in Sentence Simplification. We analyzed the variations of the correlations of sentence-level metrics with human judgments along three dimensions: the perceived simplicity level, the system type, and the set of references used to compute the automatic scores. For the first dimension, we found that metrics can more reliably score low-quality simplifications in terms of Simplicity-DA, while this effect is not apparent in Simplicity Gain and no strong conclusions could be drawn for Structural Simplicity due to inconsistencies in the ratings. For the second dimension, correlations change based on the system type. In the Simplicity-DA data set, most metrics are better at scoring system outputs from neural sequence-tosequence models. While this difference in correlation is more significant in the Structural Simplicity data set, it seems to be caused by low representation of sentence splitting in the data, rather than differences in system type. This highlights the importance of analyzing outputs of several types of systems (e.g., neural and non-neural) with all the characteristics under study (e.g., split sentences), to prevent obtaining conclusions that are limited to a certain subgroup of models. For the third dimension, combining all multi-reference data sets does not significantly improve metrics' correlations over using only ASSET in the Simplicity-DA data set. Further analyses on the diversity of the manual references across ASSET, TurkCorpus, and HSplit should be performed in order to explain this result. In addition, preliminary experiments on per-sentence reference selection based on the performed operations showed promising results.
Based on the findings of our meta-evaluation, we designed a set of guidelines for automatic evaluation of current simplification models. In particular for multi-operation simplifications, we suggest using BERTScore with references from ASSET during the development stage of simplification models, and manual evaluation for final comparisons. The main reason is that BERTScore is very good at identifying references that are similar to a system output. However, because not all references have the same simplicity level, a high similarity with a reference does not necessarily indicate high (improvements in) simplicity. Finally, we proposed a desiderata for the characteristics of new resources for automatic evaluation. Namely: (1) to further collect Simplicity Gain and Structural Simplicity ratings with better quality controls and diversity of system outputs; (2) to develop metrics that take both the original sentence and the automatic simplification into consideration; (3) to enrich manual references with their simplicity level; and (4) to evaluate new metrics along several dimensions and not just overall absolute correlation with human ratings on some form of simplicity.