Revisiting Meta-evaluation for Grammatical Error Correction

Abstract Metrics are the foundation for automatic evaluation in grammatical error correction (GEC), with their evaluation of the metrics (meta-evaluation) relying on their correlation with human judgments. However, conventional meta-evaluations in English GEC encounter several challenges, including biases caused by inconsistencies in evaluation granularity and an outdated setup using classical systems. These problems can lead to misinterpretation of metrics and potentially hinder the applicability of GEC techniques. To address these issues, this paper proposes SEEDA, a new dataset for GEC meta-evaluation. SEEDA consists of corrections with human ratings along two different granularities: edit-based and sentence-based, covering 12 state-of-the-art systems including large language models, and two human corrections with different focuses. The results of improved correlations by aligning the granularity in the sentence-level meta-evaluation suggest that edit-based metrics may have been underestimated in existing studies. Furthermore, correlations of most metrics decrease when changing from classical to neural systems, indicating that traditional metrics are relatively poor at evaluating fluently corrected sentences with many edits.


Introduction
Grammatical error correction (GEC) is the task of automatically detecting and correcting errors, including grammatical, orthographic, and semantic errors, within a given sentence.The prevailing approach in GEC involves the use of a sequence-tosequence method (Bryant et al., 2023).
Automatic evaluation metrics play an important role in the progress of GEC.These metrics are essential for a fast and efficient improvement cycle of system development because they can replace costly and time-consuming human evaluations and immediately reflect system performance.GEC has made progress by enabling a fair comparison of performance on a common benchmark using these metrics in shared tasks (Dale and Kilgarriff, 2011;Dale et al., 2012;Ng et al., 2013Ng et al., , 2014;;Bryant et al., 2019).
GEC metrics are categorized into edit-based and sentence-based types according to their evaluation granularity, and each has its objectives.Edit-Based Metrics (EBMs), such as M 2 (Dahlmeier and Ng, 2012) and ERRANT (Bryant et al., 2017), focus on evaluating the quality of the edit itself, whereas Sentence-Based Metrics (SBMs), such as GLEU (Napoles et al., 2015), evaluate the quality of the entire sentence after correction.Since the system output consists only of sentences without explicit edits, EBMs require the edit extraction from the system output using any method.In addition, these metrics are primarily evaluated based on the correlation with human judgment (i.e., meta-evaluation).
Most of the previous meta-evaluations in English GEC have relied on Grundkiewicz et al. (2015)'s dataset with human judgments (henceforth, this dataset is referred to as GJG15).However, existing meta-evaluations based on GJG15 (Grundkiewicz et al., 2015;Chollampatt and Ng, 2018b;Yoshimura et al., 2020;Gong et al., 2022) have several significant issues.First, the performance of EBMs may be underestimated due to biases resulting from inconsistencies in evaluation granularity.As an example of biases, while EBMs assign the lowest score (or the highest score in the sentence-level evaluation) to the uncorrected sentence, sentence-based human evaluation, such as GJG15, assigns scores across the entire range.Furthermore, according to the actual data in Table 1, since human evaluations may yield different results based on granularity, the GEC evaluation suggests a need to separate evaluations

Grammatical Error Correction
It is hereditary.
Source: Do one who suffered from this disease keep it a secret of infrom their relatives?
In retrospect, its is also ones duty to ensure that he or she undergo periodic healthchecks in their own.
Output A: Should someone who suffered from this disease keep it a secret or inform their relatives?Output B: Does someone who suffers from this disease keep it a secret from their relatives?
Edit-based human evaluation (Rank 1) Should someone who suffered from this disease keep it a secret or inform their relatives?Output B: (Rank 5) Does someone who suffers from this disease keep it a secret from their relatives?
Table 1: Actual data taken from our dataset shows that the results of human evaluation vary depending on the granularity.In edit-based evaluation, output B was assigned the highest rank (tied with output A), while in sentence-based evaluation, output B received the lowest rank.The results suggest that, even if all edits are considered valid, there are instances where the corrected sentence may lack fluency and naturalness in context.
for edits and sentences.Second, GJG15 is manually evaluated against the set of classical systems in CoNLL-2014 shared task, such as statistical machine translation approach (Junczys-Dowmunt and Grundkiewicz, 2014), and classifier-based approach (Rozovskaya et al., 2014).Therefore, the gap between the classical systems in GJG15 and the current modern GEC systems based on deep neural networks limits the applicability of metaevaluation.Third, a single correlation from the current fixed set of systems may not sufficiently capture the performance of metrics, leading to the possibility of drawing incorrect conclusions.For example, Deutsch et al. (2021)'s study on metaevaluation of summarization revealed that certain metrics can exhibit a spectrum of correlation values, ranging from weak negative to strong positive correlations.Mathur et al. (2020)'s study also showed that outlier systems have a strong influence on correlations in a meta-evaluation of machine translation.Therefore, we are concerned that a similar scenario could occur in the GEC.
To address these issues, we propose SEEDA, 1 a new dataset to improve the validity of metaevaluation in English GEC.Specifically, we carefully designed SEEDA to address the first and second issues by performing human evaluations corresponding to two different granularity met-1 SEEDA stands for Sentence-based and Edit-based human Evaluation DAtaset for GEC.We have made this dataset publicly available at https://github.com/tmu-nlp/SEEDA.rics (i.e., EBMs and SBMs), covering 12 stateof-the-art system corrections including large language models (LLMs), and two human corrections with different focuses ( §3 and §4).Also, through meta-evaluation using SEEDA, we investigate whether EBMs, such as M 2 and ERRANT, are underestimated and demonstrate how the correlation varies between classical systems and neural systems ( §6).Furthermore, to address the third issue, we investigate the inadequacy of GEC metaevaluation based solely on a single correlation by analyzing the presence of outliers and using window analysis ( §7).Finally, we discuss best practices and provide recommendations for future researchers to properly meta-evaluate GEC metrics and evaluate their GEC models ( §8).
Our contributions are summarized as follows.
(1) We construct a new dataset that allows for biasfree meta-evaluation that fits modern neural systems.(2) The dataset analysis shows variations in sentence-level human evaluation results depending on the evaluation granularity.(3) We identified two findings through meta-evaluation: aligning the granularity between human evaluation and metric enhances correlations, and correlations for classical and neural systems are different.(4) Investigating the influence of outliers and system sets, we discovered that a meta-evaluation of a single setting cannot analyze the detailed characteristics of the metric.We also found that existing metrics lack the precision to differentiate between the performances of top-tier systems.
Meta-evaluation: Grundkiewicz et al. (2015) proposed a dataset (GJG15) with sentence-based human ratings for system outputs in the CoNLL-2014 test set and found that M 2 has a moderate positive correlation with human judgments.Simultaneously, Napoles et al. (2015) constructed a dataset by performing a similar human evaluation and observed that their proposed metric, GLEU, has a stronger correlation than M 2 .Both studies found no correlation with I-measure (Felice and Briscoe, 2015).Chollampatt and Ng (2018b) carried out significance tests between various metrics using GJG15.They concluded that there was no clear distinction in performance between M 2 and GLEU, with I-measure proving to be the most robust metric.However, these experiments are based on classical systems and thus deviate from modern neural systems.MAEGE proposed by Choshen and Abend (2018a) applies multiple partial edits to the uncorrected sentence and assigns pseudo-scores based on the number of edits, aiming for a meta-evaluation independent of human evaluation.MAEGE does not consider system outputs and human evaluations, so it should be distinguished from existing metaevaluations that rely on humans.Moreover, since it does not account for errors that machines might make but humans wouldn't, the need for human evaluation against outputs persists.Furthermore, Napoles et al. (2019) constructed GMEG-Data by performing human judgments using continuous scales on the CoNLL-2014 test set and three domain-specific datasets.Their findings highlighted diverse correlations across the different domains.They explored neural systems, but these deviate from mainstream systems pretrained with pseudo data and fine-tuned based on the transformer (Vaswani et al., 2017).While SEEDA offers greater validity due to its focus on contemporary target systems and the evaluation granularity, GMEG-Data has the advantage of allowing metaevaluation using the entire CoNLL-2014 benchmark in various domains.
Reference-based evaluation: In the evaluation of GEC, commonly used metrics rely on reference sentences.Some of the most prevalent metrics include M 2 , ERRANT, and GLEU.Both M 2 and ERRANT calculate F 0.5 score by comparing the edits in the corrected sentence to those in the ref-erence.In contrast, GLEU assesses based on the matching of N-grams between the corrected sentence and reference.I-measure evaluates the degree of improvement from the original sentence using the weighted precision of edits.There are also newer metrics like GoToScorer (Gotou et al., 2020), which takes into account the difficulty of corrections, and PT-M 2 (Gong et al., 2022), which extends M 2 (and ERRANT) with pretraining-based metrics.It is worth noting that these reference-based evaluations can lose validity with limited reference coverage.
Reference-less evaluation: Evaluations without reference sentences aim to overcome the coverage issues.GBM (Napoles et al., 2016b) estimates grammaticality by identifying the number of errors in a sentence.However, it may be less sensitive to semantic changes.To address this limitation, GFM (Asano et al., 2017) was proposed.It incorporates sub-metrics to estimate grammaticality, fluency, and meaning preservation.Additionally, USim (Choshen and Abend, 2018b) was developed to specifically estimate semantic faithfulness.SOME (Yoshimura et al., 2020) draws inspiration from GFM and optimizes each sub-metric based on human evaluation using BERT (Devlin et al., 2019).Scribendi Score (Islam and Magnani, 2021) relies on various factors, including GPT-2 perplexity, token sort ratio, and Levenshtein distance ratio, to evaluate correction quality.IM-PARA (Maeda et al., 2022) fine-tunes BERT using only parallel data to quantify the impact of corrections.In terms of quality estimation, Chollampatt and Ng (2018a) introduced the first neural approach that does not rely on handcrafted features, while Liu et al. (2021) considered interactions between hypotheses using inference graphs.

The SEEDA dataset
The SEEDA dataset consists of corrections annotated with human ratings along two different evaluation granularities: edit-and sentence-based, covering 12 state-of-the-art neural systems including LLMs, and two human corrections.The SEEDA dataset is denoted as SEEDA-E for editbased evaluation and SEEDA-S for sentencebased evaluation.In this section, we describe the SEEDA dataset, how we generated the corrections ( §3.1), and how we collected the annotations ( §3.2).We use the CoNLL-2014 test set (Ng et al., 2014) as our input data, consisting of test essays and their error annotations.The test essays are written by non-native English-speaking students from the National University of Singapore and cover two genres: genetic testing and social media.Error annotations for the test essays are conducted by two native English speakers.The data comprises a total of 50 essays, consisting of 1,312 sentences and 30,144 tokens.

GEC Systems
To align with the current setting in GEC, we collect corrections using two mainstream neuralbased approaches: sequence-to-sequence and sequence tagging (Bryant et al., 2023).To investigate how highly discriminating current metrics are, top-tier systems should be included among the target systems.This includes the LLMs which have received increased attention in recent years.Following these requirements, we carefully selected 11 systems, ensuring that the count is no less than the number of systems in GJG15.Among these, eight systems are sequence-to-sequence models that generate each token autoregressively, TemplateGEC (Li et al., 2023), TransGEC (Fang et al., 2023), T5 (Rothe et al., 2021), LM-Critic (Yasunaga et al., 2021), BART (Lewis et al., 2019), BERTfuse (Kaneko et al., 2020), Riken Tohoku (Kiyono et al., 2019), and UEDIN-MS (Grundkiewicz et al., 2019).The remaining three systems are sequence tagging models that predict edit tags in parallel, GECToR-ens (Tarnavskyi et al., 2022), GECToR-BERT (Omelianchuk et al., 2020), and PIE (Awasthi et al., 2019).Following the recent LLMs trend, we consider GPT-3.5 (text-davinci-003) for two-shot learn-ing (Coyne et al., 2023).We included INPUT (source from the CoNLL-2014 test set) since GEC evaluation requires consideration of uncorrected sentences.We also consider REF-M (minimal edit references by experts) and REF-F (fluency edit references by experts), which are introduced by Sakaguchi et al. (2016), to compare the system performance with human correction, bringing to the total to 15 sentence sets.
Figure 1 shows the M2 Score (F 0.5 ) 2 and word edit rate for classical systems in GJG15, neural systems in SEEDA, and human sentences.Comparing these systems, neural systems in SEEDA show a higher number of edits and demonstrate better correction performance from the perspective of M 2 .This performance comparison utilizes the most common GEC evaluation method, reproducing results reported in existing studies.On the other hand, this performance comparison contains intuitive contradictions, such as the lower performance of human-corrected sentences and LLMs.Therefore, we investigate and report how the modern system comparison deviates from human judgments ( §4.2).Note that few-shot learning such as GPT-3.5 is known to be not grounding to target sentences as compared to finetuned models and may produce fluent but lengthy correction sentences that do not preserve the meaning of the source (Maynez et al., 2023).
Error identification (@Source): It is againt his or her human rights and it is against the law's spirit.Step 1

Checking system edits (@Output
Corrected error coverage (@Source): It is againt his or her human rights and it is against the law's spirit.

Annotation scheme
Edit-based human evaluation: In the editbased human evaluation, we evaluate only for edits in the system output.We perform a step-bystep sequence labeling using the doccano annotation tool (Nakayama et al., 2018).In the editbased human evaluation, we decided to divide the process into two steps to avoid complicating the annotation process.
Figure 2 shows an overview of the annotation flow and an example of edit-based human evaluation.In Step 1, the detection of errors in the source and checking for edits in the output are performed.During the initial error detection, annotators refer to 25 error categories by Bryant et al. (2017) to identify error locations in the source, enabling them to label errors at the minimal unit level.In the subsequent Edit checking, annotators perform a binary decision to determine whether they would like to apply the edits in the output to improve the source or not.To reduce annotation costs, ER-RANT is used for extracting edits.When there are conflicting edits (e.g., subject-verb agreement error), the one that aligns with the context is deemed effective, while the other is considered ineffective.Furthermore, for edits that depend on each other (e.g., [law's→ ] and [ →of the law] in Figure 2), each is assigned an independent label, but they are deemed effective only if all dependent edits are present.In Step 2, the annotator performs a binary decision to determine whether each edit in the output effectively corrects the errors found in Step 1. Finally, we compute F 0.5 based on Precision and Recall3 for each corrected sentence and subsequently rank the set of corrected sentences accordingly.The supplementary information about the annotation is provided in Appendix A.
Sentence-based human evaluation: Following Grundkiewicz et al. (2015), sentence-based human evaluation is performed using the Appraise evaluation scheme (Federmann, 2010).Annotators read the context in the same way as edit-based human evaluation.And then, the corrected sentences are relatively ranked, allowing the same rank from the best to the worst.The judgment of whether a sentence is good or bad is left to the subjectivity of each annotator.
Annotator and sampling method: Each annotation was performed by three native English speakers with extensive knowledge of the language.To observe differences by evaluation granularity, they are responsible for the same set of edit-based and sentence-based annotations.Following Grundkiewicz et al. (2015), we sample 200 subsets from the 1312 correction sets against the CoNLL-2014 test set using a parameterized distribution that favors more diverse outputs.To measure inter-and intra-annotator agreements, we duplicated at least 12.5% of the subset.One subset   may contain up to five sentences, and the annotator creates a ranking from those sentences.

Dataset analysis
In this section, we analyze SEEDA with a focus on evaluation granularity.First, we present the dataset statistics ( §4.1).Second, we produce human rankings for the system using rating algorithms to conduct system-level meta-evaluation ( §4.2).Third, we quantitatively analyze to discern any disparities in human evaluations across different evaluation granularities ( §4.3).

Dataset statistics
Table 3 presents the statistics for pairwise judgments by annotators.Each annotator has created 200 rankings for each subset, resulting in a total of 600 rankings.We take all combinations of all two sentences (A, B) for ranking, make a pairwise judgment (A>B, A=B, A<B), and count their numbers.To investigate the frequency of duplicate corrections, the raw data was expanded by treating systems that produced the same output independently.As a result, the number of pairwise evaluations increased significantly.This finding, similar to classical systems in Grundkiewicz et al. (2015), suggests that even high-performing neural systems that make many edits often generate duplicated corrections.Moving forward, experiments will be conducted using raw data of pairwise judgments.Table 4 shows average inter-and intra-annotator agreements.Cohen's kappa coefficient (κ) (Cohen, 1960) is used to measure the agreement.In comparison to the results in Grundkiewicz et al. ( 2015), the high inter-and intraannotator agreement indicates that the annotators were able to provide more consistent evaluations.

Human rankings
Following Grundkiewicz et al. (2015), we employed two rating algorithms, TrueSkill (TS) from Sakaguchi et al. (2014) and Expected Wins (EW) from Bojar et al. (2013), to create human rankings based on pairwise judgments.Table 2 shows the human rankings generated using TS for both edit- Comparing the orange and blue regression lines to the gray regression line allows us to observe the degree of influence of each outlier on the distribution trend.For example, the leftward tilt of the orange regression lines for M 2 , PT-M 2 , ERRANT, and GLEU indicates a negative impact from fluent sentences as outliers.
based and sentence-based evaluations.In contrast to classical systems in GJG15, all the neural systems receive ranks surpassing INPUT.This indicates a tendency of these systems to improve uncorrected sentences through correction.Systems based on GPT and T5 architectures (e.g., GPT-3.5, T5, TransGEC) achieve higher rankings than REF-M.This suggests the potential of these systems to offer corrections that might even surpass human capabilities.

Difference in human evaluation by granularity
We perform a quantitative analysis of the variations in human evaluation based on granularity.
To measure sentence-level agreement, we calculate the average intra-annotator κ between editbased and sentence-based evaluations.The result, a modest 0.36, indicates low agreement.On the other hand, the system-level κ using pairwise judgments from the human rankings stands at a much higher 0.83, revealing negligible disparity.This indicates a pronounced difference in sentence-level evaluation, but a relatively minor one in system-level evaluation.This suggests that biases are more prominent at the sentence-level meta-evaluation.

Edit-based metrics
M 2 (Dahlmeier and Ng, 2012).It compares the edits in the corrected sentence with those in the reference.It dynamically searches for edits to optimize alignment with the reference edits using Levenshtein alignment (Levenshtein, 1966).
SentM 2 .It is a variant of M 2 that calculates F 0.5 score at the sentence level.
PT-M 2 (Gong et al., 2022).It is a hybrid metric that combines M 2 and BERTScore (Zhang et al., 2019).It can measure the semantic similarity between pairs of sentences, not just comparing edits.
ERRANT (Bryant et al., 2017).It is similar to M 2 but differs in that it uses linguistically enhanced Damerau-Levenshtein alignment for extracting edits.It is characterized by its ability to calculate F 0.5 score for each error type.
SentERRANT.It is a variant of ERRANT that computes sentence-level F 0.5 score.Table 5: System-level and sentence-level meta-evaluation results excluding outliers.We use Pearson (r) and Spearman (ρ) for system-level and Accuracy (Acc) and Kendall (τ ) for sentence-level meta-evaluations.The sentencebased human evaluation dataset is denoted SEEDA-S and the edit-based one is denoted SEEDA-E.The score in bold represents the metrics with the highest correlation at each granularity.There is a trend of improving correlation by aligning the metrics at the sentence level (SEEDA-S vs SEEDA-E) and a trend of decreasing correlation by changing the target systems from classical systems to neural systems (GJG15 vs SEEDA-S).
PT-ERRANT.It is a variant of PT-M 2 where the base metric has been changed from M 2 to ER-RANT.
GoToScorer (Gotou et al., 2020).It calculates F 0.5 score while considering the difficulty of correction.The difficulty is calculated based on the number of systems that were able to correct the error.

Sentence-based metrics
GLEU (Napoles et al., 2015).It is based on the commonly used BLEU (Papineni et al., 2002) in machine translation.It rewards N-grams in the output that match the reference but are not in the source while penalizing N-grams in the source that do not match the reference.For better evaluations, we use GLEU without tuning (Napoles et al., 2016a).
It evaluates by combining the perplexity calculated by GPT-2 (Radford et al., 2019), token sort ratio and Levenshtein distance ratio.SOME (Yoshimura et al., 2020).It optimizes human evaluations by fine-tuning BERT separately for each of the following criteria: grammaticality, fluency, and meaning preservation.
IMPARA (Maeda et al., 2022).It combines a quality estimation model fine-tuned with parallel data using BERT and a similarity model to consider the impact of edits.
6 Revisiting meta-evaluation for GEC We investigate how correlations are affected by resolving granularity inconsistencies and are changed from classical systems to modern neural systems through system-level ( §6.1) and sentencelevel ( §6.2) meta-evaluations.Figure 3 shows the scatter plots of the human evaluation and the metric scores, indicating that uncorrected sentences (INPUT) and fluently corrected sentences (REF-F, GPT-3.5)stand out as outliers and influence the correlation.Therefore, we consider 12 systems, deliberately excluding uncorrected sentences (IN-PUT) and sentences with fluently corrected sentences (REF-F, GPT-3.5).We calculate metric scores on the subset targeted in human evaluations.

System-level meta-evaluation
Setup: For our system-level meta-evaluation, we report correlation using system scores obtained from human rankings.Metrics such as M 2 , PT-M 2 , ERRANT, GoToScorer, and GLEU can calculate system scores, while other metrics use the average of sentence-level scores as the system score.We use Pearson correlation (r) and Spearman rank correlation (ρ) to measure the closeness between the metric and human evaluation.
Result: According to the system-level metaevaluation results in Table 5, it is evident that aligning the granularity between the metrics and human evaluation improves the correlation for EBMs in SEEDA-E, while the correlation for Table 6: Meta-evaluation results when an outlier is included.Green indicates an increase in correlation compared to the meta-evaluation in Table 5, while red indicates a decrease."+Min" in parentheses is when 11 minimal edit references are added, and "+Flu" is when three fluency edit references are added."All systems" is the case where all outliers are considered.For most metrics, INPUT acts as an outlier that improves correlation, while REF-F and GPT-3.5 function as outliers that decrease correlation.
SBMs in SEEDA-S tends to decrease.One reason for the inconsistent results even when the granularity is aligned is that system-level human evaluations exhibit relatively small variations across different evaluation granularities.
We discovered that as we move from classical systems to neural systems, correlations for all metrics-except GoToScorer and GLEUdecrease through a comparison between GJG15 and SEEDA-S.This result suggests that the majority of current metrics cannot adequately evaluate the more extensively edited and fluent corrections produced by neural systems, in contrast to those generated by classical systems.In the metaevaluation results of GJG15, comparing it with existing studies (Grundkiewicz et al., 2015;Choshen and Abend, 2018a) is unfeasible, as the exclusion of INPUT has been implemented to alleviate scoring bias between EBMs and sentence-based human evaluation.

Sentence-level meta-evaluation
Setup: In sentence-level meta-evaluation, we use pairwise judgments in Table 3 to calculate correlations.We use Kendall's rank correlation (τ ) and Accuracy (Acc) to measure the performance of the metrics.Kendall (τ ) can measure performance in the common use case of comparing corrected sentences to each other.

Result:
In contrast to the system-level results, sentence-level meta-evaluations showed more significant improvements in correlations when the granularity was aligned.The substantial variation in sentence-level human evaluations based on granularity likely contributed to more consistent results.In other words, it became evident that correlations in sentence-level meta-evaluation are underestimated when granularity is not aligned.
When we compared GJG15 and SEEDA-S, we observed a decrease in correlations for most metrics, especially in EBMs, similar to the systemlevel results.Consistently high correlations were found for SOME and IMPARA, indicating the effectiveness of fine-tuned BERT.

Further analysis
As further analysis, we investigate the influence of outliers ( §7.1) and variations in the system set ( §7.2) on the correlation of the metric.We test the hypothesis on which this study focuses, that there may be a range of correlations in flexible settings in GEC.Based on the best practices obtained in §6, granularity will be aligned in subsequent metaevaluations.

Influence of outliers
Table 6 shows the results when the uncorrected sentences (INPUT) and/or fluently corrected sen- System-level analysis: The system-level results show that simply considering INPUT increases the correlations for most metrics to the point where comparisons are difficult.This suggests that IN-PUT serves as a strong outlier that skews the correlation positively and prevents accurate metaevaluation.One of the reasons is that most EBMs assign the lowest score to INPUT, which also ranks the lowest in human evaluations.Therefore, in the meta-evaluation using neural models, it was demonstrated that a fair comparison cannot be made when considering the INPUT.
On the other hand, the addition of REF-F and GPT-3.5 shows a sharp drop in overall correlation.The results suggest that metrics other than SOME and IMPARA cannot properly assess fluently corrected sentences.Increasing references to commonly used metrics (M 2 , ERRANT, GLEU) improves the correlation slightly, but still does not provide the same evaluation as humans.The same tendency as in the Maynez et al. (2023)'s study was observed that the overlap-based metric does not correctly evaluate LLMs for few-shot learning.
Sentence-level analysis: The results in the sentence-level meta-evaluation showed a similar trend as system-level results but with some differences.Adding INPUT improved correlations for most metrics, but both GoToScorer and Scribendi Score have decreased, which may be attributed to the inability to properly perform sentence-based evaluation.Furthermore, when adding REF-F and GPT-3.5, not only did many metrics show a decrease in correlation, but SOME and IMPARA also exhibited a slight reduction in correlation.
The improved correlations in M 2 (+Min) and GLEU (+Min), when REF-F and GPT-3.5 were added, indicate that the fluency correction may no longer be an outlier for commonly used metrics if the low coverage of reference-based evaluation is mitigated.To address the issue of reference coverage, an approach similar to Choshen and Abend (2018a), which involves splitting and combining edits for each reference, could potentially enhance the effective utilization of references.However, the result that fluency edit references were useful only for GLEU suggests that fluent edit references may be effective on an N-gram basis, but not on an edit extraction basis.As one of the reasons, we can consider the difficulties and complexities in edit extraction for fluent sentences in EBMs, as well as the inability to address the low coverage of three fluent references.

Influence of variations in the system set
Next, we investigate the extent to which the correlation of the metrics varies with changes in a system set.To create a difficult setting for the metric, correlations are computed for a set of systems with close performance by sorting the systems in order of human ranking.Figure 4 shows the variation in correlations using window analysis.What is common for most metrics is that Pearson (r) tends to be highly variable from positive to negative for evaluation of four systems, but relatively stable for evaluation of eight systems.This suggests that most metrics do not have enough precision to identify performance differences in a set of high-performance neural systems.Therefore, there is still a need to develop better metrics that allow precise evaluation.Furthermore, M 2 , ER-RANT, and GLEU were often uncorrelated or negatively correlated, indicating that the commonly used metrics do not have high robustness.On the other hand, the BERT-based metrics were found to maintain relatively high correlations, with SOME in particular being the most robust.Kendall (τ ) has a large number of samples for pairwise judgments, so there is no significant change.

Discussion
We provide a more practical guideline for metaevaluation ( §8.1) and evaluation ( §8.2) methodologies in future GEC research by considering the experimental results so far.

Towards valid meta-evaluation in GEC
We recommend that meta-evaluation be conducted at each evaluation granularity in GEC.Specifically, EBMs should use SEEDA-E, and SBMs should use SEEDA-S.The meta-evaluation using SEEDA should use the 12 systems as a baseline, excluding outliers, and add REF-F and GPT-3.5 if you want to find out how well the metrics can evaluate fluent corrections.This allows metaevaluation for the modern neural system without the bias of the granularity.Additionally, conducting experiments with various methodologies is crucial to validate the characteristics of metrics.Therefore, experiments using GMEG-Data for domain-specific meta-evaluation of SBMs and meta-evaluation by MAEGE, irrespective of granularity, should be considered if resources permit.
The further analysis in §7, which yielded results unavailable in §6, demonstrates that conduct-ing meta-evaluation for only a single setting is inadequate in GEC.Therefore, it is necessary to measure correlations across multiple experimental settings, considering the presence of outliers and more realistic sets of systems with similar performance.Additionally, achieving meta-evaluation reliability in GEC using confidence intervals for correlations, like Deutsch et al. (2021)'s study, is considered important.Furthermore, annotation based on Multidimensional Quality Metrics (Lommel et al., 2014) can take into account error types and severity, potentially providing interesting insights when compared to results from WMT (Freitag et al., 2021(Freitag et al., , 2022)).

Best practices for GEC evaluation
We recommend the use of both EBMs and SBMs in GEC.In light of the trend toward more fluent correcting systems such as the GPT model, the current combination of the CoNLL-2014 test set and M 2 will no longer be adequate for proper evaluation.Therefore, it is essential to use high correlation metrics, such as SOME or IMPARA, in addition to M 2 , to enable the evaluation of LLMs and achieve a more human-like and robust evaluation.Alternatively, exhaustive fluency references should be prepared to improve M 2 correlations, or datasets such as JFLEG (Napoles et al., 2017) that can account for fluency should be used.Furthermore, using LLMs, as reported in recent studies (Chiang and Lee, 2023;Liu and Fabbri, 2023;Kocmi and Federmann, 2023) as an effective evaluator for other generative tasks, may also prove beneficial in GEC.If resources allow, it would be good to conduct additional human evaluations.
EBMs and SBMs each have different strengths.EBMs can calculate Precision, Recall, and Fscore, allowing a detailed evaluation of the system performance.In terms of second language acquisition, the evaluation of each edit provides information about the error location, type, and amount, which can improve the quality of feedback and learning efficiency.Most SBMs, on the other hand, can evaluate without references, circumventing the problem of underestimating corrections that are limited by the coverage of references.Also, unlike EBMs, SBMs do not automatically give the lowest score to uncorrected sentences.This allows for a quantifiable measurement to determine whether a sentence has been improved or worsened as a result of correction.

Conclusion
To address issues in conventional meta-evaluation in English GEC, we construct a meta-evaluation dataset (SEEDA) consisting of corrections with human ratings along two different evaluation granularities, covering 12 state-of-the-art system corrections including LLMs, and two human corrections with different focuses.The dataset analysis reveals that the results of sentence-level human evaluation differ between granularities and that GEC systems based on GPT and T5 can correct as well as or better than humans.Also, through metaevaluation using SEEDA, we demonstrate that EBMs may be underestimated in existing metaevaluations and that matching the evaluation granularity of metrics with human evaluations tends to improve sentence-level correlations.By further analysis, we discovered the uncertainty of conclusions based on a single correlation and found that most metrics lacked the precision to distin-guish differences among high-performance neural systems.Finally, we propose a methodology for meta-evaluation and evaluation in GEC.We hope that this paper contributes to further advancements in GEC.

A Supplement of annotations
Figure 5 shows a screenshot of doccano used in the edit-based human evaluation.The source is enclosed in a <t> tag, and each corrected sentence is emphasized with a <s> tag along with the system number.In step 1, there are error labels for the source and True and False labels for each edit.In step 2, True and False labels with the system number are used to indicate whether the errors in the source were corrected.Due to the specifications of doccano, even if the same edit appears in multiple corrections, annotators need to label each occurrence separately.For information on Appraise in sentence-based human evaluation, you may refer to the (Grundkiewicz et al., 2015)'s work.
[Do → Should] [one → someone] who suffered from this disease keep it a secret [of → or] [inform → inform] their relatives?Output B: (Rank 1) [Do → Does] [one → someone] who [suffered → suffers] from this disease keep it a secret [of inform → from] their relatives?Sentence-based human evaluation Output A:

Figure 1 :
Figure1: M 2 Score (F 0.5 ) and word edit rate for classical systems in GJG15, neural systems in SEEDA, and human sentences.These neural systems generate more edits and better corrections compared to classical systems.
): It is [againt → again] his or her human rights and it is against the [law's→ ] spirit [ →of the law].

Figure 2 :
Figure 2: An overview of the annotation flow and an example of edit-based human evaluation.In Step 1, the annotator identifies errors in the source.Then, they categorize each edit in the output as either valid or not.In Step 2, the annotator determines whether each edit in the output effectively corrects the errors found in Step 1. TP, FP, and FN represent True Positive, False Positive, and False Negative, respectively.

Figure 3 :
Figure 3: Scatter plots of the human score and the metric score."Base" indicates the 12 systems excluding uncorrected sentences (INPUT) and fluent sentences (REF-F, GPT-3.5).Each line represents a regression line, and the shaded area indicates the size of the confidence interval for the estimated regression, obtained using bootstrap.Comparing the orange and blue regression lines to the gray regression line allows us to observe the degree of influence of each outlier on the distribution trend.For example, the leftward tilt of the orange regression lines for M 2 , PT-M 2 , ERRANT, and GLEU indicates a negative impact from fluent sentences as outliers.

Figure 4 :
Figure4: Variation of correlation when different systems are considered using window analysis.The x-axis represents the human ranking of the 12 systems excluding outliers."n" denotes the number of systems considered, with solid lines representing four systems and dashed lines representing eight systems.For example, for n=4, a point with x=5 corresponds to a human evaluation using systems ranked 2 to 5. The orange line represents Pearson (r) and the blue line represents Kendall (τ ).The correlation of the main metrics (M 2 , ERRANT, GLEU) shows significant variability, while pretraining-based metrics (SOME, IMPARA) exhibit relatively stable correlations.

Figure 5 :
Figure 5: Screenshot of doccano used in the edit-based human evaluation.

Table 2 :
Human rankings for each evaluation granularity using TS.Systems based on GPT and T5 architectures (GPT-3.5,T5, TransGEC) consistently achieve higher rankings than REF-M, suggesting the potential for these systems to outperform human capabilities in providing corrections.

Table 3 :
Dataset statistics for pairwise judgments by annotators.The numbers within the parentheses represent the number of ties, with the left being edit-based and the right being sentence-based.

Table 4 :
Cohen's κ measures the average inter-and intra-annotator agreements on pairwise judgments.The numbers in parentheses represent the κ for GJG15.