초록

In the automatic evaluation of translations, precision and recall are two indices that show how precisely (precision) and how much (recall) the system is able to recognize the well-translated portion in a translation. It would be ideal if two indices could be equally weighted in the evaluation system, since both accuracy and completeness are important criteria in evaluation of human translations (HT). This is, however, not easy, as both indices are negatively correlated. Papineni et al. (2002), for example, opted for precision, while Lavie et al. (2005) used both indices, giving recall nine times more weight than precision. The aim of this work is to examine which of the two indices correlates better with evaluation of professional evaluators and how much weight should be given each to precision and to recall. For this purpose, 459 translated texts were rated with precision, recall, F1 (harmonic mean of precision and recall) and Fmean (nine times higher weight on recall) as well as by professional evaluators. The results show that recall correlates better with human evaluation than precision in almost all cases, but not Fmean than F1, which were equivalent in all but one case. They indicate that recall is indeed a more important metric, but the weight as high as nine on recall is not ideal for HT evaluation.

키워드

자동평가, 번역품질, 정확도, 재현도, F1, Fmean

참고문헌(19)open

  1. [단행본] 권철민 / 2020 / 파이썬 머신러닝 완벽 가이드 / 위키북스

  2. [학술지] 정혜연 / 2020 / 번역자동평가에서 풀리지 않은 과제 / 번역학연구 21 (1) : 9 ~ 29

  3. [보고서] 박혜주 / 2007 / 문학번역 평가 시스템 연구

  4. [학술지] 정혜연 / 2021 / 인간번역 자동평가에서 정답자와 평가자가 다르다면 / 독일언어문학 (93) : 75 ~ 95

  5. [학술지] 정혜연 / 2021 / 임베딩을 활용한 인간번역의 자동평가 - 기계가 의미를 평가할 수 있을까 / 통번역학연구 25 (3) : 141 ~ 162

  6. [학술대회] 한국외대 번역평가인증 연구팀 / 2016 / 번역인증제도 (실무편) / 한국외대 통번역연구소 학술대회 <언어, 통번역의 평가 및 인증> 발표집 : 23 ~ 33

  7. [학술대회] Banerjee, Satanjeev / 2005 / METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments / Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization : 65 ~ 72

  8. [학술지] Buckland, Michael / 1994 / The Relationship between Recall and Precision / Journal of the American Society for Information Science 45 (1) : 12 ~ 19

  9. [학술지] Chung, Hye-Yeon / 2020 / Automatische Evaluation der Humanübersetzung: BLEU vs. METEOR / Lebende Sprachen 65 (1) : 181 ~ 205

  10. [학술대회] Han, Lifeng / 2018 / Machine Translation Evaluation Resources and Methods: A Survey / IPRC-2018 (Ireland Postgraduate Research Conference)

  11. [학술지] Kunilovskaya, Maria / 2015 / How Far Do We Agree on the Quality of Translation? / English Studies at NBU 1 (1) : 18 ~ 31

  12. [학술지] Lai, Tzu-Yun / 2011 / Reliability and Validity of a Scale-based Assessment for Translation Tests / Meta 56 (3) : 713 ~ 722

  13. [인터넷자료] Lavie, Alon / The Significance of Recall in Automatic Metrics for MT Evaluation

  14. [학술대회] Papineni, Kishore / 2002 / BLEU: A Method for Automatic Evaluation of Machine Translation / Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) : 311 ~ 318

  15. [인터넷자료] Sasaki, Yutaka / The Truth of the F-measure

  16. [학술지] Waddington, Christopher / 2001 / Should Translations Be Assessed Holistically or through Error Analysis? / HERMES Journal of Language and Communication in Business 26 : 15 ~ 37

  17. [학술지] Waddington, Christopher / 2001 / Different Methods of Evaluating Student Translations: The Question of Validity / Meta 46 (2) : 311 ~ 325

  18. [단행본] van Rijsbergen, Cornelius / 1979 / Information Retrieval / Butterworth

  19. [학술대회] Zhang, Tianyi / 2020 / BERTScore: Evaluating Text Generation with BERT / Conference Paper at ICLR 2020 : 1 ~ 14