Abstract
Machine translation systems have been widely adopted in our daily life, making life easier and more convenient. Unfortunately, erroneous translations may result in severe consequences, such as financial losses. This requires to improve the accuracy and the reliability of machine translation systems. However, it is challenging to test machine translation systems because of the complexity and intractability of the underlying neural models. To tackle these challenges, we propose a novel metamorphic testing approach by syntactic tree pruning (STP) to validate machine translation systems. Our key insight is that a pruned sentence should have similar crucial semantics compared with the original sentence. Specifically, STP (1) proposes a core semantics-preserving pruning strategy by basic sentence structures and dependency relations on the level of syntactic tree representation, (2) generates source sentence pairs based on the metamorphic relation, and (3) reports suspicious issues whose translations break the consistency property by a bag-of-words model. We further evaluate STP on two state-of-the-art machine translation systems (i.e., Google Translate and Bing Microsoft Translator) with 1,200 source sentences as inputs. The results show that STP accurately finds 5,073 unique erroneous translations in Google Translate and 5,100 unique erroneous translations in Bing Microsoft Translator (400% more than state-of-the-art techniques), with 64.5% and 65.4% precision, respectively. The reported erroneous translations vary in types and more than 90% of them are not found by state-of-the-art techniques. There are 9,393 erroneous translations unique to STP, which is 711.9% more than state-of-the-art techniques. Moreover, STP is quite effective in detecting translation errors for the original sentences with a recall reaching 74.0%, improving state-of-the-art techniques by 55.1% on average.
- [1] . 2022. The British Broadcasting Corporation (BBC) News Homepage. Retrieved from https://www.bbc.com/
(accessed August, 2022). Google Scholar - [2] . 2018. Synthetic and natural noise both break neural machine translation. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18). 1–13.Google Scholar
- [3] . 2022. SemMT: A semantic-based testing approach for machine translation systems. ACM Trans. Softw. Eng. Methodol. 31, 2 (2022), 1–36.Google ScholarDigital Library
- [4] . 2020. Open-domain question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 34–37.Google ScholarCross Ref
- [5] . 2021. Testing your question answering software via asking recursively. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). 104–116.Google ScholarDigital Library
- [6] . 2020. Metamorphic testing: A new approach for generating next test cases. Retrieved from https://arXiv:2002.12543Google Scholar
- [7] . 2018. Metamorphic testing: A review of challenges and opportunities. ACM Comput. Surveys 51, 1 (2018), 1–27.Google ScholarDigital Library
- [8] . 2019. Robust neural machine translation with doubly adversarial inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 4324–4333.Google ScholarCross Ref
- [9] . 2002. Syntactic Structures. Walter de Gruyter.Google ScholarCross Ref
- [10] . 2022. The Cable News Network (CNN) News Homepage. Retrieved from https://edition.cnn.com/
(accessed August, 2022). Google Scholar - [11] . 2022. China Daily News Homepage. Retrieved from https://www.chinadaily.com.cn/Google Scholar
- [12] . 2016. Machine Translation Tips. Retrieved from https://cloud.ibm.com/docs/GlobalizationPipeline?topic=GlobalizationPipeline-globalizationpipeline_tips&locale=en
(accessed August, 2022). Google Scholar - [13] . 2019. Benchmarking adversarial robustness. Retrieved from https://arXiv:1912.11852Google Scholar
- [14] . 2019. Deepstellar: Model-based quantitative analysis of stateful deep learning systems. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’19). 477–487.Google ScholarDigital Library
- [15] . 2018. On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18). 653–663.Google Scholar
- [16] . 2022. Google Translate. Retrieved from https://translate.google.com
(accessed August, 2022). Google Scholar - [17] . 2022. CoreNLP. Retrieved from https://stanfordnlp.github.io/CoreNLP
(accessed August, 2022). Google Scholar - [18] . 2020. Machine translation testing via pathological invariance. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’20). 863–875.Google Scholar
- [19] . 2018. Achieving human parity on automatic chinese to english news translation. Retrieved from https://arXiv:1803.05567Google Scholar
- [20] . 2022. Machine Translation Testing Toolkit. Retrieved from https://github.com/RobustNLP/TestTranslation
(accessed August, 2022). Google Scholar - [21] . 2020. Structure-invariant testing for machine translation. In Proceedings of the 42nd IEEE/ACM International Conference on Software Engineering (ICSE’20). 961–973.Google ScholarDigital Library
- [22] . 2021. Testing machine translation via referential transparency. In Proceedings of the 43nd IEEE/ACM International Conference on Software Engineering (ICSE’21). 961–973.Google ScholarDigital Library
- [23] . 2022. AEON: A method for automatic evaluation of NLP test cases. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’22). 202–214.Google ScholarDigital Library
- [24] . 1984. Introduction to the Grammar of English. Cambridge University Press.Google ScholarCross Ref
- [25] . 2021. Automated testing for machine translation via constituency invariance. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). 468–479.Google ScholarDigital Library
- [26] . 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2021–2031.Google ScholarCross Ref
- [27] . 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 1039–1049.Google ScholarDigital Library
- [28] . 2014. Compiler validation via equivalence modulo inputs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14). 216–226.Google ScholarDigital Library
- [29] . 2023. Accelerating fuzzing through prefix-guided execution. Proc. ACM Program. Lang. 7, OOPSLA1 (2023), 1–27.Google ScholarDigital Library
- [30] . 2020. Explicit sentence compression for neural machine translation. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI’20), Vol. 34. 8311–8318.Google ScholarCross Ref
- [31] . 2015. Many-core compiler fuzzing. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’15). 65–76.Google ScholarDigital Library
- [32] . 2018. Defensive quantization: When efficiency meets robustness. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [33] . 2015. Metamorphic model-based testing applied on NASA DAT–an experience report. In Proceedings of the 37th IEEE/ACM International Conference on Software Engineering (ICSE’15), Vol. 2. 129–138.Google Scholar
- [34] . 2022. The 20 Most Spoken Languages in the World in 2022. Retrieved from https://lingua.edu/the-20-most-spoken-languages-in-the-world-in-2022/
(accessed August, 2022). Google Scholar - [35] . 2020. Incomplete utterance rewriting as semantic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 2846–2857.Google ScholarCross Ref
- [36] . 1995. Linguistic Semantics: An Introduction. Cambridge University Press.Google ScholarCross Ref
- [37] . 2018. MODE: Automated neural network model debugging via state differential analysis and input selection. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’18). 175–186.Google ScholarDigital Library
- [38] . 1988. Rhetorical structure theory: Toward a functional theory of text organization. Text-interdisc. J. Study Disc. 8, 3 (1988), 243–281.Google ScholarCross Ref
- [39] . 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL’14). 55–60.Google ScholarCross Ref
- [40] . 2022. Bing Microsoft Translator. Retrieved from https://www.bing.com/translator
(accessed August, 2022). Google Scholar - [41] . 2018. Did the model understand the question? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 1896–1906.Google ScholarCross Ref
- [42] . 2008. Properties of machine learning applications for use in metamorphic testing. In Proceedings of the 20th International Conference on Software Engineering and Knowledge Engineering (SEKE’08). 867–872.Google Scholar
- [43] . 2019. Transforming complex sentences into a semantic hierarchy. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3415–3427.Google ScholarCross Ref
- [44] . 2018. Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). 3956–3965.Google Scholar
- [45] . 2016. Distillation as a defense to adversarial perturbations against deep neural networks. In Proceedings of the IEEE Symposium on Security and Privacy (SP’16). IEEE, 582–597.Google ScholarCross Ref
- [46] . 2018. A monte carlo method for metamorphic testing of machine translation services. In Proceedings of the IEEE/ACM 3rd International Workshop on Metamorphic Testing (MET’18). IEEE, 38–45.Google ScholarDigital Library
- [47] . 2010. A Comprehensive Grammar of the English Language. Pearson Education India.Google Scholar
- [48] . 2022. Reuters News Homepage. Retrieved from https://www.reuters.com/
(accessed August, 2022). Google Scholar - [49] . 2016. A survey on metamorphic testing. IEEE Trans. Softw. Eng. 42, 9 (2016), 805–824.Google ScholarCross Ref
- [50] . 2022. Natural test generation for precise testing of question answering software. In Proceedings of the IEEE/ACM Conference on Automated Software Engineering (ASE’22).Google ScholarDigital Library
- [51] . 2022. Compressing pre-trained models of code into 3 MB. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE’22). 1–12.Google ScholarDigital Library
- [52] . 2020. A survey on text simplification. Retrieved from https://arXiv:2008.08612Google Scholar
- [53] . 2018. Metamorphic testing for machine translations: MT4MT. In Proceedings of the 25th Australasian Software Engineering Conference (ASWEC’18). IEEE, 96–100.Google ScholarCross Ref
- [54] . 2020. Automatic testing and improvement of machine translation. In Proceedings of the 42nd IEEE/ACM International Conference on Software Engineering (ICSE’20). 974–985.Google ScholarDigital Library
- [55] . 2022. Improving machine translation systems via isotopic replacement. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22).Google ScholarDigital Library
- [56] . 2018. DeepTest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th IEEE/ACM International Conference on Software Engineering (ICSE’18). 303–314.Google ScholarDigital Library
- [57] . 2016. Ten Years of Google Translate. Retrieved from https://blog.google/products/translate/ten-years-of-google-translate/Google Scholar
- [58] . 2022. Bridging pre-trained models and downstream tasks for source code understanding. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22). 287–298.Google ScholarDigital Library
- [59] . 2019. Adversarial sample detection for deep neural network through model mutation testing. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 1245–1256.Google ScholarDigital Library
- [60] . 2019. Detecting failures of neural machine translation in the absence of reference translations. In Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’19). 1–4.Google ScholarCross Ref
- [61] . 2010. Semantics-preserving bag-of-words models and applications. IEEE Trans. Image Process. 19, 7 (2010), 1908–1920.Google ScholarDigital Library
- [62] . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. Retrieved from https://arXiv:1609.08144Google Scholar
- [63] . 2019. Generating 3d adversarial point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 9136–9144.Google ScholarCross Ref
- [64] . 2019. Meshadv: Adversarial meshes for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6898–6907.Google ScholarCross Ref
- [65] . 2011. Testing and validating machine learning classifiers by metamorphic testing. J. Syst. Softw. 84, 4 (2011), 544–558.Google ScholarDigital Library
- [66] . 2020. Discourse-aware neural extractive text summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’18). 5021–5031.Google ScholarCross Ref
- [67] . 2022. Youdao Translator. Retrieved from http://www.youdao.com
(accessed August, 2022). Google Scholar - [68] . 2022. Automated testing of image captioning systems. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’22). 467–479.Google ScholarDigital Library
- [69] . 2019. DeepSearch: Simple and effective blackbox fuzzing of deep neural networks. Retrieved from https://arXiv:1910.06296Google Scholar
- [70] . 2014. Search-based inference of polynomial metamorphic relations. In Proceedings of the 29th IEEE/ACM International Conference on Automated Software Engineering (ASE’14). 701–712.Google ScholarDigital Library
- [71] . 2018. Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE’18). 132–142.Google ScholarDigital Library
- [72] . 2023. STP Reproduction Artifacts. Retrieved from https://github.com/iSEngLab/STP
(accessed December, 2023). Google Scholar - [73] . 2021. Crafting adversarial examples for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL’21). 1967–1977.Google ScholarCross Ref
- [74] . 2018. An empirical study on tensorflow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’18). 129–140.Google ScholarDigital Library
- [75] . 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2205–2215.Google ScholarCross Ref
- [76] . 2019. Testing untestable neural machine translation: An industrial case. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE’19). 314–315.Google ScholarDigital Library
- [77] . 2015. Metamorphic testing for software quality assessment: A study of search engines. IEEE Trans. Softw. Eng. 42, 3 (2015), 264–284.Google ScholarDigital Library
- [78] . 2012. Automated functional testing of online search services. Softw. Test. Verific. Reliab. 22, 4 (2012), 221–243.Google ScholarDigital Library
- [79] . 2013. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL’13). 434–443.Google Scholar
Index Terms
- Machine Translation Testing via Syntactic Tree Pruning
Recommendations
Language Modeling for Syntax-Based Machine Translation Using Tree Substitution Grammars: A Case Study on Chinese-English Translation
The poor grammatical output of Machine Translation (MT) systems appeals syntax-based approaches within language modeling. However, previous studies showed that syntax-based language modeling using (Context-Free) Treebank Grammars was not very helpful in ...
Structure-invariant testing for machine translation
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software EngineeringIn recent years, machine translation software has increasingly been integrated into our daily lives. People routinely use machine translation for various applications, such as describing symptoms to a foreign doctor and reading political news in a ...
Large aligned treebanks for syntax-based machine translation
We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the non-terminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- ...
Comments