Abstract
Parallel corpora are the basic resource for many multilingual natural language processing models. Recent advances in, e.g. neural machine translation have shown that the quality of the alignment in the corpus has a crucial impact on the quality of the resulting model, renewing interest in filtering automatically aligned corpora to increase their quality. In this contribution, we investigate the use of a fast change point detection method to detect possibly problematic parts of a parallel corpus. We demonstrate its performance on German-English corpora of 11k and 31k sentences, achieve a boundary identification performance above 80% and improve the detection of genuine parallel sentences up to 88%. To our knowledge this is the first application of change point detection to the problem of error detection in noisy corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Note that the scope of this paper is not to develop a better sentence alignment score, but to identify document boundaries to improve corpus filtering.
References
Adams, R.P., MacKay, D.J.: Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742 (2007)
Barry, D., Hartigan, J.A.: A Bayesian analysis for change point problems. J. Am. Stat. Assoc. 88(421), 309–319 (1993)
Belinkov, Y., Bisk, Y.: Synthetic and natural noise both break neural machine translation. In: Proceedings of ICLR 2018 (2018)
Brown, P.F., Della Pietra, V.J., Della Pietra, S.A., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Carpuat, M., Vyas, Y., Niu, X.: Detecting cross-lingual semantic divergence for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, pp. 69–79 (2017)
Erdman, C., Emerson, J.W., et al.: bcp: an R package for performing a Bayesian analysis of change point problems. J. Stat. Softw. 23(3), 1–13 (2007)
Goutte, C., Carpuat, M., Foster, G.: The impact of sentence alignment errors on phrase-based machine translation performance. In: Proceedings of AMTA 2012 (2012)
Goutte, C., Wang, Y., Liao, F., Zanussi, Z., Larkin, S., Grinberg, Y.: Eurogames16: evaluating change detection in online conversation. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). European Language Resource Association (2018). http://aclweb.org/anthology/L18-1277
Jackson, B., et al.: An algorithm for optimal partitioning of data on an interval. IEEE Signal Process. Lett. 12(2), 105–108 (2005)
James, N.A., Matteson, D.: ecp: an R package for nonparametric multiple change point analysis of multivariate data. J. Stat. Softw. 62(1), 1–25 (2015). https://www.jstatsoft.org/index.php/jss/article/view/v062i07
Junczys-Dowmunt, M.: Dual conditional cross-entropy filtering of noisy parallel corpora. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 888–895. Association for Computational Linguistics (2018). http://aclweb.org/anthology/W18-6478
Khadivi, S., Ney, H.: Automatic filtering of bilingual corpora for statistical machine translation. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 263–274. Springer, Heidelberg (2005). https://doi.org/10.1007/11428817_24
Khayrallah, H., Koehn, P.: On the impact of various types of noise on neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 74–83 (2018)
Koehn, P., et al.: ParaCrawl corpus version 1.0 (2018). http://hdl.handle.net/11372/LRT-2610. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Koehn, P., Khayrallah, H., Heafield, K., Forcada, M.: Findings of the WMT 2018 shared task on parallel corpus filtering. In: Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, Brussels, Belgium. Association for Computational Linguistics, October 2018
Koehn, P., Knowles, R.: Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, pp. 28–39 (2017)
Lamraoui, F., Langlais, P.: Yet another fast, robust and open source sentence aligner. Time to reconsider sentence alignment? In: Machine Translation Summit XIV, Nice, France, September 2013
Lo, C.K., Simard, M., Stewart, D., Larkin, S., Goutte, C., Littell, P.: Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: the NRC supervised submissions to the parallel corpus filtering task. In: Proceedings of the Third Conference on Machine Translation (WMT 2018) (2018)
Lu, J., Lv, X., Shi, Y., Chen, B.: Alibaba submission to the WMT18 parallel corpus filtering task. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 917–922. Association for Computational Linguistics (2018). http://aclweb.org/anthology/W18-6482
Mihalcea, R., Pedersen, T.: An evaluation exercise for word alignment. In: Mihalcea, R., Pedersen, T. (eds.) HLT-NAACL 2003 Workshop: Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, Edmonton, Alberta, Canada, pp. 1–10. Association for Computational Linguistics, May 2003
Rossenbach, N., Rosendahl, J., Kim, Y., Graça, M., Gokrani, A., Ney, H.: The RWTH Aachen university filtering system for the WMT 2018 parallel corpus filtering task. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 946–954. Association for Computational Linguistics (2018). http://aclweb.org/anthology/W18-6487
Saatçi, Y., Turner, R., Rasmussen, C.E.: Gaussian process change point models. In: Proceedings of the 27th International Conference on Machine Learning (2010)
Székely, G.J., Rizzo, M.L.: Hierarchical clustering via joint between-within distances: extending Ward’s minimum variance method. J. Classif. 22(2), 151–183 (2005). https://EconPapers.repec.org/RePEc:spr:jclass:v:22:y:2005:i:2:p:151-183
Xu, Y.: Confidence measures for alignment and for machine translation. Theses, Université Paris-Saclay, September 2016. https://tel.archives-ouvertes.fr/tel-01399222
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Crown
About this paper
Cite this paper
Pagotto, A., Littell, P., Wang, Y., Goutte, C. (2019). Identifying Misaligned Spans in Parallel Corpora Using Change Point Detection. In: Meurs, MJ., Rudzicz, F. (eds) Advances in Artificial Intelligence. Canadian AI 2019. Lecture Notes in Computer Science(), vol 11489. Springer, Cham. https://doi.org/10.1007/978-3-030-18305-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-18305-9_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18304-2
Online ISBN: 978-3-030-18305-9
eBook Packages: Computer ScienceComputer Science (R0)