Skip to main content

Identifying Misaligned Spans in Parallel Corpora Using Change Point Detection

  • Conference paper
  • First Online:
Book cover Advances in Artificial Intelligence (Canadian AI 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11489))

Included in the following conference series:

  • 2501 Accesses

Abstract

Parallel corpora are the basic resource for many multilingual natural language processing models. Recent advances in, e.g. neural machine translation have shown that the quality of the alignment in the corpus has a crucial impact on the quality of the resulting model, renewing interest in filtering automatically aligned corpora to increase their quality. In this contribution, we investigate the use of a fast change point detection method to detect possibly problematic parts of a parallel corpus. We demonstrate its performance on German-English corpora of 11k and 31k sentences, achieve a boundary identification performance above 80% and improve the detection of genuine parallel sentences up to 88%. To our knowledge this is the first application of change point detection to the problem of error detection in noisy corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://cran.r-project.org/web/packages/ocp/index.html.

  2. 2.

    http://www.statmt.org/wmt13/translation-task.html.

  3. 3.

    Note that the scope of this paper is not to develop a better sentence alignment score, but to identify document boundaries to improve corpus filtering.

References

  1. Adams, R.P., MacKay, D.J.: Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742 (2007)

  2. Barry, D., Hartigan, J.A.: A Bayesian analysis for change point problems. J. Am. Stat. Assoc. 88(421), 309–319 (1993)

    MathSciNet  MATH  Google Scholar 

  3. Belinkov, Y., Bisk, Y.: Synthetic and natural noise both break neural machine translation. In: Proceedings of ICLR 2018 (2018)

    Google Scholar 

  4. Brown, P.F., Della Pietra, V.J., Della Pietra, S.A., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)

    Google Scholar 

  5. Carpuat, M., Vyas, Y., Niu, X.: Detecting cross-lingual semantic divergence for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, pp. 69–79 (2017)

    Google Scholar 

  6. Erdman, C., Emerson, J.W., et al.: bcp: an R package for performing a Bayesian analysis of change point problems. J. Stat. Softw. 23(3), 1–13 (2007)

    Article  Google Scholar 

  7. Goutte, C., Carpuat, M., Foster, G.: The impact of sentence alignment errors on phrase-based machine translation performance. In: Proceedings of AMTA 2012 (2012)

    Google Scholar 

  8. Goutte, C., Wang, Y., Liao, F., Zanussi, Z., Larkin, S., Grinberg, Y.: Eurogames16: evaluating change detection in online conversation. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). European Language Resource Association (2018). http://aclweb.org/anthology/L18-1277

  9. Jackson, B., et al.: An algorithm for optimal partitioning of data on an interval. IEEE Signal Process. Lett. 12(2), 105–108 (2005)

    Article  MathSciNet  Google Scholar 

  10. James, N.A., Matteson, D.: ecp: an R package for nonparametric multiple change point analysis of multivariate data. J. Stat. Softw. 62(1), 1–25 (2015). https://www.jstatsoft.org/index.php/jss/article/view/v062i07

    Google Scholar 

  11. Junczys-Dowmunt, M.: Dual conditional cross-entropy filtering of noisy parallel corpora. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 888–895. Association for Computational Linguistics (2018). http://aclweb.org/anthology/W18-6478

  12. Khadivi, S., Ney, H.: Automatic filtering of bilingual corpora for statistical machine translation. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 263–274. Springer, Heidelberg (2005). https://doi.org/10.1007/11428817_24

    Chapter  Google Scholar 

  13. Khayrallah, H., Koehn, P.: On the impact of various types of noise on neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 74–83 (2018)

    Google Scholar 

  14. Koehn, P., et al.: ParaCrawl corpus version 1.0 (2018). http://hdl.handle.net/11372/LRT-2610. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

  15. Koehn, P., Khayrallah, H., Heafield, K., Forcada, M.: Findings of the WMT 2018 shared task on parallel corpus filtering. In: Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, Brussels, Belgium. Association for Computational Linguistics, October 2018

    Google Scholar 

  16. Koehn, P., Knowles, R.: Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, pp. 28–39 (2017)

    Google Scholar 

  17. Lamraoui, F., Langlais, P.: Yet another fast, robust and open source sentence aligner. Time to reconsider sentence alignment? In: Machine Translation Summit XIV, Nice, France, September 2013

    Google Scholar 

  18. Lo, C.K., Simard, M., Stewart, D., Larkin, S., Goutte, C., Littell, P.: Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: the NRC supervised submissions to the parallel corpus filtering task. In: Proceedings of the Third Conference on Machine Translation (WMT 2018) (2018)

    Google Scholar 

  19. Lu, J., Lv, X., Shi, Y., Chen, B.: Alibaba submission to the WMT18 parallel corpus filtering task. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 917–922. Association for Computational Linguistics (2018). http://aclweb.org/anthology/W18-6482

  20. Mihalcea, R., Pedersen, T.: An evaluation exercise for word alignment. In: Mihalcea, R., Pedersen, T. (eds.) HLT-NAACL 2003 Workshop: Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, Edmonton, Alberta, Canada, pp. 1–10. Association for Computational Linguistics, May 2003

    Google Scholar 

  21. Rossenbach, N., Rosendahl, J., Kim, Y., Graça, M., Gokrani, A., Ney, H.: The RWTH Aachen university filtering system for the WMT 2018 parallel corpus filtering task. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 946–954. Association for Computational Linguistics (2018). http://aclweb.org/anthology/W18-6487

  22. Saatçi, Y., Turner, R., Rasmussen, C.E.: Gaussian process change point models. In: Proceedings of the 27th International Conference on Machine Learning (2010)

    Google Scholar 

  23. Székely, G.J., Rizzo, M.L.: Hierarchical clustering via joint between-within distances: extending Ward’s minimum variance method. J. Classif. 22(2), 151–183 (2005). https://EconPapers.repec.org/RePEc:spr:jclass:v:22:y:2005:i:2:p:151-183

    Article  MathSciNet  Google Scholar 

  24. Xu, Y.: Confidence measures for alignment and for machine translation. Theses, Université Paris-Saclay, September 2016. https://tel.archives-ouvertes.fr/tel-01399222

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cyril Goutte .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Crown

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pagotto, A., Littell, P., Wang, Y., Goutte, C. (2019). Identifying Misaligned Spans in Parallel Corpora Using Change Point Detection. In: Meurs, MJ., Rudzicz, F. (eds) Advances in Artificial Intelligence. Canadian AI 2019. Lecture Notes in Computer Science(), vol 11489. Springer, Cham. https://doi.org/10.1007/978-3-030-18305-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-18305-9_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-18304-2

  • Online ISBN: 978-3-030-18305-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics