Skip to main content

Automatic Parallel Data Mining After Bilingual Document Alignment

  • Conference paper
  • First Online:
Recent Advances in Information Systems and Technologies (WorldCIST 2017)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 569))

Included in the following conference series:

  • 2553 Accesses

Abstract

It has become essential to have precise translations of texts from different parts of the world, but it is often difficult to fill the translation gaps as quickly as might be needed. Undoubtedly, there are multiple dictionaries that can help in this regard, and various online translators exist to help cross this lingual bridge in many cases, but even these resources can fall short of serving their true purpose. The translators can provide a very accurate meaning of given words in a phrase, but they often miss the true essence of the language. The research presented here describes a method that can help close this lingual gap by extending certain aspects of the alignment task for WMT16. It is possible to achieve this goal by utilizing different classifiers and algorithms and by use of advanced computation. We carried out various experiments that allowed us to extract parallel data at the sentence level. This data proved capable of improving overall machine translation quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.statmt.org/wmt16/bilingual-task.html.

  2. 2.

    http://www.statmt.org/wmt16/translation-task.html.

  3. 3.

    https://www.ted.com.

  4. 4.

    http://workshop2015.iwslt.org/.

  5. 5.

    https://github.com/machinalis/yalign.

  6. 6.

    https://github.com/machinalis/yalign/issues/, 3 accessed 10.11.2015.

  7. 7.

    https://github.com/krzwolk/yalign.

  8. 8.

    https://mega.nz/#F!hkEjFC4Q!lV9OJplRnsbtgveSLcc94g.

  9. 9.

    http://www.statmt.org/wmt15/.

References

  1. Wołk, K., Marasek, K.: Real-Time Statistical Speech Translation. New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer, Switzerland (2014)

    Google Scholar 

  2. Wołk, K., Marasek, K.: Polish–English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany, pp. 113–119 (2013)

    Google Scholar 

  3. Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2009)

    Book  MATH  Google Scholar 

  4. García Berrotarán, G., Carrascosa, R., Vine, A.: Yalign documentation. https://yalign.readthedocs.org. Accessed 01/2015

  5. Dieny, R., Thevenon, J., Martinez-del-Rincon, J., Nebel, J.-C.: Bioinformatics inspired algorithm for stereo correspondence. In: International Conference on Computer Vision Theory and Applications, Algarve, Portugal, Vilamoura (2011)

    Google Scholar 

  6. Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman) (2007). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf

  7. Cetollo, M., Girardi, C., Federico, M.: Wit3: web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268 (2012)

    Google Scholar 

  8. Mohammadi, M., GasemAghaee, N.: Building bilingual parallel corpora based on Wikipedia. In: 2010 Second International Conference on Computer Engineering and Applications (ICCEA), pp. 264–268. IEEE (2010)

    Google Scholar 

  9. Tyers, F.M., Pienaar, J.A.: Extracting bilingual word pairs from Wikipedia. In: Collaboration: Interoperability Between People in the Creation of Language Resources for Less-Resourced Languages, vol. 19, pp. 19–22 (2008)

    Google Scholar 

  10. Yasuda, K., Sumita, E.: Method for building sentence-aligned corpus from Wikipedia. In: AAAI Workshop on Wikipedia and Artificial Intelligence (WikiAI 2008), pp. 263–268 (2008)

    Google Scholar 

  11. Pal, S., Pakray, P., Naskar, S.K.: Automatic building and using parallel resources for SMT from comparable corpora. In: Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra)@ EACL, pp. 48–57 (2014)

    Google Scholar 

  12. Plamada, M., Volk, M.: Mining for domain-specific parallel text from wikipedia. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, ACL 2013, pp. 112–120 (2013)

    Google Scholar 

  13. Strötgen, J., Gertz, M., Junghans, C.: An event-centric model for multilingual document similarity. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 953–962 (2011)

    Google Scholar 

  14. Paramita, M.L., Guthrie, D., Kanoulas, E., Gaizauskas, R., Clough, P., Sanderson, M.: Methods for collection and evaluation of comparable documents. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds.) Building and Using Comparable Corpora, pp. 93–112. Springer, Heidelberg (2013). doi:10.1007/978-3-642-20128-8_5

    Chapter  Google Scholar 

  15. Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005). doi:10.1007/11562214_23

    Chapter  Google Scholar 

  16. Clark, J.H., et al.: Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, vol. 2, pp. 176–181. Association for Computational Linguistics (2011)

    Google Scholar 

  17. Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. In: Rocha, Á., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, Volume 1. AISC, vol. 275, pp. 229–237. Springer, Cham (2014). doi:10.1007/978-3-319-05951-8_22

    Chapter  Google Scholar 

  18. Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics (2011)

    Google Scholar 

  19. Wołk, K., Marasek, K.: Tuned and GPU-accelerated parallel data mining from comparable corpora. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 32–40. Springer, Cham (2015). doi:10.1007/978-3-319-24033-6_4

    Chapter  Google Scholar 

  20. Khaladkar, C.S.: An Efficient Implementation of Needleman-Wunsch Algorithm on Graphical Processing Units. P.h.D. thesis, School of Computer Science and Software Engineering, The University of Western Australia (2009)

    Google Scholar 

  21. Junczys-Dowmunt, M., Szał, A.: SyMGiza++: symmetrized word alignment models for statistical machine translation. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) SIIS 2011. LNCS, vol. 7053, pp. 379–390. Springer, Heidelberg (2012). doi:10.1007/978-3-642-25261-7_30

    Chapter  Google Scholar 

  22. Durrani, N., et al.: Integrating an unsupervised transliteration model into statistical machine translation. In: EACL, pp. 148–153 (2014)

    Google Scholar 

  23. Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011)

    Google Scholar 

Download references

Acknowledgements

Work financed as part of the investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education and was backed by the PJATK legal resources.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Wołk, K., Wołk, A. (2017). Automatic Parallel Data Mining After Bilingual Document Alignment. In: Rocha, Á., Correia, A., Adeli, H., Reis, L., Costanzo, S. (eds) Recent Advances in Information Systems and Technologies. WorldCIST 2017. Advances in Intelligent Systems and Computing, vol 569. Springer, Cham. https://doi.org/10.1007/978-3-319-56535-4_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56535-4_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56534-7

  • Online ISBN: 978-3-319-56535-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics