Automatic Parallel Data Mining After Bilingual Document Alignment

Wołk, Krzysztof; Wołk, Agnieszka

doi:10.1007/978-3-319-56535-4_32

Krzysztof Wołk¹⁹ &
Agnieszka Wołk¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 569))

Included in the following conference series:

World Conference on Information Systems and Technologies

2553 Accesses

Abstract

It has become essential to have precise translations of texts from different parts of the world, but it is often difficult to fill the translation gaps as quickly as might be needed. Undoubtedly, there are multiple dictionaries that can help in this regard, and various online translators exist to help cross this lingual bridge in many cases, but even these resources can fall short of serving their true purpose. The translators can provide a very accurate meaning of given words in a phrase, but they often miss the true essence of the language. The research presented here describes a method that can help close this lingual gap by extending certain aspects of the alignment task for WMT16. It is possible to achieve this goal by utilizing different classifiers and algorithms and by use of advanced computation. We carried out various experiments that allowed us to extract parallel data at the sentence level. This data proved capable of improving overall machine translation quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.statmt.org/wmt16/bilingual-task.html.
2.
http://www.statmt.org/wmt16/translation-task.html.
3.
https://www.ted.com.
4.
http://workshop2015.iwslt.org/.
5.
https://github.com/machinalis/yalign.
6.
https://github.com/machinalis/yalign/issues/, 3 accessed 10.11.2015.
7.
https://github.com/krzwolk/yalign.
8.
https://mega.nz/#F!hkEjFC4Q!lV9OJplRnsbtgveSLcc94g.
9.
http://www.statmt.org/wmt15/.

References

Wołk, K., Marasek, K.: Real-Time Statistical Speech Translation. New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer, Switzerland (2014)
Google Scholar
Wołk, K., Marasek, K.: Polish–English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany, pp. 113–119 (2013)
Google Scholar
Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2009)
Book MATH Google Scholar
García Berrotarán, G., Carrascosa, R., Vine, A.: Yalign documentation. https://yalign.readthedocs.org. Accessed 01/2015
Dieny, R., Thevenon, J., Martinez-del-Rincon, J., Nebel, J.-C.: Bioinformatics inspired algorithm for stereo correspondence. In: International Conference on Computer Vision Theory and Applications, Algarve, Portugal, Vilamoura (2011)
Google Scholar
Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman) (2007). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf
Cetollo, M., Girardi, C., Federico, M.: Wit3: web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268 (2012)
Google Scholar
Mohammadi, M., GasemAghaee, N.: Building bilingual parallel corpora based on Wikipedia. In: 2010 Second International Conference on Computer Engineering and Applications (ICCEA), pp. 264–268. IEEE (2010)
Google Scholar
Tyers, F.M., Pienaar, J.A.: Extracting bilingual word pairs from Wikipedia. In: Collaboration: Interoperability Between People in the Creation of Language Resources for Less-Resourced Languages, vol. 19, pp. 19–22 (2008)
Google Scholar
Yasuda, K., Sumita, E.: Method for building sentence-aligned corpus from Wikipedia. In: AAAI Workshop on Wikipedia and Artificial Intelligence (WikiAI 2008), pp. 263–268 (2008)
Google Scholar
Pal, S., Pakray, P., Naskar, S.K.: Automatic building and using parallel resources for SMT from comparable corpora. In: Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra)@ EACL, pp. 48–57 (2014)
Google Scholar
Plamada, M., Volk, M.: Mining for domain-specific parallel text from wikipedia. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, ACL 2013, pp. 112–120 (2013)
Google Scholar
Strötgen, J., Gertz, M., Junghans, C.: An event-centric model for multilingual document similarity. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 953–962 (2011)
Google Scholar
Paramita, M.L., Guthrie, D., Kanoulas, E., Gaizauskas, R., Clough, P., Sanderson, M.: Methods for collection and evaluation of comparable documents. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds.) Building and Using Comparable Corpora, pp. 93–112. Springer, Heidelberg (2013). doi:10.1007/978-3-642-20128-8_5
Chapter Google Scholar
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005). doi:10.1007/11562214_23
Chapter Google Scholar
Clark, J.H., et al.: Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, vol. 2, pp. 176–181. Association for Computational Linguistics (2011)
Google Scholar
Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. In: Rocha, Á., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, Volume 1. AISC, vol. 275, pp. 229–237. Springer, Cham (2014). doi:10.1007/978-3-319-05951-8_22
Chapter Google Scholar
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics (2011)
Google Scholar
Wołk, K., Marasek, K.: Tuned and GPU-accelerated parallel data mining from comparable corpora. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 32–40. Springer, Cham (2015). doi:10.1007/978-3-319-24033-6_4
Chapter Google Scholar
Khaladkar, C.S.: An Efficient Implementation of Needleman-Wunsch Algorithm on Graphical Processing Units. P.h.D. thesis, School of Computer Science and Software Engineering, The University of Western Australia (2009)
Google Scholar
Junczys-Dowmunt, M., Szał, A.: SyMGiza++: symmetrized word alignment models for statistical machine translation. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) SIIS 2011. LNCS, vol. 7053, pp. 379–390. Springer, Heidelberg (2012). doi:10.1007/978-3-642-25261-7_30
Chapter Google Scholar
Durrani, N., et al.: Integrating an unsupervised transliteration model into statistical machine translation. In: EACL, pp. 148–153 (2014)
Google Scholar
Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011)
Google Scholar

Download references

Acknowledgements

Work financed as part of the investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education and was backed by the PJATK legal resources.

Author information

Authors and Affiliations

Polish-Japanese Academy of Information Technology, Koszykowa 86, 02-008, Warsaw, Poland
Krzysztof Wołk & Agnieszka Wołk

Authors

Krzysztof Wołk
View author publications
You can also search for this author in PubMed Google Scholar
Agnieszka Wołk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

DEI/FCT, Universidade de Coimbra, Coimbra, Baixo Mondego, Portugal
Álvaro Rocha
Nova IMS, Universidade Nova de Lisboa, Lisboa, Portugal
Ana Maria Correia
College of Engineering, The Ohio State University, Columbus, Ohio, USA
Hojjat Adeli
DSI/EEUM, Universidade do Minho, Guimarães, Portugal
Luís Paulo Reis
DIMES, Università della Calabria, Arcavacata di Rende, Italy
Sandra Costanzo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wołk, K., Wołk, A. (2017). Automatic Parallel Data Mining After Bilingual Document Alignment. In: Rocha, Á., Correia, A., Adeli, H., Reis, L., Costanzo, S. (eds) Recent Advances in Information Systems and Technologies. WorldCIST 2017. Advances in Intelligent Systems and Computing, vol 569. Springer, Cham. https://doi.org/10.1007/978-3-319-56535-4_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-56535-4_32
Published: 28 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56534-7
Online ISBN: 978-3-319-56535-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics