Abstract
It has become essential to have precise translations of texts from different parts of the world, but it is often difficult to fill the translation gaps as quickly as might be needed. Undoubtedly, there are multiple dictionaries that can help in this regard, and various online translators exist to help cross this lingual bridge in many cases, but even these resources can fall short of serving their true purpose. The translators can provide a very accurate meaning of given words in a phrase, but they often miss the true essence of the language. The research presented here describes a method that can help close this lingual gap by extending certain aspects of the alignment task for WMT16. It is possible to achieve this goal by utilizing different classifiers and algorithms and by use of advanced computation. We carried out various experiments that allowed us to extract parallel data at the sentence level. This data proved capable of improving overall machine translation quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
https://github.com/machinalis/yalign/issues/, 3 accessed 10.11.2015.
- 7.
- 8.
- 9.
References
Wołk, K., Marasek, K.: Real-Time Statistical Speech Translation. New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer, Switzerland (2014)
Wołk, K., Marasek, K.: Polish–English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany, pp. 113–119 (2013)
Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2009)
García Berrotarán, G., Carrascosa, R., Vine, A.: Yalign documentation. https://yalign.readthedocs.org. Accessed 01/2015
Dieny, R., Thevenon, J., Martinez-del-Rincon, J., Nebel, J.-C.: Bioinformatics inspired algorithm for stereo correspondence. In: International Conference on Computer Vision Theory and Applications, Algarve, Portugal, Vilamoura (2011)
Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman) (2007). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf
Cetollo, M., Girardi, C., Federico, M.: Wit3: web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268 (2012)
Mohammadi, M., GasemAghaee, N.: Building bilingual parallel corpora based on Wikipedia. In: 2010 Second International Conference on Computer Engineering and Applications (ICCEA), pp. 264–268. IEEE (2010)
Tyers, F.M., Pienaar, J.A.: Extracting bilingual word pairs from Wikipedia. In: Collaboration: Interoperability Between People in the Creation of Language Resources for Less-Resourced Languages, vol. 19, pp. 19–22 (2008)
Yasuda, K., Sumita, E.: Method for building sentence-aligned corpus from Wikipedia. In: AAAI Workshop on Wikipedia and Artificial Intelligence (WikiAI 2008), pp. 263–268 (2008)
Pal, S., Pakray, P., Naskar, S.K.: Automatic building and using parallel resources for SMT from comparable corpora. In: Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra)@ EACL, pp. 48–57 (2014)
Plamada, M., Volk, M.: Mining for domain-specific parallel text from wikipedia. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, ACL 2013, pp. 112–120 (2013)
Strötgen, J., Gertz, M., Junghans, C.: An event-centric model for multilingual document similarity. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 953–962 (2011)
Paramita, M.L., Guthrie, D., Kanoulas, E., Gaizauskas, R., Clough, P., Sanderson, M.: Methods for collection and evaluation of comparable documents. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds.) Building and Using Comparable Corpora, pp. 93–112. Springer, Heidelberg (2013). doi:10.1007/978-3-642-20128-8_5
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005). doi:10.1007/11562214_23
Clark, J.H., et al.: Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, vol. 2, pp. 176–181. Association for Computational Linguistics (2011)
Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. In: Rocha, Á., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, Volume 1. AISC, vol. 275, pp. 229–237. Springer, Cham (2014). doi:10.1007/978-3-319-05951-8_22
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics (2011)
Wołk, K., Marasek, K.: Tuned and GPU-accelerated parallel data mining from comparable corpora. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 32–40. Springer, Cham (2015). doi:10.1007/978-3-319-24033-6_4
Khaladkar, C.S.: An Efficient Implementation of Needleman-Wunsch Algorithm on Graphical Processing Units. P.h.D. thesis, School of Computer Science and Software Engineering, The University of Western Australia (2009)
Junczys-Dowmunt, M., Szał, A.: SyMGiza++: symmetrized word alignment models for statistical machine translation. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) SIIS 2011. LNCS, vol. 7053, pp. 379–390. Springer, Heidelberg (2012). doi:10.1007/978-3-642-25261-7_30
Durrani, N., et al.: Integrating an unsupervised transliteration model into statistical machine translation. In: EACL, pp. 148–153 (2014)
Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011)
Acknowledgements
Work financed as part of the investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education and was backed by the PJATK legal resources.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Wołk, K., Wołk, A. (2017). Automatic Parallel Data Mining After Bilingual Document Alignment. In: Rocha, Á., Correia, A., Adeli, H., Reis, L., Costanzo, S. (eds) Recent Advances in Information Systems and Technologies. WorldCIST 2017. Advances in Intelligent Systems and Computing, vol 569. Springer, Cham. https://doi.org/10.1007/978-3-319-56535-4_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-56535-4_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56534-7
Online ISBN: 978-3-319-56535-4
eBook Packages: EngineeringEngineering (R0)