Skip to main content
Log in

The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Around the world, a growing interest has been seen in learner translator corpora, which are invaluable resources for teaching and research. This paper introduces a new resource to support researchers from different interdisciplinary areas such as computational linguistics, descriptive translation studies, computer-aided translation technology, Arabic machine translation applications, cognitive science, and translation pedagogy. Motivated by the lack of learner translator resources that provide data about learners of translation from and into Arabic, the undergraduate learner translator corpus (ULTC) is an ongoing, error-tagged sentence-aligned parallel corpus of English, Arabic, and French, with Arabic as its main language. The present corpus, consisting of parallel texts of female learners of translation from English or French into Arabic, is the first of its kind in terms of the languages represented, tasks covered, and number of students involved. It is also unique in terms of combining many complementary corpora of cross-lingual data, each of which has its own web-based query interface and corpus analysis tools. This paper describes the ULTC compilation process, preliminary findings, and planned future expansion and research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

References

  • Abu Shquier, M. M., & Abu Shqeer, O. (2012). Words ordering and corresponding verb-subject agreements in English–Arabic machine translation, An enhancement approach. The International Arab Journal of Information Technology (IAJIT),2, 49–60.

    Article  Google Scholar 

  • Afli, H., Lohar, P., & Way, A. (2017). MultiNews: A web collection of an aligned multimodal and multilingual corpus. In Proceedings of the first workshop on curation and applications of parallel and comparable corpora. Taipei, Taiwan.

  • Al-Ajmi, H. (2004). A new English-Arabic parallel text corpus for lexicographic applications. Lexikos, 14(1), 326–330.

    Google Scholar 

  • Al-Jarf, R. (2007). SVO word order errors in English–Arabic translation. Translators’ Journal,52, 299–308.

    Google Scholar 

  • Al-Momani, I. (2010). Does the VP node exist in Modern Standard Arabic? Journal of Language and Literature, ISSN: 2078-0303, May 2010.

  • Alotaibi, H. M. (2017). Arabic–English parallel corpus: A new resource for translation training and language teaching. Arab World English Journal,8(3), 319.

    Article  Google Scholar 

  • Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 17–45). Amsterdam: John Benjamins.

    Chapter  Google Scholar 

  • Baker, M. (1999). The role of corpora in investigating the linguistic behaviour of professional translators. International Journal of Corpus Linguistics,4(2), 281–298.

    Article  Google Scholar 

  • Bowker, L., & Pearson, J. (2002). Working with specialized language: A practical guide to using corpora. London: Routledge.

    Book  Google Scholar 

  • Bowker, L., & Peter, B. (2003). Student translation archive and student translation tracking system: Design, development and application. In F. Zanettin, S. Bernardini, & D. Stewart (Eds.), Corpora in translator education (pp. 103–119). Manchester: St Jerome Publishing.

    Google Scholar 

  • Carl, M. (2012). Translog-II: A program for recording user activity data for empirical reading and writing research. In Proceedings of the eighth international conference on language resources and evaluation, European Language Resources Association (ELRA), Istanbul, Turkey. http://www.lrec-conf.org/proceedings/lrec2012/summaries/614.html.

  • Carl, M., Bangalore, S., & Schaeffer, M. (2015). New directions in empirical translation process research: Exploring the CRITT TPR-DB. Cham: Springer. (New Frontiers in Translation Studies).

    Google Scholar 

  • Carl, M., & Dragsted, B. (2012). Inside the monitor model: Process of default and challenged translation production. Translation: Corpora, Computation, Cognition,2(1), 127–145. (Special issue on the crossroads between contrastive linguistics, translation studies and machine translation).

    Google Scholar 

  • Carl, M., Dragsted, B., Elming, J., Hardt, D., & Jakobsen, A. L. (2012). The process of post-editing: A pilot study. In B. Sharp, M. Zock, M. Carl, A. L. Jakobsen (eds.), Proceedings of the 8th natural language processing and cognitive science workshop (Copenhagen studies in language series, Vol. 41, pp. 131–142).

  • Castagnoli, S. (2009). Regularities and variations in learner translations: A corpus-based study of conjunctive explicitation. PhD Dissertation, University of Pisa.

  • Castagnoli, S., Ciobanu, D., Kunz, K., Volanschi, A., & Kübler, N. (2011). Designing a learner translator corpus for training purposes. In N. Kübler (Ed.), Corpora, language, teaching, and resources: From theory to practice (pp. 221–248). Bern: Peter Lang.

    Google Scholar 

  • Cettolo, M. (2016). An Arabic–Hebrew parallel corpus of TED talks. In Proceedings of the AMTA 2016 workshop on Semitic machine translation (SeMaT). Austin, US-TX.

  • Dimitriu, R. (2009). Translators’ prefaces as documentary sources for translation studies, Perspectives. Studies in Translatology,17(3), 193–206.

    Article  Google Scholar 

  • Espunya, A. (2014). The UPF learner translation corpus as a resource for translator training. Language Resources & Evaluation,48, 33.

    Article  Google Scholar 

  • Ferguson, C. A. (1959). Diglossia. Word,15, 325–340.

    Article  Google Scholar 

  • Florén, C. (2006). ENTRAD, an English Spanish parallel corpus created for the teaching of translation. Paper presented at the 7th teaching and language corpora conference (TALC 2006).

  • Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and em. In Proceedings of EMNLP, vol. 2004.

  • Graedler, A. L. (2013). Nest – a corpus in the brooding box. In M. Huber & J. Mukherjee (Eds.), Corpus linguistics and variation in English: Focus on non-native Englishs. Studies in Variation, Contacts and Change in English, University of Giessen.

  • Granger, S. (2002). A bird’s eye view of learner corpus research. In S. Granger, J. Hung, S. Petch-Tyson (eds.), Computer learner corpora, second language acquisition and foreign language teaching. Amsterdam & Philadelphia: Benjamins.

  • Guzman, F., Sajjad, H., Abdelali, A., & Vogel, S. (2013). The AMARA corpus: Building resources for translating the web’s educational content. In Proceedings of the international workshop on spoken language translation, IWSLT 2013. Heidelberg: IWSLT.

  • Hansen, G. (Ed.). (2002). Empirical translation studies: Process and product (Copenhagen studies in language, vol. 27). Denmark: Samfundslittera-tur.

  • Hewavitharana, S., Vogel, S. (2011). Extracting parallel phrases from comparable data. In Proceedings of the 4th workshop on building and using comparable corpora: Comparable corpora and the web (pp. 61–68). Association for Computational Linguistics.

  • Horn, C. (2015). Diglossia in the Arab world. Open Journal of Modern Linguistics,5, 100–104.

    Article  Google Scholar 

  • Hu, K., & Tao, Q. (2013). The Chinese–English conference interpreting corpus: Uses and limitations. Meta,58(3), 626–642. https://doi.org/10.7202/1025055ar.

    Article  Google Scholar 

  • Izquierdo, M., Hofland, K., & Reigem, Ø. (2008). The ACTRES parallel corpus: An English–Spanish translation corpus. Corpora,3(1), 31–41.

    Article  Google Scholar 

  • Izwaini, S. (2003). Building specialised corpora for translation studies. In Workshop on multilingual corpora: Linguistic requirements and technical perspectives, corpus linguistics. (pp. 17–25). , Lancaster University, UK. http://www.coli.uni-sb.de/muco03/izwaini.pdf.

  • Jakobsen, A. (2003). Effects of think aloud on translation speed, revision and segmentation. In F. Alves (Ed.), Triangulating translation: Perspectives in process oriented research (pp. 69–95). Amsterdam: Benjamins.

    Chapter  Google Scholar 

  • Jakobsen, A. L. (2011). Tracking translators’ keystrokes and eye movements with Translog. In C. Alvstad, A. Hild, & E. Tiselius (Eds.), Methods and strategies of process research integrative approaches in translation studies (pp. 37–55). Amsterdam: John Benjamins Publishing.

    Chapter  Google Scholar 

  • Jakobsen, A. L., & Schou, L. (1999). Logging target text production with Translog. Copenhagen Studies in Language (Vol. 24, pp. 9–20). Copenhagen: Samfundslitteratur.

    Google Scholar 

  • Kumar, G., Cao, Y., Cotterell, R., Callison-Burch, C., Povey, D., & Khudanpur, S. (2014). Translations of the CALLHOME Egyptian Arabic corpus for conversational speech translation. IWSLT.

  • Kutuzov, A., & Kunilovskaya, M. (2014). Russian learner translator corpus. In P. Sojka, A. Horak, I. Kopecek, & K. Pala (Eds.), Text, speech and dialogue (Lecture Notes in Computer Science) (Vol. 8655, pp. 315–323). Berlin: Springer.

    Chapter  Google Scholar 

  • Li, X., et al. (2013). GALE Arabic-English parallel aligned treebank – broadcast news. Part 1 LDC2013T14. Web Download. Philadelphia: Linguistic Data Consortium.

  • McEnery, A. M., & Xiao, R. Z. (2007). Parallel and comparable corpora: What are they up to? In G. Anderman, & M. Rogers (Eds.), Incorporating corpora: Translation and the linguist. Retrieved from http://eprints.lancs.ac.uk/59/.

  • Mesa-Lao, B. (2014). Gaze behavior on source texts: An exploratory study comparing translation and post-editing. In S. O’Brien, L. W. Balling, M. Carl, M. Simard, & L. Specia (Eds.), Post-editing of machine translation (pp. 219–245). Newcastle Upon Tyne: Cambridge Scholar Publishing.

    Google Scholar 

  • Mikhailov, M., Cooper, R. (2016). Corpus linguistics for translation and contrastive studies: A guide for research. Routledge. Corpus Linguistics Guides. London & New York: Routledge.

  • Norberg, U. (2014). Fostering self-reflection in translation students. Translation & Interpreting Studies,9(1), 150–164.

    Article  Google Scholar 

  • Oakes, M. (1998). Statistics for corpus linguistics. Edinburgh: Edinburgh University Press.

    Google Scholar 

  • Paltridge, B. (2012). Discourse analysis (2nd ed.). London: Bloomsbury.

    Google Scholar 

  • Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R. M. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. Language Resources and Evaluation Conference (LREC 2014).

  • Rafalovitch, A., & Dale, R. (2009). United Nations General Assembly resolutions: A six-language parallel corpus. In Proceedings of the MT summit XII. (pp. 292–299, Ottawa, Canada).

  • Russo, M., Bendazzoli, C., Sandrelli, A., & Spinolo, N. (2012). The European parliament interpreting corpus (EPIC): Implementation and developments. In S. F. Straniero & C. Falbo (Eds.), Breaking ground in corpus-based interpreting studies (pp. 53–90). Frankfurt am Main: Peter Lang.

    Google Scholar 

  • Salhi, H. (2013). Investigating the complementary polysemy and the Arabic translations of the noun destruction in EAPCOUNT. Meta Translators’ Journal,58(1), 227–246.

    Google Scholar 

  • Schmidt, T., & Wörner, K. (Eds.). (2012). Multilingual corpora and multilingual corpus analysis (p. 407). Amsterdam/Philadelphia: John Benjamins.

    Google Scholar 

  • Serbina, T., et al. (2015). Development of a keystroke logged translation corpus. In C. Fantinuoli & F. Zanettin (Eds.), New directions in corpus-based translation studies (pp. 11–34). Berlin: Language Science Press.

    Google Scholar 

  • Shlesinger, M. (2008). Towards a definition of interpretese: An intermodal, corpus-based study. In G. Hansen, A. Chesterman, & H. Gerzynisch-Arbogast (Eds.), Efforts and models in interpreting and translation research (pp. 237–253). Amsterdam/Philadelphia: John Benjamins.

    Google Scholar 

  • Smith, J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. In Human language technologies: The 2010 annual conference of the North American chapter of the Association for Computational Linguistics, pp. 403–411. Association for Computational Linguistics.

  • Sosnina, E. P. (2006). Development and application of Russian translation learner corpus. St. Petersburg: Papers from the Corpus Linguistics Conference.

    Google Scholar 

  • Stefanescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the 16th conference of the European Association for Machine Translation (pp. 137–144).

  • Štěpánková, K. (2014). Learner translation corpus: CELTraC (Bachelor’s thesis).

  • Temnikova, I., Abdelali, A., Hedaya, S., Vogel, S., & Al Daher, A. (2017). Interpreting strategies annotation in the WAW corpus. RANLP, p. 36.

  • Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th international conference on language resources and evaluation (LREC’12) (pp. 2214–2218). Istanbul: European Language Research Association.

  • Tono, Y. (2003). Learner corpora: Design, development and application. In Proceedings of the corpus linguistics 2003 conference (pp. 800–809). Lancaster, UK, 28–31 March 2003.

  • Uzar, R., & Walinski, J. (2001). Analyzing the fluency of translators. International Journal of Corpus Linguistics,155(166), 12.

    Google Scholar 

  • Wurm, A. (2013). Eigennamen und Realia in einem Korpus studentischer Übersetzungen (KOPTE); in: transkom, 6(2); 381–419. http://trans-kom.eu.

  • Xiao, R., & McEnery, T. (2002). A two-level approach to situation aspect. Paper presented at the 5th chronos colloquium on tense, aspect and modality, Groningen, Netherlands.

  • Zaidan, O. F., & Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics,40(1), 171–202.

    Article  Google Scholar 

Download references

Acknowledgements

The author would like to thank the anonymous reviewers for the detailed and constructive review that helped to clarify many points and improve the structure of the manuscript. The author is greatly indebted to PNU instructors, course coordinators, and learners for their contributions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reem F. Alfuraih.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alfuraih, R.F. The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics. Lang Resources & Evaluation 54, 801–830 (2020). https://doi.org/10.1007/s10579-019-09472-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-019-09472-6

Keywords

Navigation