ABSTRACT
This paper presents the compilation of the CroCo Corpus, an English-German translation corpus. Corpus design, annotation and alignment are described in detail. In order to guarantee the searchability and exchangeability of the corpus, XML stand-off mark-up is used as representation format for the multi-layer annotation. On this basis it is shown how the corpus can be queried using XQuery. Furthermore, the generalisation of results in terms of linguistic and translational research questions is briefly discussed.
- Mona Baker. 1996. Corpus-based translation studies: The challenges that lie ahead. In Harold Somers (ed.). Terminology, LSP and Translation. Benjamins, Amsterdam:175--186.Google Scholar
- Douglas Biber. 1993. Representativeness in Corpus Design. Literary and Linguistic Computing 8/4:243--257.Google ScholarCross Ref
- Shoshana Blum-Kulka. 1986. Shifts of cohesion and coherence in Translation. In Juliane House and Shoshana Blum-Kulka (eds.). Interlingual and Intercultural Communication. Gunter Narr, Tübingen:17--35.Google Scholar
- Thorsten Brants. 2000. TnT - A Statistical Part-of-Speech Tagger. Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, Seattle, WA. Google ScholarDigital Library
- Matthias Heyn. 1996. Integrating machine translation into translation memory systems. European Association for Machine Translation - Workshop Proceedings, ISSCO, Geneva:111--123.Google Scholar
- Heinz Dieter Maas. 1998. Multilinguale Textverarbeitung mit MPRO. Europäische Kommunikationskybernetik heute und morgen '98, Paderborn.Google Scholar
- Christoph Müller and Michael Strube. 2003. Multi-Level Annotation in MMAX. Proceedings of the 4th SIGdial Workshop on Discourse and Dialogue, Sapporo, Japan:198--107.Google Scholar
- Stella Neumann and Silvia Hansen-Schirra. 2005. The CroCo Project: Cross-linguistic corpora for the investigation of explicitation in translations. In Proceedings from the Corpus Linguistics Conference Series, Vol. 1, no. 1, ISSN 1747-9398.Google Scholar
- Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Journal of Computational Linguistics Nr. 1, vol. 29:19--51. Google ScholarDigital Library
- Maeve Olohan and Mona Baker. 2000. Reporting that in Translated English. Evidence for Subconscious Processes of Explicitation? Across Languages and Cultures 1(2):141--158.Google ScholarCross Ref
- Geoffrey Sampson. 1995. English for the Computer. The Susanne Corpus and Analytic Scheme. Clarendon Press, Oxford.Google Scholar
- Anne Schiller, Simone Teufel and Christine Stöckert. 1999. Guidelines für das Tagging deutscher Textkorpora mit STTS, University of Stuttgart and Seminar für Sprachwissenschaft, University of Tübingen.Google Scholar
- Erich Steiner. 2005. Explicitation, its lexicogrammatical realization, and its determining (independent) variables -- towards an empirical and corpus-based methodology. SPRIKreports 36:1--43.Google Scholar
- Elke Teich, Silvia Hansen, and Peter Fankhauser. 2001. Representing and querying multi-layer annotated corpora. Proceedings of the IRCS Workshop on Linguistic Databases. Philadelphia: 228--237.Google Scholar
- Multi-dimensional annotation and alignment in an English-German translation corpus
Recommendations
Building an English-Vietnamese Bilingual Corpus for Machine Translation
IALP '12: Proceedings of the 2012 International Conference on Asian Language ProcessingBilingual corpora are critical resources for machine translation research and development since parallel corpora contain translation equivalences of various granularities. Manual annotation of word alignments is of significance to provide a gold-...
Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation
COLING-MTIA '02: Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16The most difficult task in machine translation is the elimination of ambiguity in human languages. A certain word in English as well as Vietnamese often has different meanings which depend on their syntactical position in the sentence and the actual ...
POS-tagger for English-Vietnamese bilingual corpus
HLT-NAACL-PARALLEL '03: Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3Corpus-based Natural Language Processing (NLP) tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for unpopular languages (e.g. Vietnamese) are at a ...
Comments