Abstract
In this paper we present the coreferential tagging of part of the EPEC Corpus of Basque. Although coreference is a pragmatic linguistic phenomenon highly dependent on the situational context, it shows some language-specific patterns that vary according to the features of each language. Due to the fact that Basque is not an Indo-European language, it differs considerably in grammar from the languages spoken in surrounding areas. We will explain these features and the decisions made in each case. After describing the criteria defined for coreferential tagging in Basque, the annotation process will be explained. Our annotation is based on a morphologically and syntactically annotated corpus that provides us with a manageable environment, in which the specific structures that are part of a reference chain can be more easily identified. A part of the corpus was tagged by two annotators who marked up the same text independently, and by another annotator that acted as judge, solving problems in case of disagreement. All this process has been automatized as a result of previous studies carried out in this field. The automatic detection of mentions (Soraluze et al., in: Proceedings of Konvens, 2012) has provided us with a better working environment, and given us the possibility to build a first significant corpus for a later computational treatment of automatic coreferential resolution.
Similar content being viewed by others
Notes
Most of the examples of this paper come from the EPEC corpus explained in Sect. 4.
In generall, the examples in English may have more linguistic expressions that refer to the same entity but we only mark the equivalents of the elements annotated in Basque. See Sect. 3.1 for a detailed explanation.
As mentioned before, in Basque the pronouns are formed by demonstratives (Laka 1996)
References
Aduriz, I., Ceberio, K., & Díaz de Ilarraza, A. (2005). Euskarazko anafora pronominala: Ikuspuntu konputazionala eta corpus baten garapena. Gogoa, 5(1), 91–116.
Aduriz, I., Aranzabe, M. J., Arriola, J. M., Atutxa, A., Díaz de Ilarraza, A., Ezeiza, N., et al. (2006a). Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for the automatic processing. In A. Wilson, P. Rayson, & D. Archer (Eds.), Corpus linguistics around the world. Book series: Language and computers (Vol. 56, pp. 1–15). Netherlands: Rodopi.
Aduriz, I., Aranzabe, M. J., Arriola, J. M., & de Ilarraza, A. D. (2006b). Sintaxi partziala. In B. Fernández & I. Laka (Eds.), Andolin gogoan: Essays in honour of professor Eguzkitza (pp. 31–49). Bilbo: UPV/EHU Publishing Services.
Alegria, I., Artola, X., Sarasola, K., & Urkia, M. (1996). Automatic morphological analysis of Basque. Literary & Linguistic Computing, 11(4), 193–203.
Alegria, I., Ezeiza, N., & Fernandez, I. (2006). Named entities translation based on comparable corpora multi-word-expressions in a multilingual context. In Proceedings of workshop on EACL06 (pp. 1–8). Trento (Italy).
Arriola, J. M., Aduriz, I., Aldezabal, I., Aranzabe, M. J., Ceberio, K., Estarrona, A., Iruskieta, M., Lersundi, M., Pociello, E., Uria, L., & Urizar, R. (2013). Reusing the CG-2 grammar for processing basque complex postpositions. In A. D. Iñaki Alegria & J. Villena (Eds.). Actas del XXIX Congreso de la Sociedad Española del Procesamiento del Lenguaje Natural (SEPLN 2013) (pp. 20–27). Madrid (España).
Borthen, K. (2004). Predicative NPs and the annotation of reference chains. In Proceedings of Coling2004 (pp. 1175–1178). Geneva, Switzerland.
Botley, S., & McEnery, T. (Eds.). (2000). Corpus-based and computational approaches to discourse anaphora. Amsterdam: John Benjamins.
Ceberio, K., Aduriz, I., de Ilarraza, A. D., & Garcia-Azkoaga, I. (2008). Erreferentziakidetasunaren azterketa eta anotazioa euskarazko corpus batean. In X. Artiagoitia & J. A. Lakarra (Eds.), Gramatika Jaietan. Patxi Goenagaren omenez, ASJU (Vol. 51, pp. 153–172). Bilbo: UPV/EHU & Gipuzkoako Foru Aldundia.
Cornish, F. (1999). Anaphora, discourse and understanding: Evidence from English and French. Oxford: Clarendon.
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., & Weischedel, R. (2004). The automatic content extraction (ACE) program—Tasks, data, and evaluation. In Proceedings of LREC 2004 (pp. 837–840), Lisbon.
Euskaltzaindia. (1985). Euskal Gramatika: Lehen urratsak-I. Bilbo: Euskaltzaindia.
Euskaltzaindia. (2002). Euskal Gramatika Laburra: Perpaus Bakuna. Bilbo: Euskaltzaindia (2nd ed.).
Garcia-Azkoaga, I. M. (2003). Kohesio anaforikoa hiru testu generotan. Adinaren araberako azterketa. Bilbo: EHU-UPV Publishing Services.
Hualde, J. I., & Ortiz de Urbina, J. (Eds.). (2003). A grammar of basque. Berlin, New York: Mouton de Gruyter.
Kleiber, G. (1994). Anaphores et pronoms. Louvain-la-Neuve: Duculot.
Laka, I. (1996). A brief grammar of euskara, the basque language. EHU/UPV Publishing Services: Leioa (Spain). Retrieved December, 2016, from http://www.ehu.eus/eu/web/eins/a-brief-grammar-of-euskara.
McCarthy, J. F., & Lehnert, W. G. (1995). Using decision trees for conference resolution. In Proceedings of the 14th international joint conference on Artificial intelligence (Vol. 2, pp. 1050–1055). San Francisco, CA, USA.
Mitkov, R. (2002). Anaphora resolution. London: Longman.
Moirand, S. (1990). Une grammaire des textes et des dialogues. Paris: Hachette.
Müller, C., & Strube, M. (2006). Multi-level annotation of linguistic data with MMAX2. In S. Braun, K. Kohn, & J. Mukherjee (Eds.), Corpus technology and language pedagogy. New resources, new tools, new methods (English Corpus Linguistics, Vol. 3, pp. 197–214). Frankfurt: Peter Lang.
Nicolov, N., Salvetti, F., & Ivanova, S. (2008). Sentiment analysis: Does coreference matter?. In AISB 2008 convention communication, interaction and social intelligence, pp. 37–40.
Nilsson Björkenstam, K. (2013). SUC-CORE: A balanced corpus annotated with noun phrase coreference. Northern European Journal of Language Technology (NEJLT), 3, 19–39.
Ortiz de Urbina, J. (1989). Parameters in the grammar of basque: A GB approach to basque syntax. Dordrecht: Foris.
Peral, J., Palomar, M., & Ferrández, A. (1999). Coreference-oriented interlingual slot structure & machine translation. In Proceedings of the workshop on coreference and its applications, (CorefApp 1999) (pp. 69–76). Stroudsburg, PA, USA.
Poon, H., Christensen, J., Domingos, P., Etzioni, O., Hoffmann, R., Kiddon, C., Lin, T., Ling, X., Ritter, A., Schoenmackers, S., Soderland, S., Weld, D., Wu, F., & Zhang, C. (2010). Machine reading at the University of Washington. In Proceedings of the NAACL HLT 2010 first international workshop on formalisms and methodology for learning by reading, (FAM-LbR 2010) (pp 87–95). Stroudsburg, PA, USA.
Pradhan, S. S., Ramshaw, L., Weischedel, R., MacBride, J., & Micciulla, L. (2007). Unrestricted coreference: Identifying entities and events in OntoNotes. In Proceedings of ICSC 2007 (pp. 446–453). Irvine, California.
Recasens, M. (2010). Coreference: Theory, annotation, resolution and evaluation. Ph.D. thesis, University of Barcelona, Spain.
Rodriguez, K. (2010). Resources for linguistically motivated multilingual anaphora resolution. Ph.D. thesis, University of Trento, Italy.
Saeed, J. I. (2009). Semantics (3rd ed.). New York: Wiley.
Stede, M. (2011). Discourse processing. San Rafael, California: Morgan & Claypool Publishers.
Steinberger, J., Poesio, M., Kabadjov, M. A., & Jeek, K. (2007). Two uses of anaphora resolution in summarization. Information Processing and Management, 43(6), 1663–1680.
Soraluze, A., Arregi, O., Arregi, X., Ceberio, K., & Díaz de Ilarraza, A. (2012). Mention detection: First steps in the development of a basque coreference resolution system. Proceedings of Konvens, 2012, 128–136.
Stoyanov, V., Gilbert, N., Cardie, C., & Riloff, E. (2009). Conundrums in noun phrase coreference resolution: Making sense of the state of-the-art. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (pp. 656–664). Suntec, Singapore.
Vicedo, J. L., & Ferrández, A. (2006). Coreference in Q&A. In Advances in open domain question answering (of text, speech and language technology) (Vol. 32, pp. 71–96). Berlin/New York: Springer.
Zabala, I. (1996). Testu-lotura: Lotura tematikoa eta erreferentzia-sareak testu teknikoetan. In Igone Zabala (Ed.), Testu-loturarako baliabideak: Euskara Teknikoa (pp. 15–44). Bilbo: EHU-UPV Publishing Services.
Zabala, I., & Odriozola, J. C. (2004). Los complejos posposicionales en vasco. In G. E. Perez, Zabala I. Igone, & L. Gràcia Sole (Eds.), Las Fronteras de la Composición (pp. 281–315). Donostia: University of Deusto.
Zhekova, D., & Kübler, S. (2010). UBIU: A language-independent system for coreference resolution. In Proceedings of the 5th international workshop on semantic evaluation (SemEval 2010) (pp. 96–99). Stroudsburg, PA, USA.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Ceberio, K., Aduriz, I., Díaz de Ilarraza, A. et al. Coreferential Relations in Basque: The Annotation Process. J Psycholinguist Res 47, 325–342 (2018). https://doi.org/10.1007/s10936-018-9559-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10936-018-9559-6