Abstract
In this paper we present and justify methodological principles and syntactic criteria to build Cast3LB: a treebank for Spanish. As a previous work necessary to develop it, some automatic and semiautomatic processes have been carried out: automatic morphological analysis and disambiguation; manual validation of the tagging process, which guarantees the quality of the data; and, finally, automatic shallow parsing. The syntactic annotation consists of the labelling of constituents, including some elliptical elements, and syntactic functions. In this paper we focus on the second phase, presenting the basic guidelines for syntactic annotation and the boundaries of the work being done.
Similar content being viewed by others
References
Abeillé A., Clément L., Kinyon A. (2000) Building a Treebank for French. In Proceedings of the Second Conference on Language Resources and Evaluation (LREC2000).
A. Abeillé L. Clément A. Kinyon (2001) Building a Treebank for French A. Abeillé (Eds) Building and Using Syntactically Annotated Corpora Language and Speech Kluwer Dordrecht
Abeillé A., Toussenel F., Chéradame M. (2002) Corpus le Monde. Annotations En Constituants. Guide Pour Les Correcteurs. Technical Report, LLF, UFRL.
S. Abney (1991) Parsing by Chunks R. Berwick S. Abney C. Tenny (Eds) Principle-Based Parsing Kluwer Academic Dordrecht
Abney S. (1996) Part-of-Speech Tagging and Partial Parsing. In Proceedings of the ESSLLI‘96 Robust Parsing Workshop.
Aduriz I., Aldezabal I., Aranzabe M., Arrieta B., Arriola J. M., Atutxa A., Diaz de Ilarraza A., Gojenola K., Oronoz M., Sarasola K. (2002) Construcción de un corpus etiquetado sintácticamente para el euskara. Procesamiento del Lenguaje Natural, 29.
Afonso S., Bick E., Haber R., Santos D. (2002) ‘Floresta Sintá(c)tica’: a Treebank for Portuguese. In Proceedings of the Third Conference on Language Resources and Evaluation (LREC2002).
Atserias J., Rodríguez H. (1998) TACAT: TAgged Corpus Text Analyzer Technical Report. Software Department (LSI). Technical University of Catalonia (UPC).
Bemova A., Hajic J., Hladka B., Panevova J. (1999) Morphological and Syntactic Tagging of The Prague Dependency Treebank.Journées Atala, Corpus annotés pour la syntaxe, Paris, June.
Bies A., Ferguson M., Katz K., MacIntyre R. (1995) Bracketing Guidelines for Treebank II Style Penn Treebank Project LDC.
Boguslavsky I., Chardin I., Grigorieva S., Grigoriev N., Iomdin L., Kreidlin L., Frid N. (2002) Development of a Dependency Treebank for Russian and its possible Applications in NLP. In Proceedings of the Third Conference on Language Resources and Evaluation (LREC2002).
Böhmova A., Hajicova E. (1999) How Much of the Underlying Syntactic Structure can be Tagged Automatically. Journées Atala, Corpus annotés pour la syntaxe, Paris, June.
Bosco C., Lombardo V., Vassallo D., Lesmo L. (2000) Building a Treebank for Italian: a Data-driven Annotation Schema. In Proceedings of the Second Conference on Language Resources and Evaluation (LREC2000).
Brants T., Skut W., Uszkoreit H. (1999) Syntactic Annotation of a German Newspaper Corpus. Journées Atala, Corpus annotés pour la syntaxe, Paris, June.
Brants T., Skut, W., Uszkoreit H. (2001) Syntactic Annotation of a German Newspaper Corpus In Abeillé, A. (ed.), Building and Using syntactically annotated corpora, Kluwer, Language and Speech.
Carmona J., Cervell S., Màrquez L., Martí M. A., Padró L., Placer R., Rodríguez H., Taulé M., Turmo J. (1998) An Environment for Morphosyntactic Processing of Unrestricted Spanish Text. In Proceedings of the First Conference on Language Resources and Avaluation. LREC’98.
Civit M. (2000) Guía para la anotación morfológica de corpus X-Tract WP-00/06, available: http://clic.fil.ub.es/personal/civit/
Civit M., Castellón I., Martí M. A. (2001) Joven periodista triste busca casa frente al mar o la ambigüedad en la anotacion de corpus. Congreso Internacional sobre nuevas tendencias de la lingüística. Granada.
Civit M. (2002) Guía para la anotación sintáctica de Cast3LB: un corpus del espanol con anotación sintáctica, semántica y pragmática 3LB WP-02-01; X-Tract-II WP-02-01.
Civit M., Martí M. A., Padró L. (2003) Using Hybrid Probabilistic-Linguistic Knowledge to improve PoS-Tagging Performance. In Proceedings of Corpus Linguistics 2003. Lancaster University (UK).
Cotton S. and Bird S. (2002) An Integrated Framework for Treebanks and Multilayer Annotations. In Proceedings of the Third Conference on Language Resources and Evaluation (LREC2002).
Hajic J. (1998) Building a Syntactically Annotated Corpus: the Prague dependency Treebank.Issues of Valency and Meaning.
Leech G., Barnett, R., Kahrel, P. (1996) Recommendations for the Syntactic Annotation of Corpora, EAGLES.
M. Marciniak A. Mykowiecka A. Przepiórkowski A. Kupsc (2001) Construction of an HPSG Treebank for Polish A Abeillé (Eds) Building and Using Syntactically Annotated Corpora. Language and Speech Kluwer Dordrecht
Marcus M., Santorini B., Marcinkiewicz, M. A. (1993) Building a Large annotated corpus of English: the Penn Treebank.Computational Linguistics
Marcus M, Kim G., Marcinkiewicz M. A., MacIntyre R., Bies A., Ferguson M., Katz K., Schasberger B. (1994) The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the ARPA Human Language Technology Workshop, Princeton.
Monachini M., Calzolari N. (1996) Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A common Proposal and Applications to European Languages, EAGLES.
S. Montemagni F. Barsotti M. Battista N. Calzolari O. Corazzari A. Lenci A. Zampolli F. Fanciulli M. Massetani R. Raffaelli R. Basili T. Pazienza M. D. Saracino F. Zanzotto N. Mana F. Pianesi R. Delmonte (2001) Building the Italian Syntactic-Semantic Treebank A. Abeillé (Eds) Building and Using Syntactically Annotated Corpora.Language and Speech. Kluwer Dordrecht
Moreno A., López S., Sánchez F. (1999) Spanish Tree Bank: Specifications (Version 5) UAM.
Navarro B., Civit M., Martí M. A., Fernández B., Marcos R. (2003) Syntactic, Semantic and Pragmatic Annotation in Cast3LB. In Proceedings of the Corpus Linguistics, Lancaster.
K. Oflazer B. Say D. Z. Hakkani-Tür G. Tür (2001) Building a Turkish Treebank A. Abeillé (Eds) Building and Using Syntactically Annotated Corpora, Language and Speech. Kluwer Dordrecht
Padró L. (1998)A Hybrib Environment for Syntax-Semantic Tagging PhD. Thesis Software Department (LSI). Technical University of Catalonia (UPC).
Rambow O., Crecwell C., Szekely R., Taber H., Walker M. (2002) A Dependency Treebank for English. In Proceedings of the Third Conference on Language Resources and Evaluation (LREC2002).
G. Sampson (1987) Probabilistic Models of Analysis R. Garside G. Leech G. Sampson (Eds) The Computational Analysis of English Longman New York
G. Sampson (1995) English for the Computer The SUSANNE corpus and Analytic Scheme. Clarendon Press Oxford
Sebastián N., Martí M. A., Carreiras M. F., Cuetos F. (2000) LEXESP: Léxico Informatizado del Espanol Edicions de la Universitat de Barcelona.
Simov K., Osenova P., Slavcheva M., Kolkovska S., Balabanova E., Doikoff D., Ivanova K., Simov A., Kouylekov M. (2002) Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC02).
Tadic M. (2002) Building the Croatian National Corpus. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC02).
A. Taylor M. Marcus B. Santorini (2001) The Penn Treebank: an overview A. Abeillé (Eds) Building and Using Syntactically Annotated Corpora, Language and Speech Kluwer Dordrecht
Váradi T. (2002) The Hungarian National Corpus. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC02).
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Civit, M., Martí, M.A. Building Cast3LB: A Spanish Treebank. Res Lang Comput 2, 549–574 (2004). https://doi.org/10.1007/s11168-004-7429-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11168-004-7429-x