Skip to main content
Log in

Building Cast3LB: A Spanish Treebank

  • Published:
Research on Language and Computation

Abstract

In this paper we present and justify methodological principles and syntactic criteria to build Cast3LB: a treebank for Spanish. As a previous work necessary to develop it, some automatic and semiautomatic processes have been carried out: automatic morphological analysis and disambiguation; manual validation of the tagging process, which guarantees the quality of the data; and, finally, automatic shallow parsing. The syntactic annotation consists of the labelling of constituents, including some elliptical elements, and syntactic functions. In this paper we focus on the second phase, presenting the basic guidelines for syntactic annotation and the boundaries of the work being done.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Abeillé A., Clément L., Kinyon A. (2000) Building a Treebank for French. In Proceedings of the Second Conference on Language Resources and Evaluation (LREC2000).

  • A. Abeillé L. Clément A. Kinyon (2001) Building a Treebank for French A. Abeillé (Eds) Building and Using Syntactically Annotated Corpora Language and Speech Kluwer Dordrecht

    Google Scholar 

  • Abeillé A., Toussenel F., Chéradame M. (2002) Corpus le Monde. Annotations En Constituants. Guide Pour Les Correcteurs. Technical Report, LLF, UFRL.

  • S. Abney (1991) Parsing by Chunks R. Berwick S. Abney C. Tenny (Eds) Principle-Based Parsing Kluwer Academic Dordrecht

    Google Scholar 

  • Abney S. (1996) Part-of-Speech Tagging and Partial Parsing. In Proceedings of the ESSLLI‘96 Robust Parsing Workshop.

  • Aduriz I., Aldezabal I., Aranzabe M., Arrieta B., Arriola J. M., Atutxa A., Diaz de Ilarraza A., Gojenola K., Oronoz M., Sarasola K. (2002) Construcción de un corpus etiquetado sintácticamente para el euskara. Procesamiento del Lenguaje Natural, 29.

  • Afonso S., Bick E., Haber R., Santos D. (2002) ‘Floresta Sintá(c)tica’: a Treebank for Portuguese. In Proceedings of the Third Conference on Language Resources and Evaluation (LREC2002).

  • Atserias J., Rodríguez H. (1998) TACAT: TAgged Corpus Text Analyzer Technical Report. Software Department (LSI). Technical University of Catalonia (UPC).

  • Bemova A., Hajic J., Hladka B., Panevova J. (1999) Morphological and Syntactic Tagging of The Prague Dependency Treebank.Journées Atala, Corpus annotés pour la syntaxe, Paris, June.

  • Bies A., Ferguson M., Katz K., MacIntyre R. (1995) Bracketing Guidelines for Treebank II Style Penn Treebank Project LDC.

  • Boguslavsky I., Chardin I., Grigorieva S., Grigoriev N., Iomdin L., Kreidlin L., Frid N. (2002) Development of a Dependency Treebank for Russian and its possible Applications in NLP. In Proceedings of the Third Conference on Language Resources and Evaluation (LREC2002).

  • Böhmova A., Hajicova E. (1999) How Much of the Underlying Syntactic Structure can be Tagged Automatically. Journées Atala, Corpus annotés pour la syntaxe, Paris, June.

  • Bosco C., Lombardo V., Vassallo D., Lesmo L. (2000) Building a Treebank for Italian: a Data-driven Annotation Schema. In Proceedings of the Second Conference on Language Resources and Evaluation (LREC2000).

  • Brants T., Skut W., Uszkoreit H. (1999) Syntactic Annotation of a German Newspaper Corpus. Journées Atala, Corpus annotés pour la syntaxe, Paris, June.

  • Brants T., Skut, W., Uszkoreit H. (2001) Syntactic Annotation of a German Newspaper Corpus In Abeillé, A. (ed.), Building and Using syntactically annotated corpora, Kluwer, Language and Speech.

  • Carmona J., Cervell S., Màrquez L., Martí M. A., Padró L., Placer R., Rodríguez H., Taulé M., Turmo J. (1998) An Environment for Morphosyntactic Processing of Unrestricted Spanish Text. In Proceedings of the First Conference on Language Resources and Avaluation. LREC’98.

  • Civit M. (2000) Guía para la anotación morfológica de corpus X-Tract WP-00/06, available: http://clic.fil.ub.es/personal/civit/

  • Civit M., Castellón I., Martí M. A. (2001) Joven periodista triste busca casa frente al mar o la ambigüedad en la anotacion de corpus. Congreso Internacional sobre nuevas tendencias de la lingüística. Granada.

  • Civit M. (2002) Guía para la anotación sintáctica de Cast3LB: un corpus del espanol con anotación sintáctica, semántica y pragmática 3LB WP-02-01; X-Tract-II WP-02-01.

  • Civit M., Martí M. A., Padró L. (2003) Using Hybrid Probabilistic-Linguistic Knowledge to improve PoS-Tagging Performance. In Proceedings of Corpus Linguistics 2003. Lancaster University (UK).

  • Cotton S. and Bird S. (2002) An Integrated Framework for Treebanks and Multilayer Annotations. In Proceedings of the Third Conference on Language Resources and Evaluation (LREC2002).

  • Hajic J. (1998) Building a Syntactically Annotated Corpus: the Prague dependency Treebank.Issues of Valency and Meaning.

  • Leech G., Barnett, R., Kahrel, P. (1996) Recommendations for the Syntactic Annotation of Corpora, EAGLES.

  • M. Marciniak A. Mykowiecka A. Przepiórkowski A. Kupsc (2001) Construction of an HPSG Treebank for Polish A Abeillé (Eds) Building and Using Syntactically Annotated Corpora. Language and Speech Kluwer Dordrecht

    Google Scholar 

  • Marcus M., Santorini B., Marcinkiewicz, M. A. (1993) Building a Large annotated corpus of English: the Penn Treebank.Computational Linguistics

  • Marcus M, Kim G., Marcinkiewicz M. A., MacIntyre R., Bies A., Ferguson M., Katz K., Schasberger B. (1994) The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the ARPA Human Language Technology Workshop, Princeton.

  • Monachini M., Calzolari N. (1996) Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A common Proposal and Applications to European Languages, EAGLES.

  • S. Montemagni F. Barsotti M. Battista N. Calzolari O. Corazzari A. Lenci A. Zampolli F. Fanciulli M. Massetani R. Raffaelli R. Basili T. Pazienza M. D. Saracino F. Zanzotto N. Mana F. Pianesi R. Delmonte (2001) Building the Italian Syntactic-Semantic Treebank A. Abeillé (Eds) Building and Using Syntactically Annotated Corpora.Language and Speech. Kluwer Dordrecht

    Google Scholar 

  • Moreno A., López S., Sánchez F. (1999) Spanish Tree Bank: Specifications (Version 5) UAM.

  • Navarro B., Civit M., Martí M. A., Fernández B., Marcos R. (2003) Syntactic, Semantic and Pragmatic Annotation in Cast3LB. In Proceedings of the Corpus Linguistics, Lancaster.

  • K. Oflazer B. Say D. Z. Hakkani-Tür G. Tür (2001) Building a Turkish Treebank A. Abeillé (Eds) Building and Using Syntactically Annotated Corpora, Language and Speech. Kluwer Dordrecht

    Google Scholar 

  • Padró L. (1998)A Hybrib Environment for Syntax-Semantic Tagging PhD. Thesis Software Department (LSI). Technical University of Catalonia (UPC).

  • Rambow O., Crecwell C., Szekely R., Taber H., Walker M. (2002) A Dependency Treebank for English. In Proceedings of the Third Conference on Language Resources and Evaluation (LREC2002).

  • G. Sampson (1987) Probabilistic Models of Analysis R. Garside G. Leech G. Sampson (Eds) The Computational Analysis of English Longman New York

    Google Scholar 

  • G. Sampson (1995) English for the Computer The SUSANNE corpus and Analytic Scheme. Clarendon Press Oxford

    Google Scholar 

  • Sebastián N., Martí M. A., Carreiras M. F., Cuetos F. (2000) LEXESP: Léxico Informatizado del Espanol Edicions de la Universitat de Barcelona.

  • Simov K., Osenova P., Slavcheva M., Kolkovska S., Balabanova E., Doikoff D., Ivanova K., Simov A., Kouylekov M. (2002) Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC02).

  • Tadic M. (2002) Building the Croatian National Corpus. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC02).

  • A. Taylor M. Marcus B. Santorini (2001) The Penn Treebank: an overview A. Abeillé (Eds) Building and Using Syntactically Annotated Corpora, Language and Speech Kluwer Dordrecht

    Google Scholar 

  • Váradi T. (2002) The Hungarian National Corpus. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC02).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Montserrat Civit.

About this article

Cite this article

Civit, M., Martí, M.A. Building Cast3LB: A Spanish Treebank. Res Lang Comput 2, 549–574 (2004). https://doi.org/10.1007/s11168-004-7429-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11168-004-7429-x

Keywords

Navigation