Dependency structure annotation in the IULA Spanish LSP Treebank

This paper presents the IULA Spanish LSP Treebank, an open-source treebank of over 40,000 sentences, developed in the framework of the European project METANET4U. The IULA Spanish LSP Treebank is the first technical corpus of Spanish annotated at surface syntactic level, following the dependency grammar theory. We present the method we used to create the resource and the linguistic annotations that the treebank provides, using examples and comparing with similar resources. We also provide the statistics of the treebank and the evaluation results.


Introduction
Syntactically annotated corpora-treebanks-constitute a crucial resource for research in quantitative and qualitative studies of a wide range of phenomena in lexis, grammar, semantics, discourse, language variation, etc., as well as for natural language processing (NLP) research activities, such as training and evaluation data of data-driven parsing systems.Thus, in the past decades, there has been an increasing interest towards the construction of treebanks that provide constituent structure and/or dependency structure annotations.
First works on treebank projects were strongly influenced by the phrase structure grammar and most syntactically annotated corpora conformed to annotations of hierarchical constituent structures, along the lines of the pioneer Lancaster Parsed Corpus (Garside et al. 1992) and Penn English Treebank (Marcus et al. 1993).But, due to the recent advances in both grammar-driven dependency-based parsing systems (Karlsson 1990;Karlsson et al. 1995;Maruyama 1990;Harper and Helzerman 1995;Ja ¨rvinen and Tapanainen 1998;Menzel and Schro ¨der 1998;Eisner 2000;Covington 2001;Oflazer 2003) and data-driven dependency-based parsing systems (Eisner 1996a, b;Collins et al. 1999;Samuelsson 2000;Bangalore 2003;Yamada and Matsumoto 2003;Nasr and Rambow 2004;Nivre et al. 2004;Wang et al. 2004;McDonald et al. 2005) more and more interest is being placed on the annotation of syntactic structure in the form of labeled dependency trees, derived from dependency grammar theories.The seminal work in this direction was the Prague Dependency Treebank (Bo ¨hmova ´et al. 2003), and nowadays dependency treebanks have been built, on a smaller or larger scale, for at least 30 languages, natively or by means of algorithms that convert automatically available phrase structure treebanks to dependency structure notations.Descriptions of available annotated corpora can be found in Abeille ´(2003) and in the proceedings from the annual editions of the International Workshop on Treebanks and Linguistic Theories (TLT).Ku ¨bler et al. (2009) give a thorough introduction to the methods that are most widely used today in dependency parsing, and Kakkonen (2005) presents the state of the art of dependency-based treebanks.
This paper presents the IULA 1 Spanish LSP 2 Treebank, a dependency treebank developed in the framework of the European project METANET4U. 3While other existing Spanish treebanks (Moreno et al. 2000;Taule ´et al. 2008) are newspaper text, the IULA Spanish LSP Treebank is the first technical corpus of Spanish annotated at surface syntactic level following the dependency grammar theory and, therefore, it is the only one that can be used in linguistic studies and NLP research activities for this text type in Spanish.The treebank is accessible with a browser that provides concordancebased search functions 4 and it has been made publicly and freely available from the META-SHARE platform5 with a Creative Commons Attribution 3.0 Unported License. 6fter this introductory section, this article is organized as follows: Sect. 2 presents the main features of the corpus; in Sect. 3 we set out the annotation process; in Sect. 4 we give details about the generated dependency representations; Sect. 5 gives the statistics of the treebank; Sect.6 presents the methods used to evaluate the quality of the resource; and, finally, Sect.7 summarizes and presents the conclusions.
2 Language for Special Purposes.
The main purpose of this project was to add syntactic information to the Corpus Técnic de l'IULA, a collection of written texts from the fields of Law, Economics, Computer Science, Environment, and Medicine, which includes a range of vertical variation ranging from science divulgation articles to research papers of contemporary language and which is PoS tagged (Cabre ´et al. 2006;Vivaldi 2009).This corpus, of 1,389 documents, contains 31,436,451 words distributed among 412,707 sentences.The distribution of the sentences in the corpus is such that the total amount of sentences whose length ranges from 4 to 30 words represents 65.1 %. Figure 1 shows the ratio of number of sentences per sentence length for the different domains and Table 1 shows a few sample sentences form the Computer Science field. 7he IULA Spanish LSP Treebank contains over 40,000 syntactically annotated sentences, from 4 to 30 words.In order to annotate a representative sample of the Corpus Técnic de l'IULA which could be used to create a model for automatic annotation, these sentences were chosen at random from the above mentioned range and with the same proportion in terms of number of sentences per length and domain. 8Table 2 shows the number of sentences per sentence length and Table 3 shows the number of sentences, words, distinct words, and distinct lemmas distributed among the different domains.The corpus was linguistically annotated in three different steps.
First, it was analyzed by means of the FreeLing toolkit 9 (Padro ´and Stanilovsky 2012), which morphologically annotated each word according to the EAGLES tagset for Spanish 10 by dictionary look-up, performed Hidden Markov Model disambiguation, and identified punctuation symbols, multi-word units, and Named Entities (NEs).Figure 2 shows an annotation sample with the sentence (1), which includes the tags 'NP00000' and 'W' that FreeLing assigned to proper names and dates, respectively (other tags for NEs are: 'Z' for numbers, 'Zm' for currencies, and 'Zu' for measures).
(1) El Reino de Espan ˜a aplicara ´ı ´ntegramente el The Kingdom of Spain apply.FUTURE.3RD.SG fully the arancel aduanero comu ´n a partir de el 1 de enero de 1993.tariff customs common to start from the 1st of January of 1993 'The Kingdom of Spain will apply the common customs tariff from 1 January 1993.' Second, we parsed the PoS tagged sentences11 using the Deep Linguistic Processing with HPSG Initiative (DELPH-IN)12 open-source processing framework, also used in several treebank projects (Oepen et al. 2002;Hashimoto et al. 2007;Branco et al. 2010;Kordoni and Zhang 2009;Marimon 2010;Flickinger et al. 2012).The DELPH-IN processing framework offers a range of facilities: (a) the annotation is based on the manual selection of the correct analysis among all the analyses that are produced by a symbolic grammar, instead of using human annotation only13 ; (b) the use of a stochastic learner that, based on the decisions taken by the annotators, sorts and reduces the outputs generated by the grammar, with a reduction of the manual annotation effort, especially for long sentences; (c) finally, the disambiguation decisions can be reused to update the treebank semiautomatically with a revised version of the grammar.
In our treebank project, we used a broad-coverage Head-driven Phrase Structure Grammar (HPSG) (Pollard andSag 1987, 1994) grammar of Spanish implemented in the Linguistic Knowledge Builder (LKB) system (Copestake 2002): Spanish Resource Grammar (SRG).A detailed description of the grammar may be found in Marimon (2013).
The manual selection of the correct parse was done by the help of an interfacethe [incr tstb()] profiling environment (Oepen and Carroll 2000)-that allowed the annotator to reduce the set of parses incrementally, through the choice of so-called discriminants (Carter 1997); i.e. by selecting (or, alternatively, rejecting) the features that distinguished between the different parses, until the appropriate parse was left.If none of the displayed parses was the correct one, all parses were rejected and an error analysis was performed to evaluate the performance of both the grammar and FreeLing and to identify and correct the grammar deficiencies w.r.t.coverage, errors, and unanticipated interactions.Then, we updated the treebank with the revised version of the grammar. 14;15  As it is always the case with symbolic grammars, SRG produces several hundreds of analyses for a sentence.The DELPH-IN framework, however, provides a MaxEnt based stochastic ranker (Toutanova et al. 2005) that sorts the parses produced by the grammar.While this stochastic ranker could not be used to select the correct parse automatically without introducing a considerable amount of errors, it allowed the annotators to reduce the forest to the 500-top readings, such that the disambiguation should not require more than 9 decisions.Statistics were gathered from disambiguated parses and the MaxEnt model was updated as the number of annotated sentences increased. 16ig. 2 Example of a sentence analyzed by FreeLing Finally, an automatic converter was developed to transform selected parses, represented as derivation trees, into dependency trees.

The conversion procedure: the source and target formats
The linguistic analysis produced by the LKB system for each parsed sentence is represented as a parse tree with standard HPSG typed feature structures at each phrasal and lexical node.From this richly annotated output, all kinds of linguistic information can be extracted and represented in different formats that include (a) a binary branching phrase structure tree representing constituency structure, (b) a Minimal Recursion Semantic (MRS) semantic representation (Copestake et al. 2006) representing structural semantics (i.e.predicate-argument relations), and (c) a derivation tree representing coarse-grained dependency relations.For the purpose of this paper, we restrict ourselves to the derivation tree, which is the only format we considered to generate the dependency trees.
In the LKB format, derivation trees are encoded in a nested, parenthesized structure whose elements correspond to the identifiers of the phrase structure rules and the lexical items involved in the parsing.Phrase structure rules-marked by the suffix '_c' (for construction)-identify the daughter sequence, separated by a hyphen, and, in headed-phrase constructions, a basic dependency relation between sentence constituents, namely: subject-head (sb-hd), head-complement (hd-cmp), head-adjunct (hd-ad), specifier-head (sp-hd), clitic-head (cl-hd), filler-head (flr-hd), and head-punctuation (hd-pt).Lexical items are annotated with part-of-speech information (e.g.NCMS000, VMIF3S0) and their lexical entry identifier (e.g.arancel_n, aplicar_v-np).17 Figure 3 shows an example of a derivation tree with the sentence (1).
From this derivation tree, we obtained the information for the dependency structures that the IULA Spanish LSP Treebank provides in the CoNLL standardized format (Buchholz and Marsi 2006).In this target annotation, the syntactic analysis assigned to a sentence takes the form of lexical elements linked by asymmetrical dependency relations in which one of the related elements is considered the head of the relation and the other one is its dependent.In this dependency structure, only dependencies between actual words in the sentence are marked, so no elliptical elements are inserted, for example to identify null-subjects in finite clauses (pro-drop) or in infinitive-VP complements.Figure 4 shows the dependency tree for the same sentence as in Fig. 3.
Footnote 16 continued dependency parser (Nivre et al. 2007), which will be used in the near future to enlarge the treebank.We have performed some experiments using a set 1,428 sentences and obtained the following results: 445 sentences were selected out of the 1,428 sentences (31.2 %), precision (number of correctly selected sentences among all the selected sentences) stood at 90.6 % (403/445), while recall (number of correctly selected sentences among all the actually correctly ranked first sentences) was 46.6 % (403/864).See further details in Marimon et al. (2014).
The conversion is a fully automatic and unambiguous process.A deterministic conversion algorithm makes use of the identifiers of the phrase structure rules to identify the heads, the dependents, and some dependency types that are directly transferred onto the dependency structure, e.g., subject, specifier, and modifier.The identifiers of the lexical entries, which include the syntactic category of the subcategorised elements, enable the identification of the argument-related dependency labels, e.g.direct object and indirect object.

The syntactic annotation
Stemming from the theory of structural syntax developed by Tesnie `re (1959), dependency grammar includes a large family of grammatical theories and formalisms that include Word Grammar (Hudson 1984(Hudson , 1990)), Functional Generative Description (Sgall et al. 1986), Dependency Unification Grammar (Hellwig 1986(Hellwig , 2003)), Meaning-Text Theory (Mel'c ˇuk 1988), Constraint Dependency Grammar (Maruyama 1990;Harper and Helzerman 1995;Menzel and  Centered upon the notion of dependency, the syntactic structure of a sentence that these frameworks provide is based in asymmetrical relations between a head and a dependent.These frameworks also provide a common analysis for a core of syntactic constructions.However, they differ in the criteria for identifying the head and the dependent in the relations, as well as in the analysis of certain types of syntactic constructions.In addition, the connections between theoretical frameworks and computational systems are often rather loose for dependency-based analysis (cf.Nivre et al. 2005), and the same loose connection is observed between theoretical frameworks and dependency treebanks, specially in those treebanks which have been derived from phrase structure annotations.
In this section we present, using examples and comparing with other treebanks, the linguistic annotations that the IULA Spanish LSP Treebank provides following the dependency grammar model.We start with the criteria for identifying the head and the dependent in the relations and the presentation of the dependency relations that we have compiled, then, we discuss the analysis that the treebank provides for coordination constructions and headless constructions.

Dependency relations, heads, and dependents
While head-complement and head-modifier structures receive a fairly unified analysis in existing dependency treebanks, and they only differ with respect to the inventory of specific dependency labels, no agreement has been reached on the analysis of those constructions that involve functional words e.g.determiners, auxiliaries, complementizers, and case-marking prepositions.Based on semantic criteria, some versions treat the content word as the head of the relation.There are also theories and formalisms that make the opposite assumption following syntactic criteria.The IULA Spanish LSP Treebank annotation mostly follows syntactic criteria, as we detail below.
• Nouns are the heads of noun phrases, and all determiners are their dependents, labeled as SPEC (specifier) [See, for instance, the analysis of the noun phrase el arancel aduanero común ('the common customs tariff') in Fig. 4].This is the standard approach in dependency treebanks, and only in the Danish Dependency Treebank (Kromann 2003) the determiner is considered to be the head of the noun phrase.• For prepositional phrases, the treebank also provides the most standard approach in dependency-based analysis: prepositions govern their noun phrases, which, in turn, are represented inside the prepositional phrase (i.e.nouns govern their dependents) (See, for instance, the analysis of the PP por la electricidad atmosférica ('by atmospheric electricity'), which appears in sentence (2), shown in Fig. 5).Alternative analyses are followed, for instance, in  (2008), where the preposition gets an auxiliary label, even though it is the head.
The fires caused by the electricity atmospheric 'Fired caused by atmospheric electricity.' • In the analysis of subcategorized subordinate clauses introduced by the complementizer que ('that'), two approaches are followed in the dependency analysis: (1) the complementizer is attached to the predicate of the embedded clause (Brants et al. 2004;Afonso et al. 2002;Taule ´et al. 2008), and (2) the complementizer lies between the two clauses; i.e. it constitutes the head of the subordinate clause and it depends on the verb of the matrix clause.In this second approach, the label of the relation of the subordinate clause to its parent can be assigned to the complementizer (Montemagni et al. 2003;Husain et al. 2010) or to the clausal predicate (Bo ¨hmova ´et al. 2003;Dzeroski et al. 2006).The IULA Spanish LSP Treebank follows the second approach and the complementizer is the head of the subordinate clause and it depends on the verb of the matrix clause.This is illustrated in Fig. 6 with the sentence (3), which also shows that in our treebank the label of the relation of the complement clause, in this case DO, is assigned to the complementizer. 18  Los incendios causados por la electricidad atmosférica (1) Serı ´a preferible que be.CONDITIONAL.3RD.SG preferable that estuviesen ma ´s acordes.be.SUBJUNCTIVE.PRESENT.3RD.PL more consistent 'It would be preferable that they were more consistent.' (3) Creo que estarı ´amos think.PRESENT.1ST.SG that be.CONDITIONAL.1ST.PL ante una discriminacio ´n de base.before a discrimination of base 'I think that we would have a basic discrimination.' • Finally, different criteria may also be used to define the dependency relations in the analysis of verb groups, so that auxiliaries and modals are linked to the main verb by a label AUX (auxiliary) (Taule ´et al. 2008;Montemagni et al. 2003) or they are considered as the head of the group (Afonso et al. 2002;Beek et al. 2002;Brants et al. 2004;Nilsson et al. 2005).Heads typically bear the label describing the relation of the verb group, subjects and negative particles are often attached to heads, and complements are often attached to content elements (the non-finite forms of the verb groups).Our treebank follows a mixed approach; only the auxiliary verbs haber ('to have') and ser ('to be') take the label AUX and modal verbs are considered as heads of the verb group.In addition, all dependents are attached to the non-finite verb.The reason for treating auxiliaries and modals differently is that we treated them differently in SRG; i.e. modal verbs are treated as main verbs.Figure 7 shows an example with the sentence (4) (4) La empresa no hubiera comprado las ma ´quinas.
The company not had bought the machines 'The company had not bought the machines.'Table 4 shows the dependency relations that the treebank distinguishes.We have compiled 26 dependency relations, which we discuss below.19 • The SPEC (specifier) dependency relation can be realized by determiners depending on nouns and degree adverbs depending on adjectives and adverbs.interrogative), or an infinitival verbal form.Human-denoting DOs in Spanish are realized by a noun phrase marked by the preposition a, in this case a takes the label DO and the head of the NP takes the label COMP, as we show in Fig. 8 with the sentence (5).21 (5) Identifico ´a dos ladrones.Identify.PAST.3RD.SG to two robbers 'S/he identified two robbers.' • IO (indirect object), which in Spanish is realized by a prepositional phrase headed by the marking preposition a, or a dative clitic pronoun.As in human-denoting DOs, a takes the label IO and the head of the NP takes the label COMP.• OBLC (oblique object), for marked noun phrases or verb phrases.
• BYAG, for the by-agent complement in passive constructions, which is realized by a prepositional phrase introduced by the preposition por ('by').• ATR (attribute) complement of ser or estar ('to be').
• PRD (predicative complement) and OPRD (object predicative complement), for predicative complements of non-copular verbs.• PP-LOC (locative complements) and PP-DIR (directional complements), realized by a locative prepositional or adverbial phrase and a directional prepositional or adverbial phrase, respectively.• and ADV (adverb), for adverbial non-locative and non-directional adverbs subcategorized by a very reduced number of verbs like portarse ('to behave').Verbs may also govern the following dependency labels: • PRNM (pronominal marker), for clitic pronouns found with socalled inherently reflexive verbs (or pronominal verbs); i.e. verbs that require a clitic pronoun co-indexed with the subject and which lack the corresponding non-reflexive form, as in ( 6).
(6) A ello me refere ´en la parte final de mi To it CLITIC refer.FUT.1ST.SG in the part final of my exposicio ´n.presentation 'I will refer to it in the last part of my presentation.' • IMPM (impersonal marker), realized by the grammatical marker se that appears in impersonal se-constructions, as in ( 7). ( 7) Se trata de una encuesta descriptiva y transversal.CLITIC is of a survey descriptive and transversal 'It's a descriptive and transversal survey.' • PASSM (passive marker), realized by the grammatical marker se that appears in passive se-constructions, as in (8). 228) La salmuera se recubre con una The brine CLITIC cover.PRESENT.3RD.SG with a capa de agua dulce.layer of sweet water 'Brine is covered with a layer of freshwater.'

Coordination structures
Coordination structures are often cited as one of the major problems in dependency grammar, since, even in their simplest case (i.e. a coordinating conjunction joins two syntactically compatible words or phrases), it is not possible to characterize them as asymmetrical relations between a head and a dependent. 23n an early solution proposed by Tesnie `re (1959), all conjuncts are attached directly to the head by horizontal edges and to the conjunction by vertical edges.Since then, many different alternatives have been proposed and adopted in the different treebank projects.
In a different analysis, originally proposed by Mel'c ˇuk (1988), the first conjunct is the head of the other elements, which are organized in a chain; i.e. the conjunction is a dependent of the first conjunct and the second conjunct of the conjunction (in multi-conjunct coordination, the conjunction depends on the penultimate conjunct and the last conjunct on the conjunction) (Fig. 9b).This approach has been followed in Brants et al. (2004), Ja ¨rvinen and Tapanainen (1998), Nilsson et al. (2005), Vincze et al. (2010), andSeeker andKuhn (2012).
Finally, the first conjunct is the head and all other elements (including the conjunction) are attached directly to it (Afonso et al. 2002;Montemagni et al. 2003;Simov and Osenova 2005;Taule ´et al. 2008) (Fig. 9c).
Figure 9 shows schematics of the three possibilities.
Various arguments can be found to support the particular choices.Sgall et al. (1986)'s approach provides more expressive power than Mel'c ˇuk (1988)'s option, since it allows to distinguish between dependents that affect the entire coordination and dependents that affect only a particular conjunct.Mel'c ˇuk (1988)'s has been proved to be best suited to dependency parsing (Nilsson et al. 2006).
Since one of the purposes of the treebank is to be a training corpus for the development of data-driven dependency-parsing, we have taken Mel'c ˇuk (1988)'s approach, and coordinated elements are organized in a chain.This is illustrated in Fig. 10 with the noun phrase (9).
(9) Una agenda y algunas relaciones sociales o profesionales An agenda and some relationships social or professional 'An agenda and some social or professional relationships' In multi-conjunct coordinated constructions, we use the label ENUM, instead of CONJ, in all but the last coordinated element, as we illustrate in Fig. 11 with the sentence (10).

Elliptical structures
Headless constructions-i.e.elliptical structures where the head is elided-are one of the weaker points in dependency grammar approaches, since they assume that the dependency representation is restricted to the surface of the sentence and it is not clear where the dependents of an elided head should be attached.For elliptical noun phrases, the standard strategy in dependency corpora is the following: the modifier of the elided head is chosen to become the head of the construction and it is labeled with the syntactic function of the elided head.
For elliptical finite verbs in e.g.gapping constructions, a number of idiosyncratic solutions have been proposed.One solution is to choose one of the dependents of a headless phrase to become the head, as in Brants et al. (2004).
In another solution, Mel'c ˇuk (1988) introduces a zero word form (i.e. a phonetically empty head) into the dependency tree to provide an attachment point for the dependents of the ellipsis.Empty heads are used in several treebank projects, such as Bosco et al. (2000), Chaitanya et al. (2011), Vincze et al. (2010), and Seeker and Kuhn (2012), since they allow to preserve the parallelism in the syntactic structure of the two conjuncts; however, no statistical dependency parsing method models phonetically empty heads convincingly and performs on a satisfactory level.
A third solution is the analysis we have adopted, also found in Ja ¨rvinen and Tapanainen (1998), Bo ¨hmova ´et al. (2003), Kromann (2003), andOflazer et al. (2003), where the coordinating conjunction (or a comma) represents the missing verb and inherits all its properties, such that subjects, complements, and adjuncts are linked to it, marked as SUBJ-GAP, COMP-GAP, and MOD-GAP.
Figure 12 illustrates the analysis of both elliptical noun phrases and gapping that the treebank provides with the sentence (11), in which the subject of the elliptical finite verb is a noun phrase whose head is elided.
(11) Se necesita el mı ´nimo cuando esta ´n en orden inverso y CLITIC needs the minimum when are in order reverse and el ma ´ximo cuando esta ´n ordenados.the maximum when are ordered 'The minimum is required when they are in reverse order and the maximum when they are ordered.'

Statistics of the treebank
As we have already mentioned in Sect.2, the IULA Spanish LSP Treebank contains 40,106 sentences distributed among different domains.The details about the statistics is shown in Table 3.It is worth mentioning that 11.59 % of the words in the treebank are tagged as verb, 25 % as common noun, and 9.81 % as adjectives, as we show in Table 5, which displays the relative frequency of the syntactic  6 gives some figures for the occurrences of the dependency tags identified in the treebank.

Quality evaluation
To guarantee consistency, that is, that the same phenomena get the same annotation through the corpus, is crucial for any of the possible uses of the treebank.We measured the consistency level of the annotation using inter-annotator agreement analysis. 24hree tests were done on three different series of 100 sentences (chosen at random from all domains) with two annotators (Ph.D. students) in three different moments in a six month period.We measured full agreement in selecting the analysis produced by the HPSG grammar.The agreements achieved in these tests were 75, 60, and 78 %.
Close examination of the disagreements allowed us to categorize them as follows: • Misunderstanding of the annotation guidelines which discussed the meaning of the HPSG syntactic rules/lexical entries to select and the disambiguation decisions agreed to apply, for instance, in case of spurious ambiguity.Disagreement due to misunderstanding of guidelines was much higher in the first test than in the following ones, since after the first test we improved the guidelines.• Distinction between complements and modifiers.Whether a phrase is a nominal or verbal complement or an adjunct had doubts among the annotators, as this is an arguable matter especially in the case of nominal heads.• The choice of the appropriate lexical entry of some words in certain contexts.For example in sentence ( 12), where the annotator had to choose among the different lexical entries for seguro which the grammar included to deal with the different meanings ('secure', 'safe', 'certain', 'reliable', 'sure').
(12) En estas situaciones, la laparoscopia suele permitir In these situations, the laparoscopy uses allow.INFINITIVE un diagno ´stico ma ´s seguro.a diagnosis more accurate 'In these situations, laparoscopy usually allows a more accurate diagnosis.' • Adverbs caused analysis differences among annotators, especially in relation to scope ambiguities of focus adverbs.• The inherent linguistic ambiguity of PP-attachment.
• Complex syntactic issues.This included several types of syntactic constructions, such as se-constructions, raising and control, some types of relative clauses, elision, etc.The complexity of the syntax also increased in accordance with the length of the sentence.• Annotators' fatigue.Although some types of errors decreased when the annotators gain confidence with the tools and guidelines, fatigue errors were quite constant.

Conclusions
This paper has described the IULA Spanish LSP Treebank, a dependency treebank of over 40,000 sentences, developed in the framework of the European project METANET4U.The IULA Spanish LSP Treebank is the first technical corpus of Spanish annotated at surface syntactic level, following the dependency grammar theory.We have described the different steps of the annotation process and we have given details about the generated dependency structures that the treebank provides, discussing the annotation options chosen.We have also provided the statistics of the treebank and the evaluation results.In the future, we plan to annotate more sentences and to include in the treebank semantic roles by extracting them from the MRS semantic representation.

Fig. 1
Fig. 1 Ratio of number of sentences per sentence length in the Corpus Te `cnic de l'IULA

Fig. 3 Fig. 4
Fig. 3 Example of an LKB derivation tree, source annotation of the treebank Fig. 5 Example of a prepositional phrase

Fig. 11
Fig. 9 Three annotation options for coordination

Fig. 12
Fig. 12 Example of an elliptical noun phrase and a gapping construction

Table 1
Sample sentences form the Corpus Te `cnic de l'IULA

Table 2
Number of sentences per sentence length in the IULA Spanish LSP Treebank Brants et al. (2004)002)andBrants et al. (2004), prepositions in our treebank get the label of the relation of the prepositional phrase, and thus they differ from Bo ¨hmova ´et al. (2003),Dzeroski et al. (2006), and Smrz et al.
(Seeker and Kuhn 2012)where prepositions are attached to nouns, and in the CoNLL-2009 Shared Task data set for German, where prepositions govern their noun phrases, but all dependents of the noun are attached to the preposition (cf.(Seeker and Kuhn 2012)).

Table 4
Dependency relations in the IULA Spanish LSP Treebank McDonald et al. (2013)uns can appear either attached to the right side of the host verb verb, the so-called enclitics, or as independent lexical units in front of the verb, known as proclitics.Infinitives, gerunds, and non-negated imperatives have enclitics, verbs in personal forms always require proclitics, and past participles cannot have clitics.As in the Spanish treebank described byMcDonald et al. (2013), enclitics are not split and only proclitics are annotated.

Table 6
Occurrences of dependency labels in the IULA Spanish LSP Treebank