Parsing Hebrew CHILDES transcripts

Gretz, Shai; Itai, Alon; MacWhinney, Brian; Nir, Bracha; Wintner, Shuly

doi:10.1007/s10579-013-9256-x

Parsing Hebrew CHILDES transcripts

Original Paper
Published: 22 November 2013

Volume 49, pages 107–145, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Shai Gretz¹,
Alon Itai¹,
Brian MacWhinney²,
Bracha Nir³ &
…
Shuly Wintner⁴

303 Accesses
2 Citations
Explore all metrics

Abstract

We present a syntactic parser of (transcripts of) spoken Hebrew: a dependency parser of the Hebrew CHILDES database. CHILDES is a corpus of child–adult linguistic interactions. Its Hebrew section has recently been morphologically analyzed and disambiguated, paving the way for syntactic annotation. This paper describes a novel annotation scheme of dependency relations reflecting constructions of child and child-directed Hebrew utterances. A subset of the corpus was annotated with dependency relations according to this scheme, and was used to train two parsers (MaltParser and MEGRASP) with which the rest of the data were parsed. The adequacy of the annotation scheme to the CHILDES data is established through numerous evaluation scenarios. The paper also discusses different annotation approaches to several linguistic phenomena, as well as the contribution of morphological features to the accuracy of parsing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multilevel Annotation in the Corpus for Parsing Russian Spontaneous Speech

A morphologically annotated longitudinal corpus of spoken Czech child–adult interactions

Article 30 March 2024

Anna Chromá, Jakub Sláma, … Jolana Treichelová

PassPort: A Dependency Parsing Model for Portuguese

Notes

However, the Hebrew transcriptions are not always consistent with respect to these markings. Corrections have been made on these corpora where possible but some problematic instances may remain which may have a negative impact on the quality of the parsing.
This one-to-one alignment is automatically verified by the Chatter program: http://talkbank.org/software/chatter.html.
The standard terminology, of course, is subject for Aagr and object for Anonagr . We use formal, rather than functional labels, for consistency and to avoid theory-specific controversies.
A dependency tree is projective if it has no crossing edges. For some languages, especially when word order is rather free, non-projective trees are preferred for better explaining sentence structure (Kübler et al. 2009, p. 16).
The number of occurrences of the relation Root is not identical to the number of utterances since in some cases an elision relation (e.g., Aagr-Root ) was used instead.
Examples in this section do not necessarily label the dependencies, either because the original work did not label them or because the label names are not relevant in our context.
In the annotation scheme for written Hebrew (Goldberg 2011), the marker ʔet is the head of the nominal element. According to Goldberg (2011), the reason for this decision is to adapt to cases where ʔet may appear whereas the subsequent nominal element is elided. These types of sentences are rather formal and we do not expect to encounter them in spoken language. No such constructions occur in our corpus.

References

Albert, A., MacWhinney, B., Nir, B., & Wintner, S. (2014). The Hebrew CHILDES corpus: Transcription and morphological analysis. Language Resources and Evaluation.
Ballesteros, M., Herrera, J., Francisco, V., & Gervás, P. (2012). Analyzing the CoNLL-X shared task from a sentence accuracy perspective. SEPLN: Sociedad Española Procesamiento del Lenguaje Natural, 48, 29–34.
Google Scholar
Ballesteros, M., & Nivre, J. (2012). MaltOptimizer: A system for MaltParser optimization. In Proceedings of the eighth international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). ISBN 978-2-9517408-7-7.
Berman, R. A. (1978). Modern Hebrew structure. Tel Aviv: University Publishing Projects.
Google Scholar
Berman, R. A., & Weissenborn, J. (1991). Acquisition of word order: A crosslinguistic study. Jerusalem, Israel: German-Israel Foundation for Research and Development (GIF); In Hebrew.
Blevins, J. P. (2006). Word-based morphology. Journal of Linguistics, 42(4), 531–573. doi:10.1017/S0022226706004191.
Article Google Scholar
Bohnet, B. (2010). Very high accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd international conference on computational linguistics (pp. 89–97). Stroudsburg, PA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1873781.1873792.
Brown, R. (1973). A first language: The early stages. Cambridge, MA: Harvard University Press.
Book Google Scholar
Danon, G. (2001). Syntactic definiteness in the grammar of Modern Hebrew. Linguistics, 39(6), 1071–1116. doi:10.1515/ling.2001.042.
Article Google Scholar
de Marneffe, M.-C., MacCartney, B., & Manning, C. D. (2006). Generating typed dependency parses from phrase structure trees. In Proceedings of LREC-2006. http://nlp.stanford.edu/pubs/LREC06_dependencies.pdf.
de Marneffe, M.-C., & Manning, C. D. (2008). The Stanford typed dependencies representation. In COLING workshop on cross-framework and cross-domain parser evaluation. http://pubs/dependencies-coling08.pdf.
Dromi, E., & Berman, R. A. (1982). A morphemic measure of early language development: Data from Modern Hebrew. Journal of Child Language, 9, 403–424. ISSN 1469-7602. http://journals.cambridge.org/article_S0305000900004785.
Eryiğit, G., & Nivre, J., & Oflazer, K. (2008). Dependency parsing of Turkish. Computational Linguistics, 34(3), 357–389. ISSN 0891-2017. doi:10.1162/coli.2008.07-017-R1-06-83.
Google Scholar
Goldberg, Y. (2011). Automatic syntactic processing of Modern Hebrew. PhD thesis, Ben Gurion University of the Negev, Israel.
Goldberg, Y., & Elhadad, M. (2009). Hebrew dependency parsing: Initial results. In Proceedings of the 11th international workshop on parsing technologies (IWPT-2009), 7–9 October 2009 (pp. 129–133). Paris, France: The Association for Computational Linguistics.
Hajič, J., & Zemánek, P. (2004). Prague Arabic dependency treebank: Development in data and tools. In Proceedings of the NEMLAR international conference on Arabic language resources and tools (pp. 110–117).
Haugereid, P., Melnik, N., & Wintner, S. (2013). Nonverbal predicates in Modern Hebrew. In S. Müller (Ed.), The proceedings of the 20th international conference on head-driven phrase structure grammar. CSLI Publications.
Kübler, S., McDonald, R. T., & Nivre, J. (2009). Dependency parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
Lembersky, G., Shacham, D., & Wintner, S. (2014). Morphological disambiguation of Hebrew: A case study in classifier combination. Natural Language Engineering. ISSN 1469-8110. doi:10.1017/S1351324912000216.
MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk (3 ed). Mahwah, NJ: Lawrence Erlbaum.
Google Scholar
Marton, Y., Habash, N., & Rambow, O. (2013). Dependency parsing of Modern Standard Arabic with lexical and inflectional features. Computational Linguistics, 39(1), 161–194.
Article Google Scholar
McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependency parsers. In Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 91–98). Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/1219840.1219852.
Ninio, A. (2013). Dependency grammar and Hebrew. In G. Khan (Ed.), Encyclopedia of Hebrew language and linguistics. Leiden: Brill.
Nir, B., MacWhinney, B., & Wintner, S. (2010). A morphologically-analyzed CHILDES corpus of Hebrew. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10) (pp. 1487–1490). European Language Resources Association (ELRA), ISBN 2-9517408-6-7.
Nivre, J. (2003). An efficient algorithm for projective dependency parsing. In Proceedings of the eighth international worskshop on parsing technologies (IWPT-2003) (pp. 149–160).
Nivre, J. (2005). Dependency grammar and dependency parsing. Technical report, Växjö University.
Nivre, J. (2009). Non-projective dependency parsing in expected linear time. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing (pp. 351–359). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1687878.1687929.
Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., et al. (2007). The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL (pp. 915–932), Prague.
Nivre, J., Hall, J., & Nilsson, J. (2006). Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of LREC-2006 (pp. 2216–2219).
Nivre, J., Kuhlmann, M., & Hall, J. (2009). An improved oracle for dependency parsing with online reordering. In Proceedings of the 11th international conference on parsing technologies (IWPT-09) (pp. 73–76).
Plank, B. (2011). Domain adaptation for parsing. Ph.D. Thesis, University of Groningen.
Rosen, H. B. (1966). Ivrit Tova (Good Hebrew). Kiryat Sepher, Jerusalem, in Hebrew.
Sagae, K., Davis, E., Lavie A., MacWhinney, B., & Wintner, S. (2010). Morphosyntactic annotation of CHILDES transcripts. Journal of Child Language, 37(3), 705–729. doi:10.1017/S0305000909990407.
Google Scholar
Sagae, K., & Lavie, A. (2006). A best-first probabilistic shift-reduce parser. In Proceedings of the COLING/ACL poster session (pp. 691–698). Association for Computational Linguistics.
Sagae, K., & Tsujii, J. (2007). Dependency parsing and domain adaptation with LR models and parser ensembles. In Proceedings of the CoNLL shared task session of EMNLP-CoNLL 2007 (pp. 1044–1050). http://www.aclweb.org/anthology/D/D07/D07-1111.
Seddah, D., Tsarfaty, R., & Foster, J., eds. (October 2011). Proceedings of the second workshop on statistical parsing of morphologically rich languages. Dublin, Ireland: Association for Computational Linguistics. http://www.aclweb.org/anthology/W11-38.
Sima’an, K., Itai, A., Winter, Y., Altman, A., & Nativ, N. (2001). Building a tree-bank of Modern Hebrew text. Traitement Automatique des Langues, 42(2), 247–380.
Google Scholar
Smrž, O., & Pajas, P. (2004). MorphoTrees of Arabic and their annotation in the TrEd environment (pp. 38–41). ELDA.
Tsarfaty, R., & Goldberg, Y. (2008). Word-based or morpheme-based? Annotation strategies for Modern Hebrew clitics. In Proceedings of the sixth international conference on language resources and evaluation (LREC’08). European Language Resources Association (ELRA). ISBN 2-9517408-4-0. http://www.lrec-conf.org/proceedings/lrec2008/.
Tsarfaty, R., Nivre, J., & Andersson, E. (2012). Joint evaluation of morphological segmentation and syntactic parsing. In Proceedings of the 50th annual meeting of the association for computational linguistics (vol. 2, pp. 6–10).
Tsarfaty, R., Seddah, D., Goldberg, Y., Kübler, S., Candito, M., Foster, J., et al. (2010). Statistical parsing of morphologically rich languages (spmrl): What, how and whither. In Proceedings of the NAACL HLT 2010 first workshop on statistical parsing of morphologically-rich languages (pp. 1–12). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1868771.1868772.
Tsarfaty, R., Seddah, D., Kübler, S., & Nivre, J. (2013). Parsing morphologically rich languages: Introduction to the special issue. Computational Linguistics, 39(1), 15–22.
Article Google Scholar
Wintner, S. (2004). Hebrew computational linguistics: Past and future. Artificial Intelligence Review, 21(2), 113–138. ISSN 0269-2821. doi:10.1023/B:AIRE.0000020865.73561.bc.
Zhang, Y., & Clark, S. (2011). Syntactic processing using the generalized perceptron and beam search. Computational Linguistics, 37(1):105–151. doi:10.1162/coli_a_00037.
Article Google Scholar

Download references

Acknowledgments

This research was supported by Grant No. 2007241 from the United States-Israel Binational Science Foundation (BSF). We are grateful to Alon Lavie for useful feedback, and to Shai Cohen for helping with the manual annotation. We benefitted greatly from the comments of two anonymous reviewers.

Author information

Authors and Affiliations

Department of Computer Science, Technion, Haifa, Israel
Shai Gretz & Alon Itai
Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, USA
Brian MacWhinney
Department of Communication Disorders, University of Haifa, Haifa, Israel
Bracha Nir
Department of Computer Science, University of Haifa, Haifa, Israel
Shuly Wintner

Authors

Shai Gretz
View author publications
You can also search for this author in PubMed Google Scholar
Alon Itai
View author publications
You can also search for this author in PubMed Google Scholar
Brian MacWhinney
View author publications
You can also search for this author in PubMed Google Scholar
Bracha Nir
View author publications
You can also search for this author in PubMed Google Scholar
Shuly Wintner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuly Wintner.

Appendices

Appendix 1: Dependency relations

Table 18 summarizes the basic dependency relations we define in this work. We list below all the relations, providing a brief explanation and a few examples.

AgreementArgument (Aagr) Specifies the relation between an argument and a predicate that mandates agreement.
Non-agreementArgument (Anonagr) Specifies any argument of a verb which need not agree with the verb, including indirect arguments.
Subordinate Clause (SubCl) Specifies the relation between a complementizer and the main verb of a subordinate clause.
ArgumentOfPreposition(Aprep) Specifies the relation between a preposition and its argument.
NonFiniteArgument (Ainf) This relation is specified between a verb or a noun in the main clause and its non-finite verbal argument.
ArgumentOfCopula (Acop) Specifies the relation between a copula and its predicate (either nominal or adjectival). See Sect. 7.1 for further discussion regarding this relation.
ArgumentOfExistential (Aexs) Specifies a relation between an existential element and a nominal or adjectival predicate. See Sect. 7.1 for further discussion regarding this relation.
Mdet Specifies a relation between a determiner and a noun.
Madj Specifies a relation between an adjective and a noun.
Mpre Specifies a relation between a dependent preposition and a head noun or a verb.
Mposs Specifies a relation between a noun and a subsequent possessive marker, noted by the token `s̆el', headed by the noun.
Mnoun Specifies a noun–noun relation, where the first noun, the head, is in the construct state.
Madv Specifies a relation between a dependent adverbial modifier and the verb it modifies.
Mneg Specifies a negation of a verb or a noun.
Mquant Specifies a relation between a noun and a nominal quantifier, headed by the noun.
Msub Specifies a relation between a nominal element and a relativizer of a relative clause, headed by the nominal element. The main predicate of the subordinate clause is marked as the dependent of the relativizer with a RelCl relation.
Voc Specifies a vocative.
Com Specifies a communicator.
Coordination (Coord) Specifies a coordination relation between coordinated items and conjunctions, most commonly we- “and”, headed by the conjunction.
Serialization (Srl) Specifies a serial verb.
Enumeration (Enum) Specifies an enumeration relation.
Unknown (Unk) Specifies an unclear or unknown word—most commonly a child invented word—which appears disconnected from the rest of the utterance and often functions as a filler syllable.
Punctuation (Punct) Specifies a punctuation mark, always attached to the root.

Table 18 Taxonomy of dependency relations

Full size table

Appendix 2: The effect of MaltOptimizer

2.1 2.1 The features chosen by MaltOptimizer

The Stack non-projective eager algorithm uses three data structures: a stack Stack of partially processed tokens; a queue Input which holds nodes that have been on Stack; and a queue Lookahead which contains nodes that have not been on Stack. This algorithm facilitates the generation of non-projective trees using a SWAP transition which reverses the order of the top two tokens on Stack by moving the top token on Stack to Input. The recommended feature set for the All–All configuration is depicted in Table 19. The features reflect positions within these data structures, where ‘0’ indicates the first position. For example, the feature ‘POSTAG (Stack[0])’ specifies the part-of-speech tag of the token in the first position (i.e., the top) of the Stack data structure. The NUM, GEN, PERS and VERBFORM features are short for the number, gender, person and verb form morphological features, respectively. Merge and Merge3 are feature map functions which merge two feature values and three feature values into one, respectively. ldep returns the leftmost dependent of the given node; rdep return the rightmost dependent; head returns the head of the node. For definitions of the rest of the features, refer to Nivre et al. (2007).

Table 19 In-domain, All–All configuration, MaltOptimizer recommended feature set

Full size table

2.2 2.2 MaltParser’s default features

MaltParser’s default parsing algorithm is Nivre arc-eager (Nivre 2003), which uses two data structures: a stack Stack of partially processed tokens and a queue Input of remaining input tokens. The feature set used by Nivre-arc is depicted in Table 20.

Table 20 In-domain, All–All configuration, MaltParser default feature set

Full size table

2.3 2.3 The features chosen by MaltOptimizer for the out-of-domain configuration

The feature set for the out-of-domain configuration suggested by MaltOptimizer is depicted in Table 21. The similarities between the suggested MaltOptimizer configurations of the in-domain and out-of-domain scenarios are not surprising, as the training set of the in-domain scenario is a subset of the training set of the out-of-domain scenario.

Table 21 Out-of-domain, All–All configuration, MaltOptimizer recommended feature set

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gretz, S., Itai, A., MacWhinney, B. et al. Parsing Hebrew CHILDES transcripts. Lang Resources & Evaluation 49, 107–145 (2015). https://doi.org/10.1007/s10579-013-9256-x

Download citation

Published: 22 November 2013
Issue Date: March 2015
DOI: https://doi.org/10.1007/s10579-013-9256-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parsing Hebrew CHILDES transcripts

Abstract

Access this article

Similar content being viewed by others

Multilevel Annotation in the Corpus for Parsing Russian Spontaneous Speech

A morphologically annotated longitudinal corpus of spoken Czech child–adult interactions

PassPort: A Dependency Parsing Model for Portuguese

Notes

References

Acknowledgments