Skip to main content

Advertisement

Log in

Investigating the Relationship Between Linguistic Representation and Computation through an Unsupervised Model of Human Morphology Learning

  • Published:
Research on Language and Computation

Abstract

We develop an unsupervised algorithm for morphological acquisition to investigate the relationship between linguistic representation, data statistics, and learning algorithms. We model the phenomenon that children acquire the morphological inflections of a language monotonically by introducing an algorithm that uses a bootstrapped, frequency-driven learning procedure to acquire rules monotonically. The algorithm learns a morphological grammar in terms of a Base and Transforms representation, a simple rule-based model of morphology. When tested on corpora of child-directed speech in English from CHILDES (MacWhinney in The CHILDES-Project: Tools for analyzing talk. Erlbaum, Hillsdale, 2000), the algorithm learns the most salient rules of English morphology and the order of acquisition is similar to that of children as observed by Brown (A first language: the early stages. Harvard University Press, Cambridge, 1973). Investigations of statistical distributions in corpora reveal that the algorithm is able to acquire morphological grammars due to its exploitation of Zipfian distributions in morphology through type-frequency statistics. These investigations suggest that the computation and frequency-driven selection of discrete morphological rules may be important factors in children’s acquisition of basic inflectional morphological systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Albright, A., & Hayes, B. (2002). Modeling English past tense intuitions with minimal generalization. In Proceedings of the special interest group on computational phonology.

  • Argamon, S., Akiva, N., Amir, A., & Kapah, O. (2004). Efficient unsupervised recursive word segmentation using minimum description length. In Proceedings of the international conference on computational linguistics.

  • Baayen R. H., Piepenbrock R., van Rijn H. (1996) The CELEX2 lexical database (CD-ROM). Linguistic Data Consortium, Philadelphia

    Google Scholar 

  • Bacchin M., Ferro N., Melucci M. (2005) A probabilistic model for stemmer generation. Information Processing and Management 41: 121–137

    Article  Google Scholar 

  • Baroni, M., & Ueyama, M. (2006). Building general- and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL international symposium, language corpora: Their compilation and application.

  • Beesley K., Karttunen L. (2003) Finite state morphology. CSLI Publications, Stanford

    Google Scholar 

  • Biemann, C. (2006). Unsupervised part-of-speech tagging employing efficient graph clustering. In Proceedings of the Association for Computational Linguistics.

  • Bordag, S. (2007). Elements of knowledge-free and unsupervised lexical acquisition. Dissertation, University of Leipzig.

  • Brent M., Cartwright T. (1996) Distributional regularity and phonotactic constraints are useful for segmentation. Cognition 61: 93–125

    Article  Google Scholar 

  • Brown R. (1973) A first language: The early stages. Harvard University Press, Cambridge

    Google Scholar 

  • Bybee J. L. (1985) Morphology: A study of the relation between meaning and form. John Bejamins, Amsterdam

    Google Scholar 

  • Can, B., & Manandhar, S. (2009). Unsupervised learning of morphology by using syntactic categories. In Working notes for the cross language evaluation forum (CLEF), MorphoChallenge.

  • Carreras, X., Chao, I., Padró, L., & Padró, M. (2004). FreeLing: an open-source suite of language analyzers. In Proceedings of the language and resources evaluation conference.

  • Carlson L. (2005) Inducing a morphological transducer from inflectional paradigms. In: Arppe A., Carlson L., Lindén K., Piitulainen J., Suominen M., Vainio & M., Westerlund H., Yli-Jyrä A. (eds) Inquiries into words, constraints and contexts, Festschrift for Kimmo Koskenniemi on his 60th Birthday. CSLI Publications, Stanford, CA

    Google Scholar 

  • Chan, E. (2008). Structures and distributions in morphology learning. Dissertation, University of Pennsylvania.

  • Chomsky N., Halle M. (1968) The sound pattern of English. Harper & Row, New York

    Google Scholar 

  • Clark, A. (2001). Learning morphology with pair hidden markov models. In Proceedings of the student workshop at the 39th annual meeting of the Association for Computational Linguistics.

  • Clark, A. (2002). Memory-based learning of morphology with stochastic transducers. In Proceedings of the Association for Computational Linguistics.

  • Clark, A. (2003). Combining distributional and morphological information for part of speech induction. In Proceedings of the 10th conference of the European chapter of the Association for Computational Linguistics.

  • Corbett G. G., Fraser N. M. (1993) Network morphology: a DATR account of Russian nominal inflection. Journal of Linguistics 29: 42–113

    Article  Google Scholar 

  • Creutz, M. (2003). Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the association of computational linguistics.

  • Creutz, M., & Lagus, K. (2004). Induction of a simple morphology for highly-inflecting languages. In Proceedings of the special interest group in computational phonology.

  • Daelemans, W., Berck, P., & Gillis, S. (1996). Unsupervised discovery of phonological categories through supervised learning of morphological rules. In Proceedings of the 16th international conference on computational linguistics.

  • Dasgupta, S., & Ng, V. (2007). Unsupervised part-of-speech acquisition for resource-scarce languages. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning.

  • Deerwester S., Dumais S. T., Furnas G. W., Landauer T. K., Harshman R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science 41: 391–407

    Article  Google Scholar 

  • Demberg, V. (2007). A language-independent unsupervised model for morphological segmentation. In Proceedings of the Association for Computational Linguistics.

  • Dressler, W. U. (2005). Morphological typology and first language acquisition: some mutual challenges. In G. Booij, E. Guevara, A. Ralli, S. Sgroi, & S. Scalise, (Eds.), In Morphology and Linguistic Typology, On-line Proceedings of the fourth mediterranean morphology meeting, Catania, 21–23 September 2003, University of Bologna. http://morbo.lingue.unibo.it/mmm.

  • Dreyer, M., Smith, J., & Eisner, J. (2008). Latent-variable modeling of string transductions with finite-state methods. In Proceedings of the conference on empirical methods in natural language processing.

  • Erjavec, T. (2006). The English-Slovene ACQUIS corpus. In Proceedings of the language and resources evaluation conference.

  • Forsberg, M., & Ranta, A. (2004). Functional morphology. In Proceedings of the international conference on functional programming.

  • Francis W. N., Kucera H. (1967) Computing analysis of present-day American english. Brown University Press, Providence, RI

    Google Scholar 

  • Freitag, D. (2004). Toward unsupervised whole-corpus tagging. In Proceedings of the international conference on computational linguistics.

  • Freitag, D. (2005). Morphology induction from term clusters. In Proceedings of the conference on computational natural Language Learning.

  • Gambell, T., & Yang, C. (2004). Statistical learning and universal grammar: modeling word segmentation. In Proceedings of the international conference on computational linguistics.

  • Gerken L. A. (2006) Decisions, decisions: infant language learning when multiple generalizations are possible. Cognition 98: B67–B74

    Article  Google Scholar 

  • Gerken L. A., Bollt A. (2008) Three exemplars allow at least some linguistic generalizations: Implications for generalization mechanisms and constraints. Language Learning and Development 4: 228–248

    Article  Google Scholar 

  • Gildea D., Jurafsky D. (1996) Learning bias and phonological rule induction. Computational Linguistics 22: 497–530

    Google Scholar 

  • Goldberg A. E. (1995) Constructions: A construction grammar approach to argument structure. University of Chicago Press, Chicago

    Google Scholar 

  • Golding A. R., Thompson H. S. (1985) A morphology component for language programs. Linguistics 23: 263–284

    Article  Google Scholar 

  • Goldsmith J. A. (2001) Unsupervised learning of the morphology of a natural language. Computational Linguistics 27: 153–198

    Article  Google Scholar 

  • Goldsmith J. A. (2006) An algorithm for the unsupervised learning of morphology. Natural Language Engineering 12: 1–19

    Article  Google Scholar 

  • Goldwater S., Griffiths T. L., Johnson M. (2006) Interpolating between types and tokens by estimating power-law generators. In: Weiss Y., Schöllkopf B., Plat J. (eds) Advances in neural information processing systems. The MIT Press, Cambridge

    Google Scholar 

  • Graff D., Gallegos G. (1999) Spanish newswire text. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  • Gustafson-Capková S., Hartmann B. (2006) Manual of the Stockholm Umeå Corpus version 2.0. Department of Linguistics, Stockholm University, Stockholm

    Google Scholar 

  • Hafer M., Weiss S. (1974) Word segmentation be letter successor varieites. Information Storage and Retrieval 10: 371–385

    Article  Google Scholar 

  • Hajic J. et al (2006) Prague Dependency Treebank 2.0, CDROM, LDC2006T01. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  • Halle M. (1973) Prolegomena to a theory of word-formation. Linguistic Inquiry 4: 3–16

    Google Scholar 

  • Harris Z. (1955) From phoneme to morpheme. Language 31: 190–222

    Article  Google Scholar 

  • Harris Z. (1970) Papers in structural and transformational linguistics. D. Reidel, Dordrecht

    Google Scholar 

  • Higgins, D. (2003). Unsupervised learning of Bulgarian POS tags. In Workshop on morphological processing of slavic languages.

  • Hockett C. F. (1954) Two models of grammatical description. Word 10: 210–231

    Google Scholar 

  • Hooper, J. B. (1979). Child morphology and morphophonemic change. Linguistics, 17, 21–50. (Also in J. Fisiak (Ed.), Historical morphology (pp. 157–187). The Hague: Mouton).

  • Hu, Y., Matveeva, I., Goldsmith, J. A., & Sprague, C. (2005a). The SED heuristic for morpheme discovery: a look at Swahili. In Proceedings of the second workshop on psychocomputational models of human language acquisition.

  • Hu, Y., Matveeva, I., Goldsmith, J. A., & Sprague, C. (2005b). Using morphology and syntax together in unsupervised learning. In Proceedings of the second workshop on psychocomputational models of human language acquisition.

  • Itai A., Wintner S. (2008) Language resources for Hebrew. Language Resources and Evaluation 42: 77–98

    Article  Google Scholar 

  • Johnson, M. (1984). A discovery procedure for certain phonological rules. In Proceedings of the international conference on computational linguistics and the Association for Computational Linguistics.

  • Kaplan R., Kay M. (1994) Regular models of phonological rule systems. Computational Linguistics 20: 331–378

    Google Scholar 

  • Karttunen, L. (1998). The proper treatment of optimality in computational phonology. In Proceedings of FSMNLP’98. International workshop on finite-state methods in natural language processing.

  • Karttunen L. (2003) Computing with realizational morphology. In: Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing, vol. 2588 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, pp 205–216

    Google Scholar 

  • Kazakov D., Manandhar S. (2001) Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming. Machine Learning 43: 121–162

    Article  Google Scholar 

  • Klein, D., & Manning, C. (2004). Corpus-based induction of syntactic structure: models of dependency and constituency. In Proceedings of the Association for Computational Linguistics.

  • Kurimo, M., Virpioja, S., Turunen, V. T., Blackwood, G. W., & Byrne, W. (2009). Overview and results of Morpho Challenge 2009. In Working notes for the CLEF 2009 workshop.

  • Linguistic Data Consortium (1994). ECI multilingual text. CDROM, LDC94T5. Linguistic Data Consortium, Philadelphia, PA

  • Lignos, C., Chan, E., Marcus, M. P., & Yang, C. (2009). A rule-based unsupervised morphology learning framework. In Working notes for cross-linguistic evaluation forum, MorphoChallenge.

  • Lignos, C., Chan, E., Marcus, M. P., & Yang, C. (2010). Evidence for a morphological acquisition model from development data. In Proceedings of the 34th annual Boston University conference on language development.

  • Lin, Y. (2005). Learning features and segments from waveforms: a statistical model of early phonological acquisition. Dissertation, UCLA.

  • Ling C. X. (1994) Learning the past tense of English verbs: the symbolic pattern associator versus connectionist models. Journal of Artificial Intelligence Research 1: 202–229

    Google Scholar 

  • MacWhinney B. (2000) The CHILDES-Project: Tools for analyzing talk. 2nd edn. Erlbaum, Hillsdale

    Google Scholar 

  • Manandhar, S., Džeroski, S., & Erjavec, T. (1998). Learning multilingual morphology with CLOG. In Proceedings of inductive logic programming (ILP), 8th international conference. Lecture notes in artificial intelligence (Vol. 1446, pp. 135–144). Heidelberg: Springer.

  • Marcus G. F., Pinker S., Ullman M., Hollander M., Rosen T. J., Xu F., Clahsen H. (1992) Overregularization in language acquisition. Monographs of the Society for Research in Child Development 54(4): 1–182

    Google Scholar 

  • Màrquez, L., Taulé, M., Marti, A., Garcia, M., Real, F., & Ferrés, D. (2004). Senseval-3: The catalan lexical sample task. In Proceedings of senseval-3: The third international workshop on the evaluation of systems for the semantic analysis of text.

  • McClelland J. L., Patterson K. (2002) Rules or connections in past-tense inflections: What does the evidence rule out?. Trends in Cognitive Science 6: 74–465

    Google Scholar 

  • Molnar, R. A. (2001). Generalize and sift as a model of inflection acquisition. Masters thesis, Massachusetts Institute of Technology.

  • Mooney R. J., Califf M. E. (1996) Learning the past tense of English verbs using inductive logic programming. In: Wermter S., Riloff E., Scheler G. (eds) Symbolic, connectionist, and statistical approaches to learning for natural language processing. Spring, Heidelberg

    Google Scholar 

  • Naradowsky, J., & Goldwater, S. (2009). Improving morphology induction by learning spelling rules. In Proceedings of the international joint conference on artificial intelligence.

  • Newman M. E. J. (2005) Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46: 323–351

    Article  Google Scholar 

  • Ninio A. (2006) Language and the learning curve: A new theory of syntactic development. Oxford University Press, Oxford

    Google Scholar 

  • Oflazer K., Nirenburg S., McShane M. (2001) Bootstrapping morphological analyzers by combining human elicitation and machine learning. Computational Linguistics 27: 59–85

    Article  Google Scholar 

  • Papageorgiou, H., Prokopidis, P., Giouli, V., & Piperidis, S. (2000). A unified POS tagging architecture and its application to Greek. In Proceedings of the language and resources evaluation conference.

  • Parkes, C., Malek, A. M., & Marcus, M. P. (1998). Towards unsupervised extraction of verb paradigms from large corpora. In Proceedings of the sixth workshop on very large corpora.

  • Pinker S. (1999) Words and rules: The ingredients of language. Harper Collins, New York

    Google Scholar 

  • Pinker S., Prince A. (1988) On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition 28: 73–193

    Article  Google Scholar 

  • Pinker S., Ullmann M. T. (2002) The past and future of the past tense. Trends in Cognitive Science 6: 456–463

    Article  Google Scholar 

  • Plisson, J., Lavrac, N., & Mladenic, D. (2004). A rule based approach to word lemmatization. In SiKDD 2004 at multiconference IS-2004, Ljubljana, Slovenia.

  • Poon, H., Cherry, C., & Toutanova, K. (2009). Unsupervised morphological segmentation with log-linear models. In Proceedings of the North American chapter of the Association for Computational Linguistics—Human Language Technologies Conference.

  • Prince, A., & Smolensky, P. (1993). Optimality theory: Constraint interaction in generative grammar. Technical Report, Rutgers University center for cognitive science and computer science Department, University of Colorado at Boulder. Also published by Blackwell Publishers, 2004.

  • Redington M., Chater N., Finch S. (1998) Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science 22: 425–469

    Article  Google Scholar 

  • Rumelhart D. E., McClelland J. L. (1986) On learning the past tenses of English verbs. In: McClelland J. L., Rumelhart D. E. (eds) The PDP research group, Parallel distributed processing: Explorations in the microstructure of cognition 2. The MIT Press, Cambridge

    Google Scholar 

  • Schone, P., & Jurafsky, D. (2000). Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of the conference on computational natural language learning.

  • Schone, P., & Jurafsky, D. (2001). Knowledge-free induction of inflectional morphologies. In Proceedings of the North American chapter of the Association for Computational Linguistics.

  • Schütze, H. (1993). Part-of-speech induction from scratch. In Proceedings of the Association for Computational Linguistics.

  • Segal, E. (1999). Hebrew morphological analyzer for Hebrew undotted texts. Master’s thesis, Technion, Israel Institute of Technology, Haifa.

  • Shalonova, K., & Flach, P. (2007). Morphology learning using tree of aligned suffix rules. In Proceedings of the workshop on challenges and applications of grammar induction.

  • Slobin D. I. (1973) Cognitive prerequisites for the development of grammar. In: Ferguson C. A., Slobin D. I. (eds) Studies of child language development. Rinehart & Winston, New York: Holt

    Google Scholar 

  • Slobin, D. I. (Ed.) (1985) (2 vols.), (1992), (1997) (2 vols.). The crosslinguistic study of language acquisition. Hillsdale: Erlbaum, NJ.

  • Snover, M., & Brent, M. (2001). A Bayesian model for morpheme and paradigm identification. In Proceedings of the Association for Computational Linguistics.

  • Snover, M., Jarosz, G., & Brent, M. (2002). Unsupervised learning of morphology using a novel directed search algorithm: taking the first step. In Proceedings of the special interest group in computational phonology.

  • Sproat R. (1992) Morphology and computation. The MIT Press, Cambridge

    Google Scholar 

  • Stroppa, N., & Yvon, F. (2005). An analogical learner for morphological analysis. In Proceedings of the Conference on Computational Natural Language Learning.

  • Stump G. T. (2001) Inflectional morphology: A theory of paradigm structure. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Theron, P., & Cloete, I. (1997). Automatic acquisition of two-level morphological rules. In Proceedings of the conference on applied natural language processing.

  • Tomasello M. (2003) Constructing a language: A usage-based theory of language acquisition. Harvard University Press, Cambridge

    Google Scholar 

  • Weide, R. (1998). The Carnegie mellon pronouncing dictionary [cmudict. 0.6]. (Carnegie Mellon University: http://www.speech.cs.cmu.edu/cgi-bin/cmudict).

  • Wicentowski, R. (2002). Modeling and learning multilingual inflectional morphology in a minimally supervised framework. Dissertation, Johns Hopkins University.

  • Wicentowski, R. (2004). Multilingual noise-robust supervised morphological analysis using the WordFrame model. In Proceedings of special interest group on computational phonology (SIGPHON).

  • Wothke, K. (1986). Machine learning of morphological rules by generalization and analogy. In Proceedings of the international conference on computational linguistics (COLING).

  • Yarowsky, D., & Wicentowski, R. (2000). Minimally supervised morphological analysis by multimodal alignment. In Proceedings of the Association for Computational Linguistics.

  • Yip, K., & Sussman, G. J. (1997). Sparse representations for fast, one-shot learning. A.I. Memo No. 1633, Artificial Intelligence Laboratory, Massachusetts Institute of Technology.

  • Zajac, R. (2001). Morpholog: constrained and supervised learning of morphology. In Proceedings of the Special Interest Group in Computational Phonology.

  • Zipf G. K. (1935) The psycho-biology of language, an introduction to dynamic philology. The Riverside Press, Cambridge

    Google Scholar 

  • Zipf G. K. (1949) Human behavior and the principle of least effort. Addison-Wesley, Cambridge

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erwin Chan.

About this article

Cite this article

Chan, E., Lignos, C. Investigating the Relationship Between Linguistic Representation and Computation through an Unsupervised Model of Human Morphology Learning. Res on Lang and Comput 8, 209–238 (2010). https://doi.org/10.1007/s11168-011-9077-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11168-011-9077-2

Keywords

Navigation