Abstract
We develop an unsupervised algorithm for morphological acquisition to investigate the relationship between linguistic representation, data statistics, and learning algorithms. We model the phenomenon that children acquire the morphological inflections of a language monotonically by introducing an algorithm that uses a bootstrapped, frequency-driven learning procedure to acquire rules monotonically. The algorithm learns a morphological grammar in terms of a Base and Transforms representation, a simple rule-based model of morphology. When tested on corpora of child-directed speech in English from CHILDES (MacWhinney in The CHILDES-Project: Tools for analyzing talk. Erlbaum, Hillsdale, 2000), the algorithm learns the most salient rules of English morphology and the order of acquisition is similar to that of children as observed by Brown (A first language: the early stages. Harvard University Press, Cambridge, 1973). Investigations of statistical distributions in corpora reveal that the algorithm is able to acquire morphological grammars due to its exploitation of Zipfian distributions in morphology through type-frequency statistics. These investigations suggest that the computation and frequency-driven selection of discrete morphological rules may be important factors in children’s acquisition of basic inflectional morphological systems.
Similar content being viewed by others
References
Albright, A., & Hayes, B. (2002). Modeling English past tense intuitions with minimal generalization. In Proceedings of the special interest group on computational phonology.
Argamon, S., Akiva, N., Amir, A., & Kapah, O. (2004). Efficient unsupervised recursive word segmentation using minimum description length. In Proceedings of the international conference on computational linguistics.
Baayen R. H., Piepenbrock R., van Rijn H. (1996) The CELEX2 lexical database (CD-ROM). Linguistic Data Consortium, Philadelphia
Bacchin M., Ferro N., Melucci M. (2005) A probabilistic model for stemmer generation. Information Processing and Management 41: 121–137
Baroni, M., & Ueyama, M. (2006). Building general- and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL international symposium, language corpora: Their compilation and application.
Beesley K., Karttunen L. (2003) Finite state morphology. CSLI Publications, Stanford
Biemann, C. (2006). Unsupervised part-of-speech tagging employing efficient graph clustering. In Proceedings of the Association for Computational Linguistics.
Bordag, S. (2007). Elements of knowledge-free and unsupervised lexical acquisition. Dissertation, University of Leipzig.
Brent M., Cartwright T. (1996) Distributional regularity and phonotactic constraints are useful for segmentation. Cognition 61: 93–125
Brown R. (1973) A first language: The early stages. Harvard University Press, Cambridge
Bybee J. L. (1985) Morphology: A study of the relation between meaning and form. John Bejamins, Amsterdam
Can, B., & Manandhar, S. (2009). Unsupervised learning of morphology by using syntactic categories. In Working notes for the cross language evaluation forum (CLEF), MorphoChallenge.
Carreras, X., Chao, I., Padró, L., & Padró, M. (2004). FreeLing: an open-source suite of language analyzers. In Proceedings of the language and resources evaluation conference.
Carlson L. (2005) Inducing a morphological transducer from inflectional paradigms. In: Arppe A., Carlson L., Lindén K., Piitulainen J., Suominen M., Vainio & M., Westerlund H., Yli-Jyrä A. (eds) Inquiries into words, constraints and contexts, Festschrift for Kimmo Koskenniemi on his 60th Birthday. CSLI Publications, Stanford, CA
Chan, E. (2008). Structures and distributions in morphology learning. Dissertation, University of Pennsylvania.
Chomsky N., Halle M. (1968) The sound pattern of English. Harper & Row, New York
Clark, A. (2001). Learning morphology with pair hidden markov models. In Proceedings of the student workshop at the 39th annual meeting of the Association for Computational Linguistics.
Clark, A. (2002). Memory-based learning of morphology with stochastic transducers. In Proceedings of the Association for Computational Linguistics.
Clark, A. (2003). Combining distributional and morphological information for part of speech induction. In Proceedings of the 10th conference of the European chapter of the Association for Computational Linguistics.
Corbett G. G., Fraser N. M. (1993) Network morphology: a DATR account of Russian nominal inflection. Journal of Linguistics 29: 42–113
Creutz, M. (2003). Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the association of computational linguistics.
Creutz, M., & Lagus, K. (2004). Induction of a simple morphology for highly-inflecting languages. In Proceedings of the special interest group in computational phonology.
Daelemans, W., Berck, P., & Gillis, S. (1996). Unsupervised discovery of phonological categories through supervised learning of morphological rules. In Proceedings of the 16th international conference on computational linguistics.
Dasgupta, S., & Ng, V. (2007). Unsupervised part-of-speech acquisition for resource-scarce languages. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning.
Deerwester S., Dumais S. T., Furnas G. W., Landauer T. K., Harshman R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science 41: 391–407
Demberg, V. (2007). A language-independent unsupervised model for morphological segmentation. In Proceedings of the Association for Computational Linguistics.
Dressler, W. U. (2005). Morphological typology and first language acquisition: some mutual challenges. In G. Booij, E. Guevara, A. Ralli, S. Sgroi, & S. Scalise, (Eds.), In Morphology and Linguistic Typology, On-line Proceedings of the fourth mediterranean morphology meeting, Catania, 21–23 September 2003, University of Bologna. http://morbo.lingue.unibo.it/mmm.
Dreyer, M., Smith, J., & Eisner, J. (2008). Latent-variable modeling of string transductions with finite-state methods. In Proceedings of the conference on empirical methods in natural language processing.
Erjavec, T. (2006). The English-Slovene ACQUIS corpus. In Proceedings of the language and resources evaluation conference.
Forsberg, M., & Ranta, A. (2004). Functional morphology. In Proceedings of the international conference on functional programming.
Francis W. N., Kucera H. (1967) Computing analysis of present-day American english. Brown University Press, Providence, RI
Freitag, D. (2004). Toward unsupervised whole-corpus tagging. In Proceedings of the international conference on computational linguistics.
Freitag, D. (2005). Morphology induction from term clusters. In Proceedings of the conference on computational natural Language Learning.
Gambell, T., & Yang, C. (2004). Statistical learning and universal grammar: modeling word segmentation. In Proceedings of the international conference on computational linguistics.
Gerken L. A. (2006) Decisions, decisions: infant language learning when multiple generalizations are possible. Cognition 98: B67–B74
Gerken L. A., Bollt A. (2008) Three exemplars allow at least some linguistic generalizations: Implications for generalization mechanisms and constraints. Language Learning and Development 4: 228–248
Gildea D., Jurafsky D. (1996) Learning bias and phonological rule induction. Computational Linguistics 22: 497–530
Goldberg A. E. (1995) Constructions: A construction grammar approach to argument structure. University of Chicago Press, Chicago
Golding A. R., Thompson H. S. (1985) A morphology component for language programs. Linguistics 23: 263–284
Goldsmith J. A. (2001) Unsupervised learning of the morphology of a natural language. Computational Linguistics 27: 153–198
Goldsmith J. A. (2006) An algorithm for the unsupervised learning of morphology. Natural Language Engineering 12: 1–19
Goldwater S., Griffiths T. L., Johnson M. (2006) Interpolating between types and tokens by estimating power-law generators. In: Weiss Y., Schöllkopf B., Plat J. (eds) Advances in neural information processing systems. The MIT Press, Cambridge
Graff D., Gallegos G. (1999) Spanish newswire text. Linguistic Data Consortium, Philadelphia
Gustafson-Capková S., Hartmann B. (2006) Manual of the Stockholm Umeå Corpus version 2.0. Department of Linguistics, Stockholm University, Stockholm
Hafer M., Weiss S. (1974) Word segmentation be letter successor varieites. Information Storage and Retrieval 10: 371–385
Hajic J. et al (2006) Prague Dependency Treebank 2.0, CDROM, LDC2006T01. Linguistic Data Consortium, Philadelphia
Halle M. (1973) Prolegomena to a theory of word-formation. Linguistic Inquiry 4: 3–16
Harris Z. (1955) From phoneme to morpheme. Language 31: 190–222
Harris Z. (1970) Papers in structural and transformational linguistics. D. Reidel, Dordrecht
Higgins, D. (2003). Unsupervised learning of Bulgarian POS tags. In Workshop on morphological processing of slavic languages.
Hockett C. F. (1954) Two models of grammatical description. Word 10: 210–231
Hooper, J. B. (1979). Child morphology and morphophonemic change. Linguistics, 17, 21–50. (Also in J. Fisiak (Ed.), Historical morphology (pp. 157–187). The Hague: Mouton).
Hu, Y., Matveeva, I., Goldsmith, J. A., & Sprague, C. (2005a). The SED heuristic for morpheme discovery: a look at Swahili. In Proceedings of the second workshop on psychocomputational models of human language acquisition.
Hu, Y., Matveeva, I., Goldsmith, J. A., & Sprague, C. (2005b). Using morphology and syntax together in unsupervised learning. In Proceedings of the second workshop on psychocomputational models of human language acquisition.
Itai A., Wintner S. (2008) Language resources for Hebrew. Language Resources and Evaluation 42: 77–98
Johnson, M. (1984). A discovery procedure for certain phonological rules. In Proceedings of the international conference on computational linguistics and the Association for Computational Linguistics.
Kaplan R., Kay M. (1994) Regular models of phonological rule systems. Computational Linguistics 20: 331–378
Karttunen, L. (1998). The proper treatment of optimality in computational phonology. In Proceedings of FSMNLP’98. International workshop on finite-state methods in natural language processing.
Karttunen L. (2003) Computing with realizational morphology. In: Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing, vol. 2588 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, pp 205–216
Kazakov D., Manandhar S. (2001) Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming. Machine Learning 43: 121–162
Klein, D., & Manning, C. (2004). Corpus-based induction of syntactic structure: models of dependency and constituency. In Proceedings of the Association for Computational Linguistics.
Kurimo, M., Virpioja, S., Turunen, V. T., Blackwood, G. W., & Byrne, W. (2009). Overview and results of Morpho Challenge 2009. In Working notes for the CLEF 2009 workshop.
Linguistic Data Consortium (1994). ECI multilingual text. CDROM, LDC94T5. Linguistic Data Consortium, Philadelphia, PA
Lignos, C., Chan, E., Marcus, M. P., & Yang, C. (2009). A rule-based unsupervised morphology learning framework. In Working notes for cross-linguistic evaluation forum, MorphoChallenge.
Lignos, C., Chan, E., Marcus, M. P., & Yang, C. (2010). Evidence for a morphological acquisition model from development data. In Proceedings of the 34th annual Boston University conference on language development.
Lin, Y. (2005). Learning features and segments from waveforms: a statistical model of early phonological acquisition. Dissertation, UCLA.
Ling C. X. (1994) Learning the past tense of English verbs: the symbolic pattern associator versus connectionist models. Journal of Artificial Intelligence Research 1: 202–229
MacWhinney B. (2000) The CHILDES-Project: Tools for analyzing talk. 2nd edn. Erlbaum, Hillsdale
Manandhar, S., Džeroski, S., & Erjavec, T. (1998). Learning multilingual morphology with CLOG. In Proceedings of inductive logic programming (ILP), 8th international conference. Lecture notes in artificial intelligence (Vol. 1446, pp. 135–144). Heidelberg: Springer.
Marcus G. F., Pinker S., Ullman M., Hollander M., Rosen T. J., Xu F., Clahsen H. (1992) Overregularization in language acquisition. Monographs of the Society for Research in Child Development 54(4): 1–182
Màrquez, L., Taulé, M., Marti, A., Garcia, M., Real, F., & Ferrés, D. (2004). Senseval-3: The catalan lexical sample task. In Proceedings of senseval-3: The third international workshop on the evaluation of systems for the semantic analysis of text.
McClelland J. L., Patterson K. (2002) Rules or connections in past-tense inflections: What does the evidence rule out?. Trends in Cognitive Science 6: 74–465
Molnar, R. A. (2001). Generalize and sift as a model of inflection acquisition. Masters thesis, Massachusetts Institute of Technology.
Mooney R. J., Califf M. E. (1996) Learning the past tense of English verbs using inductive logic programming. In: Wermter S., Riloff E., Scheler G. (eds) Symbolic, connectionist, and statistical approaches to learning for natural language processing. Spring, Heidelberg
Naradowsky, J., & Goldwater, S. (2009). Improving morphology induction by learning spelling rules. In Proceedings of the international joint conference on artificial intelligence.
Newman M. E. J. (2005) Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46: 323–351
Ninio A. (2006) Language and the learning curve: A new theory of syntactic development. Oxford University Press, Oxford
Oflazer K., Nirenburg S., McShane M. (2001) Bootstrapping morphological analyzers by combining human elicitation and machine learning. Computational Linguistics 27: 59–85
Papageorgiou, H., Prokopidis, P., Giouli, V., & Piperidis, S. (2000). A unified POS tagging architecture and its application to Greek. In Proceedings of the language and resources evaluation conference.
Parkes, C., Malek, A. M., & Marcus, M. P. (1998). Towards unsupervised extraction of verb paradigms from large corpora. In Proceedings of the sixth workshop on very large corpora.
Pinker S. (1999) Words and rules: The ingredients of language. Harper Collins, New York
Pinker S., Prince A. (1988) On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition 28: 73–193
Pinker S., Ullmann M. T. (2002) The past and future of the past tense. Trends in Cognitive Science 6: 456–463
Plisson, J., Lavrac, N., & Mladenic, D. (2004). A rule based approach to word lemmatization. In SiKDD 2004 at multiconference IS-2004, Ljubljana, Slovenia.
Poon, H., Cherry, C., & Toutanova, K. (2009). Unsupervised morphological segmentation with log-linear models. In Proceedings of the North American chapter of the Association for Computational Linguistics—Human Language Technologies Conference.
Prince, A., & Smolensky, P. (1993). Optimality theory: Constraint interaction in generative grammar. Technical Report, Rutgers University center for cognitive science and computer science Department, University of Colorado at Boulder. Also published by Blackwell Publishers, 2004.
Redington M., Chater N., Finch S. (1998) Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science 22: 425–469
Rumelhart D. E., McClelland J. L. (1986) On learning the past tenses of English verbs. In: McClelland J. L., Rumelhart D. E. (eds) The PDP research group, Parallel distributed processing: Explorations in the microstructure of cognition 2. The MIT Press, Cambridge
Schone, P., & Jurafsky, D. (2000). Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of the conference on computational natural language learning.
Schone, P., & Jurafsky, D. (2001). Knowledge-free induction of inflectional morphologies. In Proceedings of the North American chapter of the Association for Computational Linguistics.
Schütze, H. (1993). Part-of-speech induction from scratch. In Proceedings of the Association for Computational Linguistics.
Segal, E. (1999). Hebrew morphological analyzer for Hebrew undotted texts. Master’s thesis, Technion, Israel Institute of Technology, Haifa.
Shalonova, K., & Flach, P. (2007). Morphology learning using tree of aligned suffix rules. In Proceedings of the workshop on challenges and applications of grammar induction.
Slobin D. I. (1973) Cognitive prerequisites for the development of grammar. In: Ferguson C. A., Slobin D. I. (eds) Studies of child language development. Rinehart & Winston, New York: Holt
Slobin, D. I. (Ed.) (1985) (2 vols.), (1992), (1997) (2 vols.). The crosslinguistic study of language acquisition. Hillsdale: Erlbaum, NJ.
Snover, M., & Brent, M. (2001). A Bayesian model for morpheme and paradigm identification. In Proceedings of the Association for Computational Linguistics.
Snover, M., Jarosz, G., & Brent, M. (2002). Unsupervised learning of morphology using a novel directed search algorithm: taking the first step. In Proceedings of the special interest group in computational phonology.
Sproat R. (1992) Morphology and computation. The MIT Press, Cambridge
Stroppa, N., & Yvon, F. (2005). An analogical learner for morphological analysis. In Proceedings of the Conference on Computational Natural Language Learning.
Stump G. T. (2001) Inflectional morphology: A theory of paradigm structure. Cambridge University Press, Cambridge
Theron, P., & Cloete, I. (1997). Automatic acquisition of two-level morphological rules. In Proceedings of the conference on applied natural language processing.
Tomasello M. (2003) Constructing a language: A usage-based theory of language acquisition. Harvard University Press, Cambridge
Weide, R. (1998). The Carnegie mellon pronouncing dictionary [cmudict. 0.6]. (Carnegie Mellon University: http://www.speech.cs.cmu.edu/cgi-bin/cmudict).
Wicentowski, R. (2002). Modeling and learning multilingual inflectional morphology in a minimally supervised framework. Dissertation, Johns Hopkins University.
Wicentowski, R. (2004). Multilingual noise-robust supervised morphological analysis using the WordFrame model. In Proceedings of special interest group on computational phonology (SIGPHON).
Wothke, K. (1986). Machine learning of morphological rules by generalization and analogy. In Proceedings of the international conference on computational linguistics (COLING).
Yarowsky, D., & Wicentowski, R. (2000). Minimally supervised morphological analysis by multimodal alignment. In Proceedings of the Association for Computational Linguistics.
Yip, K., & Sussman, G. J. (1997). Sparse representations for fast, one-shot learning. A.I. Memo No. 1633, Artificial Intelligence Laboratory, Massachusetts Institute of Technology.
Zajac, R. (2001). Morpholog: constrained and supervised learning of morphology. In Proceedings of the Special Interest Group in Computational Phonology.
Zipf G. K. (1935) The psycho-biology of language, an introduction to dynamic philology. The Riverside Press, Cambridge
Zipf G. K. (1949) Human behavior and the principle of least effort. Addison-Wesley, Cambridge
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Chan, E., Lignos, C. Investigating the Relationship Between Linguistic Representation and Computation through an Unsupervised Model of Human Morphology Learning. Res on Lang and Comput 8, 209–238 (2010). https://doi.org/10.1007/s11168-011-9077-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11168-011-9077-2