Elsevier

Cognition

Volume 117, Issue 2, November 2010, Pages 107-125
Cognition

Modeling human performance in statistical word segmentation

https://doi.org/10.1016/j.cognition.2010.07.005Get rights and content

Abstract

The ability to discover groupings in continuous stimuli on the basis of distributional information is present across species and across perceptual modalities. We investigate the nature of the computations underlying this ability using statistical word segmentation experiments in which we vary the length of sentences, the amount of exposure, and the number of words in the languages being learned. Although the results are intuitive from the perspective of a language learner (longer sentences, less training, and a larger language all make learning more difficult), standard computational proposals fail to capture several of these results. We describe how probabilistic models of segmentation can be modified to take into account some notion of memory or resource limitations in order to provide a closer match to human performance.

Introduction

Human adults and infants, non-human primates, and even rodents all show a surprising ability: presented with a stream of syllables with no pauses between them, individuals from each group are able to discriminate statistically coherent sequences from sequences with lower coherence (Aslin et al., 1998, Hauser et al., 2001, Saffran et al., 1999, Saffran et al., 1996, Toro and Trobalon, 2005). This ability is not unique to linguistic stimuli (Saffran et al., 1999) or to the auditory domain (Conway and Christiansen, 2005, Kirkham et al., 2002), and is not constrained to temporal sequences (Fiser & Aslin, 2002) or even to the particulars of perceptual stimuli (Brady & Oliva, 2008). This “statistical learning” ability may be useful for a large variety of tasks but is especially relevant to language learners who must learn to segment words from fluent speech.

Yet despite the scope of the “statistical learning” phenomenon and the large literature surrounding it, the computations underlying statistical learning are at present unknown. Following an initial suggestion by Harris (1951), work on this topic by Saffran and colleagues (Saffran et al., 1996, Saffran et al., 1996) suggested that learners could succeed in word segmentation by computing transitional probabilities between syllables and using low-probability transitions as one possible indicator of a boundary between words. More recently, a number of investigations have used more sophisticated computational models to attempt to characterize the computations performed by human learners in word segmentation (Giroux & Rey, 2009) and visual statistical learning (Orbán, Fiser, Aslin, & Lengyel, 2008) tasks.

The goal of the current investigation is to extend this previous work by evaluating a larger set of models against new experimental data describing human performance in statistical word segmentation tasks. Our strategy is to investigate the fit of segmentation models to human performance. Because existing experiments show evidence of statistical segmentation but provide only limited quantitative results about segmentation under different conditions, we parametrically manipulate basic factors leading to difficulty for human learners to create a relatively detailed dataset with which to evaluate models.

The plan of the paper is as follows. We first review some previous work on the computations involved in statistical learning. Next, we make use of the adult statistical segmentation paradigm of Saffran et al. (1996) to measure human segmentation performance as we vary three factors: sentence length, amount of exposure, and number of words in the language. We then evaluate a variety of segmentation models on the same dataset and find that although some of the results are well-modeled by some subset of models, no model captures all three results. We argue that the likely cause of this failure is the lack of memory constraints on current models. We conclude by considering methods for modifying models of segmentation to better reflect the memory constraints on human learning.

There are three contributions of this work: we introduce a variety of new human data about segmentation under a range of experimental conditions, we show an important limitation of a number of proposed models, and we describe a broad class of models—memory-limited probabilistic models—which we believe should be the focus of attention in future investigations.

Investigations of the computations underlying statistical learning phenomena have followed two complementary strategies. The first strategy is the strategy of evaluating model sufficiency: whether a particular model, given some fixed amount of data, will converge to the correct solution. If a model does not converge to the correct solution within the amount of data available to a human learner, either the model is incorrect or the human learner relies on other sources of information to solve the problem. The second strategy evaluates fidelity: the fit between model performance and human performance across a range of different inputs. To the extent that a model correctly matches the pattern of successes and failures exhibited by human learners, it can be said to provide a better theory of human learning.

Investigations of the sufficiency of different computational proposals for segmentation have suggested that transitional probabilities may not be a viable segmentation strategy for learning from corpus data (Brent, 1999b). For example, Brent (1999a) evaluated a number of computational models of statistical segmentation on their ability to learn words from infant-directed speech and found that a range of statistical models were able to outperform a simpler transitional probability-based model. A more recent investigation by Goldwater, Griffiths, and Johnson (2009) built on Brent’s modeling work by comparing a unigram model, which assumed that each word in a sentence was generated independently of each other word, to a bigram model which assumed sequential dependencies between words. The result of this comparison was clear: the bigram model substantially outperformed the unigram model because the unigram model tended to undersegment the input, mis-identifying frequent sequences of words as single units (e.g. “whatsthat” or “inthehouse”). Thus, incorporating additional linguistic structure into models may be necessary to achieve accurate segmentation. In general, however, the model described by Goldwater et al. (2009) and related models (Johnson, 2008, Liang and Klein, 2009) achieve the current state-of-the-art in segmentation performance due to their ability to find coherent units (words) and estimate their relationships within the language.

It is possible that human learners use a simple, undersegmenting strategy to bootstrap segmentation but then use other strategies or information sources to achieve accurate adult-level performance (Swingley, 2005). For this reason, investigations of the sufficiency of particular models are not alone able to resolve the question of what computations human learners perform either in artificial language segmentation paradigms or in learning to segment human language more generally. Thus, investigations of the fidelity of models to human data are a necessary part of the effort to characterize human learning. Since data from word segmentation tasks with human infants are largely qualitative in nature (Saffran et al., 1996, Jusczyk and Aslin, 1995), artificial language learning tasks with adults can provide valuable quantitative data for the purpose of distinguishing models.

Three recent studies have pursued this strategy. All three have investigated the question of the representations that are stored in statistical learning tasks and whether these representations are best described by chunking models or by transition-finding models. For example, Giroux and Rey (2009) contrasted the PARSER model of Perruchet and Vinter (1998) with a simple recurrent network, or SRN (Elman, 1990). The PARSER model, which extracts and stores frequent sequences in a memory register, was used as an example of a chunking strategy and the SRN, which learns to predict individual elements on the basis of previous elements, was used as an example of a transition-finding model. Giroux and Rey compared the fit of these models to a human experiment testing whether adults were able to recognize the sub-strings of valid sequences in the exposure corpus and found that PARSER fit the human data better, predicting sub-string recognition performance would not increase with greater amounts of exposure. These results suggest that PARSER may capture some aspects of the segmentation task that are not accounted for by the SRN. But because each model in this study represents only one particular instantiation of its class, a success or failure by one or the other does not provide evidence for or against the entire class.

In the domain of visual statistical learning tasks, Orbán et al. (2008) conducted a series of elegant behavioral experiments with adults that were also designed to distinguish chunking and transition-finding strategies. (Orbán et al. referred to this second class of strategies as associative rather than transition-finding, since transitions were not sequential in the visual domain.) Their results suggested that the chunking model, which learned a parsimonious set of coherent chunks that could be composed to create the exposure corpus, provided a better fit to human performance across a wide range of conditions. Because of the guarantee of optimality afforded by the ideal learning framework that Orbán et al. (2008) used, this set of results provides slightly stronger evidence in favor of a chunking strategy. While Orbán et al.’s work still does not provide evidence against all transition-finding strategies, their results do suggest that it is not an idiosyncrasy of the learning algorithm employed by the transition-finding model that led to its failure. Because this result was obtained in the visual domain, however, it cannot be considered conclusive for auditory statistical learning tasks, since it is possible that statistical learning tasks make use of different computations across domains (Conway & Christiansen, 2005).

Finally, a recent study by Endress and Mehler (2009) familiarized adults to a language which contained three-syllable words that were each generated via the perturbation of one syllable of a “phantom word” (labeled this way because the word was not ever presented in the experiment). At test, participants were able to distinguish words that actually appeared in the exposure corpus from distractor sequences with low internal transition probabilities but not from phantom words. These data suggest that participants do not simply store frequent sequences; if they did, they would not have indicated that phantom words were as familiar as sequences they actually heard. However, the data are consistent with at least two other possible interpretations. First, participants may have relied only on syllable-wise transition probabilities (which would lead to phantom words being judged equally probable as the observed sequences). Second, participants might have been chunking sequences from the familiarization corpus and making the implicit inference that many of the observed sequences were related to the same prototype (the phantom word). This inference would in turn lead to a prototype enhancement effect (Posner & Keele, 1968), in which participants believe they have observed the prototype even though they have only observed non-prototypical exemplars centered around it. Thus, although these data are not consistent with a naïve chunking model, they may well be consistent with a chunking model that captures other properties of human generalization and memory.

To summarize: although far from conclusive, the current pattern of results is most consistent with the hypothesis that human performance in statistical learning tasks is best modeled by a process of chunking which may be limited by the basic properties of human memory. Rather than focusing on the question of chunking vs. transition-finding, our current work begins where this previous work leaves off, investigating how to incorporate basic features of human performance into models of statistical segmentation. Although some models of statistical learning have incorporated ideas about restrictions on human memory (Perruchet and Vinter, 1998, Perruchet and Vinter, 2002), for the most part, models of segmentation operate with no limits on either memory or computation. Thus, one goal of the current work is to investigate how these limitations can be modeled and how modeling these limitations can improve models’ fit to human data.

We begin by describing three experiments which manipulate the difficulty of the learning task. Experiment 1 varies the length of the sentences in the segmentation language. Experiment 2 varies the amount of exposure participants were given to the segmentation language. Experiment 3 varies the number of words in the language. Taken together, participants’ mean performance in these three experiments provides a set of data which we can use to investigate the fit of models.

Section snippets

Experiment 1: sentence length

When learning to segment a new language, longer sentences should be more difficult to understand than shorter sentences. Certainly this is true in the limit: individually presented words are easy to learn and remember, while those presented in long sentences with no boundaries are more difficult. In order to test the hypothesis that segmentation performance decreases as sentence length increases, we exposed adults to sentences constructed from a simple artificial lexicon. We assigned

Experiment 2: amount of exposure

The more exposure to a language learners receive, the easier it should be for them to learn the words. To measure this relationship, we conducted an experiment in which we kept the length of sentences constant but varied the number of tokens (instances of words) participants heard.

Experiment 3: number of word types

The more words in a language, the harder the vocabulary of that language should be to remember. All things being equal, three words will be easier to remember than nine words. On the other hand, the more words in a language, the more diverse the evidence that you get. For a transition-finding model, this second fact is reflected in the decreased transition probabilities between words in a larger language, causing part-word distractors to have lower probability. For a chunking model, the same

Model comparison

In this section, we compare the fit of a number of recent computational proposals for word segmentation to the human experimental results reported above. We do not attempt a comprehensive survey of models of segmentation.1 Instead we sample broadly from the space of available models, focusing on those models whose fit or lack of

Adding resource constraints to probabilistic models

Memory limitations provide a possible explanation for the failure of many models to fit human data. To test this hypothesis, the last section of the paper investigates the issue of adding memory limitations to models of segmentation. We explore two methods. The first, evidence limitation, implements memory limitations as a reduction in the amount of the evidence available to learners. The second, capacity limitation, implements memory limitations explicitly via imposing limits on models’

General discussion

We presented results from three adult artificial language segmentation experiments. In each of these experiments we varied one basic aspect of the composition of the language that participants learned while holding all others constant, producing a set of three average performance curves across a wide variety of input conditions. The results in all three conditions were intuitive: longer sentences, less data, and a larger number of words all made languages harder to segment. However, a variety

Acknowledgements

We gratefully acknowledge Elissa Newport and Richard Aslin for many valuable discussions of this work and thank LouAnn Gerken, Pierre Perruchet, and two anonymous reviewers for comments on the paper. Portions of the data in this paper were reported at the Cognitive Science conference in Frank, Goldwater, Mansinghka, Griffiths, and Tenenbaum (2007). We acknowledge NSF Grant #BCS-0631518, and the first author was supported by a Jacob Javits Graduate Fellowship and NSF DDRIG #0746251.

References (56)

  • E. Newport et al.

    Learning at a distance I. Statistical learning of non-adjacent dependencies

    Cognitive Psychology

    (2004)
  • P. Perruchet et al.

    PARSER: A model for word segmentation

    Journal of Memory and Language

    (1998)
  • J.R. Saffran et al.

    Statistical learning of tone sequences by human infants and adults

    Cognition

    (1999)
  • J.R. Saffran et al.

    Word segmentation: The role of distributional cues

    Journal of Memory and Language

    (1996)
  • D. Swingley

    Statistical clustering and the contents of the infant vocabulary

    Cognitive Psychology

    (2005)
  • J. Aitchison

    Words in the mind: An introduction to the mental lexicon

    (2003)
  • R.N. Aslin et al.

    Computation of conditional probability statistics by 8-month-old infants

    Psychological Science

    (1998)
  • T. Brady et al.

    Statistical learning using real-world scenes: Extracting categorical regularities without conscious intent

    Psychological Science

    (2008)
  • M.R. Brent

    An efficient, probabilistically sound algorithm for segmentation and word discovery

    Machine Learning

    (1999)
  • C.M. Conway et al.

    Modality-constrained statistical learning of tactile, visual, and auditory sequences

    Journal of Experimental Psychology Learning Memory and Cognition

    (2005)
  • N. Cowan

    The magical number 4 in short-term memory: A reconsideration of mental storage capacity

    Behavioral and Brain Sciences

    (2001)
  • N. Daw et al.

    The pigeon as particle filter

  • P. Dayan et al.

    Explaining away in weight space

  • A. Doucet et al.

    Sequential Monte Carlo methods in practice

    (2001)
  • Dutoit, T., Pagel, V., Pierret, N., Bataille, F., & Van Der Vrecken, O. (1996). The MBROLA project: Towards a set of...
  • J. Fiser et al.

    Statistical learning of new visual feature combinations by infants

    Proceedings of the National Academy of Sciences

    (2002)
  • M. Frank et al.

    Modeling human performance on statistical word segmentation tasks

  • A. Gelman et al.

    Bayesian data analysis

    (2004)
  • Cited by (123)

    • Measuring children's auditory statistical learning via serial recall

      2020, Journal of Experimental Child Psychology
    View all citing articles on Scopus
    View full text