Elsevier

Neural Networks

Volume 117, September 2019, Pages 249-267
Neural Networks

A neural network architecture for learning word–referent associations in multiple contexts

https://doi.org/10.1016/j.neunet.2019.05.017Get rights and content

Abstract

This article proposes a biologically inspired neurocomputational architecture which learns associations between words and referents in different contexts, considering evidence collected from the literature of Psycholinguistics and Neurolinguistics. The multi-layered architecture takes as input raw images of objects (referents) and streams of word’s phonemes (labels), builds an adequate representation, recognizes the current context, and associates label with referents incrementally, by employing a Self-Organizing Map which creates new association nodes (prototypes) as required, adjusts the existing prototypes to better represent the input stimuli and removes prototypes that become obsolete/unused. The model takes into account the current context to retrieve the correct meaning of words with multiple meanings. Simulations show that the model can reach up to 78% of word–referent association accuracy in ambiguous situations and approximates well the learning rates of humans as reported by three different authors in five Cross-Situational Word Learning experiments, also displaying similar learning patterns in the different learning conditions.

Introduction

Language is surely a vital and distinctive trait of human beings. Even though language acquisition by young children is an active research topic in cognitive sciences, a number of open issues persist, despite the achievements of the field. For instance, we do not know exactly how humans acquire the meaning of words, an essential part of the language acquisition process. In this article, we propose a word learning model composed of a set of neural modules, or schemes (Arbib, 2008), that simultaneously compete and cooperate to perform higher-level tasks. The model was proposed considering the evidence brought by the literature of neurolinguistics and psycholinguistics about the characteristics of the word learning capabilities displayed by humans. With that, the proposed model is able to simulate multiple statistical characteristics displayed by humans when they learn new words.

We assume that word learning may be studied disregarding the interference of other aspects of language acquisition, such as the acquisition of grammar, semantics, and pragmatics. Therefore, according to Bloom (2002), in order to learn the meaning of a word, an individual must learn three different elements: (i) the concept or meaning of the word (referent); (ii) the sound or lexical representation of the word (label); and (iii) the association between referent and label. Each of these challenging tasks will be addressed in this article.

A classic example (Quine, 1960) illustrates the difficulties that children and foreign language learners have to handle to correctly match words and referents. When a native speaker of an unknown language sees a white rabbit and pronounces “gavagai”, one might understand this as clear evidence that the word “gavagai” means rabbit. However, such sound could also mean “white”, “furry”, “food”, “let’s go hunting” or even something completely unrelated with rabbit, such as “it is going to rain”. The expression “gavagai” could even be a composition of two or three words with their own meanings.

One possible strategy to address the problem described by Quine (1960), is known as “cross-situational word learning”(CWSL) (Yu & Smith, 2007). In this type of learning, the words would not be learned after a single exposure. The learning process would consider information from multiple learning trials. Thus, a learner who is unable to decide unambiguously the meaning of a word after a single trial would form a new knowledge subject to be further strengthened or weakened upon new evidence.

Currently, we can argue that word learning requires a set of cognitive abilities that are not yet fully understood (Bloom, 2002), such as theory of mind (the ability to simulate and understand the thought of others), concept acquisition, and fast mapping (the ability to associate referents and labels with few, or even one trial). In this article, we focus on the last two abilities of this list.

Concept acquisition may be seen as the ability to recognize and group similar referents together so that the category itself (concept) could be further associated with a label. Harnad (2005) points out that “To Cognize is to Categorize” and Perlovsky (2006) describes the mind as a hierarchy of multiple layers of concept-models, from simple elements like edges or moving dots to more abstract concept-models of objects, relationships, complete scenes, and so on.

The proposed model is compatible with these views because it defines the learning tasks mentioned above as a subspace clustering problem (Bassani and Araujo, 2015, Hu and Pei, 2018, Kriegel et al., 2005), in which the cluster prototypes capture the concept-models. At the current state, the model focuses on the lower levels of the concept-model hierarchy mentioned by Perlovsky, learning the referents, labels, and their associations for concrete nouns that can be depicted in static images, such as chair, table, and pen, in their different usage contexts (basic concept-models). The model learns such elements incrementally by creating new prototype nodes as required, adjusting the existing prototypes to better represent the auditory and visual input stimuli or removing prototypes that become obsolete/unused.

To achieve this, we specify a neurocomputational architecture composed of four layers: (i) the first layer extracts the perceptions from raw visual data (the referents) and auditory data (the labels); (ii) the second layer creates a more suitable representation for labels and referents; (iii) the third layer recognizes the current context and; (iv) the fourth layer creates the associations between labels and referents in the different contexts in which they are used, thus forming the prototypes representing the basic concept-models learned by the model.

In order to evaluate the proposed model, we simulate the CSWL experiments carried out with human beings by Trueswell, Medina, Hafri, and Gleitman (2013), Yu and Smith (2007) and Yurovsky, Yu, and Smith (2013). These experiments provide sound evidence on the operation of word learning mechanisms. Any model aiming to represent the functioning of these learning mechanisms must be able to reproduce to some extent the world learning patterns described in the following paragraphs.

Yu and Smith (2007) designed experiments to evaluate the abilities of humans in acquiring correct word–referent pairings and they have found compelling evidence that adult humans are able to learn label–referent pairings through CSWL. In their experiments, the stimuli consisted of slides containing 2, 3, or 4 pictures of unusual objects paired with 2, 3, or 4 pseudowords presented in the auditory form. These artificial words were generated by a computer program using standard phonemes in English. In this case, the label–referent pairs were formed by single and unique objects randomly chosen, used in three different training conditions of ambiguity.

The training conditions differ only in the number of labels and referents simultaneously presented to the subjects. Fig. 1 illustrates a 4 × 4 condition, in which four objects (referents) were presented simultaneously on the screen, while the sound of 4 pseudowords (labels) were heard from the speakers. The results showed that the individuals were able to discover on average more than 16 out of the 18 pairs in the 2 × 2 condition and more than 13 out of the 18 pairs in the 3 × 3 condition.

Yurovsky et al. (2013) expanded the previous experiment including situations in which labels could be associated with more than one referent. They were interested in evaluating if there was competition occurring in the learning process and if it was local (among referents presented in the same trial) or global (among referents presented in different trials). Their results suggested that global competition is most likely to occur.

The computational models proposed in the literature for CSWL can be divided into two categories (Yu & Smith, 2007): the Hypothesis-Testing Models, in which the learner maintains a list of hypothesized pairings to be further confirmed or rejected due to a mutual exclusivity constraint and the Associative Models, a basic form of Hebbian learning which strengths associations between observed word–referent pairs.

Trueswell et al. (2013) designed experiments to compare the two hypotheses and their results suggested that subjects did not keep track of multiple candidate meanings for each label, hence, according to the authors, such experiments weaken the hypothesis that humans employ some kind of statistical learning of the word–referent pairings.

Current studies have focused on comparing these two modeling approaches in terms of how well they fit experimental data, but no consensus has emerged yet. For instance, Kachergis, Yu, and Shiffrin (2017) found that an associative model which includes competition between familiarity and uncertainty biases reproduces better the individual and combined effects of frequency and contextual diversity on human learning. Khoe, Perfors, and Hendrickson (2019) found that this associative model better captures the full range of individual differences and conditions when learning is cross-situational, although the hypothesis testing approach outperforms it when there is no referential ambiguity during training.

The model proposed in this article differs from these studies by focusing in dealing with real-world data (raw images and phoneme sequences) and in employing a neural network architecture that can be used to simulate models of both categories, though in the present work the associative approach was considered.

The obtained results show that the proposed model is able to replicate the patterns of CSWL presented by humans. Additionally, the proposed model was also tested in scenarios in which there was ambiguity about the correct word–referent parings, with more than one association. We show that the model can take into account the context to solve ambiguity and choose the correct referent for ambiguous words.

The following sections of this article are structured as follows: Section 2 discusses the Associationism theory and presents the experimental evidence on word–referent associations. Section 3 describes correlated models for language acquisition. Section 4 presents the proposed modular architecture for replicating the CSWL experiments while Sections 5 Subspace clustering with self-organizing maps, 6 Context formation and recognition with ART2 detail the two neural network models employed in the learning tasks, LARFDSSOM, and ART2 with Context. Section 7 describes the CSWL experiments performed by Trueswell et al. (2013), Yu and Smith (2007) and Yurovsky et al. (2013) along with the simulations carried out with the proposed model for replicating them. Finally, Section 8 discusses and summarizes the main conclusions drawn from the obtained results.

Section snippets

Associationism and experimental evidence about how humans learn word–referent associations

Associationism is one of the most widely held theories of learning, appearing since John Locke in the 17th century. According to it, learning is based on sensibility to covariation of the human brain. Richards and Goldfarb (1986) proposed that children could learn the meaning of a word by repeatedly associating its verbal label with their perceptual experience at the time that the label is used. For those perceptual properties that repeatedly co-occur with the label, the association strengthens.

Previous language acquisition models based on self-organizing maps

Considering that children are able to acquire language without explicit feedback, several language acquisition models are based on unsupervised learning methods. Self-Organizing Maps (Kohonen, 1982) and Adaptive Resonant Theory (ART) (Grossberg, 1976a, Grossberg, 1976b) are two of the most prominent unsupervised learning neural networks. ART was employed for modeling human behavior in the task of memorization of word lists (Araujo et al., 2010, Pacheco, 2004), while several computational models

Proposed modular architecture

Fig. 2 illustrates the proposed architecture, which is stratified in four layers. The first two layers are comprised of parallel modules that are specialized for each kind of stimuli (auditory or visual), while the third and fourth layers present one module each performing multisensory integration. Below we present a general description of each layer:

  • A

    – Perception: It extracts relevant information (perceptions) from the sensory data. The sensory-perception mapping modules present in this layer

Subspace clustering with self-organizing maps

The Self-Organizing Map (SOM) proposed by Kohonen (1982), is a neural network trained with unlabeled data (unsupervised learning). It maps a high-dimensional data into a lower dimensional (usually bi-dimensional) grid of N nodes (or neurons), compressing information while preserving the topological relationships of the original data.

The following characteristics of SOM are worth highlighting here:

  • It creates an abstraction and a simplified representation of the input data distribution (Haykin,

Context formation and recognition with ART2

Since words can have different meanings in different context, taking context into account when recognizing words is a fundamental task in word learning. In this work, we employ for this task a neural network called ART2 with Context (Araujo et al., 2010), based on ART2 (Carpenter & Grossberg, 1987) which is a model from the Adaptive Resonant Theory. Such an unsupervised incremental learning is capable of grouping patterns, associates stimuli of different natures, adjusts the degree of

Simulations

The simulations aimed to reproduce the CSWL experiments available in the literature, following the methodology introduced by Yu and Smith (2007) and further extended by others. Subsections from Sections 7.3 to 7.7 describe the CSWL experiments considered in this work and the respective simulations carried out with the proposed model. Section 7.2 describes the dataset used in the simulations. Notice that we employ the term “Experiment” to refer to the actual experiments carried out by Trueswell

Discussion

The experimental paradigm of the cross-situational wordlearning has shown to be a very useful tool for evaluating the hypothesis about the mechanisms that allow us to learn word–referent associations. The model described in this article has been proposed considering pieces of evidence accumulated in the studies of psycholinguistics and neurolinguistics, organized in a modular architecture that allows us to better understand and communicate about the functions required for word–referent

Acknowledgments

The authors would like to thank CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico), Brazil and FACEPE (Fundação de Amparo à Ciência e Tecnologia do Estado de Pernambuco), Brazil for supporting project #APQ-0880-1.03/14.

References (86)

  • MarkmanE.M. et al.

    Children’s use of mutual exclusivity to constrain the meaning of words

    Cognitive Psychology

    (1988)
  • MedeirosH.R. et al.

    Dynamic topology and relevance learning som-based algorithm for image clustering tasks

    Computer Vision and Image Understanding

    (2019)
  • MiikkulainenR.

    Dyslexic and category-specific aphasic impairments in a self-organizing feature map model of the lexicon

    Brain and Language

    (1997)
  • PlunkettK.

    Theories of early language acquisition

    Trends in Cognitive Sciences

    (1997)
  • PlunkettK. et al.

    Labels can override perceptual categories in early infancy

    Cognition

    (2008)
  • SojaN.N. et al.

    Ontological categories guide young children’s inductions of word meaning: object terms and substance terms

    Cognition

    (1991)
  • TrueswellJ.C. et al.

    Propose but verify: fast mapping meets cross-situational word learning

    Cognitive Psychology

    (2013)
  • Abdelsamea, M. M., Mohamed, M. H., & Bamatraf, M. (2015). An effective image feature classiffication using an improved...
  • AggletonJ.P. et al.

    Episodic memory, amnesia, and the hippocampal-anterior thalamic axis

    Behavioral and Brain Sciences

    (1999)
  • AllenJ.

    Natural language understanding

    (1994)
  • AraujoA.F.R. et al.

    Occurrence of false memories: A neural module considering context for memorization of words lists

  • BassaniH.F. et al.

    Dimension selective self-organizing maps for clustering high dimensional data

  • BassaniH.F. et al.

    Dimension selective self-organizing maps with time-varying structure for subspace and projected clustering

    IEEE Transactions on Neural Networks and Learning Systems

    (2015)
  • BloomP.

    How children learn the meanings of words

    (2002)
  • BornR.T. et al.

    Structure and function of visual area MT

    Annual Review in Neuroscience

    (2005)
  • BunceJ.P. et al.

    Finding meaning in a noisy world: exploring the effects of referential ambiguity and competition on 2. 5-year-olds’ cross-situational word learning

    Journal of Child Language

    (2017)
  • CareyS.

    Conceptual change in childhood

    (1995)
  • CareyS. et al.

    Acquiring a single new word

    Papers and Reports on Child Language Development

    (1978)
  • CarpenterG.A. et al.

    Art-2 - self-organization of stable category recognition codes for analog input patterns

    Applied Optics

    (1987)
  • CollinsG.

    Visual co-orientation and maternal speech

  • DollaghanC.

    Child meets word: “fast mapping” in preschool children

    Journal of Speech and Hearing Research

    (1985)
  • FeijooS. et al.

    When meaning is not enough: Distributional and semantic cues to word categorization in child directed speech

    Frontiers in Psychology

    (2017)
  • GrossbergS.

    Adaptive pattern classification and universal recoding: I. parallel development and coding of neural feature detectors

    Biological Cybernetics

    (1976)
  • GrossbergS.

    Adaptive pattern classification and universal recording: II. Feedback, expectation, olfaction, illusions

    Biological Cybernetics

    (1976)
  • GuentherF.H. et al.

    The perceptual magnet effect as an emergent property of neural map formation

    Journal of the Acoustical Society of America

    (1996)
  • HarnadS.

    To cognize is to categorize: Cognition is categorization

  • HarrisM. et al.

    The nonverbal content of mothers’ speech to infants

    First Language

    (1983)
  • HaykinS.

    Neural networks: A comprehensive foundation

    (1998)
  • HeK. et al.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

  • HeibeckT. et al.

    Word learning in children: An examination of fast mapping.

    Child Development

    (1987)
  • HuJ. et al.

    Subspace multi-clustering: a review

    Knowledge and Information Systems

    (2018)
  • HubelD. et al.

    Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex

    Journal Physiology

    (1962)
  • HubelD. et al.

    Functional architecture of macaque visual cortex

    Proceedings of the Royal Society B

    (1977)
  • Cited by (2)

    • Self-organizing subspace clustering for high-dimensional and multi-view data

      2020, Neural Networks
      Citation Excerpt :

      In this context, LARFDSSOM is used for clustering disjoint and non-disjoint images. Furthermore, Bassani and Araújo (2019) employed LARFDSSOM for Natural Language Processing (NLP) to generate codebook vectors and, then, group the images according to a given label. In the present article we aim to modify LARFDSSOM to widen the range of use for the approach, applying it to more complex sets (increasing dimensionality combined with a growth of the number of clusters and/or number of samples), different domains (data mining, gene expression, multi-view data, text clustering, and computer vision), different samplings (sets with few and many samples), and distinct density (dense and sparse data).

    View full text