Simulating vocal learning of spoken language: Beyond imitation

approaches have an important


Introduction
Investigating how children learn to speak is a promising path towards understanding the human speech system, language acquisition, and cognition. With increases in computational power and advances in the speech sciences, computational approaches have emerged as an important means of modelling this complex process (Dupoux, 2018). This is evident in recent work on vocal learning in particular. Simulations of early-stage babbling (Serkhane et al., 2007;Nam et al., 2013) have proposed mechanisms that may explain the typical distributions of early vocalisations in pre-linguistic learners (Davis and MacNeilage, 1995). Late-stage or canonical babbling is associated with the emergence of spoken language, including vowels and reduplicated consonant-vowel (CV) syllables (Oller and Eilers, 1988), and has been simulated as an imitative or goal-directed process (Bailly, 1997;Howard and Huckvale, 2005;Howard and Messum, 2007;Philippsen et al., 2014;Philippsen, 2021;Rasilo et al., 2013;Rasilo and Räsänen, 2017). Despite recent progress, auditory-acoustic imitation is confronted with two significant obstacles: the speaker normalisation (Howard and Huckvale, 2005;Rasilo and Räsänen, 2017) and correspondence (Philippsen, 2021;Messum and Howard, 2015) problems.
The normalisation problem refers to the difficulty of comparing phonologically equivalent utterances by different speakers. Differences in the vocal tract size and shape between infants and adults (especially Fig. 1. Sources of information (dashed-line blocks) and the nature of sensory signals (between blocks) that are applicable to articulatory exploration. The shaded areas illustrate how grounded auditory perception can integrate information that has not been included in previous vocal learning simulations. Fig. 1 presents a simplified view of the sources of information and the nature of sensory signals that are applicable to articulatory exploration. The shaded areas show how grounded auditory perception can naturally facilitate the inclusion of information sources needed to address the speaker normalisation and correspondence problems. That is, by combining the semantic context and ambient speech data from multiple speakers, the perceptual mapping can perform implicit speaker normalisation and produce a representation with phonological correspondence. By contrast, the unshaded areas represent aspects that have been discussed in previous work based on acoustic imitation (Bailly, 1997;Howard and Huckvale, 2005;Howard and Messum, 2007;Philippsen et al., 2014;Philippsen, 2021;Rasilo et al., 2013;Rasilo and Räsänen, 2017). This includes the possible role of caregiver interaction to provide semantic and articulatory information (Rasilo et al., 2013;Messum and Howard, 2015;Murakami et al., 2015) and the importance of somatosensory feedback (Tourville and Guenther, 2011). While computational works exist that construct a phonological perceptual space as a basis for vocal learning (Philippsen, 2021;Kröger et al., 2009Kröger et al., , 2014Barnaud et al., 2019), these simulations differ materially from the proposal in Fig. 1 in that they construct the perceptual space either through self-organisation of utterances generated by the learner or stimuli from a single external speaker who serves as teacher or ''master agent''. This does not consider the effect that (1) multiple speakers in ambient speech stimuli and (2) perceptual development before the onset of vocal exploration may have on learning.
In this work, we simulate articulatory exploration, or babbling, driven by auditory-perceptual goals which include the sources of information highlighted in Fig. 1. We do not model the ontogenesis of the perceptual mapping but construct a functional equivalent based on an existing speech corpus to investigate the use of perceptual representations, auditory percepts, as basis for goal-directed vocal exploration. Specifically, we seek to answer the following questions: 1. Can auditory percepts be used to reproduce individual utterances by different speakers regardless of their vocal tract characteristics? 2. If the encoding of the relevant phonological units are known, can auditory percepts be used to find the articulation of ideal phonetic realisations? That is, to produce utterances that represent phonologically appropriate generalisations over the speech inputs they are derived from.
We answer these questions by subjecting the outputs of the simulation to recognition experiments (see Section 4) to obtain a quantitative measure of success and show that both of these outcomes are possible.
The empirical results confirm that goal-directed vocal exploration can succeed on the basis of reproducing low-dimensional auditory percepts. The perceptual mapping used in this work may also be a natural way of aggregating auditory experience to support autonomous exploration and vocal learning from memory.

Approach
We view articulatory exploration in abstract terms such that it is also directly applicable in the field of articulatory speech synthesis . The process is formulated as an optimisation task using the VocalTractLab (VTL) articulatory synthesiser (Birkholz, 2005(Birkholz, , 2013 to produce candidate utterances. The objective function combines articulatory specifications and an auditory-perceptual mapping to evaluate articulatory gestures. An outline of the approach is illustrated in Fig. 2 and motivated in the following subsections.

Articulatory exploration
In our simulation, goal-directed exploration is the process of minimising auditory and articulatory losses to discover a phonologically appropriate utterance. The central block in Fig. 2 is the global optimisation task * = arg min of finding an articulatory gesture * that minimises the loss function which is dependent on the speaker vocal tract and the auditoryperceptual mapping described in Section 2.2. This is done by the optimisation algorithm sampling each gesture from the articulatory space which is also determined by the speaker model.

Auditory perceptual objectives
We construct an auditory-perceptual mapping that is functionally equivalent to the proposal in Fig. 1, by means of two practical restrictions. Firstly, in the absence of raw stimuli from ecological environments as proposed in Dupoux (2018), we make use of a transcribed speech recognition corpus. This data source satisfies both of the conditions to construct a grounded mapping: it contains the linguistic (semantic) context associated with the speech signal and utterances from multiple speakers. Secondly, the mapping is derived before the onset of the simulation and not updated continuously throughout the process. These restrictions should not affect the validity of the conclusions regarding the questions raised in Section 1 and can, in principle, be relaxed to expand the scope of future work. Furthermore, we are not concerned with modelling the emergence of a phonological representation, but focus on the task of vocal exploration. With this aim, we adopt the syllable as unit of perception for the following reasons: (1) It provides the necessary context to account for signal variation known from acoustic-phonetic descriptions of coarticulation (Adriaans, 2018;Liu et al., 2022). (2) This would correspond directly to articulatory representations that we adopt in this work on articulatory-phonetic grounds. That is, a single percept would D.R. van Niekerk et al.

Fig. 2.
A process for discovery of phonological articulatory gestures. The central exploration task is goal-directed and relies on articulatory sampling, speech production, and auditory and articulatory objectives (Section 2). The speech production component is implemented using VTL and is responsible for generating two different types of output given the speaker model and sampled articulatory targets: (1) synthesised audio representing a candidate utterance which is processed by the syllable encoder (Section 3.1) to evaluate the auditory perceptual objective, and (2) the vocal tract tube areas and transfer function which is used to implement articulatory objectives (Section 3.3).
correspond to a fixed set of articulatory targets per syllable (refer to Section 2.4).
(3) Syllables are plausible early perceptual units that may be involved in overcoming the segmentation problem (Jusczyk, 1997;Räsänen et al., 2018) and require further attention in computational studies of language acquisition (Schatz et al., 2021;Räsänen, 2012).
The result is a syllable encoder that consumes a pre-segmented acoustic signal and produces an auditory percept vector or embedding that is used as objective during optimisation.

Articulatory objectives
Articulatory information that forms part of the optimisation goal in our simulation can be described as either somatosensory or regularisation objectives.
Somatosensory objectives represent specifications, obtainable from non-auditory signals, that the learner can consciously monitor through somatosensory feedback (Nasir and Ostry, 2006) once a correspondence between external stimuli and imitating actions is established (Brass and Heyes, 2005). For example, it is known that sighted vocal learners may benefit from visual examples of articulation (Mills, 1988) from which they may learn to expect the somatosensory feedback associated with lip closure. We do not model the process of establishing articulatory correspondence between the different senses and somatosensory targets but simply substantiate their inclusion.
Regularisation objectives may originate in physiologically motivated processes (Serkhane et al., 2007;Nam et al., 2013) or constraints (Oohashi et al., 2013) that are not under conscious control, but nevertheless determine the typical articulatory solution space. Regularisation objectives are explicit in our work since we rely on a general optimisation algorithm with uniform priors and VTL only provides basic physical restrictions. That is, the synthesiser does not implicitly enforce articulatory coordination or implement higher-level organisation such as the direct control of constriction that may be relevant in natural speech production (Saltzman and Munhall, 1989).

Speech production
To produce articulatory trajectories, we use the target-approximation model (TAM) (Xu and Wang, 2001) implemented using a 5th-order critically damped linear system (Birkholz et al., 2018). This model is implemented in VTL, allowing the complete specification of trajectories -and synthesis of utterances -from a set of articulatory targets, durations for each segment, and time constants which correspond to ''articulatory effort'' (Birkholz, 2007).
This compact parameterisation of articulatory dynamics combined with assumptions of synchronisation (Birkholz et al., 2011) has enabled the discovery of simple CV syllables using derivative free optimisation by significantly reducing the associated degrees of freedom (Xu et al., 2019;Van Niekerk et al., 2020). Furthermore, articulatory targets may be more relevant phonologically than complete trajectories (Turk and Shattuck-Hufnagel, 2020) and provide a means of decoupling temporal parameters that vary over different prosodic forms and speech rates (Birkholz et al., 2011).

Auditory perceptual mapping
To construct the syllable encoder we used supervised learning to find a mapping between speech signals of up to 1 second in duration to fixed-size vectors that encode each syllable type uniquely. The clean subset of the Librispeech speech recognition corpus (Panayotov et al., 2015) was annotated using the CMU pronunciation dictionary (Carnegie Mellon University, 2000), with a phoneset appropriate for American English, to extract CV syllables. Combinations of 3 consonants and 15 vowels (including diphthongs) were included in the selection to represent a complete set of vowels and minimal set of consonants for the experimental conditions described in Section 4 ( Table 2). Using this dataset, containing 487 male and 453 female speakers, a recurrent neural network was trained to map a sequence of mel-frequency cepstral coefficient vectors (MFCCs) to a single output vector -defined as the concatenation of the one-hot encoded phonetic labels in the training data.
While the training data labels are categorical, representing ideal points in the output space, the model was trained using the mean squared error (MSE) loss to construct a regression model. Therefore it is interpreted as a mapping between acoustic realisations of CV syllables and points in an 18-dimensional continuous perceptual space  speaker normalisation, that is, the output space should be speakerinvariant. See Appendix A.1 for a more complete description of the dataset, model architecture and signal processing involved. This forms an auditory goal-space used to specify a goal percept which can be compared to each trial , obtained during exploration, using a metric such as the Euclidean distance: In our experiments (Section 4), the values of are obtained either by mapping a specific acoustic realisation to the perceptual space (Experiment 1) or by specifying ideal syllables in terms of their one-hot encoding (Experiment 2). For example, an ideal target for the syllable /bae/, as in ''bad'', would correspond to a vector [1, 0, 0, 1, 0, … , 0] given that /b/ is the first component for the 3 consonants and /ae/ the first component for the 15 vowels.

Speech production
For speech synthesis, VTL 1 was used to realise articulatory targets with the ''JD2'' male speaker and geometric glottis model . While an experimental scaled down ''child'' vocal tract model does exist, it is not guaranteed to be as realistic as JD2 which is 1 Version 2.3 available at https://www.vocaltractlab.de. based on magnetic resonance imaging (MRI) data of a real speaker and the adult male speaker model suffices to answer the questions posited in Section 1.
Since we focused on investigating the upper vocal tract parameters and the production of voiced speech, the glottal parameters were kept constant at the preset values for ''modal voice'' with the exception of the chink area and relative amplitude which were optimised to allow control of the voice onset time (VOT) and voice quality of the consonant (Abramson, 1977). All of the upper vocal tract parameters were optimised, except the velum opening (VO) which was kept closed -since we did not account for utterances with nasality (Table 2)and the tongue root (TRX, TRY) parameters which were derived from the tongue body values . The temporal aspects of an utterance were determined by a preset fixed duration for each of the two segments and the target-approximation trajectories for the transition between consonant and vowel targets controlled by two free parameters -one time constant each for the glottal and upper vocal tract parameters. Table 1 details the full set of parameters and corresponding optimisation configuration described above.

Articulatory objectives
The somatosensory objectives were implemented using proprioceptive or tactile feedback approximated by evaluating the VTL tube area function associated with each articulatory target. Three such objectives Table 1 Target parameters for VTL's ''JD2'' speaker with geometric glottis model . Neutral (initial) parameters and optimisation ranges are shown; single values in the Range column indicate constants. Similarly, two regularisation objectives were implemented: • Precision objective: prefers consonant targets with precise closures ( ) at a single place of articulation ( ). • Coarticulation objective: prefers a smaller articulatory distance between the consonant and vowel targets ( ). Fig. 3 shows an example of consonant and vowel targets producing an utterance [dae] with their associated tube areas and the influence of articulatory objectives described here. See Appendix A.2 for precise definitions of the loss terms.

Optimisation process
We used the Tree-structured Parzen Estimator (TPE) approach (Bergstra et al., 2011) as the optimisation algorithm to drive the articulatory sampling as implemented in the hyperopt 2 software package. The loss function was defined as a linear combination of the auditory 2 https://github.com/hyperopt/hyperopt (v0.2.5). and articulatory loss functions described earlier (term weights were not optimised but all articulatory loss terms were scaled by a factor 0.2 to ensure that the auditory loss dominates). The optimisation algorithm samples articulatory targets which are synthesised by VTL and each sample is evaluated by the auditory perceptual mapping and tube area function to determine the resulting loss. The optimisation algorithm was configured to sample the first 5% of trials uniformly, including the neutral position. To improve the computational efficiency of the process, synthesis and auditory evaluation is only performed when the somatosensory objectives are satisfied. In the case of failure to achieve these objectives, the loss function is set to an arbitrary large value proportional to the articulatory loss.

Experimental setup
We designed two experiments to address the questions posed in Section 1: • Experiment 1: The simulation is set up to find articulatory gestures that reproduce specific instances of CV utterances produced by male and female speakers. The proposed system based on the male vocal tract is compared with a baseline which uses acoustic matching instead of the syllable encoder described in Section 3.1. • Experiment 2: The simulation is configured to find articulatory gestures that produce CV utterances with specific phonetic identities. Ideal auditory-perceptual objectives are defined and different sets of articulatory objectives are compared.
On the surface these tasks are distinct in character, however, we consider the success of the outcomes in terms of the production of phonologically relevant utterances. That is, the system should produce or reproduce goal utterances that are equivalent in the particular spoken language context. Consequently, the test utterances are taken from a specific set of words shown in Table 2. These words contain the variation in simple vowels and consonants considered, and all end with the coda consonant /d/ to reduce the complexity of the experiments. These word contexts are used in two ways to enable both tasks to be evaluated in terms of recognition experiments, that is, to obtain a quantitative measure of success related to the function of spoken language (see Section 4.3). Firstly, the goal utterances produced by the male and female speakers (Experiment 1) are extracted from recordings of these CVC words. Secondly, the CV gestures found by vocal exploration (Experiment 1 and 2) are embedded in the same CVC words by appending an articulatory target for the coda /d/ using the predefined set of articulatory parameters from VTL. Thus, each evaluated sample represents an instance of a known word from Table 2.

Experiment 1
Recordings of the words in Table 2 were made by a British male and female speaker and manually segmented to extract the CV portions of each. Each of these template utterances was used to drive the proposed optimisation process by inputting the audio to the syllable encoder to obtain a goal percept . The exploration algorithm was set to produce trial utterances that were also processed with the syllable encoder, to obtain their encoded percept . The auditory loss was calculated using these two outputs (Eq. (2)). Note, that although the syllable encoder was trained using American English data, it is used here to compare individual utterances relative to each other. That is, it is not important that the goal utterance does not correspond to a specific phonetic category in the perceptual space as long as the language dialects are close enough. 3 For articulatory objectives, only the vowel and closure somatosensory objectives that ensure a basic CV utterance were applied here. Neither the visual nor regularisation objectives were included.
For comparison, in the baseline system the auditory loss was replaced by the MSE calculated frame-by-frame over the sequences of acoustic features extracted for the template and trial utterances. The features used were identical to those used in the syllable encoder Appendix A.1 and frame alignment was ensured by using the template segment durations during synthesis of the trial utterances.

Experiment 2
The goal percepts in this experiment were the one-hot encoded representations of the CVs shown in Table 2. That is, using the chosen perceptual encoding to represent the consonant and vowel identities directly, without the use of template utterances. In this case, the goal utterances are generalised American English pronunciations of the CV syllables as derived from the transcribed speech corpus.
Here we make a comparison between different degrees of articulatory feedback to determine whether this benefits the expected outcomes under the constraint of a finite number of exploratory trials. Independent simulations were initiated to test the following sets of articulatory objectives: • Minimal: includes only the vowel and closure objectives as in Experiment 1 and serves as a baseline condition. • Visual: includes the minimal set of objectives and the visual objective. The fact that is dependent on the consonant type implies a dependence on the auditory perceptual goal.
• Visual + precise: adds the precision objective.
For the last condition the optimisation process involved two passes. In the first pass both the consonant and vowel targets are optimised using the Visual + precise configuration, followed by the second pass optimising only the consonant targets using Visual + precise + coart. with the vowel parameters from the first pass fixed. This has the effect that the coarticulation objective only affects the consonant targets given the vowel found in the first pass. Jointly optimising the consonant and vowel using the coarticulation objective can reduce the chances of finding an appropriate vowel target . Since the two-pass procedure is not directly comparable with the rest, the Visual + precise configuration was also applied in two passes for comparison with the coarticulation condition.

Evaluation
Two methods were used to recognise the synthesised discovered single-word utterances (Table 2) to determine their intelligibility: • Automatic speech recognition (ASR) or speech-to-text is an inexpensive and objective mechanism which provides reliable results for VTL speech on this task (Van Niekerk et al., 2020). This allows rapid evaluation of experimental configurations and was used in both experiments.
• An open-ended transcription task that asks listeners to enter a word for each utterance. This is more expensive, with practical limitations, but offers more precise feedback and may be considered more relevant than ASR results. This method was only used in Experiment 2.
Since the exploratory process is non-deterministic, depending on the set of random initial trials, each experimental condition was evaluated through independent repeated experiments. That is, for each configuration described in Sections 4.1 and 4.2 the simulation was run times for each of the words in Table 2 with different random seeds. The results presented in the next section are in terms of the mean accuracy or recognition rate over all the independent instances of the process and represent the expected intelligibility for the experimental condition over this set of words. For example, given the 13 words and = 20 seeds, the recognition rate is calculated over 13 × 20 = 260 utterances. In all cases the best candidate, according to the objective function, was selected after the process sampled 5,000 valid utterances. That is, the process was terminated after synthesising a fixed number of utterances, excluding articulatory targets that did not satisfy the basic somatosensory objectives.
For the ASR evaluation, we used the state-of-the-art Google Speechto-Text service. 4 The service automatically determines the appropriate back-end model to use based on the input, however, we explicitly selected the language and dialect: British English for Experiment 1 and American English for Experiment 2. In addition to the audio samples, the set of 13 words in Table 2 was submitted as ''speech contexts''. This adapts the language model in favour of this set of words and is considered best-practice for recognising short utterances. A single request was made to the service for each sample and the associated response was in the form of an ordered n-best list of orthographic transcriptions or an empty list (null result) which is interpreted as the rejection of an unintelligible utterance. The system responses were automatically post-processed to obtain a single transcription (or null result) for each sample by applying two operations: (1) only the most likely candidate transcription was retained from the n-best list, and (2) the text was normalised by lowercasing and removal of excess whitespace characters. For all ASR evaluations = 20 instances of each word was evaluated.
The transcription task was set up as an online experiment using the Gorilla 5 platform. Each listener was presented with randomised utterances consisting of = 10 instances of each word from the baseline and best conditions (Minimal and Visual + precise + coart.). The process consisted of basic user and consent forms, a soundcheck to verify the use of headphones (Milne et al., 2020), a short practice transcription task, and the main transcription task. For the main task, listeners were expected to type in the word played back through headphones or indicate if the utterance was unintelligible after listening to it no more than 3 times. Care was taken not to include any implicit or explicit information about the content or quality of the utterances that could introduce bias in the responses. Therefore the practice transcription task contained unrelated 1 or 2 syllable words produced by a female speaker and given the experimental setup it is reasonable to assume that the extent of listeners' prior expectation was to hear short single-word or unintelligible utterances. American English participants were recruited on Amazon Mechanical Turk 6 resulting in the following participant funnel: 110 visited the task; 57 passed the soundcheck and 43 submitted a completed task. Three of the completed tasks were excluded after manual inspection revealed anomalies in the response time and/or distributions of answers. The result was a total of 40 participants with valid responses. Responses were post-processed by performing two operations: (1) the text was automatically normalised D.R. van Niekerk et al. Fig. 4. Experiment 1 -ASR recognition rates comparing the acoustic matching and syllable encoder auditory objectives for reproducing utterances by a male and female speaker (error bars indicate the 95% confidence intervals). The recognition rate for the female-acoustic utterances is significantly lower than the other conditions. by lowercasing and removal of excess whitespace characters, and (2) cases of unambiguous spelling or typographical errors were manually corrected. The conditions for applying a correction was that the initial response was not a valid word and that the set of possible corrections consisted of only one likely valid word (exceptions were made to refrain from changing responses where the listener may have attempted to transcribe disfluencies), for example, ''beed'' → ''bead'', ''gaurd'' → ''guard'', and so forth, but not ''booke'' → ''book''.
Transcriptions obtained from the ASR system typically consisted of 1-2 words or a null result and most responses from human listeners were single words (compare Figs. 7 and 8 which are discussed later). To determine the recognition rate, transcriptions were either compared directly to the orthographic reference, referred to hereafter as the ''word level'', or to phonetic representations of the onset, vowel and coda. The latter was obtained by manually mapping the orthographic forms using the CMU dictionary as reference, or where this was not possible, to a null symbol. For this open-vocabulary transcription task, the set of possible outputs (transcriptions) and its prior probability distribution are unknown -preventing calculation of a chance-level recognition rate directly. However, for our experimental setup the expected recognition  rates for a random utterance generator is less than 5% for the largevocabulary ASR system and 2% for the human transcription task on the word level. Refer to Appendix B for a detailed note on the interpretation of absolute recognition rates. Similarly, vowel and onset recognition rates should be interpreted carefully, not as independent classification tasks, but rather to provide insight into the relative contribution on the word level error rate. The analysis presented in Appendix B establishes that the recognition rates for all the test conditions are consequential. In the following section we therefore focus on the relative performance of different experimental conditions that address the questions posed in Section 1.
To quantify the significance of results throughout, effect sizes are reported in terms of Cohen's and two-tailed values at the 95% level from Welch's unequal variances t-test.

Experiment 1
The results of Experiment 1 are presented in Fig. 4. Experimental conditions relying on the syllable encoder (whether male or female templates) or acoustic matching with the male templates result in D.R. van Niekerk et al. similar recognition rates -no significant differences are found. This indicates that the syllable encoder performs comparatively regardless of the sex of the speaker and that the acoustic matching is also effective when the speaker sex is matched to the male vocal tract. By comparison, the word recognition rate when using acoustic matching against the mismatched templates (female) are significantly lower ( = 0.27, with (501.56) = −3.32, = .001), demonstrating the speaker normalisation problem. Further inspection of the results shows that the syllable encoder sustained or improved recognition rates of all vowels for the female templates and the high vowel /i/ for the male templates. For the male templates, the vowels /6/ and /I/ were significantly more successful when using acoustic matching. However, the types of errors made with acoustic matching were indicative of high variance compared to a consistent bias with the syllable encoder. That is, errors with the syllable encoder involved outputs that were perceptually close, whereas acoustic matching led to less predictable confusions.

Experiment 2
The results for Experiment 2 based on the ASR and online transcriber tasks are presented in Figs. 5 and 6 respectively. The ASR results show a trend of improvement with the inclusion of additional articulatory objectives with a significant difference in word recognition rate given the precise objective compared to the baseline and visual conditions ( = 0.24 with (517.87) = −2.74, = .006 between Visual + precise and Minimal, and = 0.20 with (517.95) = −2.29, = .022 between Visual + precise and Visual). There are no significant differences amongst the conditions that include regularisation objectives despite the two-pass results involving double the number of trials of the consonant. It may be noted, however, that informal inspection and subsequent results do suggest that articulatory solutions are more ''prototypical'' with the regularisation objectives, especially in terms of coarticulation . The online transcription task (Fig. 6) also exhibits a significant improvement of outcomes for the two-pass process with precise and coarticulation objectives over the baseline ( (10381.89) = −2.60, = .009 for the word recognition rates) albeit with a smaller measurable effect compared to the ASR results ( = 0.05). There is no significant difference in recognition rate for the coda consonant, which is expected since all utterances were based on the same preset coda target for /d/. Lastly, a comparison of the word recognition rates for the ASR and online transcription tasks confirms the difference in the nature of the tasks which is illustrated further in Figs. 7 and 8. Fig. 7 shows the full confusion matrix for the ASR results of the best performing condition. It is clear that the inclusion of the 13 words as ''speech contexts'' adjusts the prior probabilities in favour of these words to the extent that an utterance is most likely to be recognised within this set or judged as unintelligible (the null result indicated with ''?'' in the figure). By contrast, Fig. 8 reflects an open transcription task without prior knowledge of the set of reference words; 571 unique responses were observed and listeners were less likely to label utterances as unintelligible. Even so, when the results are viewed in terms of their consonant and vowel constituents, the recognition rates on the two tasks are comparable.
To contextualise the recognition rates obtained in this simulation, the intelligibility of natural speech can be evaluated using the ASR system to provide an expected upper bound for this experimental setup. For this purpose approximately 20 instances of each of the target words (Table 2) were extracted from the Librispeech corpus. 7 The resulting word accuracy is 80.6 ± 4.9% compared to the best configuration with 55.4 ± 6.1%. Although the recognition rate for isolated words (not uttered in sentence context) may be higher, the syllable encoder on which the simulation is based has the same intrinsic limitation -i.e., it is trained on continuous speech. The recognition rate is comparable with the test set classification accuracy of 81.7% reported in Appendix A.1.

Discussion
The results for Experiment 1 and 2 are not intended to be comparable and should be viewed separately. Firstly, the tasks differ fundamentally due to different definitions of the goal percepts, and secondly, the experimental conditions differ significantly, for example, the two different ASR systems used as evaluators will have distinct performance characteristics due to dialect and construction. It is however notable that there is a significant difference in the absolute recognition rates even for the baseline condition. This could be expected since the agent's auditory discrimination task and the evaluation task are more closely aligned in Experiment 2: (1) the goal percept is an ideal point in American English perceptual space whereas the template utterances are not guaranteed to be optimal British English examples, and (2) the perceptual space of the discriminator and evaluator are matchedboth are American English. The remainder of this section focuses on discussing the results for the individual experiments with reference to the research questions posed in Section 1. Experiment 1 demonstrates that articulatory exploration using language oriented perception (Fig. 1) is more successful than acoustic matching at reproducing phonological utterances when a vocal tract mismatch exists. This suggests that the auditory-perceptual objective can be used in an interactive setting where the learner imitates an arbitrary caregiver's stimuli. A secondary observation is that, despite an output representation and training data based on the American English vowel space, the syllable encoder supports the reproduction of British English utterances with comparable success to acoustic matching when considering the male, matched vocal tract, condition. This confirms that it maps to a continuous perceptual space with the ability to represent (interpolate) vowels that are not characteristic of American English (see Section 3.1).
Experiment 2 demonstrates that low-dimensional auditory percepts can be used to produce utterances that reflect aggregated auditory experience. This may be useful for vocal learning in an autonomous setting or, when the mapping is known, to enumerate phonological units. Furthermore, although the perceptual mapping was only trained to discriminate three voiced consonants, it was possible to obtain reasonable recognition rates through the inclusion of basic articulatory objectives and glottal constraints. This suggests that an incomplete discriminative model can still be useful at early stages of development. Experiment 2 also shows that the inclusion of a regularisation objective that prefers more precise articulation of closures results in significantly better recognition rates. The reason for this should be investigated formally in future work, however, inspection of articulatory solutions suggest that imprecise or double-articulations are perceptually ambiguous or sensitive to the articulatory effort controlled by the time constant parameter. Lastly, we have included a condition that prefers solutions where the onset consonant is maximally coarticulated with the vowel (Xu, 2020;Liu et al., 2022). The fact that this configuration is among the best performing conditions is further computational evidence for intra-syllable synchronisation (Xu et al., 2019;Van Niekerk et al., 2020.

Relationship to other work
The present work is related to other goal-directed simulations of babbling that produce spoken language utterances such as vowels and CVs (Bailly, 1997;Howard and Huckvale, 2005;Howard and Messum, 2007;Philippsen et al., 2014;Philippsen, 2021;Rasilo and Räsänen, 2017). However, a fundamental distinction of this study is the inclusion of language oriented perception that may affect how ambient language, including multiple speakers, influences vocal exploration ( Fig. 1): (1) The ambient language may influence auditory perception early on, before the onset of late-stage babbling (Kuhl, 2004). (2) During the development of auditory perception, the learner may rely on multisensory signals to partially resolve some acoustic ambiguity (Frank et al., 2014). This clarifies the notion of a language oriented goal-space which allows quantitative evaluation in terms of recognition-based experiments which had not yet been applied in this context.
An interactive process and the role of caregiver feedback during vocal learning has been proposed as mechanisms that may alleviate the speaker normalisation (Rasilo and Räsänen, 2017;Plummer, 2012) and correspondence problems (Messum and Howard, 2015). The present work does not preclude the integration of these information sources but asserts that earlier perceptual development should also be accounted for. Under the current conception, the benefits of interactive feedback could be described in terms of continuous development of the perception model with inputs and feedback from the caregiver and exploration process (see Fig. 1). It is also possible to interpret the auditory perception function in our simulation more abstractly. That is, as representing the joint auditory experience of the overall system (i.e., the learnercaregiver combination). In this case, an independent model that is updated during the learning process could represent the learner's own auditory experience which should eventually approximate the joint experience to become independent.
The present study, and Fig. 1 in particular, explicitly draws links between simulations of babbling and computational work on speech perception (Frank et al., 2014;Schatz et al., 2021). While the focus in this article is on vocal exploration, our simulation presents a welldefined task and methodology based on quantitative evaluation that may be useful for testing assumptions about speech perception. The simplifications in perceptual modelling described in Section 2.2 may be relaxed to investigate the simultaneous development of perception and production or the model can be constructed in different ways to investigate different instants during development (Dupoux, 2018).
Our implementation of articulation which relies on syllable synchronisation and the target-approximation model is based on the idea that the syllable is central to simplifying the biomechanical and cognitive demands of articulatory coordination (Xu, 2020). This is supported by observations of intra-syllable coarticulation (Liu et al., 2022) which in turn is the primary reason for selecting the syllable as the temporal domain of perception, see Section 2.2. Furthermore, the alignment of articulatory and perceptual units has the advantage of allowing for a context-free mapping between their respective representations. This significantly simplifies the structure of mappings in either direction, that is, both forward and inverse models could be implemented using a simple feedforward network which maps between percepts and articulatory targets. This interface, forged during vocal learning, may be directly observable (Casile et al., 2011).
Lastly, three important questions fall outside the scope of the current framework: (1) Adaptive control plays an important role in the speech motor system (Houde and Jordan, 1998) and has been the basis of successful simulations of articulatory control (Parrell et al., 2019). The importance of somatosensory signals are acknowledged in our work, however, there is no risk of external perturbations in our simulation. This means that it is possible to rely on the empirically derived kinematics of the target-approximation model (Xu and Wang, 2001;Birkholz, 2007) to determine articulatory dynamics. Moreover, to model adaptive control would require a more complete physical simulation of the vocal tract plant, including properties such as mass and elasticity.
(2) Intrinsic motivation may be responsible for forming the developmental stages of speech acquisition (Moulin-Frier et al., 2014). That is, it could be viewed as a possible mechanism that initiates or controls instances of the current simulation. To be specific, it may determine how to enumerate the optimisation goals and/or replace the general optimisation algorithm used here. (3) Segmentation of continuous speech or the mapping to a sequence of percepts representing syllables is not considered here but assumed possible (Jusczyk, 1997;Räsänen et al., 2018).

Limitations and future work
Our experimental setup relied on synthesis of utterances with predefined segmental durations. This means that the system was constrained to evaluating spectral properties and only basic temporal features affected by articulatory effort. Duration and additional aspects of the glottal model need to be incorporated into the set of optimised parameters to cover a larger set of phonological units in English and other languages. Furthermore, duration and articulatory effort should be allowed to vary to allow for variations in speaking rate and prosody (Birkholz et al., 2011). The simulation could benefit from allowing for tolerances instead of finding articulatory targets that are dependent on a specific value of duration or articulatory effort.
During the course of our experiments, we inspected plots of the articulatory solutions to compare the qualitative impact of different sets of regularisation objectives. As could be expected, including the objectives for precision and coarticulation leads to a greater proportion of ''prototypical'' solutions that correspond to articulatory phonetic D.R. van Niekerk et al. descriptions. However, questions of articulatory correspondence are beyond the scope of the current work. Two questions could be addressed in future work: (1) what are the conditions -stimuli, constraints or processes -that could lead to establishing sets of articulatory objectives similar to those implemented here, and (2) how do the solutions found in the simulation compare to prototypical articulatory descriptions. For the latter, future work could attempt to quantify this objectively by selecting an appropriate articulatory reference dataset and implementing a procedure for comparing vocal tract configurations. Lastly, it may be necessary to investigate additional articulatory feedback mechanisms and exploration strategies that facilitate learning of more complex syllable types towards complete language coverage .

Conclusions
By considering computational work on speech perception and production, we have presented an implementation of vocal exploration which includes semantic, auditory, and articulatory signals. It was suggested that language oriented auditory-perceptual representations can facilitate the inclusion of these information sources to account for the speaker normalisation and phonological correspondence problems associated with imitative vocal learning and the possibility of using such low-dimensional percepts was demonstrated. This approach extends existing work on vocal learning by constructing an appropriate goalspace for vocal learning and adopting a recognition-based methodology for quantitative evaluation. Moreover, the proposed optimisation-based framework was shown to be an effective way of exploring the vocal tract domain which may be useful for self-generating grounded data for developing articulatory synthesisers in new languages  or for learning forward and inverse models of articulation (Jordan and Rumelhart, 1992

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.

Funding
This work was supported by the Leverhulme Trust with Research Project, United Kingdom Grant RPG-2019-241: ''High quality simulation of early vocal learning''.

A.1. Syllable encoder
More than 380,000 CV syllables were extracted from the ''train clean'' subset of the Librispeech corpus using phone-level forcedalignments obtained with the Kaldi ASR toolkit (Povey et al., 2011) and partitioned into training, development and test sets as illustrated in Table A.4. From the raw audio, 12-dimensional Mel-frequency cepstral coefficients (including energy) without delta or acceleration coefficients were extracted every 5 ms in a 20 ms Hamming window (zero-padded to 512 samples at 22050 Hz) using librosa 8 (McFee et al., 2015) and z-normalised using the statistics of the training data set. These sequences were pre-padded to have a length of 200 samples (spanning 1 s) and used as input for training the encoder. The model thus learns to map a sequence of these acoustic features, spanning a syllable, to a single 18-dimensional vector (as described in Section 3.1). For example, a syllable extracted from the corpus and transcribed as being part of the word ''bad'' will be assigned the pronunciation /bae/ which has an ideal one-hot vector representation [1, 0, 0, 1, 0, … , 0] given that /b/ is the first component for the 3 consonants and /ae/ the first component for the 15 vowels (from Table 2).
Tensorflow 9 was used to train the network consisting of 2 bidirectional long short-term memory (LSTM) recurrent network layers (Hochreiter and Schmidhuber, 1997) followed by 6 dense feedforward layers with dropout regularisation as shown in Table A.3. Since we view the output simply as a point (embedding) in a continuous space, the activations of the output layer are used directly, that is, they are not normalised to represent probabilities over the output dimensions or parts thereof. Training proceeded with early stopping based on the validation set loss with a patience of 6 epochs. When applied as a classifier on the test data, the resulting model obtained an overall accuracy of 79.9% with 96.5% and 81.7% for consonant and vowel identities respectively. This gives an indication of the quality of the data, labels and alignments, and the difficulty of the perception task as well as an explicit upper limit for experimental results described in Section 5.2.

A.2. Articulatory loss functions
Let a CV be represented by a 38-dimensional concatenation vector = [ , , ] of the articulatory targets for the consonant and vowel (one 18-dimensional vector each) and the two target-approximation time constants (refer to the free parameters shown in Table 1). Furthermore, let represent the set of constant speaker-specific parameters for ''JD2'' on which all the VTL functions are implicitly dependent. Then the loss terms can be defined as follows (with some examples illustrated in Fig. 3): where is the smallest value depending on the numerical resolution of the simulation and max is the maximum tube area possible given the speaker .
(4) The precise closure loss applies a threshold to the tube closure lengths function 12 (as illustrated in Fig. 3) to incentivise consonant closures over a short section of the vocal tract where is the set of articulators { , } and ′ is its complement when the consonant is a bilabial and vice versa when the consonant is not a bilabial. 10 Using the C++ function vtlGetTransferFunction. 11 Using the function vtlTractToTube. 12 Also using vtlTractToTube.  Table 1).

Appendix B. Interpreting recognition rates
The expected recognition rate for a random utterance generator cannot be determined directly for the free transcription or large vocabulary recognition task because neither the effective number of output classes of the discriminator (whether the proprietary ASR system or human listeners) nor the prior probability distribution over the set of possible outputs (i.e., the language model) is known. This obstacle extends to the recognition rates in terms of the vowel or consonant that are subject to a decision by the discriminator on the word level which involves both the prior probabilities of words as well as the phonotactics of the language (i.e., the context-dependent acoustic model).
However, a reasonable estimate of the upper bound of the expected recognition rate for a random generator would be to take a uniform distribution over the number of output classes observed for the specific experimental condition. For example, we can estimate this for the best performing condition for the ASR and human evaluators from the information presented in Figs. 7 and 8 respectively. For the ASR system, the set of output labels has size 23 resulting in an expected recognition rate of 1∕23 ≈ 4.4%. For the human listeners, if we consider the 38 frequent responses representing approximately 70% of the probability mass, the result is 0.7∕38 ≈ 1.8%.
These estimates should be interpreted as upper bounds since for a process with lower precision, such as random sampling of the articulatory space, the following is expected: (1) Higher variance in the outputs should result in the discriminator producing a larger set of output classes which under the assumption of a uniform distribution will decrease the expected recognition rate.
(2) Acknowledging that the prior probabilities modelled by the discriminator are not uniformly distributed, it is expected that more of the probability mass will be distributed outside of the 13 test classes. Concretely, it is expected that a random utterance generator will produce a larger proportion of rejected utterances, at least in the case of the ASR system where the null result is the most probable in the posterior distribution.
Furthermore, since these estimates are derived from the most precise (lowest variance) condition, similar estimates for the other conditions presented in the study are expected to be lower.