Local and non-local dependency learning and emergence of rule-like representations in speech data by Deep Convolutional Generative Adversarial Networks

This paper argues that training GANs on local and non-local dependencies in speech data offers insights into how deep neural networks discretize continuous data and how symbolic-like rule-based morphophonological processes emerge in a deep convolutional architecture. Acquisition of speech has recently been modeled as a dependency between latent space and data generated by GANs in Begu\v{s} (2020b; arXiv:2006.03965), who models learning of a simple local allophonic distribution. We extend this approach to test learning of local and non-local phonological processes that include approximations of morphological processes. We further parallel outputs of the model to results of a behavioral experiment where human subjects are trained on the data used for training the GAN network. Four main conclusions emerge: (i) the networks provide useful information for computational models of speech acquisition even if trained on a comparatively small dataset of an artificial grammar learning experiment; (ii) local processes are easier to learn than non-local processes, which matches both behavioral data in human subjects and typology in the world's languages. This paper also proposes (iii) how we can actively observe the network's progress in learning and explore the effect of training steps on learning representations by keeping latent space constant across different training steps. Finally, this paper shows that (iv) the network learns to encode the presence of a prefix with a single latent variable; by interpolating this variable, we can actively observe the operation of a non-local phonological process. The proposed technique for retrieving learning representations has general implications for our understanding of how GANs discretize continuous speech data and suggests that rule-like generalizations in the training data are represented as an interaction between variables in the network's latent space.


Introduction
The discussion between connectionist and symbolic approaches to language and human cognition in general has long been in the focus of computational cognitive science Marcus 2001, i.a.). Phonetic and phonological data are uniquely appropriate for addressing this problem. Over a century-long tradition of scientific study of acoustic and perceptual phonetics (for an overview, see MacMahon 2013) that deals with physical properties of speech sounds provides a solid understanding of the continuous data that hearing infants acquire language from: raw acoustic speech. Phonology is the study of how humans analyze, discretize, self-organize, and manipulate continuous speech data into discretized mental representations called phonemes. The scientific study of phonology, too, has an over-a-century long history (for an overview, see van der Hulst 2013), which resulted in a solid understanding of local and non-local discrete dependencies in human speech. Phonetic and phonological data and analysis are thus uniquely appropriate for probing what deep convolutional networks can and cannot learn, how discrete representations can emerge in deep neural networks, and how their performance can be paralleled to human behavior. Despite these advantages, the majority of neural network interpretability studies focus on non-linguistic visual data or syntactic/semantic levels, the latter of which lack a continuous component.
Computational models of speech acquisition have a long history. The majority of models, however, operate with abstract and already discretized data rather than raw acoustic inputs (McClelland and Elman, 1986;Gaskell et al., 1995;Plaut and Kello, 1999). Deep neural network models of phonetic and phonological data operating with raw acoustic inputs emerged only recently. Several proposals model phonetic learning with deep autoencoder models (Räsänen et al., 2016;Alishahi et al., 2017;Eloff et al., 2019;Shain and Elsner, 2019;Chung et al., 2020). Autoencoders learn to reduce data and encode data distributions in latent representations: they are trained on reproducing inputs by generating outputs from a reduced latent space. Inputs are thus directly connected to the outputs with an intermediate latent space that is reduced in dimensionality. Clustering analyses on the latent space show that the networks trained on phonetic data learn approximations of phonetic features based on phonetic similarity (Räsänen et al., 2016;Alishahi et al., 2017;Eloff et al., 2019;Shain and Elsner, 2019).
While the reduced dimensionality in the autoencoder architecture approximates phonetic features based on phonetic similarity, the proposals do not model phonological processes. The human language learner has to acquire not only the identity of individual sounds based on acoustic similarity (as approximately modeled by the proposals using the autoencoder architecture), but also to manipulate those sounds in a given phonetic context. For example, a voiceless bilabial stop /p/ in English can surface as aspirated [p h ] (produced with aspiration or a puff of air) before stressed vowels or as unaspirated [p] (without aspiration or a puff of air) if a fricative [s] precedes it. A minimal pair illustrating this distribution is ["p h It] 'pit' and ["spIt] 'spit'. The learner needs to learn not only to output voiceless bilabial stop, but also to shorten the aspiration time (VOT) when an [s] precedes it. Autoencoders are also trained on replicating output data as closely as possible to the input data, which is not desirable in models of language acquisition. While dimensionality reduction in autoencoders is unsupervised, input-output pairing is not.
To model phonetic learning simultaneously with the learning of simple allophonic processes, Beguš (2020b) proposes that speech acquisition can be modeled as a dependency between the latent space and generated data in the Generative Adversarial Networks. Generative Adversarial Networks (GAN), first proposed by Goodfellow et al. (2014), have not been used for modeling language acquisition, despite several advantages that this architecture features for computational models of language learning. GAN models are unsupervised and fully generative, which means that a deep convolutional network outputs innovative data that have no direct link to the training data (unlike, for example, in the autoencoder architecture). In other words, deep convolutional networks in the GAN architecture need to learn to output data from some random distribution. Beguš (2020b) argues that deep convolutional networks in the GAN architecture encode discretized phonetic and phonological representations in the latent space. A computational experiment is conducted on a GAN implementation for audio (as proposed in Donahue et al. 2019 based on Radford et al. 2015) by training the networks on an phonologically local allophonic distribution in English, where voiceless stops surface as aspirated word-initially before a stressed vowel (e.g. in ["p h It] 'pit'), except if a sibilant [s] precedes the stop (e.g. in ["spIt] 'spit'). The network learns the allophonic distribution and encodes phonetically and phonologically meaningful features in its latent space.
Based on this local allophonic distribution, Beguš (2020b) proposes a technique for identifying and manipulating variables in the latent space in the GAN architecture that correspond to desired phonetic and phonological representations. Beguš (2020b) argues that the network uses a subset of latent variables to encode presence of a sound in the output (e.g. [s]). By manipulating the identified variables, especially well beyond the training range (as proposed in Beguš 2020b), we can actively force the sound in and out of the generated outputs. Moreover, a linear interpolation of the chosen latent variables from marginal values results in almost linear reduction of the amplitude of the frication noise of [s] -a linguistically meaningful unit (Beguš, 2020b).
The goal of this paper is to argue that using the technique proposed in Beguš (2020b), we can model not only simple allophonic processes, such as English deaspiration, but also local and non-local phonological processes that are based on what would be approximated as morphology (morphophonological alternations) that resemble rule-like behavior. We also argue that we can parallel human behavioral experiments with performance of the deep convolutional networks that are trained on the same data as used in behavioral experiments. In general, natural languages strongly prefer local over non-local processes, both in phonology and on other levels such as morphology and syntax (Finley, 2011(Finley, , 2012McMullin and Hansson, 2019;White et al., 2018). In fact, the vast majority of phonological processes in the world's languages are local (targeting adjacent sounds) (Finley, 2011), with only a few processes, such as harmony, operating on non-adjacent sounds. Behavioral experiments show that local processes are easier to learn than non-local processes (Finley, 2011(Finley, , 2012McMullin and Hansson, 2019;White et al., 2018). In this paper, we test the learning of local and non-local phonological dependencies, and show that local processes (such as postnasal or intervocalic devoicing) are easier to learn for the networks than non-local vowel harmony. We parallel success rates in the computational model to behavioral data -an artificial grammar learning experiment in which human subjects are trained on the same data (Section 4). This type of combining artificial grammar learning experiments and computational models has the potential to reveal similarities in learning biases between human subjects and deep convolutional networks, and shed light on how domain-general learning biases that require no language-specific mechanisms can result in the typological prevalence of local processes and the rarity of non-local processes.
Specifically, we test the learning of non-local vowel harmony and several local devoicing patterns. Vowel harmony is a phonological process, usually non-local, in which a vowel becomes more similar to another vowel in a word. For example, the plural morpheme in Turkish surfaces as [lAr] after root vowels that are back and as [ler] if the root vowel is front (Kabak, 2011): [dAl-lAr] 'branches' and [jer-ler] 'places' (Kabak, 2011).
In formal phonological analysis, phonological computation is formalized with rewrite rules that operate as symbolic feature manipulation (Chomsky and Halle, 1968). As argued by Marcus et al. (1999) and several other works (Chomsky and Halle 1968;Heinz 2010;Berent 2013, i.a.), "algebraic rules" are required to derive a set of surface outputs such as Turkish [dAl-lAr] and [jer-ler] from stored inputs. The stored mental representation of the prefix can be posited as /lAr/. The role of phonological grammar is to derive the two surface forms (outputs) from the stored mental representation (input).
Sounds are represented with matrices of binary features that distinguish meaning (e.g. [+syllabic, + front] means a front vowel). Vowel harmony can be formalized with a simple rewrite rule (in 1) that identifies vowels ([+syllabic]) and assigns the same value (α) of feature [±front] as in the vowel that follows it (interrupted by any number of consonants C 0 ). The formalism is illustrated in (1). [+syllabic] The discussion of symbolic representation vs. connectionism has a long tradition in phonology. An influential proposal called Optimality Theory models phonology as an input-output pairing rather than a rule-based symbolic representation (Prince andSmolensky, 1993/2004;Legendre et al., 1990). Optimality Theory was directly influenced by earlier work on connectionism. Vowel harmony within this framework is modeled with the Agreement-by-correspondence proposal (Hansson, 2010;Rose and Walker, 2004): two sounds (such as the two vowels [A] in Turkish [dAl-lAr]) are in correspondence and share features, which, through surface optimization in the grammar, results in a harmonious process. Several independent facts support the approach of input-output optimization in phonology. However, both Optimality Theory and other proposals in phonology using neural networks (McClelland and Elman, 1986;Gaskell et al., 1995;Plaut and Kello, 1999) model local and non-local phonology with pre-assumed levels of abstraction, meaning that learning is not modeled from raw acoustic data but is already pre-discretized or requires language-specific mechanisms.
We argue that approximates to rule-based behavior emerge in deep convolutional networks even without any pre-assumed levels of abstraction (the networks are trained on raw acoustic inputs) and when models contain no language-specific parameters. The network discretizes the representation of a prefix in the output and uses only one latent variable (out of 100) to encode the presence of the prefix. Equivalents to non-local phonological rules emerge from an interaction between the variable that represents the prefix and a variable that generates some desired phonological process. We also argue that the same data used for training in the GAN architecture can be used to test phonological learning in artificial grammar learning experiments in human subjects. In fact, the paper argues that training GANs on relatively few data points yields, somewhat surprisingly, highly informative results (Section 3.1). This observation should open numerous opportunities for paralleling performance in deep neural networks and behavioral outcomes of artificial grammar learning experiments with human subjects. Finally, we outline a procedure to observe how the network learns dependencies as the training progresses and claim that the generator's search through the space of phone-level combinations are linguistically interpretable (Section 3.2).

Model
The main characteristic of Generative Adversarial Network architecture (Goodfellow et al., 2014), and more specifically the DCGAN proposal by Radford et al. (2015), are two deep convolutional neural networks that are trained in a minimax setting. The Discriminator learns to estimate realness of the data and minimize its own error rate (Brownlee, 2019). The Generator network learns to output data from a set of latent variables and maximize the Discriminator network's error. Initially, the Generator network produces noise, but as training progresses it becomes increasingly more successful in outputting data such that the Discriminator becomes less successful in distinguishing actual from generated data.
The majority of GANs are trained on two-dimensional visual data; a shift to apply the architecture to the audio domain has occurred only recently with the work of Donahue et al. (2019) (WaveGAN). The model in Donahue et al. (2019), used for training here, is based on the DCGAN architecture (Radford et al., 2015) and features most of the same hyperparameters. The two main differences are that the Generator involves an additional layer and generates a one-dimensional output that corresponds to approximately 1 second of audio. The cost function is taken from the Wasserstein GAN proposal with gradient penalty (WGAN-GP) (as proposed in Arjovsky et al. 2017 andGulrajani et al. 2017 ae]) and the VOT duration can be longer than in any #sTV sequence in the training data. Additionally, the network occasionally outputs innovative sequences that lack a stop (e.g. #sV) or concatenate two stops (e.g. #TTV). In other words, the Generator learns the conditional allophonic distribution, but imperfectly so (Beguš, 2020b). The outputs with long VOT (aspiration) in the [s]-condition parallel stages in language acquisition: language-acquiring children also occasionally output stops with long VOT (aspiration) in the [s]-condition (Bond and Wilson, 1980).
In addition to observing learning in the GAN architecture with surface forms, we can identify individual latent variables that correspond to phonetic and phonological representations. Beguš (2020b) proposes a technique for identification of the variables by regressing the annotated outputs to the randomly sampled latent space. Predictions of several regression models are tested in Beguš (2020b) to avoid assumptions of linearity: generalized additive models with various shrinkage techniques, linear logistic regression, Lasso logistic regression, and random forest models. The technique identifies latent variables (z; see Figure 1) that correspond to presence of [s] in the output. Moreover, it is shown that the relationship between the individual latent variables (e.g. those identified as representing [s]) and the presence of [s] in. the generated data are often linear, even when non-linear regression is used for testing.
Given this linear relationship, we can identify variables that correspond to a desired phonetic property and identify whether the property correlates with positive or negative values of the variable. Individual z-variables are uniformly distributed during the training with the interval (−1, 1). When set to a value identified as corresponding to presence of a desired phonetic feature, the output contains a significantly higher proportion of this property. Crucially, Beguš (2020b) shows that manipulating the identified variables beyond the values in the training range (−1, 1), such as to ±4.5, results in an increased presence and amplitude of the desired phonetic representation. In other words, as we interpolate a variable identified as representing an [s] in the output, the amplitude of [s] increases or decreases. We can thus actively force a phonetic or phonological feature in the output. That the proposed technique indeed identifies variables corresponding to the presence of [s] is suggested by an independent generative test in Beguš (2020b). While explorations of latent space and representation learning in GANs have been conducted before on visual data (Radford et al., 2015), the proposals, to the author's knowledge, do not use single variables to explore their meaningful equivalents in the output and do not utilize interpolation to extreme values beyond the training range. 1 Beguš (2020b) thus argues that the Generator network learns a local allophonic distribution as well as learns to encode phonetic and phonological representations with a subset of variables in the latent space. While the Generator network represents [s] in the latent space with a subset of variables in Beguš (2020b), the cutoff between variables associated with presence of [s] and the rest of the latent space is not completely categorical. The Generator network does not associate the presence of [s] with a single variable: seven z-variables are associated with the representation of [s]. There is a notable cutoff between the regression estimates of the seven highest variables and the rest of the latent space, but the difference is not substantial or categorical. Training data in Beguš (2020b) is sliced from TIMIT (Garofolo et al., 1993), which is considerably more variable than the training data in this experiment. As is argued in Section 3.3, discretization of some morphophonological representation (e.g. presence of the prefix) is substantial in the current experiment. It appears that less variable data results in a more rapid discretization.

Data
The training data (from Beguš 2020a) contain evidence for one non-local phonological processvowel harmony -and four local processes: The items are all nonce words in English, so that the same dataset can be used in the behavioral experiment with human subjects (Section 4).

Non-local processes
Non-local vowel harmony is triggered by the first vowel of the base ( is not front. The experiment thus features a similar case of vowel harmony as the Turkish example (see Section 1).
The computational experiment presented here tests the learning of non-local vowel harmony. That the process tested here is phonologically non-local is clear from Table 1: the sounds in correspondence (the vowel of the prefix and the first vowel of the lexical item) are always separated by one or two consonants.

Local processes
In addition to non-local vowel harmony, the training data contain evidence for four local processes that are triggered by the prefix. Two processes are triggered by a nasal sound in the prefix VN-. 16 unprefixed-prefixed pairs (32 items total) contain evidence for post-nasal devoicing ( Because the learning of non-local processes is predicted to be more difficult than that of local processes, the training data contain substantially more evidence for the non-local process. All items in which C 1 is constant as well as those in which it changes contain evidence for the non-local vowel harmony process. Of 270 training items, there are 117 unprefixed items with 117 corresponding prefixed forms, all of which contain evidence for vowel harmony (234 total). The remaining items (36) only include unprefixed forms (for testing learning). There is thus a substantial difference in the amount of training data that contain evidence for the non-local process (117 pairs, 234 altogether) and the four local processes (16 pairs each). Even if all four local processes are pooled together, the data still contain only 64 pairs containing evidence for the four local processes (128 altogether). Table 1 illustrates the training data: each slot is filled with a transcribed example from the training data. The entire training in IPA transcription is given in Appendix Tables A.3, A.4, A.5, A.6, A.7, A.8, and A.9.
In addition to the local and non-local processes described above, the data contain evidence for a local assimilation process which is somewhat less relevant to our experiment: if the prefix contains a nasal stop (VN-), the place of articulation of the nasal stop depends on the first consonant of the root (C 1 ). The nasal surfaces as labial [m] before the labials ([p] and [f]), and as an alveolar [n] elsewhere. Spectral differences are minimal between the two conditions, which is why a detailed analysis of this process is not possible in the computational experiment; the main purpose for including this assimilation in the data is for the behavioral experiment to include an English-like process (to not raise the attention of the subjects) and to facilitate the reading task for the speaker who recorded the stimuli.
The computational experiment tests the learning of the local devoicing processes and non-local vowel harmony that target the prefix (VN-or V-). In order to control for the potential effects of other segments on the learning of the targeted processes, we balance the experimental design as much as possible. The number of lexical items with the front vowel in V 2 is, in all but three pairs, equivalent for every C 1 condition. In other words, if there are four [d]-initial items that devoice and have frontness harmony (V 2 is front), there are also four items with backness harmony (V 2 is not front) for this condition. 2 We also aim to balance the identity of C 3 and V 4 as much as possible, but balancing these positions is limited by the requirement that the items not be real words of English or too similar to real words (due to the artificial grammar learning experiment). Only [m, n, l, ô, s] can be members of C 3 , and these along with V 4 are relatively well balanced across the groups VN- Em"fim@ En"t h ElO En"sEnO En"jim En"lEn En"ôinu Om"fuô@ On"t h Aôu On"sAnu On"jAlu On"lOô On"ôOlO with changing C 1 (e.g. approximately equal number of the same consonants across voiced-initial items that devoice and those that undergo devoicing with fricativization or occlusion), but not across other groups. A fully balanced design is difficult to achieve due to different groups and the nonce-word requirement, but given the relatively well balanced design, we do not expect undesired dependencies to affect the learning distributions of interest. The 270 items described above were presented in a simplified transcription (see Appendix Tables A.3, A.4, A.5, A.6, A.7, A.8, and A.9) and read by a single female speaker of American English (see also Beguš 2020a). The words were of the shape The speaker was unaware of the exact objectives and details of the study and was compensated for her work. Recordings of training data were made in a sound-attenuated booth using a USBPre 2 (Sound Devices) pre-amp and Shure 53 Beta omnidirectional condenser head-mounted microphone in Audacity (originally sampled at 44.1 kHz and then downsampled to 16 kHz).
The data in the form of sliced audio files for each item (approximately 1 s long padded with silence) is fed to the model randomly in mini-batches of 64. The bare unprefixed and prefixed forms are not paired in any way during training.

Results
One advantage of the GAN architecture is that the Generator network outputs innovative data that are linguistically interpretable (Beguš, 2020b). Innovative outputs are often sporadic and do not allow for a full quantitative analysis, which nonetheless does not make them less informative. It is important to describe innovative outputs and how they can inform us about the learning of speech data in deep convolutional networks. In Sections 3.1 and 3.2 we present results from an exploratory study of the network's innovative outputs based on an acoustic analysis of spectra. In Sections 3.3, 3.4, and 3.5 we present a quantitative analysis of the generated outputs. 3

Small data sets
The total unique data points (audio recordings of the words with the structure described in Section 2.2) that the network is trained on is 270. Despite the small amount of training data, the model generates outputs that closely resemble human speech, are interpretable, analyzable, and highly informative. This stands in contrast to some recent studies of neural network models on the syntactic level that require very large training datasets and do not improve substantially with more data (van Schijndel et al., 2019). As is argued below, the GANs do not overfit, but produce innovative data that are linguistically interpretable despite the small training data set. This finding should open up numerous possibilities for further exploration of learning representations in deep convolutional networks: it is generally assumed that GANs and deep convolutional networks require large amounts of data, which could be prohibitive for research questions that require smaller training datasets.
We analyze outputs of the Generator network at four training steps: after 7453 (∼ 8833 epochs), 9740 (∼ 11543 epochs), 14900 (∼ 17659 epochs), and 20990 (∼ 24877 epochs) steps. The number of steps chosen is based on maximizing clarity of the acoustic outputs that need to be appropriate for acoustic analysis and minimizing the number of steps used for training (for guidelines, see Beguš 2020b).
Some generated outputs are phonetically very similar to the input equivalents, as illustrated in Appendix A Figure A.11. The network, however, also generates outputs that substantially violate the input data. The Generator network trained after 7453 steps, for example, outputs a sequence that can be transcribed as ["dinO], yet the training data lacks this sequence altogether. The closest neighbor to the innovative ["dinO] in the training data is ["dEnO] (see Figure A.11). There are numerous other such generated outputs that violate the training data, but are linguistically valid and interpretable. For example, 23.2% of outputs violate the training data with respect to vowel harmony (see Section 3.4).
To further quantify the proportion of innovative outputs that are linguistically interpretable, we transcribe 200 randomly generated outputs from a network trained after 20990 steps. The phonemic structure is impossible to determine in only 13 of the 200 outputs (6.5%). In the majority of these 13 outputs, the generated audio resembles speech and includes periodic vibration, but spectrogram structure is too noisy for identification of clear phonetic structure for parts of the output or the entire output. On the other hand, in the majority of cases (187 or 93.5%), the generated outputs have a clear and identifiable phonetic structure. Moreover, the Generator clearly learns the structural phonotactic properties of the input data. In all outputs with an identifiable structure, the network outputs items with the structure CVCV, CVC, prefix-CVCV, or prefix-CVC. The network also learns more specific distributional patterns. For example, training data lacks nasal consonant in the initial position (C 1 in C 1 V 2 C 3 V 4 or C 1 V 2 C 3 is never a nasal, but either an obstruent or [l, ô, j]; see also Tables A.11, A.3, A.4, A.5, A.6, A.7, A.8, A.9 in Appendix). On the other hand, C 3 never features an obstruent with the exception of [s] ([p h , t h , b, d, f, v, z]) in the training data. Finally, obstruents are always voiceless in prefixed forms (e.g. [En"t h ilO] for ["dilO]). All 187 outputs conform to all these distributional patterns.
Crucially, the Generator does not simply replicate inputs. While all 187 outputs conform to the global distributional patterns of the training data, 78/187 are unique combinations of sequences that are absent from the training data. 15 out of these 78 outputs are disharmonious cases. Yet, even if they are taken out of consideration, the generator outputs 63/187 (33.7%) of outputs that conform to distributional and phonotactic patterns of input data, but feature unique phoneme sequences that are absent from the training data. For example, the network outputs ["bOôO], [O"t h On@], and ["t h ini] which conform to the phonotactic patterns of the training data, but are not present in precisely these particular combinations of segments in the training data.
Innovative outputs that violate training data distributions in linguistically interpretable ways constitute strong evidence against overfitting in the GAN architecture: even with very small datasets and a relatively high number of epochs, the Generator does not overfit. This is in line with previous evidence that GANs generally do not overfit (Adlam et al., 2019;Donahue et al., 2019), but here we additionally argue that GANs don't overfit even with small training datasets (N = 270).

Progression of learning
One advantage of the exploratory study of GANs outputs is that we can follow how dependencies in speech are learned by the network at different training steps. We propose that the progression of learning can be observed by keeping the latent space constant and generating data at different training stages of the Generator network. This provides crucial information on how the number of training steps influences the Generator's outputs and learning representations -an area that is relatively understudied. Testing the effect of training steps on learning representations using speech data should reveal further insights into neural network interpretability, as is argued below.
We propose that by analyzing generated outputs at different training steps with latent space kept constant, we can actively follow how the network corrects the outputs that violate distributions in the data. For example, at 7453 steps, the network generates an innovative output that violates the training data: ["bEnO]. At 9740 training steps, the network outputs ["bEmO] for the same latent space variables. This output still violates the data: none of the words in the training data was of the exact shape ["bEmO]. At 14900 steps, the network outputs ["bEôO] (for the same latent space), which corresponds to ["bEôO] in the training data ( Figure 2). 4 In a related example, the proposed method allows us to follow how the network searches through the space of possible segment combinations using linguistically valid strategies. Figure 2 shows an output ["zilO] for which there is no direct equivalent in the training data. The spectrogram shows a clear voicing bar and frication noise in the high frequencies, characteristic of a [z]. At 9740 steps, the network devoices the initial consonant C 1 , but keeps its frication noise (and also changes the high front vowel [i] to a back vowel [u] for an output ["sulO]. This output is likewise not attested in the training data. Finally, at 14900 steps, the network transforms the frication noise from a higher to lower kurtosis that corresponds to a labial fricative [f] in the training data (["fulO]). At 20990 steps, it appears as if the network is introducing a period of aspiration noise and turning the fricative into a stop with the same following sequence ["t h ulO]. None of these outputs are attested in the training data, but the examples illustrate that the Generator searches for segment combinations with valid phonological processes in human language, such as devoicing, occlusion, or changing distribution of frication noise.
Using this technique, we can not only observe how the network repairs distributional violations, but also how it searches through the space of possible segment combinations to repair violations of phonological rules in the data. Because the error rate of local phonological processes is relatively low in the output data, (1.8% at 20990 steps), the study of how the network repairs outputs that violate phonological processes can only be exploratory at this point. An example that illustrates how learning progress can be directly observed with this method is given in Figure 3 output, which means the output now conforms to the devoicing rule in the training data. In other words, [z], which violates the phonological rule of devoicing after a prefix, devoices to [s], which conforms to the training data. At 14900 steps, the output thus fully conforms to the distributions in the training data: harmony and devoicing: [O"sOlO] (Figure 3). The output, while conforming to the rules of training data, is still innovative and none of the training inputs contains exactly this sequence. Spectrograms in Figure 3 illustrate how the network applies learning representations in its continuous outputs at different training steps that correspond to phonological processes in natural language: devoicing and vowel-lowering.

Latent space
To test how the network encodes prefixation in its latent space, we used a technique described in Beguš (2020b) and Section 2 to identify dependencies between the latent space and generated data. 500 outputs of the Generator network trained after 20990 steps were transcribed and annotated for presence of the prefix V-and VN-. 5 The number of steps for this analysis was chosen based on the analysis of progression of learning in Section 3.2: it appears that a number of disharmonic outputs is repaired at 20990 steps and further training with more steps ceases to repair disharmonic outputs. That the network is successful in outputting data that approximates human speech in the training data is suggested by the fact that the author was unable to reliably transcribe the output in only approximately 25 out of 500 outputs (5%). The data were fit to a Lasso logistic regression model with the presence of the prefix as the dependent variable and the 100 latent variables of the Generator network as predictors (with the glmnet package in Simon et al. 2011). Alpha values were estimated with 10-fold cross-validation. Estimates in Figure 4 suggest that the network uses a single latent variable to encode the presence of the prefix in the output: there is a clear and substantial drop in regression estimates between z 16 and the rest of the latent space (other 99 z-variables). Such a substantial drop in regression estimates suggests that the network discretizes representation of the prefix into a single latent variable.
To test the effect of z 16 on generated data, we generate 100 outputs with the value of z 16 set at −4.5 (for the method, see Beguš 2020b and Section 2.1). Out of 100 generated samples, 100 (or 100%) contain a prefix V-or VN-. When z 16 is set to its opposite value (4.5), only 1 out of 100 generated samples (1%) contains a prefix. This generative test suggests that the network encodes presence of the prefix in the output as a single variable in its latent space. By manipulating this    feature, we can actively control the presence of the prefix in the output. 6

Local and non-local processes
The training data contains evidence for local and non-local phenomena. Devoicing and occlusion after the prefixes V-and VN-are local; vowel harmony is non-local, as one or two segments intervene between the target and the corresponding vowel.
To test error rates of the output data, 500 outputs from the Generator networks trained after 20990 steps were analyzed. 211 outputs (42.2%) were analyzed as involving a prefix VN-or V-. Of the 211 prefixed outputs, 162 (or 76.8%) were analyzed as harmonious. 7 Harmonious outcomes are consistently more frequent than non-harmonious both for front and back V 2 as well as across the two prefixes, V-and VN-. The distribution of the harmonious and disharmonious outputs across front and back triggering vowels and across the two prefixes are given in Table 2.
To test whether the Generator's higher rates of harmonious outcomes are significantly above chance, we fit the data to a linear logistic regression model with harmonious and non-harmonious outcomes as a dependent variable (harmonious coded as successes) and vowel frontnesss (with two sum-coded levels, front and back) and prefix identity (with two sum-coded levels, V-and VN-) as the independent variables with their interaction. Harmonious outcomes are significantly more frequent than disharmonious outcomes at means of all predictors: β = 1.34, z = 7.2, p < 0.0001. None of the interactions are significant. All estimates are given in Appendix Table A.10. Predicted values of the model are plotted in Figure 10. The results suggest that the network learns the non-local phonological process of vowel harmony, but imperfectly so: it violates the training data in approximately 23% of outputs. The violations are linguistically interpretable: the prefix vowel in the non-harmonious condition is not of random formant structure, but consists of formants characteristic of [O] or [E].
Local processes are substantially less frequent and easier to learn than non-local processes in natural languages. To test whether such distribution also emerges in deep convolutional networks, we can compare the error rate in the non-local process and the error rate in the local processes of the generated outputs. Out of 168 prefixed outputs containing a stop or a fricative, only three (1.8%) violate the devoicing rule in the training data by which stops and fricatives are always voiceless in prefixed forms, e.g. Figure 5). This error rate is significantly lower compared to the error rate of the non-local process (OR =16.2 [5.1, 83.0], p < 0.0001, Fisher Test). While the phonetic cues for harmony and devoicing are different and challenging to compare, it would be difficult to argue that the magnitude of phonetic cues for vowel formants (front vs. back) is substantially smaller than the cue for voicing. The distribution aligns 6 For a generative test showing that regression estimates indeed identify variables that correspond to a given phonetic/phonological representation, see Beguš (2020b). 7 In one output excluded from the analysis, the prefix vowel is analyzed as [A]. well with behavioral data in human subjects, where local processes have been shown to be easier to learn than non-local processes in many studies (Finley, 2011(Finley, , 2012McMullin and Hansson, 2019;White et al., 2018).

Emergence of rule-like behavior
In the framework of symbolic representations, vowel harmony can be derived with an algebraic rule (as in 1). The harmony of the prefix vowel ([E]/[O]) is triggered by the following vowel V 2 via a rule that sets the feature [±front] in the vowel of the prefix according to the value of the same feature in the following vowel (see formalism in 1). Alternatively, the grammar can also operate on a morphophonological level: a prefix as a morphological unit can be chosen based on the value of the following vowel.
We propose here that using the technique in Beguš (2020b), we can elicit such rule-like behavior in deep convolutional neural networks. The analysis in Section 3.3 suggests that the Generator learns to associate z 16 with presence of a prefix. There is a substantial drop in regression estimates after the estimates for z 16 , which suggests that the network discretizes the continuous phonetic input and uses a single variable to encode presence of some phonetic/phonological material which corresponds to a morphological unit: a prefix. To elicit rule-like behavior, we can identify another variable in the latent space -the variable that corresponds to the frontness/backness of vowel V 2 . To identify such a variable, the generated 500 outputs are annotated for vowel (V 2 ) frontness. We fit the data to two linear logistic regression models: one in which outputs with the front vowel (V 2 ) [E, i] are coded as success and another in which [A, O, u] are coded as success. The independent variables are values of the 100 latent variables z randomly sampled for each of the 500 annotated generated outputs. The model is fit using the glmnet package (Simon et al., 2011) in R (R Core Team, 2018). Lambda values are estimated with 10-fold cross-validation. Estimates of the two models are given in Figure 6.  37 40 41 42 44 45 47 48 49 51 53 54 56 62 63 64 66 67 69 70 84 85 87 88 89 91 92 95 96 97 98 99 21 82 3 75 29 78 72 93 43 60 74 46 52 23 80 83 68 35 65 39 86 24 61 59 6 18 38 28 90 13 76 12 27 81 100 77 79 32 58 14 55 9 71 8 94 50 20 22 25 16 26 33 57  Both models uniformly suggest that z 17 is the latent variable most strongly associated with determining vowel frontness of the triggering vowel V 2 . Regression estimates again suggest that the Generator network learns to encode vowel frontness with a single latent variable: there is a substantial drop of estimates after the single latent variable z 17 . Negative values of z 17 correspond to presence of front [E, i]  To elicit rule-like behavior, we force the prefix in the input and simultaneously force vowel V 2 to turn from a front vowel [E, i] into a back vowel [A, O, u]. To achieve this affect, we simultaneously manipulate z 16 (presence of prefix) and z 17 (frontness of vowel). 8 If the Generator network learned vowel harmony, then the vowel of the prefix should change together with the forced change of vowel quality. Such a behavior would parallel rule-based computation: setting a single variable to a value that forces prefixation in the output and manipulating the variable that changes the conditioning environment (V 2 ) results in a process that changes the target vowel according to the conditionvowel harmony.
To test this hypothesis, we set the value of z 16 to −2.5 which forces the prefix in the output. Additionally, we generate outputs with z 17 interpolated from values −6 to 6 in increments of 1. 60 such sets of 13 generated samples (with z 17 from −6 to 6) are generated and acoustically analyzed (780 outputs total). That z 16 indeed causes the prefix in the output is suggested by the count of prefixed forms in the output: 635 out of 780 generated samples (or 81.4%) were analyzed as featuring a prefix (for an independent test of the effect of z 16 on presence of prefix, see Section 3.3).
That z 17 indeed changes the triggering vowel V 2 from a front [E, i] to a back [A, O, u] is strongly suggested by the generated outputs. We annotate the 635 prefixed forms from the 60 sets of generated interpolated outputs for frontness and backness of the triggering vowel V 2 . We fit the annotated data to a generalized additive mixed logistic regression model (GAMMs; Wood 2011) with an intercept and thin-plate smooths that estimate how the presence of a front or back vowel in the output changes with interpolated values. A random smooth for each trajectory (each of the 60 generated sets) is added to the model (estimates in Table A.12). Figure 7 suggests that the presence of z 17 causes the triggering vowel from a front one at values in the negative range to a back one at positive values. The relationship appears to be linear even when the model does not have an assumption of linearity (GAMM). If we refit the data to a linear logistic mixed effect regression (with a random intercept for trajectory and by-trajectory random slopes), we get a significant negative correlation between values of z 17 (from −6 to 6) and percent of front vs. back output (β = −1.04, z = −5.38, p < 0.0001). Figure 7 illustrates how rates of front vowel V 2 in the output change from almost 100% at one end of spectrum to 0% (or 100% of back vowel) in the other end of spectrum.
To test whether the prefix vowel is harmonious even when the variable changing the triggering vowel is interpolated, we annotate the 635 prefixed forms from the 60 sets for frontness of the triggering vowel V 2 and for vowel harmony. Data is annotated for harmony (successes vs. failures) and fit to a generalized additive mixed effects logistic regression model. The independent variables are frontness of the vowel (treatment-coded with back as reference) and a thin plate smooth for values of z 16 as well as by-trajectory random smooths (estimates in Table A.13). The estimates of the parametric term suggest that the prefix vowel is harmonious both for front and back triggering vowels V 2 . Harmonious outputs with a back triggering vowel V 2 ([A, O, u]) are significantly more frequent that non-harmonious outputs: β = 1.43, z = 4.23, p < 0.0001. That the same is true for the front vowel is clear from estimates in Figure 7 (confidence intervals do not cross zero) and from the fact that estimates for the front triggering vowel V 2 are not different from estimates for back vowel. This is confirmed if we refit the model with sum-coded frontness factor (β = 1.41, z = 6.30, p < 0.0001). We also observe a slight negative trend in harmonious outcomes as we increase z 17 and a slight positive trend for harmony in the back vowel conditions, although estimates for smooths are not significant. This likely results from the trend that we observe in the data: as we force the triggering vowel to be front (by setting z 17 to −6), the prefix is harmonious. When the vowel changes as we interpolate the value of z 17 , we have a higher proportion of disharmonious outputs, because apparently the underlying value of the triggering vowel is not "strongly" front or back. As the value of z 17 increases towards 6 and the back vowel is forced more strongly in the output, we get a higher proportion of harmonious outputs again (of course with a back vowel harmony). 9 Figure 8 illustrates the gradual change of the forced prefix from a front (containing an [E]) to back (containing an [O]) when z 17 changes the vowel V 2 from a front to a back vowel. In other words, as we force a change of the triggering vowel quality from front to back with a single latent variable, the prefix (also forced with a single variable) automatically changes in order to remain harmonious.
The deep convolutional network thus appears to represent what would approximate a rule-like computation in phonology: as we force the prefix in the output and change the quality of the triggering vowel from front to back by manipulating only two latent variables, vowel harmony emerges automatically. The appearance of rule-based computation is not categorical -but as is always the case in connectionism, probabilistic -as the prefix does not always change to be harmonious and other features can change along the observed changes. This is to the author's knowledge the closest approximation of rule-based phenomena, especially considering that the models contain no language-specific mechanism and are trained in an unsupervised manner from raw acoustic data.
It is possible that the emergence of rule-like behavior results from the choice of distribution CIs of a generalized additive mixed effects logistic regression model with harmonious (success) and disharmonious (failure) outcome as the dependent variable, vowel frontness as a parametric predictor, and thin-plate smooths for the two levels of frontness (front vs. back) across the values of z17 and random smooths for each of the 60 set of generated outputs. Estimates of the model are given in Appendix Table A.13. of z-variables or other hyperparameters in the model. For example, z-variables can take a variety of distributions, from uniform, Gaussian, to Bernoulli distributions. Testing how hyperparameters influence behavior of the models and what implications this can bring for cognitive modeling are left for future work. A related experiment, however, in which the latent variables have Bernoulli distributions show a very similar behavior when tested on another morphophonological process -reduplication (Beguš, 2021). In the present experiment, z-variables are uniformly distributed in the interval (−1, 1). In an experiment testing reduplication (Beguš, 2021), a subset of latent variables (code variables) are Bernoulli distributed (0 or 1) that constitute a one-hot vector. Even with this distribution, interpolation and setting variables to marginal values outside of the training interval result in a rule-like behavior and a near one-to-one correspondence between the Bernoulli distributed variables and an identity-based morphophonological pattern. 10 Future work should test the effects of normally distributed variables and other hyperparameters, such as the number of convolutional layers and the number of latent variables.

Paralleling neural networks and artificial grammar learning experiments
To parallel the performance of the computational experiment with results from a behavioral experiment, we combine novel data presented here for the first time with results of an experiment in Beguš (2020a). The subjects were trained on the same data as used in the computational experiment, but divided into two separate experiments: one in which subjects were trained on data with the VN-prefix and another one on data with the V-prefix. Subjects were recruited via Amazon MTurk 11 , completed informed consent before participating, and were presented with experimental stimuli in Experigen (Becker and Levine, 2013). In the behavioral experiment, the unprefixed-prefixed forms are presented to subjects in pairs, where the prefixed form carries the function of plural. Subjects were presented with a picture of a Martian creature. A single creature is associated with the unprefixed form; four creatures are associated with the prefixed form. The experimental interface is illustrated in Figure 9.
Subjects whose first language was not English or who had self-reported linguistic education were removed from the analysis. Altogether 333 subjects that provided 1987 responses on the vowel harmony test are analyzed 12 The training phase in the VN-experiment consisted of 58 pairs of bare and prefixed forms. All examples were harmonious and some included evidence for the local processes of post-nasal devoicing and post-nasal devoicing and occlusion (as described in detail in the Section 2.2 on data used in the computational experiment). In the V-experiment, the training phase consisted of 60 pairs of bare and prefixed forms, all of which contained evidence for harmony and some of which contain evidence for local processes of devoicing and devoicing and fricativization (see Section 2.2). All items used in the behavioral experiment are listed in Appendix Tables A.3, A .4, A.5, A.6, A.7, A.8, and A.9.
After the training phase, the subjects were tested on six bare forms with C 1 either a [r] or [l] (three with a front V 2 and three with back) and had to choose between harmonious and non-10 The experiment in Beguš (2021) is trained on an InfoGAN extension (Chen et al., 2016) where another network is introduced that forces the Generator to output informative data.
11 That the results of the experiment are not heavily influenced by the participants in the behavioral experiments being recruited via Amazon MTurk is suggested by the fact that vowel harmony outcomes are very similar to a related experiment with similar training data that was performed in-person with the supervision of a research assistant in which subjects were recruited from the general public (Beguš, 2020a).
12 For detailed discussion on exclusion criteria, see Beguš (2020a). In the V-condition, we excluded participants with non-unique Amazon MTurk IDs as well as with those IDs who had already taken the VN-experiment.

Training
Training Test -non-local Test -local harmonious responses in a forced choice task (see Test -Local in Figure 9), as well as between various local processes. For example, subjects were presented with a stimulus ["lirO], presented auditorily and orthographically, and had to choose between the plural form eliro (harmonious) and oliro, presented only orthographically. 13 While the behavioral experiments do not directly test whether non-local processes are more difficult to learn than local processes (this has already been confirmed experimentally in several studies; see Finley 2011Finley , 2012McMullin and Hansson 2019;White et al. 2018), the local process is made more difficult to learn in the experiment: subjects were explicitly instructed to learn the (non-local) distribution of prefixes (vowel harmony), but never about learning the local processes. Moreover, the learning of local processes is tested exclusively with auditory stimuli.
To test the learning of the non-local process in the behavioral experiment, the responses were fit to a linear mixed effects logistic regression model (lme4 package by Bates et al. 2015). First, we fit the full model with harmonic vs. non-harmonic responses (successes vs. failures) as the dependent variable and frontness (front vs. back, sum-coded) of the vowel and the shape of the prefix (VN-vs. V-, sum-coded) as the independent variable (with interaction) and random intercepts for subject and item with by-subject and by-item random slope for frontness. The final model was chosen based on Akaike Information Criterion (AIC) by removing random slopes first and then interactions. The final model includes the frontness × prefix interaction and random intercepts for subject and item.
The results show that subject learn the vowel harmony pattern from the training data (β = 0.56, z = 5.0, p < 0.0001). In other words, harmonious responses are significantly above the chance level, which suggests subjects do learn the harmonious pattern. However, the error rate is quite high. The 95% profile CIs for the preference for harmonious response are quite low: [57.6%, 69.2%], especially given that 234/270 items are bare-prefixed pairs each of which contains evidence for vowel harmony. All regression estimates are in Table A.11.
We can directly compare subject's responses in the behavioral experiments with outputs of the computational experiment. The Generator network violates local distributions in the data in only three out of 168 generated outputs with a prefix and a stop or a fricative (1.8%). On the non-local task, however, the Generator's error rate is substantially higher and similar to the error rate in the artificial grammar learning experiment conducted on human subjects. Figure 10 illustrates the similarity.
To be sure, there are substantial differences between the computational and behavioral experiment. First, the comparison is necessarily superficial, because this paper does not claim that humans learn phonological patterns in the same way as deep convolutional networks; however, this does not preclude us from comparing their performance. The number of epochs in the computational experiment is ∼ 24877, while subjects were only exposed to training data once. On the other hand, human subjects were adults with full language capacity and already established phonological inventories, phonological grammar, and articulatory and perceptual mechanisms. The Generator network has to learn to produce speech-like outputs from random noise and does not contain any language-specific learning mechanisms.
This comparison in performance between human subjects and the computational model suggests that non-local processes are computationally similarly costly both for humans and for computational models of language acquisition to the degree that the error rates across the two conditions are similar. That non-local processes are computationally costly has of course been shown before, but to our knowledge, this is the first such confirmation on a deep convolutional neural network model that is trained on the same data as human subjects and that learns speech representations from raw acoustic data.

Discussion
This paper tests learning of local and non-local processes in human speech with deep convolutional networks in the GAN architecture. More specifically, we test the learning of non-local vowel harmony and local devoicing processes in a setting that approximates morphological and phonological processes in language: the model is trained on raw speech data with bare and prefixed forms in random order.
First, we argue that deep convolutional GANs output highly informative data despite being trained on extremely small datasets (N = 270) with a high number of epochs. The outputs are acoustically analyzable and linguistically interpretable. The Generator learns local processes and phonotactic restrictions with low error rates which suggests that training is successful for at least a subset of training objectives. As has been shown before (Beguš, 2020b;Beguš, 2021), however, the Generator also outputs innovative data that violate training data. These violations are not random, but are linguistically interpretable. 23.2% of outputs are disharmonious, and 33.7% are innovative outputs (harmonious or unprefixed) that conform to phonotactic and distributional properties of the training data, but include unique sequences that are never present in the training data (Section 3.1). In only ∼5% of annotated outputs is the data not linguistically interpretable. Innovative outputs also suggest that the Generator does not overfit despite the high number of epochs, in line with previous work on overfitting in GANs. The finding that GANs can be trained on very small data sets should open up several new possibilities for research on deep convolutional networks, speech, and internal representations in deep convolutional networks.
An exploratory study of innovative outputs suggests that, in order to repair its data violations, the network uses strategies that approximate processes in human phonology: devoicing, occlusion, and distribution of frication noise. We propose that these repairs can be directly followed with progression of learning by keeping the random latent variables constant while generating data from the network trained at different training steps. Acoustic analysis of outputs at different training steps in Section 3.2 identifies strategies that the network uses to repair violations in data distributions.
One of the objectives of this paper is to explore how deep convolutional networks trained in the GAN framework on raw speech discretize linguistically meaningful representations in the latent space, especially with respect to non-local morphophonological processes. The raw acoustic data hearing human infants are faced with is continuous. Phonological computation discretizes the continuous space into discrete representations and manipulates these representations, which results in phonological processes such as vowel harmony. Using the technique in Beguš (2020b), we identify variables in the Generator's latent space that correspond to linguistically meaningful units, such as presence of a prefix or frontness of a vowel. Lasso regression estimates suggest that the network uses a minimal number of variables to represent presence of a prefix in the output. In other words, the steep drop in the regression estimates after the variable with the highest estimate suggests that the network discretizes some continuous phonetic content in its internal space. The same is true for a phonetic feature such as frontness of the first vowel in bare forms (V 2 ). The network appears to primarily use a single variable to encode this phonetic property of outputs. An independent generative test suggest that manipulating this one variable on a linear scale well outside the training range (from −6 to 6) results in a gradual and linear transition from a front to a back first vowel (Figure 7). This paper argues that an approximation of a symbolic rule emerges as an interaction between latent variables in deep convolutional networks. To test learning of the non-local vowel harmony, we force a prefix in the output with a single variable (z 16 at −2.5) and force the change of the triggering vowel from front to back with a linear interpolation of a single variable (z 17 ). The statistical tests in Section 3.5 suggest that the generated outputs remain harmonious in the majority of cases despite the change of the triggering vowel. In other words, the rule-like vowel harmony emerges automatically in a deep convolutional network from an interaction of the variable that forces some morphophonological entity in the output (the prefix) and the variable that changes the triggering segment. While harmonic outputs are significantly more frequent than non-harmonic outputs, the distribution is probabilistic rather than categorical. Another trend emerges from the statistical tests: the outputs are more likely to be non-harmonic in the transition period when the triggering vowel changes from front to back. It is likely the case that the relative strength of frontness and backness affects the rates of harmonic vs. non-harmonic outcomes. In other words, it appears that the prefix harmony is not triggered until the frontness/backness feature of the triggering vowel is strong enough, i.e. has a high enough latent variable value. That phonological features bear inherit weights (that can be conceptualized as strength or latent variable values in our model) has been argued before in the Optimality Theoretic framework (Smolensky and Goldrick, 2016;Smolensky et al., 2019).
Phonological computation has been shown to favor local processes over non-local processes. Many studies show experimentally that the learning of non-local processes is more difficult (Finley, 2011(Finley, , 2012McMullin and Hansson, 2019;White et al., 2018). This learning bias is also reflected in typology: the majority of phonological processes are local in the world's languages (Finley, 2011). A clear preference for locality emerges in our computational experiment as well: despite substantially more evidence for the non-local process in the training data, the error rate is significantly higher in the non-local condition in the Generator's network. Whether the prevalence of some patterns in human speech results from articulatory factors (e.g. the articulation of sounds is most strongly affected by the immediately preceding or following sounds) or from learnability (e.g. the learning of non-local processes is more difficult) has been a focal topic of discussion in phonology, linguistics, and cognitive science in general. While this result does not offer an answer as to whether the preference for non-locality in typology results from learning or a language's cultural transmission Beguš (2020a), it does provide evidence that non-locality preferences can be explained with domaingeneral cognitive mechanisms using deep neural networks.
It is possible that the Generator network violates the non-local vowel harmony relatively frequently (in 23.2% of the outputs) because it is not fully trained and potentially converges on a local optimum. Even if this is the case, the results are nevertheless informative for our objectives. First, the Generator is clearly well trained on the local processes: error rate for the local process of devoicing is 1.8%. Second, the Generator is well trained on the phonotactic restrictions in the training data: the error rate for the phonotactic restrictions is 0% if we exclude unanalyzable outputs (constituting only 6.5% of the outputs). Since our primary objective is to compare the learning of local and non-local processes in speech, the fact that local processes are well learned, and significantly better compared to the non-local process (see Section 3.4), suggests that non-local processes are more difficult to learn than local processes in deep convolutional networks in the GAN framework. Finally, this paper illustrates the importance of analyzing the models at different training steps (as proposed in Section 3.2) when the primary objective is probing learning representations, neural network interpretability, cognitive modeling, or linguistic relevance of the models. One of the potential concerns in fully trained models is the so-called ceiling effect. If the model were able to perform equally well on both local and non-local processes, we might erroneously conclude that local and non-local processes are equally learnable, whereas one could have been learned substantially earlier in the training than the other.
Because GANs trained on small datasets produce informative results, we can use the same stimuli for training deep convolutional networks and artificial grammar learning experiments on human subjects. We compare data from a behavioral experiment that tested the learning of vowel harmony. Results show a similar degree or error rate across the computational and artificial grammar learning experiments. It is true that the Generator network does not output vowel harmony categorically (as opposed to local processes, which are near categorical), but neither do the human subjects tested in a behavioral experiment perform at the categorical level. This suggests that nonlocal processes are, from a learnability viewpoint, similarly costly both for the deep convolutional network and for human subjects.