Songbirds are excellent auditory discriminators, irrespective of age and experience

https://doi.org/10.1016/j.anbehav.2021.02.018 0003-3472/© 2021 The Authors. Published by Elsevie license (http://creativecommons.org/licenses/by/4.0/) Human infants but not adults possess the ability to perceive differences between non-native language phoneme categories. The predominant explanation for this age-related decline in discriminative ability is the effect of statistical learning driven by sensory exposure: phoneme categories of the native language take precedence, have a higher frequency of occurrence and may encompass category distinctions in non-native languages. Alternatively, one could explain the decline through a reduction in discriminative abilities attributable to ageing. Thus, to what extent is auditory perception influenced either by experience or by age-related processes? Here, we attempted to answer this question, which cannot easily be disentangled in humans, in songbirds, which share many properties with humans: both learn the statistical distribution of sounds in their environment, both possess neural circuits to process vocalizations of their own species and plasticity in these circuits is subject to critical periods. To study the effects of experience and ageing, we trained zebra finches, Taeniopygia guttata, to discriminate short from long versions of a single zebra finch song syllable type. Birds in four groups distinguished by their age (old versus young) and level of auditory experience (with song experience versus completely isolated from song) could learn to discriminate arbitrarily fine differences between song syllables, although we found a trend that upholds the statistical learning hypothesis: birds with song experience performed better than birds with no experience. Furthermore, birds in all groups were able to generalize their learning to new stimuli of the same type, and they were able to rapidly adapt their learned discrimination boundaries. Finally, we found that songbirds could accurately discriminate randomly selected renditions of a stereotyped adult song syllable, revealing a flexible ability to discriminate conspecific vocalizations. © 2021 The Authors. Published by Elsevier Ltd on behalf of The Association for the Study of Animal Behaviour. This is an open access article under the CC BY license (http://creativecommons.org/licenses/ by/4.0/).

A crucial open question is whether categorical perception is the result of experience or ageing or both. In support of experiencedependent mechanisms, for many researchers, language exposure is thought to affect the formation of auditory category boundaries (Abramson & Lisker, 1973;Kuhl, 2004), which agrees with theoretical models that adopt a statistical learning perspective (Kuhl, 2004;Vallabha, McClelland, Pons, Werker, & Amano, 2007).
In contrast with the statistical learning hypothesis, an ageing brain can have a significant impact on behaviour through changes in neural structure during critical periods of synaptic plasticity (Gordon & Stryker, 1996;Hensch, 2005). For example, Peña, Werker, and Dehaene-Lambertz (2012) showed that infants born 3 months preterm are exposed to their native language longer than full-term infants, but this does not accelerate speech acquisition, indicating that age-related processes provide a bottleneck for language acquisition.
The age versus experience hypothesis is difficult to test in humans without isolating infants from exposure to native language. For this reason, we tried to disentangle the roles of experience and age by studying auditory discrimination in a model system, namely songbirds (Passeriformes), which are a vocal learning group of species, like humans and whales. In most songbird species, songs are used to signal individual identity for territorial defence or to signify sexual intent, and are considered vital for sexual selection (Byers & Kroodsma, 2009). Similar to humans, in songbirds, developmental vocal learning happens early in life and is triggered by exposure to adult vocalizations (Barrington, 1773;Brainard & Doupe, 2000;Doupe & Kuhl, 1999). Songbirds such as the zebra finch, Taeniopygia guttata, also show striking parallels to humans in the sensory processing of vocalizations, as exemplified by the manner in which they distinguish both human vocalizations (Ohms, Escudero, Lammers, & ten Cate, 2012, 2010ten Cate, 2014) and conspecific vocalizations (Sturdy, Phillmore, Price, & Weisman, 1999). Some songbirds also show categorical perception of vocalizations, for example swamp sparrows, Melospiza georgiana (Nelson & Marler, 1962;Prather, Nowicki, Anderson, Peters, & Mooney, 2009).
Few comparative studies in songbirds have directly addressed the explicit roles of age versus experience in song perception. Braaten, Petzoldt, and Colbath (2006) and Sturdy, Phillmore, Sartor, and Weisman (2001) have previously compared juvenile songbirds to conspecific adults, as well as isolated to vocalizationexperienced songbirds in auditory categorization tasks. However, the acoustic stimuli in both studies were not ideal for assessing categorical perception as described in the psychology of speech perception: Braaten et al. trained zebra finches to discriminate conspecific songs from the same songs played in reverse order, while Sturdy et al. (2001) used synthetic pure-tone stimuli. Furthermore, neither study examined all four conditions (i.e. young, old, experienced and isolated) simultaneously, which is a prerequisite for weighting age versus experience.
Here, we trained male and female zebra finches following an established GoeNoGo operant conditioning paradigm (Canopoli, Herbst, & Hahnloser, 2014;Tokarev & Tchernichovski, 2014) to discriminate short from long versions of a single zebra finch song syllable. We used this arbitrary category choice to test whether zebra finches can categorize their native sounds into classes that seemingly carry no functional meaning.
We divided the birds into four groups factored by age and auditory experience: adult (A; > 90 days old) or juvenile (J; 35 days old at the start of the experiment), with experience, i.e. exposed to song (þ), or isolated from song exposure (-) (notation: Aþ, A-, Jþ and J-).
Our experimental design was inspired by the infant versus adult comparison in the Werker and Tees (1981) study, where infants with English as their native language were exposed to a stream of phonemes belonging to a phonetic contrast (e.g. /ta/ versus /Ta/ from Hindi or /ba/ versus /da/ from English). When a phonetic boundary was crossed (e.g. /ta/ -> /Ta/), the infants were conditioned to respond with a head turn to indicate the change. English language infants were as good as Hindi-speaking adults in discriminating Hindi phonemes, even though the distinction between /ta/ and /Ta/ has no functional meaning in English. Given our stimulus choice as an analogy to the human infant work, we hypothesized that more experienced and older birds (Aþ group) would fare the worst at the task, and younger, inexperienced birds (J-) would fare the best. We performed additional experiments to characterize the percepts learned by the birds.

Animals
We tested 34 zebra finches, 15 males and 19 females. Birds were partitioned into four groups, factored by age and song experience. Here, the term 'song experience' refers to auditory exposure to normal adult zebra finch songs.
(1) The Aþ group (adult with song experience; six females, three males) were raised in a cage by their parents and siblings until about 65 days posthatch, and then they were transferred to a cage with same-sex older juveniles and adults. They were introduced to the experimental set-up at an age > 90 days posthatch. (2) The Agroup (adult without song experience; five females, four males) were transferred to a sound isolation chamber (custom design) at 15 days posthatch with their mother and female siblings and reared until > 90 days posthatch, following which they were introduced to the experimental set-up. Because female zebra finches do not sing, the absence of adult male birds during development ensures a lack of song experience (A-). (3) The J-group (juvenile without song experience; five females, three males) were reared in song isolation (as for A-birds above) from 15 to 33 days posthatch and were subsequently introduced to the set-up at 34 or 35 days posthatch.
(4) The Jþ group (juvenile with song experience; three females, five males) were introduced to the set-up at 34 or 35 days posthatch, after being raised by their parents (including the father) and siblings in the same cage.
Apart from the experimentally tested animals, another 34 birds (all female) were used as a social incentive for the tested birds (see Experimental Apparatus).

Ethical Note
All experiments were approved by the Veterinary Office of the Canton of Zurich, Switzerland. The experiments were performed under the specific licence ZH207/2013. The experimental birds were born and reared in our avian breeding facility, in agreement with the Swiss Animal Welfare Act and Ordinance (TSchG, TSchV, TVV). After the experiment they were returned to the aviary. The birds were handled only when they were taken from the aviary to their home cage in the experimental chamber and twice when their home cages were cleaned during the experiment. Apart from this, human disturbance occurred once a day when their food, water, grit and cuttle bone were replaced.
Our experiment used air puffs as an aversive conditioning stimulus. Birds learned to discriminate stimuli based on the predicted occurrence of an air puff after a 1 s delay. Although the air puff pressure was kept high enough to displace birds from their perch, none of the birds showed any signs of injury or panic such as incessantly flying around the cage or calling. Birds continued to voluntarily perform hundreds of trials per day and improve at the task over several days. In our experience viewing (via webcam) birds performing the task, we did not observe any noticeable increase in anxiety in/between the birds. We checked the false negative rate (FNR, excess staying on the perch and getting hit by the air puff) and false positive rate (FPR, escaping too often) on completion of training as a proxy for noticeable signs of anxiety. In general, FPR and FNR were low across birds (FPR ¼ 0.31 ± 0.18; FNR ¼ 0.14 ± 0.1; N ¼ 34). We found no statistically significant differences in both FPR and FNR between groups (KruskaleWallis test of difference in FPR distribution among four groups (Aþ, A-, Jþ, J-): c 2 3 ¼ 0.9, P ¼ 0. 83, N ¼ 9,9,8,8; KruskaleWallis test of difference in FNR distribution between groups: c 2 3 ¼ 1.04, P ¼ 0.79, N ¼ 9, 9,8,8).

Experimental Apparatus
We adapted a GoeNoGo operant conditioning paradigm including a component of social reinforcement (Narula, Herbst, Rychen, & Hahnloser, 2018;Tokarev & Tchernichovski, 2014). During the experiment, all birds were housed with unrestricted access to food, water for drinking and bathing (separate), grit and cuttle bone, all provided in individual cages (30 x 30 cm and 40 cm high; Qualipet, 8305 Dietlikon, Switzerland) placed inside a custom sound isolation chamber. The cage bedding was made of dry wood chips. The light:dark cycle was 14/10 h. The temperature in the isolation chamber was maintained at room temperature, approximately 25 o Celsius and it never exceeded 30 o Celsius. The chamber contained a speaker for playing the stimuli, a microphone for sound recordings and a webcam. The cages for the experimental bird and a female companion bird were placed adjacent to each other. Each cage contained three perches, two for unrestricted food and water access and a third (window) for viewing the other cage. This perch was equipped with a Hall sensor that measured deviations of a magnetic field which occurred when a bird sat on the perch. These deviations were used as a signature of perch occupancy. We placed a cardboard screen with a small (15 x 15 cm) peeping window between the two cages to block the view into the other cage from all vantage points except the window perch. Experimental birds and companion birds frequently visited their window perches (henceforth referred to as 'perches') to interact with each other.

Stimuli Duration discrimination (training set)
For the training set (TRAIN), we created a set of 10 stimuli synthesized from the songs of an adult male zebra finch (o7r14) from our colony. Songs were filtered with a fourth-order Butterworth high-pass filter (600 Hz cutoff) and digitized at 32 kHz. We collected all renditions of a syllable produced during 1 day of singing in February 2015. We computed syllable durations via thresholding of sound amplitude traces. Based on the full range of durations of the selected syllable, we defined 10 stimuli of increasing duration, ranging from 144.1 to 190.8 ms. Each stimulus S i (i ¼ 1,2,…,10) in this set was made of a string of six syllable renditions, wherein each rendition was longer than the six renditions in stimulus S i-1 . Within a stimulus, the six renditions were arranged in order of increasing duration. In total, the stimulus set comprised 60 different syllable renditions (10 stimuli of six renditions each). To avoid sound onset artefacts, we smoothed syllable onsets and offsets by multiplying the sound waveform in the time domain with sigmoid functions of width 16 ms. Intersyllable gaps were set to 22 ms. All stimuli were produced with MATLAB (Mathworks Inc, Natick MA, U.S.A.). Stimuli were played at 70e75 dB sound pressure level from a HarmaneKardon speaker (HKS 4 BQ 2.0, Harman Deutschland GmBH, 3098 Schliern bei Koeniz, Switzerland), preamplified with an Alesis RA150 amplifier (Alesis, Cumberland, RI, U.S.A.).
Based on the 10 stimuli, we defined two stimulus classes: the class 'short' was formed by stimuli S 1 to S 5 , and the class 'long' was formed by stimuli S 6 to S 10 . The stimuli were distinguishable by duration but also by other sound features (Fig. 1b, right), allowing birds to 'overfit' their discriminative systems.
We implemented a GoeNoGo operant conditioning paradigm using an aversive air puff that followed the playback of stimuli from one of the two classes, either short or long. We counterbalanced the aversively reinforced class across birds. For all groups exposed to the same training set, there was no difference in learning time between short-or long-puffed birds (Appendix 1). We therefore use the terms Puff and NoPuff as class labels, irrespective of whether short or long stimuli were reinforced.

Permuted set
To create a permuted set (PERM) in which the duration of each of the five stimuli within a class was the same, we separately randomized the position of each of the 30 syllable renditions S ij (i: syllable number [1 -> 6], j: stimulus identity [1 -> 5 or 6 -> 10]) within each class in TRAIN (see Appendix 2 and Fig. A1).

NOV set
To test for the generalization of learned categories, we created a novel set (NOV) of 10 stimuli {S 0 1 ,…,S 0 10 } from renditions of the same syllable recorded on the next day and processed in the same manner. The stimulus durations in NOV tended to be slightly shorter than in TRAIN, including near the class boundary (Fig. 1c).

Moved boundary set
To move the boundary between stimulus classes, we increased the stimulus duration that defined the boundary between short and long stimuli by 8 ms towards the long category: Stimuli S 6 -S 8 originally defined as belonging to the long category were now deemed short (i.e. mapped to S 3 -S 5 ) and three new long stimuli from the original recording to create TRAIN were added to the long category (new S 8 -S 10 ; Fig. 1c). This was termed the moved boundary set (MOV).

Randomly shuffled set
We constructed a random set (RAND) of 10 stimuli by randomly assigning the 60 song syllables from TRAIN to the Puff and NoPuff classes, resulting in a uniform distribution of durations across both classes (Fig. 1c).

Pretraining phase
Each trial began with a pretraining phase that lasted 3e4 days. On the first day, birds could accustom themselves to the set-up, discover the perch and use it to view the social incentive. From the second day on, sitting on the perch (which led to an upward deflection of the Hall sensor signal) for more than 3.5 s triggered the playback of a stimulus (Fig. 1a). In the pretraining phase, we played the most easily distinguishable stimuli from TRAIN, namely the longest and the shortest from each class (S 1 from class short and S 10 from class long). One of the two classes of stimuli was associated with the aversive reinforcer: a puff of air delivered 1 s after the stimulus offset.
We gradually increased the strength of the puff by increasing its duration. The duration ranged from 0.03 s (light) to 1 s (strong), which we accomplished by approximately doubling the duration each day. With increasing puff strength, the probability of displacing a bird from the perch also increased, and as a result avoidance behaviour such as flying away from the perch to an escape perch also increased.
The probability of presenting a stimulus from the punished class was kept at 0.25. In a pilot study not included in this paper, we found that a probability of 0.5 induced anxiety in the birds, which we inferred from their escaping from the perch on almost every trial. A probability of 0.25 led to birds staying on the perch longer and more often.
Birds voluntarily performed more than 500 trials per day, and we noticed no adverse effects of the air puff. We monitored birds' behaviour through a webcam placed in the chamber. Across all birds, significant differences between the probabilities of escape associated with the Puff and NoPuff classes appeared after about 1 week. We used the z test of individual proportions (see Performance Measures and Criterion) to test for significance of this difference in escape probability. Once differences were significant on 2 consecutive days, we switched the birds to the Training phase and maintained a puff duration of 1 s (strong).

Training phase
The training phase was identical to the pretraining phase except that we presented all TRAIN stimuli in a pseudorandom order, with the probability of an air puff stimulus held constant at 0.25. Completion of the training phase was indicated when birds' performance reached a target criterion. We chose the performance criterion to reflect both the discriminatory ability and the stability of behaviour (see Performance Measures and Criterion).
In several birds, we observed that the performance measure fluctuated from day to day. We trained these birds for a further 2e3 days after they reached the performance criterion, to obtain stable estimates of discriminatory ability. Following the training phase, birds were partitioned into groups, each performing a particular subset of the experiments.

Experimental transitions
Changes of stimulus sets were immediate. Stimulus sets were ordered based on the following rules. (1) NOV and PERM: if a bird was assigned to NOV and PERM groups, the order of NOV and PERM set presentation was chosen at random, because we wanted to average out an effect of stimulus set order (among NOV and PERM) across birds (note that in the first few birds tested with the PERM set, no reduction in performance was observed).
(2) NOV/PERM and MOV: if a bird was tested on MOV as well as on NOV/PERM, the MOV set was always introduced after NOV and/or PERM because the class boundary in the MOV set was shifted, which could affect subsequent discrimination performance. Thus, MOV was never followed by NOV, because the bird would have to readjust its decision boundary twice, first after the transition TRAIN -> MOV, and then again after the transition MOV -> NOV. In our view, this readjustment would impede test performance on NOV and so would not truly test the bird's generalization ability based on TRAIN.
(3) RAND group birds were not tested on any other stimulus set orders apart from TRAIN followed by RAND.

Performance Measures and Criterion
For each bird and all experimental phases, we partitioned the trials into nonoverlapping blocks of 100 trials. We chose this number to obtain sufficient trial statistics for performing z tests of independent proportions. In each 100-trial block, we computed the true positive (escape) rate (P T ) as the probability of escaping on Puff trials and the false positive (escape) rate (P F ) as the probability of escape during NoPuff trials. Our single measure of performance in each block was the difference in escape probabilities dPesc ¼ P T À P F (Fig. 1d).
Within a block, to decide whether a bird escaped significantly more often on Puff trials than on NoPuff trials, we performed a z test of independent proportions of the null hypothesis H 0 ¼ P T ¼ P F and alternative hypothesis H a ¼ P T sP F . To obtain z scores for the test, we applied the Yates's continuity correction. Thus, for each block we computed the z statistic according to: where n T is the number of Puff trials and n F is the number of NoPuff trials in each block. The P value Pr½z > z stat was computed with the normcdf function in MATLAB (Mathworks Inc, Natick, MA, U.S.A.); a block was statistically significant if the P value in that block was smaller than 0.01.

Statistical Tests
Performance comparisons between groups were performed with either parametric (t test and ANOVA F test) or nonparametric (Wilcoxon rank sum and KruskaleWallis) tests. Before performing a t test or Wilcoxon rank sum test, we did a Shapiro test of normality. A t test was performed when the null hypothesis of normal distribution was not rejected at the 95% (P < 0.05) confidence level. All tests were two tailed and paired-sample tests are mentioned explicitly. Note that, when comparing performance between NOV and TRAIN groups, only a small subset of the birds performed both sets, which prevented us from using a paired-sample test. Also, one of the TRAIN birds was part of another experiment (Narula et al., 2018) where it acted as an observer bird. As a result of having been an observer, it performed the TRAIN set discrimination faster than an average bird. For this reason, we did not include its TRAIN performance in the statistical comparison with NOV; hence N ¼ 11 for TRAIN.

Task Completion, Criterion and Trials to Criterion
If a bird did not improve its performance dPesc for at least 5000 trials based on visual inspection of daily vectors and on statistical testing (using z tests as described above), we stopped its training. We measured dPesc at training completion to be either the dPesc value after these 5000 trials or, when the bird's performance improved gradually, the value after reaching criterion.
We established our performance criterion to include two key features: statistically significant difference between P T and P F and stability of this behaviour over several hundred trials. That is, we computed the fraction of 100 trial blocks that were significant in a sliding window of eight consecutive blocks. When this fraction crossed 87.5% (i.e. 7/8 blocks), we took the last block in the window as the block at which the performance criterion was reached ('criterion block'). 'Trials to criterion' is then simply the number of all trials performed by the bird up to and including the criterion block.

METHODS AND RESULTS
In the duration discrimination task, several birds did not reach our learning criterion within 5000 trials, which resulted in training being aborted. To test whether this happened more often in some bird groups than in others, we performed two Fisher exact tests of independence between the factor (experience or age) and training completion (or not completed). First, we found that a larger proportion of birds (0.71) with auditory experience completed training than those without (0.35), and the difference in proportions showed a trend towards significance (see Table 1). Second, we found that a similar fraction of juveniles (0.44) and adults (0.61) completed training (Table 2). We compared the discriminative performance (dPesc) of birds at the end of training (training completion), either when birds reached the training criterion or after 5000 trials when training was aborted due to poor performance. The average Pesc curve for each group at training completion is depicted in Fig. 1e. On the group level, there was no significant difference between sample distributions of dPesc for birds that completed training (Fig. 1f, Table 3). We then performed pairwise comparisons between groups. Juveniles performed marginally better than adults (mean ± SD: juvenile dPesc ¼ 0.32 ± 0.13, N ¼ 16; adult dPesc ¼ 0.3 ± 0.2, N ¼ 18), but the difference was not statistically significant (two-sample t test: t 29.6 ¼ -0.46, N 1 ¼ 16, N 2 ¼ 18, P ¼ 0.65; Fig. 2a). Birds with auditory experience (dPesc ¼ 0.37 ± 0.12, N ¼ 17) fared significantly better than those without (dPesc ¼ 0.25 ± 0.18, N ¼ 17; two-sample t test: We also computed the number of trials birds needed to reach the performance criterion but found no effect of either age or experience on learning speed (Table 3).
To quantify the explanatory effects of age and auditory experience on discrimination performance, we fitted a general linear model to dPesc at criterion, using age and auditory experience as factors. We observed a main effect of auditory experience (Table 4).
Following this negative result, we set out to characterize the categorical percept formed by the birds and to probe its flexibility.
First, an open question was whether birds discriminated the string of six syllables in any training set stimulus as a composite pattern, or whether they extracted class-predictive information from each of the six syllables and integrated (accumulated) this information before making a decision. When we tested birds on the PERM set of stimuli (see Methods, Stimuli) we found that they integrated information from each syllable, and did not use the composite pattern when making a decision (Appendix 2).
Our stimulus set consisted of actual syllable renditions, which reflected the natural within-syllable distribution of acoustic features. Apart from duration, birds could have used other features for discrimination. We used the acoustic analysis software Sound Analysis Pro (soundanalysispro.com) to compute song features such as pitch goodness, mean frequency modulation, variance of amplitude modulation and Wiener entropy (see Appendix 3, Table A1 for full list of acoustic features). Indeed, we found that features such as mean frequency modulation and variance of amplitude modulation across a syllable's time course were significantly correlated with duration (Pearson correlation coefficient: mean frequency modulation versus duration: r ¼ À0.37, N ¼ 60, P ¼ 0.004; variance of amplitude modulation versus duration: r ¼ À0.73, N ¼ 60, P ¼ 3.8 x 10 À10 ; Fig. 1b, right). In previous work (Narula et al., 2018), we fitted a logistic regression classifier using these exact stimuli and stimulus features and showed how an unregularized classifier can develop strong dependence on the aforementioned correlated features, which results in very high classification performance on the training set, but poor generalization to novel stimuli, an effect termed 'overfitting' in statistical learning literature (Geman, Bienenstock, & Doursat, 1992). Similarly, we hypothesized that if birds highly weighted such nonduration features, they might also show a tendency to overfit which would be contrary to the idea of a categorical percept that generalizes well to other instances of the same category. If the overfitting hypothesis were correct, we should observe a significant drop in performance when birds were made to generalize their percept. We tested a subset of 12 birds (5 Aþ, 3 Jþ, 2 A-and 2 J-) after they reached the criterion on TRAIN, on new renditions of the same song syllable (novel set, NOV), with durations closely matching those in TRAIN (Fig. 1c). We observed large dPesc values after switching from TRAIN to NOV (Fig. 2b, red squares) revealing a performance comparable to that with the training set, which implies good generalization. The average dPesc in the first 300 trials of NOV was at par with dPesc at criterion on TRAIN (average dPesc in first three blocks of the NOV ¼ 0.37 ± 0.14, N ¼ 12; average dPesc in last three blocks of TRAIN ¼ 0.42 ± 0.11, N ¼ 11; two-sample t test: t 20.5 ¼ 1.1, N 1 ¼ 12, N 2 ¼ 11, P ¼ 0.28; Fig. 2e). Birds achieved a dPesc after reaching criterion on the NOV set of 0.44þ/À0.15, which was significantly more accurate than performance on the training set (two-sample t test on dPesc at criterion NOV versus dPesc at criterion TRAIN: t 20.7 ¼ -;2.49, N 1 ¼ 12, N 2 ¼ 11, P ¼ 0.021). Also not surprisingly, birds reached the criterion on the NOV significantly faster (trials to criterion NOV: 1.8 ± 1.3 x 10 3 , N ¼ 12) than on the Table 2  Number of juveniles and adults and their training outcomes   Training completed  Training aborted  Totals   Adult  11  7  18  Juvenile  7  9  16  Total  18  16  34 Fisher exact test of an association between age and training completion: odds ratio ¼ 1.98, N ¼ 34, P ¼ 0.49. TRAIN (5.9 ± 6.3 x 10 3 trials, N ¼ 11; Wilcoxon rank sum test: Fig. 2h). These findings indicate that birds tended to focus their attention on duration as a discriminating feature and less on correlated features. Fast and accurate generalization to novel stimuli suggests that birds learned the category underlying the stimuli, in some sense the 'category' defined by duration. Categorical perception in human speech postulates that vocalizations belonging to the same category (e.g. a syllable such as 'b' in English) cannot be discriminated from each other, but are easily discriminated against syllables of another, phonetically similar category (e.g. 'p' as a comparison to 'b' in English; Eilers, Gavin, & Wilson, 1979). Essentially, perceptual categories are indivisible and clearly separated from other categories. We explored whether the arbitrary duration categories learned by birds in the training phase possessed such an indivisible nature. We tested this feature of categorical perception with the MOV set of stimuli. If birds had learned an indivisible categorical percept such as phonemes in human speech, we expected to see a significant drop in performance when tested on the MOV set.
We tested seven birds on the MOV set after they reached the criterion on TRAIN (see Methods, Experimental transitions for details on transition to MOV). We observed behaviour similar to that on NOV: dPesc trajectories showed no decline after stimuli were switched to MOV (Fig. 2c). In the first 300 trials of MOV, the birds were as accurate as in the last 300 trials of TRAIN (average dPesc MOV set ¼ 0.29 ± 0.18, N ¼ 7; average dPesc TRAIN ¼ 0.36 ± 0.08, N ¼ 6; Wilcoxon rank sum test: W ¼ 29, N 1 ¼ 6, N 2 ¼ 7, P ¼ 0.62; Fig. 2f). Good early performance on the MOV set shows that the birds required very few trials to adapt their learned duration boundaries. Thus, the seemingly learned categorical percept in birds was not indivisible because the boundaries were flexible. On the MOV set, birds achieved a dPesc at criterion of 0.39 ± 0.15 within 2.26 ± 1.78 x 10 3 trials, which was faster than the time it took them on TRAIN (TRAIN trials to criterion ¼ 5.9 ± 6.3 x 10 3 , N ¼ 6; Wilcoxon rank sum test: W ¼ 20.5, N 1 ¼ 6, N 2 ¼ 7, P ¼ 0.11; Fig. 2i). That birds learned to rapidly categorize MOV stimuli was indeed a feat, not to be expected from naïve and inflexible responses to the moved boundary. To simulate naïve responses, for each bird in the MOV set, we constructed a synthetic performance vector Pesc s of length 10 (one component for each stimulus S 1 eS 10 ) by removing from the measured Pesc on TRAIN the three components Pesc i (i ¼ 1,2,3), shifting the Pesc curve towards the shorter stimuli Pesc s i ¼ Pesc iþ3 (i ¼ 1,…,7), and by adding three naïve components Pesc s i ¼ 0.5 (i ¼ 8,9,10). This construction captures the hypothesis that birds maintain the responses to the known stimuli and are ignorant on the three new stimuli. Thus, simulated inflexible birds would get four stimuli right (MOV stimulus IDs S 1-2 and S 6-7 ) and three wrong (MOV stimulus IDs S 3-5 ), and they would be undecided on the three remaining stimuli (MOV stimulus IDs S 8-10 ). On these synthetic birds with their hypothetical Pesc s vector, our test very sensitively detected a drop in performance upon switching from TRAIN to MOV (Appendix 4, Fig. A2). Thus, the lack of disrupted performance of our tested birds on the MOV set emphasizes how quickly birds can adapt their decision boundary.
Our results indicate that even though auditory experience may be slightly advantageous for accurate discrimination, birds were generally successful at (1) learning an arbitrary vocal distinction (duration of a song syllable), (2) generalizing the learned distinction and (3) flexibly relearning decision boundaries. The good generalization to new stimulus sets argues for the existence of a perceptual system that can flexibly learn stimulus categories, even when these are subject to relatively fast change, unlike the rigid phonemes in human speech.
We then asked whether this perceptual acuity and flexibility in songbirds could allow them to discriminate a random shuffle among all syllable renditions in the training set, such that each class contained both short and long syllables (Fig. 1c, green curve). The aim of this experiment was to test whether birds can treat each stimulus class relationship independently when there is no featurebased definition of a category. We hypothesized that if birds adapt well (and quickly), it would indicate that they have a flexible representation of the stimuli and did not actually learn any categorical information. A significant reduction in performance would indicate that true feature-based categories were learned. We first trained four birds on TRAIN and then switched to the randomly shuffled (RAND) set. Performance at criterion on RAND was comparable to the performance on TRAIN (dPesc RAND ¼ 0.22 ± 0.13, N ¼ 4; dPesc TRAIN ¼ 0.37 ± 0.21, N ¼ 4; statistical test not performed due to small sample size; Fig. 2g). The criterion was achieved after an extensive learning period comparable to that of TRAIN (trials to criterion RAND ¼ 4.32 ± 3.15 x 10 3 ; trials to criterion TRAIN ¼ 6.95 ± 7.9 x 10 3 ; Fig. 2j), indicating no rapid transfer of discriminative competence to the RAND set. Most importantly, the average dPesc in the first three blocks after the switch to the RAND set was similar to the average dPesc in the first three blocks of training (RAND ¼ 0.087 ± 0.11, N ¼ 4; TRAIN ¼ 0.037 ± 0.15, N ¼ 4). The longer time taken to reach criterion on the RAND set in comparison to MOV and NOV sets reinforces the view that, to some extent, birds learned duration categories and did not memorize stimuli in terms of acoustic features that were spuriously correlated with duration (otherwise learning of RAND might have been as fast as for the MOV and NOV sets).
Because only male zebra finches can produce the chosen stimuli (song syllables), we also examined whether the sex of the birds had an influence on discrimination but did not find any statistically significant difference in performance between males and females, both experienced and isolated, on the TRAIN set (Appendix 5, Fig. A3).

DISCUSSION
We have shown that adult and juvenile zebra finches with or without experience of learned vocalizations can readily learn to discriminate sequences of a single song syllable type. We used duration as an arbitrary definition of 'category' and found a small but significant main effect of auditory experience on discrimination performance, which is in line with results obtained in songbirds on pure tone pitch discrimination (Njegovan & Weisman, 1997;Sturdy et al., 2001). Sturdy et al. (2001), for example, found that zebra finches reared in isolation showed deficits in both absolute and relative pitch discrimination compared to zebra finches raised with siblings and adults of both sexes.
We observed high variability in performance, with some Abirds (adult song isolated, without experience) learning just as well as the best Jþ birds (juvenile with experience). This good learning in isolated birds agrees with a study by Phillmore, Sturdy, and Weisman (2003) who trained black-capped chickadees, Poecile atricapillus, reared either in the wild or in isolation, to discriminate vocal distance cues from chickadee song and zebra finch calls. They found that both chickadees with and without experience discriminated accurately.
Unlike in discrimination studies using synthetic pure-tone stimuli presented to black-capped chickadees (Njegovan & Weisman, 1997), budgerigars, Melopsittacus undulatus, and zebra finches (Dooling, Best, & Brown, 1995), we chose natural stimuli that were generated from zebra finch song syllables without significant postprocessing. Our stimuli allowed the learning of discriminants along acoustic features other than duration.
Arguably, such natural stimuli make the task easier because they are more ethologically relevant. Our chosen natural stimulus  (Marler & Tamura, 1964;Nelson & Marler, 1993) and chaffinches, Fringilla coelebs (Thorpe, 1958), demonstrating a propensity to process ownspecies vocalizations in songbirds. One particular reason why songbirds' auditory system is genetically primed to recognize, filter and categorize speciestypical vocalizations is that auditory feedback plays an important role in successful song learning in males (Brainard & Doupe, 2000;Konishi, 1965). Since female zebra finches do not sing and, therefore, auditory feedback plays a lesser role in self-evaluation, we might expect isolated females not to learn as well. However, we did not find such a trend but instead found roughly equally good performance in isolated males and females. We speculate that one evolutionary reason why isolated females would perform equally well is that females need to listen and pay attention to male songs to choose a suitable mate (Byers & Kroodsma, 2009).
Birds were able to accurately memorize stimuluseresponse relationships that were generated by a random assignment of stimuli to classes, showing that they can categorize individual vocalizations in an experimentally induced, arbitrary manner. This ability suggests that birds may be capable of recognizing not only individuals in the wild (Falls, 1982;Gentner, 2004;Gentner & Hulse, 1998) but even subsets of renditions of an individual's songs.
Zebra finches are known to find it more difficult to classify song syllables into pseudocategories than into natural categories (Sturdy et al., 1999(Sturdy et al., , 2000. Given that we found relatively slow learning of randomly classified song syllables (TRAIN -> RAND) compared to fast and accurate generalization of learning from duration categories (TRAIN -> NOV), this suggests that our duration-based stimulus classes were perceived as natural categories by our birds.
The good learning performance in some A-birds may be explained by theoretical considerations of the neural underpinnings of perceptual decision making. It has been suggested that the transformation of stimulus representations between primary sensory areas such as the nucleus ovoidalis and secondary sensory areas such as the caudomedial and caudolateral mesopallium (Jeanne, Thompson, Sharpee, & Gentner, 2011) can act as a random projection of stimuli onto many dimensions (Babadi & Sompolinsky, 2014). The synaptic transformations between these areas would be naïve and random in birds without exposure to song. Provided the neural activity patterns representing the stimuli are sparse enough, these projections can increase the signal-to-noise ratio for effective stimulus discrimination. In other words, birds without experience may be good discriminators when their sensory cortices recruit sufficient neurons with sufficiently low activity.
We showed that songbirds were able to generalize their knowledge to novel instances of the same stimulus. Similar ease of generalization has been shown in adult zebra finches (Geberzahn & Der egnaucourt, 2020;Narula et al., 2018) and swamp sparrows which can discriminate between dialects of neighbouring species (Nelson & Marler, 1962). Our work suggests that the generalization ability is not limited to a stringent critical period but instead continues into adulthood, regardless of experience. Learning in our task was flexible because birds could rapidly relearn already acquired decision boundaries, demonstrating their capability to adapt to new environmental contingencies. For example, in a territorial defence context, songbirds may use a single acoustic feature to categorize song/call types from a large set of individuals belonging to a neighbouring flock. In a mate recognition or courtship context, a male songbird might use acoustic features from the distance calls of a specific female to define one category and features of calls from other females nearby as multiple (or a single 'other') categories. This would be a scenario similar to discriminating the RAND set in our experiment. Overall, these experiments suggest that the auditory system of songbirds can flexibly assume a learned categorical state, supported by results from Jeanne et al. (2011) who showed that neurons in the mesopallium of the European starling, Sturnus vulgaris, respond to song features that encode learned categorical representations of conspecific song.
We used operant conditioning to train birds to discriminate the stimuli, which contrasts with other categorical perception experiments based on the detection of aggressive behaviour (Nelson & Marler, 1962), which have been carried out on swamp sparrows but not zebra finches. Our paradigm also contrasts with the head turn paradigm used to elicit responses to vocalizations from human infants (Werker & Tees, 1984). Both these latter paradigms make no explicit use of reinforcement, since they are dishabituation experiments in which reinforcing feedback is either absent or not contingent. In principle, these differences in paradigms could constitute a confounding factor, suggesting that categorical learning is facilitated in the presence of reinforcers, in line with Narula et al. (2018). Indeed, our experiments bear a resemblance to experiments aimed at training adult Japanese speakers in /r/ versus /l/ phoneme discrimination (McClelland, Fiez, & McCandliss, 2002;. The /r/e/l/ phoneme distinction has no functional significance in the Japanese language and is difficult for Japanese adults to disambiguate. However, training with feedback improves discrimination performance remarkably within 1500e3000 trials, which corresponds to roughly half the average number of trials to criterion in our birds. Thus, our choice to use reinforcers could have been the key factor that explains the excellent categorization performance in all bird groups and that sets our experiment apart from the pure perceptual experiments used to probe categorical perception in infants. The statistical learning hypothesis (Kuhl, 2004;McClelland, 2014) describes the shaping of auditory perception through repetitive exposure to the distribution of naturally occurring categories present in conspecific vocalizations. Our results, in particular the main effect of auditory experience and not age on discrimination performance, suggest that this hypothesis may hold true provided the acoustic categories are behaviourally relevant, in a similar way to a word's meaning being independent of dialect, speaker identity and gender. However, the statistical learning hypothesis may be irrelevant for discrimination when the signals exchanged between senders and receivers are deliberately rich such as a song syllable without allocated meaning within the context of surrounding vocalizations. In this sense, it would be interesting to probe whether differences arising from age and experience are amplified when birds are trained to discriminate within an acoustic category that has behavioural relevance, for example food-begging calls produced by juvenile zebra finches and directed at their parents. Traditionally, calls and songs have been considered functionally distinct vocalizations, as songs are learned and calls are not (Catchpole & Slater, 2008;Elie & Theunissen, 2015;Zann, 1996). However, in juvenile zebra finches, the distinction between calls and songs is less clear. Lipkind et al. (2017) showed that juvenile zebra finches learning to imitate a model song often modify calls before inserting them into the song motif. This suggests a future avenue of research would be a closer examination of the perceptual implications of discriminating call instances within a category such as the begging call. Median pitch in specific time window 1 Median pitch computed using the harmonic power spectrum method in three 16 ms wide windows: 20 ms, 24 ms and 28 ms after the onset of the syllable