Auditory category learning is robust across training regimes

Multiple lines of research have developed training approaches that foster category learning, with important translational implications for education. Increasing exemplar variability, blocking or interleaving by category-relevant dimension, and providing explicit instructions about diagnostic dimensions each have been shown to facilitate category learning and/or generalization. However, laboratory research often must distill the character of natural input regularities that define real-world categories. As a result, much of what we know about category learning has come from studies with simplifying assumptions. We challenge the implicit expectation that these studies reflect the process of category learning of real-world input by creating an auditory category learning paradigm that intentionally violates some common simplifying assumptions of category learning tasks. Across five experiments and nearly 300 adult participants, we used training regimes previously shown to facilitate category learning, but here drew from a more complex and multidimensional category space with tens of thousands of unique exemplars. Learning was equivalently robust across training regimes that changed exemplar variability, altered the blocking of category exemplars, or provided explicit instructions of the category-diagnostic dimension. Each drove essentially equivalent accuracy measures of learning generalization following 40 min of training. These findings suggest that auditory category learning across complex input is not as susceptible to training regime manipulation as previously thought.


Introduction
Is this mushroom edible?Is that a squeal of danger, or delight?Is that stranger trustworthy?Humans and other organisms readily learn complex constellations of cues that signal functionally equivalent sensory objects and eventslike crying babies, for example.Cries of pain during a vaccination tend to be louder and longer, with more variable pitch and greater nonlinear acoustic characteristics compared to cries of bath time discomfort (Helmer et al., 2020;Koutseff et al., 2018).But adults' ability to categorize pain versus discomfort based on these complex cues demands experience; adults who have spent little time with infants categorize cries no better than chance.In contrast, parents and infant caregivers are significantly more accurate in categorizing cries, their accuracy scales with how much infant experience they have, and their categorization ability generalizes to unfamiliar infants' cries (Corvin, Fauchon, Peyron, Reby, & Mathevon, 2022).Experience molds caregivers' ability to use imperfect and complex sensory input regularities and guides behavior upon encountering novel input with similar properties.The latter abilitygeneralizationis a signature characteristic of effective category learning.
Cognitive science has long investigated the emergence of categories.One especially productive approach has been to utilize training paradigms to teach participants categories across novel or unfamiliar exemplars.In addition to advancing theoretical accounts of category learning and generalization, these literatures have informed real-world applications in second-language acquisition (Lim & Holt, 2011;Reetzke, Xie, Llanos, & Chandrasekaran, 2018), science learning (Eglington & Kang, 2017;Goldwater, Hilton, & Davis, 2022;Nosofsky, Sanders, & McDaniel, 2018), social group recognition through faces and voices (Lavan, Burton, Scott, & McGettigan, 2019;Retter, Jiang, Webster, & Rossion, 2020), stereotyping (Hugenberg & Sacco, 2008) and approaches to building effective educational materials (Carvalho & Goldstone, 2021;Nosofsky, Slaughter, & McDaniel, 2019).Many studies of category learning have examined aspects of training that best support effective learning, informing both theory and application.We examine three such aspects in more depth in the next sections.

Explicit instruction
The provision of explicit instructions may also promote category learning.Explicitly instructing learners to focus on a category-diagnostic dimension, or to direct attention away from a category non-diagnostic dimension, can result in enhanced non-native speech category learning (Chandrasekaran, Yi, Smayda, & Maddox, 2016).Moreover, when explicit instruction draws attention to a category-diagnostic dimension, it benefits non-native speech category learning and production above and beyond what is achieved with high-variability training alone (Wiener, Chan, & Ito, 2020).More nuanced, less explicit, manipulations that guide learners to category-diagnostic dimensions have also been effective in facilitating non-native speech category learning (Ingvalson, Holt, & McClelland, 2012;Iverson, Hazan, & Bannister, 2005;Jamieson & Morosan, 1986;McCandliss, Fiez, Protopapas, Conway, & McClelland, 2002;McClelland, Fiez, & McCandliss, 2002).

Summary and aim of the study
In summary, examination of category learning across novel or unfamiliar categories has been useful in understanding how category training regimes affect learning and suggests means of improving realworld categorization.Indeed, an implicit assumption of category learning research has been that laboratory training tasks with relatively simple stimuli can inform real-world category learning.Studying visual category learning across simple dimensions, for example, may reveal processes available to early-career radiographers learning to categorize the subtle patterns that differentiate a benign from a cancerous tumor (Waite et al., 2019).Correspondingly, learning a simplified category characteristic of non-native speech might suggest scenarios that would improve classroom second language learning (Wiener, Murphy, Goel, Christel, & Holt, 2019).
However, most category learning studies differ substantially from natural category learning challengesoften by design.For example, the number of unique exemplars in lab experiments vastly undersamples natural exemplar variation.Laboratory studies tend to model real-world exemplar variability with a Gaussian distribution for simplicity.Exemplars are often defined across just two sensory dimensions, and dimensions tend to be simple, easily verbalized sensory features (e.g., line orientation, acoustic frequency).Even when categories are defined by natural visual objects or spoken utterances, exemplar sampling tends not to truly reflect the full complexity of natural categories.As a result, much of what we know about category learning has come from studies with simplifying assumptions.This entirely reasonable approach nonetheless calls into question the implicit expectation that these studies reflect the process of category learning under more complex learning challenges, such as those posed by real-world input.
Here, we put this question to the test by creating an auditory category learning challenge that intentionally violates some common simplifying assumptions.We create a novel, nonspeech acoustic stimulus space comprising >36,000 tokens across four auditory categories.The categories rely upon natural acoustic variability from spoken language (Mandarin lexical tone across multiple talkers) with underlying regularities known to be learnable because they are derived from real speech.Despite their speech origins, these sounds are not familiar, do not convey talker information, and are not heard as speech.This is because we use signal processing to eliminate voice and linguistic information, leaving only the fundamental frequency (F0) contour thought to be the most diagnostic dimension for conveying Mandarin lexical tone category to native listeners (Ho, 1976;Howie, 1976).In tonal languages like Mandarin, F0 differences like these allow a syllable like "ma" to have four different meanings according to its intonation (Chao, 1965;Gandour, 1983).As noted, we can be confident the structure of these novel categories is learnable because they are drawn from natural categories.Further, prior research examining category learning among the same pool of nonspeech hums demonstrates robust category learning among non-Mandarin listeners (Liu, 2014).
We exaggerate the learning challenge in two ways.First, each category exemplar is composed of two streams of three hums, each stream spectrally filtered such that one is situated in a high frequency band and the other in a low frequency band.These two streams are played simultaneously, but only one carries information diagnostic to category decisions; the other is acoustically variable and non-diagnostic.This creates a rarely examined category learning challenge: Listeners must forage the acoustic soundscape to discover category-diagnostic information as it evolves (and dissipates) over time.By design, we build this qualitative category learning challenge into our stimulus set without modeling specific details of speech per se.Instead, our approach is to create a novel version of an important puzzle present in auditory category learning: Listeners must discover category-diagnostic acoustic dimensions in the context of non-diagnostic (or less-diagnostic) acoustic variability arising from other dimensions of the same sound source (e.g., across different bands of formant frequencies) or even across simultaneous competing sounds.
Second, the hum stream in one frequency band is a concatenation of three unique hums drawn from a single Mandarin tone category.The other is a concatenation of three unique hums, each drawn from different Mandarin tone categories.In this way, one frequency band contains tone-category-diagnostic information, and the other frequency band is category uninformative.Thus, category learning requires both discovering (at least implicitly) the category-diagnostic frequency band that contains a statistically regular pattern derived from a single Mandarin tone category, and also recognizing the category-diagnostic, but acoustically variable, pattern within this band (see Fig. 1 for a schematic depiction of the stimuli).In summary, this creates a complex highdimensional exemplar space across which four categories are defined over multiple difficult-to-verbalize dimensions and sampling distributions.
We intentionally chose a learning challenge that would not approach ceiling in a single session so that we could better capture differences that might be apparent across training regimes; ceiling performance would make this problematic.Although this approach does not measure what learners can achieve with longer training, examining category learning across a single session has been the workhorse paradigm across both the visual and auditory category learning literatures because it tracks early, online category acquisition.Further, by limiting training to a single session, we can examine effects of online category learning without influences of offline learning or consolidation (which might be productively examined in future work).

Experiments overview
Here, we first examine whether young adult participants recruited from a diverse online sample can accomplish this complex category learning challenge in a single training session that involves overt category decisions and explicit feedback.We then examine how learning is influenced by variability in three aspects of the training regime, each of which has been shown to affect learning in simpler categorization challenges: manipulations of exemplar variability, category exemplar sequencing, and explicit instructions.

Participants
Since this was a novel categorization challenge, we conducted several pilot studies from which to estimate power.These studies revealed robust learning across ~30 participants.Here, we doubled the sample, targeting recruitment of 60 participants per experiment to improve our ability to detect subtle learning differences across learning contexts.
In total, 300 young adults aged 18-35 years participated online for monetary compensation via recruitment through Prolific.co.There were no restrictions on language background, and all participants selfreported normal hearing.Table 1 shares participant demographics.Given our relatively unrestricted recruitment of participants online, our sample is likely more representative of the general population than that of studies that recruit from a university student population (Henrich, Heine, & Norenzayan, 2010).Four participants were excluded due to an experimental error that duplicated trials, leaving 296 participants in the final analyses and a minimum of 58 participants per experiment.All participants provided informed consent approved by the Carnegie Mellon University Institute Review Board (IRB).

Stimuli
Fig. 1 illustrates the construction of sound exemplars.Stimuli for all experiments were drawn from the same acoustic space.The building blocks for these stimuli were nonspeech hums created by extracting the fundamental frequency (F0) contour from natural speech recordings of single-syllable words, each recorded by four native Mandarin speakers (2 male, 2 female; Liu, 2014).A screen displayed both the Mandarin Chinese character and the pinyin spelling of the word frame (with tone number 1, 2, 3, 4) to prompt native speakers to utter each word twice, with self-paced progress as utterances were digitally recorded with Praat (Boersma, 2001).Each speaker produced 20 unique word-frames (pinyin spellings: can, chou, di, fa, ge, guo, huan, jie, kui, peng, pu, qian, shi, tuo, xi, xiang, xing, xue, yang, yu) in each of the four lexical tones for a total of 80 utterances per talker.A native Mandarin listener checked stimuli for clarity and representativeness of the lexical tone contour.
These speech recordings were processed in the open-source speech analysis software Praat (Boersma, 2001) to create non-speech hums by extracting the pitch contour using the Analyse periodicity: To Pitch function and converted into hums using the To Sound (hum) function.Expert listeners removed some stimuli from the pool based on poor pitch tracking and discontinuous hum outcomes (Liu, 2014).
To make a single stimulus exemplar, three unique non-speech hums drawn from the same Mandarin talker were assigned to a higher frequency band and three to a lower frequency band.As illustrated in Fig. 1B, one of the frequency bands was designated the diagnostic band; it possessed 3 unique hums drawn from a single lexical tone category.The other band possessed 3 unique hums from any lexical tone category ("wild card").
Next, the hums were processed using the audio processing software Sound eXchange (sox.sourceforge.net),with additional processing in Adobe Audition (version 13.0.7).First, hums were padded with 50 ms of silence at the beginning and end of the sound clip and high-pass-filtered at 30 Hz to remove slow drift and reduced in gain by 10 dB.Second, high-and low-frequency-band versions of these stimuli were created.To create the high-frequency-band components, hums were pitch-shifted +33 semitones in Audition and then high-pass filtered using sinc Kaiser-windowed filter in Sox to preserve all frequencies at and above 1000 Hz.To create the low-frequency-band components, the same hums were pitch-shifted by − 1 semitone and low-pass filtered to preserve all frequencies at and below 500 Hz.In the process of pitch shifting, hums were simultaneously normalized to be 400 ms using the iZotope algorithm in Audition, using the high precision mode with pitch coherence set to 4. The 400-ms, pitch-shifted and high/low-pass filtered hums were RMS-matched in amplitude and normalized to be − 6 dB below the maximum digital range.
As shown in Fig. 1B, the category-diagnostic band was created by drawing from the pool of hums derived from a single talker, choosing a frequency band (high or low), randomly selecting three hums from a single hum (lexical tone) category, and concatenating the hums with 100 ms of total silence between each token.We created all permutations in both high and low frequency bands.
Similarly, the category-uninformative "distractor" band was created by drawing from a pool of hums from the same talker used to create the diagnostic-band hum sequence, with hums placed into the frequency band opposite the diagnostic band.For the non-diagnostic band, hums were randomly selected from three different hum categories (selected from any of the four hum categories) and concatenated with 100 ms silences between each hum.This was repeated for all permutations.The diagnostic band and uninformative distractor band were then added together such that the onset of each of the three hums of each frequency band was temporally aligned, and stimuli evolved across 1400 ms in all.
For counterbalancing purposes, there were two sets of four categories.Fig. 1B illustrates Set 1, in which Category A and Category B are defined by high-frequency diagnostic bands whereas Category C and D are defined by low-frequency diagnostic bands.This relationship was reversed in Set 2 (e.g., low-frequency diagnostic for Categories A and B, not shown in Fig. 1).Assignment of set was counterbalanced across participants in each experiment and analyses collapse across set assignment.
Overall, the full constellation of hum permutations resulted in a stimulus pool with over 36,000 exemplars.From this exemplar space we randomly selected 2048 total exemplars (256/category/set) for the

General procedure
Five experiments shared common procedures, differing only in their approach to training.In all experiments, training blocks alternated with generalization blocks (see Fig. 2C).Generalization was similar to training, but participants did not receive feedback.Generalization trials for all experiments were 4AFC.Each of the four generalization blocks consisted of 20 novel exemplars/ category (80 total stimuli) not encountered during training.These 80 exemplars were randomly selected without replacement from the stimulus pool reserved for novel generalization prior to the experiment, and the generalization set was used consistently for each participant across each experiment.This presented the opportunity to examine crossexperiment effects of training manipulations via participants' ability to generalize category learning to novel exemplars.
Participants completed the experiment online via Gorilla, an online experiment creation and hosting website (Anwyl-Irvine, Massonnié, Flitton, Kirkham, & Evershed, 2020) on a laptop or desktop computer using the Google Chrome browser.Prior to beginning the category learning task, participants underwent a system check to ensure the auto play of sound at a comfortable listening level and a short task to ensure compliance with the use of binaural headphones (Milne et al., 2021).All sounds were presented in the lossless *.FLAC format.After the experiment, participants shared language and music training history, were invited to share notes detailing their task strategies, and received an experiment debriefing.

Approach to analyses
For each experiment, we analyzed training and generalization blocks separately, asking whether significant learning and generalization occurred with a specific training regime.For training and generalization blocks, we analyzed: (1) the overall change in performance across Blocks 1-4 using a repeated measures ANOVA and post-hoc comparison of Block 1 and Block 4; and (2) indices of early learning by examining Block 1 accuracy compared to chance.We compared training and generalization performance between select pairs of experiments using mixed model ANOVA.(Linear mixed effects modeling yielded the same results and are available on OSF.io.) To ask whether training regime differentially affected generalization overall, a set of cross-experiment analyses (reported after Experiment 5) compared generalization progress from Block 1 to 4 as well as final generalization achievement in Block 4. We supplemented these analyses with Bayesian Equivalence Independent t-tests across all pairs of experiments, looking both at generalization progress and final generalization achievement.

Experiment 1: 4AFC training with full exemplar variability
Experiment 1 tested listeners' ability to learn the complex auditory categories under conditions of full acoustic variability in a fouralternative forced-choice categorization task, with feedback (Table 2).

Methods specific to experiment 1
Here, 480 exemplars (120/category) were randomly selected from the full pool of 512 training stimuli (128/category).On each trial, participants chose which of four aliens (4AFC) corresponded to the sound they had heard; as with all experiments, they received feedback after each training trial.Participant characteristics are shown in Table 1; data are shown in Fig. 3.

Experiment 2: 4AFC training with low exemplar variability
As noted above, high exemplar variability may lead to slower and initially less accurate performance in training.However, it can yield dividends in supporting better generalization (Logan et al., 1991;Lively et al., 1993 1. Fig. 4 shows training and generalization data.

Experiment 3: 2AFC training with pairs grouped by categorydiagnostic band
Recall that the auditory category exemplars confront participants with two learning challenges: (1) to identify the diagnostic frequency band in the context of a simultaneous, non-diagnostic band and (2) to learn the pattern of hums present in the diagnostic band despite their within-category acoustic variability.In Experiment 3, we block categorization decisions according to the category-diagnostic band, thereby potentially (implicitly) encouraging selective attention to the categoryrelevant frequency band within blocks of trials (Carvalho & Goldstone, 2017).

Methods specific to experiment 3
Here, training trials were blocked as 2AFC category decisions.Like Experiment 1, participants completed 480 training trials with feedback, where the 480 trials (120/category) were randomly selected from the full pool of 512 training exemplars (128/category).This was accomplished by dividing each training block (120 trials) into six 20-trial miniblocks.Half of the mini-blocks were grouped by high-frequency diagnostic band and half by low-frequency diagnostic band.For example, as shown in Fig. 1B, Category A and B stimuli were presented in one half of the mini-blocks, and Category C and D were presented in the other half.Mini-blocks alternated between category pairs differentiated in either the high-and low-frequency diagnostic band, with order counterbalanced across participants.Generalization blocks mirrored Experiments 1 and 2. Participant demographics are shown in Table 1.Data are plotted in Fig. 5.

Experiment 4: 2AFC training with all category pairs
As a counterpart to Experiment 3, Experiment 4 examines whether category learning with 2AFC training is successful without categorydiagnostic blocking.Here, all six possible pairs of categories were presented in separate training blocks (e.g., AB/AC/AD/BC/BD/CD).We hypothesized that without implicit direction to the diagnostic band, participants would be forced to discover the two learning challenges simultaneously and that this would, akin to interleaved presentation, exaggerate between-category differences (Carvalho & Goldstone, 2017).After Experiment 4 findings are reported, results from Experiments 3 and 4 are directly compared.

Methods specific to experiment 4
Experiment 4 used full exemplar variability (like Experiments 1 and 3) and presented 2AFC training across six 20-trial mini-blocks per training block (like Experiment 3).The order of category pair mini- blocks was randomized for each training block, for each participant.Generalization blocks mirrored previous experiments.Table 1 provides demographic information, and data are plotted in Fig. 6.

Comparison of experiments 3 and 4
We asked how training that paired categories according to diagnostic band (Experiment 3) compared to pairing categories randomly regardless of diagnostic band (Experiment 4).A mixed-model ANOVA revealed that there was a significant effect of Training Regime (F(1, 116) = 12.130, p = 0.000701, η G 2 = 0.073) but no interaction between Block and Experiment 257.56 Random pairing of categories without regard to the diagnostic band in Experiment 4 resulted in significantly better Block 1 training accuracy (t(114.7529)= − 5.5759, Bonferroni-adjusted p = 6.64e-7,Cohen's d = − 1.027) compared to Experiment 3.However, there was no significant difference in final training achievement in Block 4 (t(114.

Experiment 5: 2AFC training with pairs grouped by categorydiagnostic band and explicit instructions
Experiment 3 blocked categories according to their diagnostic frequency band in a manner that might implicitly guide discovery of category-relevant dimensions.Experiment 5 takes a more explicit approach, asking whether category learning is facilitated by providing instructions about the category-relevant frequency band.

Methods specific to experiment 5
Experiment 5 used full exemplar variability (like Experiment 1) and a 2AFC training task with trials blocked according to a shared diagnostic band (like Experiment 3).In addition, participants were informed that "previous participants […] found it beneficial to listen to the higher [or lower] pitched sounds when learning which sounds go with which alien."Before each mini-block of 20 trials, participants were presented with a blank screen with the text "Listen high!" or "Listen low!" in accordance with the diagnostic frequency band of the category pairs in the mini-block.Otherwise, the procedure followed that of Experiment 3. Table 1 shows participant demographics.Data are plotted in Fig. 7.

Comparison of experiments 3 and 5
There was a significant influence of the presence of explicit in-

Comparing generalization across training regimes
As described above, each experiment involved generalization testing blocks comprised of the same 80 exemplars, not heard during training.This allows for direct comparison of the influence of different training regimes on generalization of category learning.To this end, we conducted a two-way mixed model ANOVA of generalization accuracy across Block versus all five Training Regimes (Experiments).The significant effect of Block 845.94Given the similarity among experimental outcomes, we also conducted Bayesian equivalence testing to examine the strength of the evidence that training regime manipulations have essentially equivalent effects.We again focus on generalization progress along with final generalization achievement, setting the equivalence region from − 0.05 to 0.05 in Cohen's d units using Bayesian Independent Samples Equivalence t-test (JASP Team, 2022).
Fig. 9 shows Bayes Factor (BF) comparing the equivalence hypothesis (i.e., that the effect falls within our equivalence interval) versus the C.O. Obasih et al. hypothesis that the effect lies outside this interval.For each pairwise comparison, the evidence is stronger for equivalence.Using criteria suggested by Andraszewicz et al. (2015), there is moderate evidence that generalization progress and ultimate achievement are not differentially influenced by the training regimes that manipulate exemplar variability, exemplar sequencing, or explicit instruction.

General discussion
Category learning studies have often taken the entirely reasonable approach of examining simplified category-learning challenges; one or a few often easily verbalizable diagnostic dimensions with low exemplar variability and a small number of category exemplars have been typical (e.g., Gabay, Dick, Zevin, & Holt, 2015;Lim & Holt, 2011;Maddox, Koslov, Yi, & Chandrasekaran, 2016;Roark, Lehet, Dick, & Holt, 2022).This has been as true for natural exemplars, like non-native speech sounds as well as for novel objects and events.Overall, these studies have informed theories of category learning and have significantly driven our understanding of both basic processes and application.Yet we do not completely understand how factors that impact simplified category learning challenges might play out in more real-world category learning.Here, we developed a novel space of auditory categories that  embodied some of the natural complexity and variability typically encountered in real-world stimuli.Within this space, categories were characterized by many unique exemplars, difficult-to-characterize dimensions, and simultaneous non-diagnostic information.
We observed strong evidence that these categories are learnable even over short-term training.Moreover, this learning generalizes readily to novel exemplars.Across five independent experiments involving 296 listeners, adult participants learned these challenging auditory categories above chance accuracy at the group level.Learning was rapid.There was evidence of learning as early as the first block across all training regimes; for most participants, categorization improved across the 40-45 min of total training.The learning curves across training are consistent with results from a wide variety of category learning studies with simpler category learning challenges.Typically, these studies show evidence of significant learning early in training followed by relatively slow, incremental increases in accuracy across subsequent blocks (e.g., Reetzke et al., 2018;Roark & Holt, 2019;Zeithamova & Maddox, 2006).
As is often the case in category learning studies, there were substantial range individual differences in learning outcomes (e.g., Baese-Berk, Chandrasekaran, & Roark, 2022).We informally examined two potential contributors to these individual differences across our sample of almost 300 participants: (1) experience with Mandarin or another tonal language and (2) musical expertise.Neither was predictive of generalization outcomes (supplemental information can be found at OSF.io).
With this learning and generalization as a baseline, we examined the extent to which manipulations of exemplar variability (Logan et al., 1991), exemplar blocking (Carvalho & Goldstone, 2017), and provision of explicit instruction (Chandrasekaran et al., 2016) each shown to impact category learning outcomes in prior researchmodulate generalization of category learning in a more complicated stimulus space.Under the present category learning challenge, learning was surprisingly consistent across training regimes.As demonstrated by the Bayesian analyses, generalization progress and final generalization achievement were essentially equivalent.This is quite unexpected given the prior literature.Even participants left to discover diagnostic dimensions implicitly via feedback did not fare more poorly in generalization of category learning than participants provided explicit instruction about where to find category-relevant information.Next, we consider the findings from prior literature and how they diverge from and inform our findings by examining the three training manipulations.

Exemplar variability
The expectation that training with high variability exemplars produces more robust generalization of category learning has a long history and continues to have a substantial impact on theory and application.As we noted in the introduction, the implications of high variability training have been especially well-investigated in non-native speech category learning (e.g., Logan et al., 1991).Brekelmans et al. (2022) review this literature thoroughly and make a case that evidence is mixed regarding an advantage of high versus low exemplar variability.Moreover, in this well-powered replication of Logan et al. (1991) and Lively et al. (1993), Brekelmans and colleagues observed no learning differences across high and low exemplar variability.
Other studies have shown that the benefit from high variability acoustic training interacts strongly with participants' individual characteristics and perceptual abilities.For example, Perrachione, Lee, Ha, and Wong (2011) demonstrated that high-variability training benefited only learners with already strong perceptual abilities and indeed impeded learners with weaker perceptual abilities.Several other studies have reported variation in the effectiveness of high-variability training for different learners, with some studies finding no beneficial effect of the high-variability condition, and others finding that high exemplar variability in training hinders learning (Fuhrmeister & Myers, 2017, 2020;Sadakata & McQueen, 2014).Further, another recent study has demonstrated that high variability training sets could confer an advantage or a disadvantage in voice-identity category learning, depending on stimulus type, the dimension that is varied, and the nature of the posttest (Lavan, Knight, Hazan, & Mcgettigan, 2019).
In summary, emerging evidence challenges the strength and/or consistency of effects of exemplar variability on category learning outcomes.The present results echo these concerns.Here, there was no advantage to generalization progress or ultimate achievement across training with high exemplar variability (480 unique exemplars) versus low exemplar variability (40 unique exemplars).

Exemplar sequence
A recent meta-analysis revealed that interleaved exemplar presentation tends to benefit learning (Brunmair & Richter, 2019), but vanishingly few studies have examined exemplar sequencing in the auditory modality.Studies examining learning across auditory input of non-native speech soundsthough few in numberhave found benefits of blocking, rather than interleaving, category exemplars (Carpenter & Mueller, 2013;Fuhrmeister & Myers, 2020).These studies also found that participants learned to rely on the category-diagnostic dimensions and made error judgments based on category-irrelevant dimensions.
In the present study, exemplars blocked according to the categorydiagnostic frequency band initially led to significantly poorer training performance than randomly paired category exemplars.Even so, by the end of training there was no difference in learning outcomes or generalization across training regimes.Any influence of blocked versus interleaved presentation of exemplars in training was ephemeral and contrary to expectations that category-diagnostic blocking would support learning.Participants left to discover category-relevant dimensions through trial-and-error tuned by explicit feedback fared no better or worse than learners who were supported by blocking according to the category-diagnostic dimension.

Explicit instruction
Explicitly instructing learners about the nature of categorydiagnostic dimensions can improve categorization accuracy for nonnative speech categories (Chandrasekaran et al., 2016).Other studies have more implicitly "instructed" participants via training methods that exaggerate category-relevant dimensions; these appear to enhance learning compared to control conditions (Ingvalson et al., 2012;Iverson et al., 2005;Jamieson & Morosan, 1986;McCandliss et al., 2002;McClelland et al., 2002).
In the present study, explicit instruction improved early training accuracy compared to implicit support to learning via blocking by category-diagnostic frequency band.But that advantage was fleeting.By the culmination of training, groups' learning and generalization achievements were equivalent.It is possible that simple instructions such as "Listen high!" or "Listen low!" may not be informative enough to direct listeners to the diagnostic dimension.However, we modeled our instructions after those of Chandrasekaran et al. (2016), who instructed listeners that previous participants had succeeded in listening to a specific dimension of sound, and listeners are fully capable of paying explicit attention to one of two spectrally separated dimensions in a range of tasks (Dick et al., 2017;Holt, Tierney, Guerra, Laffere, & Dick, 2018).

Conclusions
In sum, the present results underscore the robustness of auditory category learning, regardless of training regime.A large, diverse sample of online research participants exhibited the ability to acquire novel auditory categories drawn from a complex acoustic space within 40 min and to generalize this knowledge robustly to novel exemplars.At a group

Fig. 1 .
Fig. 1.Schematic of Sound Exemplars.A. Non-speech hums derived from natural utterances of four native Mandarin (2 female) speakers producing utterances varying in lexical tone, which is conveyed by fundamental frequency (F0) contours.Hums preserve only the F0 contour and do not sound like speech, yet they possess natural acoustic regularity within hum categories and distinct patterns across hum categories.Here and in subsequent panels, color conveys the hum category.B. Hums were filtered into high (≥ 1000 Hz) and low (≤500 Hz) frequency bands and three hums were concatenated in each band to compose a sound exemplar.For each, a diagnostic band (colored boxes) possessed within-hum-category exemplars and a non-diagnostic band had 3 hums, each drawn from a different one of the four hum categories (open "wild card" boxes).Exemplars defining the four categories were created such that listeners needed to discover the diagnostic band in the context of the simultaneous non-diagnostic band and learn the hum pattern across acoustic variability within the diagnostic band.The four aliens used to guide categorization responses are shown, as well.C. A spectrogram showing a representative exemplar drawn from Category A, for which the high-frequency band was diagnostic.Here, and in Panel D, colored rectangles indicate the lexical tone category from which the hum was created.Solid colored lines indicate the categorydiagnostic frequency band; dashed lines show the category uninformative frequency band.D. Spectrogram showing a representative exemplar drawn from Category D, for which the low-frequency band was diagnostic.
Only the nature of the training blocks varied across experiments.Generalization blocks were identical across experiments to facilitate cross-experiment outcome comparisons.All experiments involved training over 40 min.Moreover, across all experiments, training involved overt category decisions and explicit feedback (see Fig. 2 for schematics).Following a 500-ms fixation, participants heard a category exemplar and matched it to one of four novel 'alien' illustrations via keyboard response at sound offset, with immediate feedback lasting 1500 ms; the next trial commenced immediately.Across experiments, each auditory category consistently mapped to a specific alien presented on the screen.In Experiments 1 and 2 all four alien creatures were visible on the screen (4alternative force choice (4AFC)), whereas in Experiments 3-5, only pairs of alien creatures were visible (2AFC), with the other two aliens greyed out and unavailable for response.Each of the four training blocks consisted of 120 trials (30 trials/ category), totaling 480 training trials.At the commencement of each training block, 30 exemplars/category were randomly selected without replacement from the pool of 128 category exemplars.Thus, exemplars were never repeated within a single training block, and there was a low probability of any single exemplar repeating across training blocks.Each training block was divided into either three mini-blocks of 40 trials each (Experiments 1 and 2, for 4AFC training) or six mini-blocks of 20 trials each (Experiments 3, 4, and 5, for 2AFC training), to allow for brief selftimed breaks between mini-blocks.Except for Experiment 5 (see Section 7), participants were not informed of the dual-band nature of the stimuli and were simply instructed to use the feedback during training trials to learn which sounds corresponded with which alien.

Fig. 2 .
Fig. 2. Trial and Block Structure Across Experiments.A. Training trials with overt categorization decisions and immediate feedback.B. Generalization trials with novel sound exemplars not encountered in training, with no feedback C. Training regimes (defined by the nature of training trials) differed across experiments, but all experiments were comprised of four cycles of 120 training trials (A) followed by 20 generalization trials (B).Note that generalization trials were identical across experiments.
; see Raviv et al., 2022 for review).Conversely, small numbers of training exemplars may lead to faster and more accurate learning, but poorer generalization.We test this hypothesis in Experiment 2 with a limited set of training exemplars, but with the same set of novel generalization exemplars as in Experiment 1. 4.1.Methods specific to experiment 2 Here, training involved only 40 exemplars (10 exemplars/category) randomly selected from the training pool of 512 training exemplars prior to experimentation and consistent among participants.Each exemplar was encountered 12 times across training to arrive at the same number of 480 training trials as Experiment 1. Participant demographics are in Table

Fig. 3 .
Fig. 3. Experiment 1, 4AFC Full Exemplar Variability: Training and Generalization Accuracy by Block.The top panel represents training accuracy.The bottom panel shows generalization accuracy.Dashed lines represent chance and error bars reflect standard error of the mean.Each individual gray point represents an individual participant's mean accuracy and larger, colored symbols show mean across-participant accuracy.

Fig. 4 .
Fig. 4. Experiment 2, 4AFC Low Exemplar Variability: Training and Generalization Accuracy by Block.The top panel represents training accuracy.The bottom panel shows generalization accuracy.Dashed lines represent chance and error bars reflect standard error of the mean.Each individual gray point represents an individual participant's mean accuracy and larger, colored symbols show mean across-participant accuracy.

Fig. 5 .
Fig. 5. Experiment 3, 2AFC Pairs Grouped by Category-Diagnostic Band: Training and Generalization Accuracy by Block.The top panel represents training (2AFC) accuracy.The bottom panel shows generalization (4AFC) accuracy.Dashed lines represent chance and error bars reflect standard error of the mean.Each individual gray point represents an individual participant's mean accuracy, and larger, colored symbols show mean acrossparticipant accuracy.

Fig. 6 .
Fig. 6.Experiment 4, 2AFC All Category Pairs: Training and Generalization Accuracy by Block.Top panel shows training accuracy, and bottom panel shows generalization accuracy.Dashed lines represent chance and error bars reflect standard error of the mean.Each individual gray point represents an individual participant's mean accuracy and larger, colored symbols show mean acrossparticipant accuracy.
Fig. 8B) differed across training regimes.Given the similarity among experimental outcomes, we also conducted Bayesian equivalence testing to examine the strength of the evidence that training regime manipulations have essentially equivalent effects.We again focus on generalization progress along with final generalization achievement, setting the equivalence region from − 0.05 to 0.05 in Cohen's d units using Bayesian Independent Samples Equivalence t-test(JASP Team, 2022).Fig.9showsBayes Factor (BF) comparing the equivalence hypothesis (i.e., that the effect falls within our equivalence interval) versus the

Fig. 7 .
Fig. 7. Experiment 5 2AFC, Pairs Grouped by Category-Diagnostic Band + Explicit Instructions: Training and Generalization Accuracy by Block.The top panel represents training accuracy.The bottom panel shows generalization accuracy.Dashed lines represent chance and error bars reflect standard error of the mean.Each individual gray point represents an individual participant's mean accuracy and larger, colored symbols show mean acrossparticipant accuracy.

Fig. 8 .
Fig. 8. Generalization Progress and Achievement Across Training Regimes.Generalization of category learning was very robust.Training regime manipulations across experiments did not influence generalization progress from Block 1 to Block 4 (panel A), nor did they influence ultimate generalization achievement in Block 4 (panel B).Error bars indicate standard error.

Fig. 9 .
Fig. 9. Generalization Across Training Regimes, Bayesian Equivalence Testing.Each panel shows comparison of two Bayes Factors (BF) across experiments: the top number indicates the evidence that the difference lies within the equivalence region, and the bottom number indicates the evidence that the difference lies outside the equivalence region.A. BF results from Generalization Progress (Block 4 -Block 1 accuracy).B. BF results from Generalization Achievement (Block 4).For ease of interpretation, comparisons where BF > 4 are in bold font, and BF < 1 in italics.

Table 1
Participant demographics.experiments.Half of the exemplars for each condition (128/ category/set) were reserved as the training stimulus pool whereas the other half was reserved as a pool to test generalization.The 2048 stimuli selected for the present experiments are available on OSF.io.
1 Based on self-reported languages when asked to "List language(s) spoken before age 2."C.O.Obasih et al.present

Table 2
Training Protocols.