Voice actors show enhanced neural tracking of pitch, prosody perception, and music perception

Experiences with sound that make strong demands on the precision of perception, such as musical training and experience speaking a tone language, can enhance auditory neural encoding. Are high demands on the precision of perception necessary for training to drive auditory neural plasticity? Voice actors are an ideal subject population for answering this question. Voice acting requires exaggerating prosodic cues to convey emotion, character, and linguistic structure, drawing upon attention to sound, memory for sound features, and accurate sound production, but not ﬁne perceptual precision. Here we assessed neural encoding of pitch using the frequency-following response (FFR), as well as prosody, music, and sound perception, in voice actors and a matched group of non-actors. We ﬁnd that the consistency of neural sound encoding, prosody perception, and musical phrase perception are all enhanced in voice actors, suggesting that a range of neural and behavioural auditory processing enhancements can result from training which lacks ﬁne perceptual precision. However, ﬁne discrimination was not enhanced in voice actors but was linked to degree of musical experience, suggesting that low-level auditory processing can only be enhanced by demanding perceptual training. These ﬁndings suggest that training which taxes attention, memory, and production but is not perceptually taxing may be a way to boost neural encoding of sound and auditory pattern detection in individuals with poor auditory skills. ©

Individual differences in auditory neural encoding reflect not only innate variability, but also the effects of experience with sound.For example, tone language speakers must track pitch on a finer scale, compared to non-tone-language speakers, in order to distinguish lexically-relevant tone contours.As a result, neural representation of the fundamental frequency (F0, an acoustic correlate of pitch) is enhanced in tone language speakers (Bidelman, Gandour, & Krishnan, 2010;Chandrasekaran, Krishnan, & Gandour, 2009;Krishnan, Xu, Gandour, & Cariani, 2005, 2009).This enhanced pitch representation could help explain the enhanced pitch processing skills demonstrated by tone language speakers.These include more precise pitch discrimination (Creel, Weng, Fu, Heyman, & Lee, 2018), enhanced melody memory (Liu, Hilton, Bergelson, & Mehr, 2023), more accurate singing (Pfordresher & Brown, 2009), and greater use of pitch as a cue to speech and music perception (Jasmin, Sun, & Tierney, 2021;Petrova, Jasmin, Saito, & Tierney, 2023).
Effects of experience on the robustness of auditory neural encoding are not limited to experiences begun in infancy and early childhood.For example, music perception requires tracking of pitch on a finer scale than does speech perception.Cross-culturally, music tends to feature discrete pitch variations which listeners map onto fixed musical intervals (Ozaki et al., 2024).Speech, on the other hand, features continuous changes in pitch, with intonational structure conveyed by changes in the direction of pitch contour (Zatorre & Baum, 2012).As a result, musicians must learn to very precisely perceive pitch: a difference of less than one semitone can separate a note which is in tune versus out of tune, while similarly sized pitch changes in non-tonal languages are inconsequential.The need to track such fine gradations in pitch may enhance auditory neural encoding.Indeed, musicians have more robust pitch representations, compared to non-musicians, as measured using the mismatch negativity (MMN; Fujioka, Trainor, Ross, Kakigi, & Pantec, 2004;Magne, Sch€ on, & Besson, 2006;Chandrasekaran et al., 2009) and the frequency-following response (FFR; Musacchia, Sams, Skoe, & Kraus, 2007;Wong, Skoe, Russo, Dees, & Kraus, 2007;Skoe & Kraus, 2012).Although most prior research on musical experience and auditory neural encoding has been cross-sectional, longitudinal studies have demonstrated that music training can lead to enhancements of auditory function and sound perception (Nan et al., 2018;Tierney, Krizman, & Kraus, 2015).
What conditions must be met for an experience to enhance neural representation of sound and sound perception?Both musical training and tone language experience make strong demands on the precision of pitch perception.One possibility, therefore, is that high demands on the precision of perception drive auditory neural plasticity.An alternate possibility, however, is that auditory neural encoding can be enhanced by training that taxes sound perception without requiring high perceptual precision, instead requiring attention to sound, memory for sound features, and accurate sound production.This possibility would be consistent with Reverse Hierarchy Theory, which suggests that attention can enhance processing at low levels of perceptual systems (Ahissar & Hochstein, 2004).
Determining whether high perceptual precision is a necessary ingredient of training for auditory neural enhancement would require testing a population who receives specialized sound training but without needing to perform precise sound discrimination.One such population is actors, who must exaggerate their use of prosodic cues (Berry & Brown, 2019;Ju ¨rgens, Grass, Drolet, & Fischer, 2015;Matharu, Berry, & Brown, 2022) to clearly communicate phrase structure and pragmatics, to produce speech with maximal emotional intensity and arousal (Whiting, Kotz, Gross, Giordano, & Belin, 2020), and to convey a character's personality.To learn which characteristics to exaggerate, actors must explicitly attend to the relevant acoustic characteristics, including pitch, duration, and amplitude (Breen Fedorenko, Wagner, & Gibson, 2010;Fear, Cutler, & Butterfield, 1995;Mattys, 2000).For example, judgments of a speaker's dominance versus trustworthiness are linked to distinctive pitch contours, with the former characterized by a rapid decrease in pitch, and the later by a moderate rise in pitch (Ponsot, Burred, Belin, & Aucouturier, 2018); exaggerating these features would clearly convey to a listener that a character is dominant or trustworthy, even in the absence of cues from facial expression, gesture, or gait.In order to learn to clearly convey dominance and trustworthiness, therefore, actors must selectively direct attention towards pitch and away from less relevant acoustic characteristics, detect the difference in pitch contour associated with the two characteristics, and exaggerate it.
Moreover, prosodic exaggeration requires explicit control over the production of cues to prosodic features.Prior neural evidence suggests that this enhanced control over the production of sound could also have consequences for auditory perception: professional actors showed greater premotor activation when listening to speech versus violin music, whereas the opposite was true for professional violinists (Dick, Lee, Nusbaum, & Price, 2011), suggesting the use of motor resources during sound perception.Acting training also taxes memory, as actors must memorize long passages of speech, including relevant prosodic cues to information structure and emotion; indeed, prior evidence suggests enhanced verbal memory in actors compared to non-actors (Groussard, Coppalle, Hinault, & Platel, 2020;Noice & Noice, 2009), although it remains unclear whether this advantage extends to memory for non-verbal sounds.
To test whether speech training which does not tax auditory precision can enhance auditory neural encoding and sound perception, we tested voice actors and non-actors matched for age and degree of musical training.(The need to exaggerate prosodic cues is especially strong for voice actors, who cannot use facial expression or gesture and so must rely on sound to communicate emotion, character, and phrase structure.)We assessed the consistency and amplitude of their neural encoding of sound using the FFR, as well as behaviourally assessing their acoustic discrimination, memory for sound patterns, and ability to categorize prosodic and musical patterns.We report below how we determined our sample size, all data exclusions (if any), all inclusion/exclusion criteria, whether inclusion/exclusion criteria were established prior to data analysis, all manipulations, and all measures in the study.Sample size was determined as the largest number of participants we could recruit, given time/resource limitations, with sample size primarily limited by the feasibility of recruiting voice actors.No data were excluded.Inclusion/ exclusion criteria were established prior to data analysis.
Detailed information on acting training programmes was available from only a subset of participants.Within this subset, there was a substantial degree of variability in the type of training they obtained.Most of them completed programs in arts, drama, theatre, or stage performance.Along with modules focused on voice control, they also completed additional specialized voice acting courses, seminars, and professional development practices (e.g., voice and accent coaching, pronunciation).

Test of musical aptitude
The Musical Ear Test (Wallentin, Hojlund Nielsen, Friis-Olivarius, Vuust, & Vuust, 2010) measures musical abilities across two fundamental aspects of music e melody (including pitch and contour) and rhythm (Krumhansl, 2000).On each trial participants decided whether two short melodic or rhythmic phrases they heard were identical by clicking an appropriate button on the screen.The test consisted of 104 trials, 52 in each subtest, half of which were "same" trials and half of which were "different" trials.Each subtest started with two examples with feedback and there was no feedback thereafter.The melodic phrases contained 3e8 tones played with sampled piano sounds.The "different" trials had one pitch violation, half of which (13 trials) involved a contour violation while the remainder did not.25 trials contained nondiatonic tones, while 20 were in a major and 7 in a minor key.
The rhythmic phrases consisted of 4e11 wood block beats.
The "different" trials had one rhythmic change.Rhythmic complexity differed across trials e there were triplets in 21 trials and the remaining 31 trials only contained even beat subdivisions.37 trials began on the downbeat whereas the rest began later.All melodic and rhythm sequences had a duration of one measure and were played at 100 beats per minute.The order of trials within each subtest was randomized.Performance was assessed as percent correct and averaged across the melody and rhythm sub-tests to form an overall measure of musical aptitude.

Musical phrase perception test
A musical phrase perception test (Jasmin, Dick, Holt, & Tierney, 2020) was used to assess participants' use of pitch and temporal cues in perception of musical phrase boundaries.The stimuli were 150 musical phrases taken from a corpus of folk songs (Schaffrath & Park, 1995) synthesised as sequences of six harmonic complex tones with 10 ms on/off cosine ramps.To include trials with a variety of difficulty levels there were three conditions that manipulated the available acoustic information e a Pitch Only condition, a Duration Only condition, and a Both Cues condition (50 trials in each condition).Musical phrases in the Both Cues condition contained both pitch and duration information.Phrases in the Pitch Only condition had natural pitch variations, but the durations were set to be isochronous and equal to the mean duration of the notes in the original melody.Conversely, the Duration Only condition retained the original note durations while setting the pitch to a monotone at 220 Hz.Additionally, half of the stimuli in each condition formed a complete musical phrase with the notes in an unmodified sequential orderdthese could be perceived as a complete musical phrase.The remaining stimuli were made to sound incomplete by combining the second half of one musical phase and the first half of the next musical phrase in the song.The order of the notes within the two halves was preserved, resulting in stimuli with a phrase boundary in the middle of the sequence, rather than at the end.Phrases were presented in the same condition order across participants (Both Cues, then Duration Only, and then Pitch Only condition) with short breaks between the conditions.Within each condition, half of the trials were complete, and half incomplete.On each trial participants heard a single phrase once, and then indicated how complete the phrase sounded by clicking on a red bar on the screen.The red bar represented the degree of perceived phrase completeness on a continuum from incomplete (left side of the bar) to complete (right side of the bar).Participants could click anywhere on the bar to indicate their responses.We opted for a continuous measure of completeness to avoid potential ceiling/floor effects attributable to listeners' bias to hear phrases as complete or incomplete.The main outcome measure was the raw rating difference between complete and incomplete trials for each condition.This outcome measure was averaged across the three conditions to form a composite measure of musical phrase perception.

Auditory processing battery
Pitch and duration discrimination thresholds were estimated using the adaptive maximum likelihood procedure (Green, 1993) as implemented in the MLP Matlab toolbox (Grassi & Soranzo, 2009).The tests use an adaptive three-alternative forced-choice design, with the difficulty decreasing after every incorrect response and increasing after every correct response.The difference between the tones was adaptively reduced until the smallest discriminable difference between the frequencies and duration was established.
The stimuli for the pitch discrimination test were 250-ms complex tones with four harmonics (gated on and off with 10-ms raised cosine ramps).The standard tone was 330 Hz and the range of comparison tones extended to 390.1 Hz.The stimuli for the duration discrimination test were complex tones with four harmonics and F0 set at 330 Hz (ramped with 10-ms cosine onset and offset gates).The durations ranged from 250 ms (standard tone) to 450.1 ms.In the pitch discrimination task, participants heard three tones in each trial and were asked to identify the tone which had the highest pitch by pressing the appropriate number on the keyboard ('1', '2', or '3').In the duration discrimination task, participants heard two tones and had to indicate which tone was longer by pressing '1' or '2' on the keyboard.
In each condition, participants performed three blocks of 30 trials.In addition, 20% of the trials across the three blocks were catch trials, in which there was no change in the duration or frequency of the tones.Stimulus level thresholds corresponded to the midpoint of the psychometric curve defined as a probability of correct responses across trials (66% for pitch discrimination and 63.1 % for duration discrimination; Grassi & Soranzo, 2009).

Prosody perception test
A prosody perception task was used to assess participants' ability to distinguish between phrases varying in placement of linguistic focus (first vs second word emphasized) and phrase boundaries (early vs late closure).Stimuli were taken from the Multidimensional Battery of Prosody Perception (MBOPP; Jasmin, Dick, & Tierney, 2020).All samples were recorded by a native British English-speaking male.The stimuli were pairs of recorded phrases which were identical lexically but differed in prosody.For phrase boundary stimuli silent pauses after phrase breaks were removed prior to further stimulus manipulation.These pairs of phrases were morphed onto one another with the speech morphing software STRAIGHT (Kawahara & Irino, 2005), by varying the degree to which pitch and duration were informative of linguistic focus or phrase boundary placement.We synthesized pairs of stimuli in three conditions in which different types of acoustic cues to prosodic categorization were present, to ensure a wide range of difficulty across trials.In the Pitch Only condition, the stimulus pair had the same duration as the average durations of the two original recordings but the pitch information was set at the morphing level of 80% versus 20%.(A morphing level of 100% would indicate that the pitch information was taken entirely from the late closure or second word emphasized recordings, while a level of 0% would indicate that the pitch contour exactly matched the early closure or first word emphasized recordings.)(2) In the Duration Only condition, pitch was set to the mean of the two original versions but the duration was set at the morphing levels of 80% versus 20%.(3) in the Both Cues condition, both pitch and time cues were available simultaneously (both set at 80% vs 20% morphing levels).The morphed samples varied only in duration and pitch, while other amplitude and spectral characteristics other than pitch were kept constant at 50% between the two original recordings and therefore remained uninformative.
There were 42 trials in the phrase boundary test and 47 trials in the linguistic focus test.Each test included three conditions: Pitch Only, Duration Only, and Combined Cues.During each trial a single sentence appeared on the screen for participants to read.They were asked to imagine how the sentence might sound if they were to read them aloud.After 5 s, participants heard a man speaking the first part of the sentence in two different ways.The task was to choose which recording best matched the sentence they just read by pressing either '1' or '2' on the keyboard for the first or second recording respectively.Prior to the main task, participants completed a short practice run to familiarize themselves with the task procedure.Performance was scored as portion correct, averaged across the Both Cues, Pitch Only, and Duration Only conditions, and averaged across the focus and phrase perception sub-tests to form an overall measure of prosody perception.

EEG data acquisition and pre-processing
The stimulus was the consonant-vowel syllable/da/synthesised with a Klatt-based synthesiser.The syllable was 190 ms in duration with an F0 contour that was flat at 100 Hz for the initial 110 ms of the sound and then rose to 143 Hz over the course of the subsequent 80 ms.The stimulus began with a 5 ms onset burst.Between 5 and 50 ms F1 rose from 400 to 720 Hz, F2 fell from 1700 to 1240 Hz, and F3 fell from 2580 to 2500 Hz.Between 50 and 190 ms F1, F2, and F3 were stable at 720 Hz, 1240 Hz, and 2500 Hz, respectively.F4, F5, and F6 were constant between 5 and 190 ms at 3300 Hz, 3750 Hz, and 4900 Hz, respectively.The stimulus was presented 6000 times over the course of 25 min at alternating polarities and at a rate of 3.8 Hz.The sound was delivered diotically through E-A-RTONE 3A insert earphones at 80 dB SPL.During the recording, participants were allowed to read a magazine or a book of their choice.They were asked to relax and stay still during the sound presentation.The data were recorded using a BioSemi ActiveTwo EEG system at a 16,384 Hz sample rate and with open filters in ActiView (BioSemi) acquisition software.A montage of five electrodes with a sintered AgeAgCl pallet was used.One active electrode was placed at Cz, reference electrodes on both earlobes and two ground electrodes above the left and right eyebrows.Electrode contact impedance was kept below 20 kU throughout the session.
The mean of the reference channels was subtracted from the signal at Cz during offline data processing.Data were then bandpass-filtered using a first-order Butterworth filter with a low-pass of 70 Hz and a high-pass of 2000 Hz.Next, neural responses were divided into epochs beginning 10 ms before the onset of the sound and ending 240 ms after.Trials with an amplitude of greater than 35 mV were rejected as artifacts, and 5000 total artifact-free sweeps were selected, 2500 from each polarity.
A windowed FFT analysis was then used to extract the phase consistency of the response over time.40-ms segments of each epoch were extracted, starting at a center time point of 10 ms after the onset of the sound and ending 220 ms after sound onset, with a 1-ms step size between segments.For each segment, a Hanning windowed fast Fourier transform was used to calculate the time frequency spectrum.This procedure generates, for each trial, an amplitude value and a phase value.The resulting vectors were then transformed into unit vectors, which discards the amplitude but retains the phase value, and averaged.The length of the resulting vector was calculated as a measure of inter-trial phase locking, which varies from zero (no consistency) to one (perfect consistency).The result is a measure of the phase consistency of the response across trials for each time-frequency point.
To measure the amplitude spectrum, an average waveform was first calculated by taking the mean across all 5000 artifactfree sweeps.A windowed FFT analysis was then used to extract the amplitude of the response at each frequency over time.FFT settings were identical to those used for the phase consistency analysis (see last paragraph for details).

General procedure
Data were collected at the Department of Psychological Sciences at Birkbeck, University of London.All procedures were approved by the departmental Research Ethics Committee.
Each testing session lasted approximately 3 h (with breaks between the tasks).All participants were reimbursed £25 for their time.All participants provided written consent prior to the testing session.Data and scripts are provided at https:// osf.io/jevbm/.Study procedures and analyses were not preregistered.Research materials are provided at https://osf.io/eaqbj/.

Results
To test for a relationship between voice acting experience and the robustness of auditory processing, a series of linear regressions were run using the lm function in R (R Core Team, 2023).There were seven dependent measures, including prosody perception, musical phrase perception, musical aptitude, duration discrimination threshold, pitch discrimination threshold, FFR F0 consistency, and FFR F0 amplitude.Predictors included group (voice actor vs non-actor), as well as age and years of musical training.Fig. 1 displays individual data points for each dependent measure across voice actors and non-actors.Full regression tables are supplied in the Supplementary Information at https://osf.io/jevbm/,with significant predictors summarized below.
Voice actors performed better than non-actors at classifying patterns in both speech prosody and music, even after accounting for age and years of musical training.For prosody perception, the overall model predicted 37.4% of the variance (F (3,38) ¼ 7.57, p < .001).Voice actors (M ¼ 83.6% correct, SD ¼ 9.9%) performed significantly better than non-actors (M ¼ 74.3%, SD ¼ 13.9%; b ¼ .110,p < .001).The only additional significant predictor was age, indicating that older participants performed worse on the task (b ¼ À.006, p ¼ .001).For musical phrase perception, the overall model predicted  A few participants performed worse than chance on the prosody perception (n ¼ 2) and musical phrase perception (n ¼ 5) tasks, suggesting that they may have either struggled severely with these tasks or did not understand the instructions.To ensure that these few participants with very poor performance were not driving our results, we re-ran the prosody perception and musical phrase perception regressions after removing the participants who performed below chance on each task (performance less than .5 and less than 0, respectively).For prosody perception, voice actors once again performed significantly better than non-actors (b ¼ .087,p ¼ .007).Similarly, for musical phrase perception, voice actors again performed significantly better than nonactors (b ¼ 7.36, p ¼ .020).
Voice actors did not out-perform non-actors on more basic auditory processing tasks.For musical aptitude, the overall model predicted 10.5% of the variance (F (3,38) ¼ 1.48, p ¼ .24).No predictors were significant, although there was a trend for more years of musical training to predict better performance (b ¼ .376,p ¼ .068).For duration discrimination, the overall model predicted 9.3% of the variance (F (3,38) ¼ 1.31, p ¼ .29).No predictors were significant.For pitch discrimination, the overall model predicted 12.3% of the variance (F (3,38) ¼ 1.87, p ¼ .15).The only significant predictor was years of musical training (b ¼ À.031, p ¼ .029),indicating that a greater degree of musical experience was linked to more precise pitch perception.
Voice actors showed more consistent early auditory neural encoding of sound.To measure the consistency of neural encoding of the F0, the inter-trial phase coherence (ITPC) was averaged across a 10-Hz window around the stimulus F0 at each time point (after accounting for a 10-ms stimulusresponse lag) and averaged across time points between 21 and 190 ms.The result was an overall measure of the consistency of encoding of the F0 across the entire response (See Fig. 2 for a depiction of ITPC across time and frequency points for voice actors and non-actors, as well as a spectrogram of the stimulus).The overall model predicted 12.3% of the variance in F0 encoding consistency (F (3,38) ¼ 1.78, p ¼ .17).Voice actors had more consistent neural encoding of the fundamental frequency compared to non-actors (b ¼ .023,p ¼ .034).
However, voice actors did not have larger frequencyfollowing responses.To measure the amplitude of the frequency-following response, spectral amplitudes were averaged across a 10-Hz window around the stimulus F0 at each time point (after accounting for a 10-ms stimulusresponse lag) and averaged across time points between 21 and 190 ms.The result was an overall measure of the amplitude of the neural encoding of F0 across the entire response.The overall model predicted only 3.6% of the variance in F0 encoding (F (3,38) ¼ .47,p ¼ .70).Voice actors and non-actors had frequency following responses with equivalent amplitude (b ¼ .005,p ¼ .342).
Finally, to determine the extent to which non-verbal auditory processing skill helped predict prosody perception ability, we ran two regressions with prosody perception scores from the linguistic focus and phrase boundary sub-tests as the dependent measures and duration discrimination, pitch discrimination, musical aptitude, the consistency of neural encoding of F0, and musical phrase perception as predictors.For phrase perception, the overall model predicted 31.5% of the variance (F (5,36) ¼ 3.31, p ¼ .015).Participants who were more successful at phrase perception had lower duration discrimination thresholds (b ¼ À.160, p ¼ .004).No other predictors were significant.For linguistic focus perception, the overall model predicted 38.9% of the variance (F (5,36) ¼ 4.59, p ¼ .002).Participants who were more successful at focus perception had more consistent neural encoding of F0 (b ¼ 1.427, p ¼ .016)and performed better at musical phrase perception (b ¼ .005,p ¼ .008).

Discussion
To clearly communicate emotion, character, and phrase structure, voice actors must attend to and explicitly exaggerate acoustic cues to prosodic features, a process which does not require precise sound perception.Nevertheless, we find that voice actors benefit from more consistent neural encoding of the fundamental frequency of speech, relative to matched non-actors.That the consistency of neural encoding of pitch is enhanced in actors suggests that precise demands on perception are not a necessary pre-condition for auditory neural enhancement, which can instead be driven by training attention to sound, memory for sound features, and accurate  Please cite this article as: Kachlicka, M., & Tierney, A., , Voice actors show enhanced neural tracking of pitch, prosody perception, and music perception, Cortex, https://doi.org/10.1016/j.cortex.2024.06.016 sound production.Our finding of frequency-followingresponse enhancement in voice actors, therefore, is consistent with Reverse Hierarchy Theory, which suggests that attention can have effects which extend to lower levels of perceptual systems (Ahissar & Hochstein, 2004).Our findings also suggest that developing training which is not perceptually demanding but taxes listeners' attention, memory, and production may be a way of boosting auditory processing in listeners with poor auditory skills who would struggle to complete more demanding auditory tasks.For example, individuals could be presented with highly exaggerated prosodic features such as greatly lengthened durations before phrase boundaries or large pitch excursions during emphasized words and explicitly trained to reproduce these patterns.We find that voice acting training is linked to behavioural enhancements as well, with voice actors performing better on tests of both prosody and musical phrase perception.Prior research on the effects of acting experience has focused on changes in social and cognitive skills directly relevant to acting, including episodic memory recall (Banducci et al., 2017), verbal memory (Groussard et al., 2020;Noice & Noice, 2009), social relationships (Joronen, Konu, Rankin, & Astedt-Kurki, 2011), and theory of mind (Goldstein, Wu, & Winner, 2010).One prior study reported that drama training can lead to enhanced emotional prosody perception in 6-year-old children (Thompson, Schellenberg, & Husain, 2004), suggesting that there are perceptual as well as cognitive effects of theatre experience.Here, we find that the voice actor advantage is not limited to speech perception and neural encoding of speech but extends to music perception as well, suggesting that acting can lead to far transfer of learning to other domains (Barnett & Ceci, 2002).It remains to be seen whether the effects of voice acting training on the neural encoding of pitch extend to musical as well as speech stimuli.
Why would voice acting training have consequences for music perception, a task which is not directly relevant to voice acting?One possible explanation is that there is overlap between the acoustic cues used to convey certain prosodic features in language and in music.For example, in both domains notes/syllables feature increased duration and pitch movements just before a phrase boundary (De Pijper & Sanderman, 1994;Palmer & Krumhansl, 1987;Tierney, Russo, & Patel, 2011).By learning to explicitly attend to the acoustic cues conveying the presence of phrase boundaries in order to clearly produce and exaggerate them, voice actors may become primed to detect them even in another domain.A similar explanation could underlie the finding from longitudinal intervention studies that musical training can lead to gains in identification of emotional prosody (Mualem & Lavidor, 2015;Thompson et al., 2004), given overlap in the cues to emotion in language and music ( Q3 Ilie & Thompson, 2006;Cuotinho & Dibben, 2013).If acoustic overlap between speech and music drives learning transfer from voice acting to music, these effects may extend beyond musical phrase perception to other musical features.For example, voice actors may be better able to perceive emotion in music, compared to non-actors.The acoustic cues to the presence of musical beats and stressed syllables are also strikingly similar, including changes in pitch, amplitude, and duration (Chrabaszcz, Winn, Lin, & Idsardi, 2014;Ellis & Jones, 2009;Fear et al., 1995;Hannon, Snyder, Eerola, & Krumhansl, 2004), and so voice actors may demonstrate advantages in perception and neural encoding of musical beats as well.
Our finding of enhanced consistency of neural responses to speech and musical phrase perception in voice actors suggests that precise perceptual demands are not a necessary precondition for certain types of auditory learning, given that actors produce exaggerated cues to prosody (Berry & Brown, 2019;Ju ¨rgens et al., 2015;Matharu et al., 2022).The effectiveness of voice actor experience in driving auditory learning may be due to the roles that attention, emotion, and repetition play in acting (Patel, 2011(Patel, , 2014)): emotion is commonly evoked during acting practice and performance, attention must be directed to acoustic cues in order to exaggerate them, and lines are repeated over and over again as actors learn a scene.However, the lack of precision required by voice acting training may limit the extent of the perceptual advantages which result.Here, we find that the voice actor advantage was limited to musical phrase perception, with the actor and nonactor groups performing similarly on tests of pitch and duration discrimination, as well as music aptitude.Actors' lack of enhanced perceptual precision distinguishes them from tone language speakers and musicians, both of whom demonstrate enhanced pitch discrimination (Creel et al., 2018;Micheyl, Delhommeau, Perrot, & Oxenham, 2006;Pfordresher & Brown, 2009); indeed, in our study we found that years of musical training were linked to pitch discrimination.Precise perceptual demands, therefore, may not be necessary for enhancement of auditory neural encoding or transfer of learning across domains, but may be necessary for boosting auditory discrimination.
Another possible explanation for the actor advantage for music perception, prosody perception, and consistency of neural sound encoding is that actors need to exercise explicit control over their production of prosodic features to exaggerate them.This specialized speech production experience could have consequences for sound perception, as acting training has been shown to modulate premotor activation during perception, even in the absence of overt movement (Dick et al., 2011).However, there is some inconsistency in the literature as to the extent to which pitch perception and production are related (Hutchins & Moreno, 2013); for example, pitch discrimination training has been reported to have no effect on the accuracy of melody production (Zarate, Delhommeau, Wood, & Zatorre, 2010), and difficulties with pitch perception can sometimes occur alongside preserved pitch production (Loui, Guenther, Mathys, & Schlaug, 2008).The Linked Dual Representation model (Hutchins & Moreno, 2013) accounts for these findings by proposing that there is a direct link between low-level pitch perception and vocalmotor production, a second indirect link with production mediated by categorical perception, and a "feedback" connection from production to low-level pitch perception.Our finding of more consistent frequency-following responses in voice actors is consistent with this model, as this enhancement could reflect an influence of feedback from pitch production to low-level encoding of pitch.
Although we find that voice actors show more consistent neural encoding of the F0 of speech, we find no difference between voice actors and non-actors in the amplitude of the  Kraus, 2013;Tierney et al., 2015) but also greater F0 amplitude (Mussachia et al., 2007;Parbery-Clark, Strait, & Kraus, 2011).If F0 amplitude is enhanced in musicians but not voice actors, this could suggest that musical training enhances auditory neural encoding to a greater extent than voice acting training.However, this finding of greater F0 amplitude in musicians has not always been replicated, with other studies reporting F0 amplitude to be similar between musicians and non-musicians (Parbery-Clark, Skoe, & Kraus, 2009), or even enhanced in non-musicians relative to musicians (Parbery-Clark et al., 2012).To our knowledge, only two prior studies have directly compared neural encoding of sound in voice actors versus musicians.Dick et al. (2011) found voice actor training was linked to greater activation in the right planum temporale for dramatic speech relative to violin music, while the opposite was true for violin training, suggesting possible domain-specific effects of training type on auditory neural encoding of speech versus music.Rosslau et al. (2016) found greater pitch detection accuracy for singers than for actors, as well as prolonged late activity in right temporal and left parietal areas for singers.Based on the current literature, therefore, it remains unclear whether musical training has a greater effect on early auditory neural encoding, compared to voice acting training, a question which could be addressed by future research.
All our participants were native English speakers.However, we found large individual differences across participants in the ability to perceive prosodic features in English, with performance ranging from 45 to 95%.This test featured sound cues that were smaller than those present in naturalistic speech, to limit ceiling effects.Nevertheless, this wide variability suggests that not all individuals are equally able to perceive prosody in their native language.Moreover, we find that variability in prosody perception is linked to the robustness of encoding of the acoustic cue most relevant to a given feature.For example, perception of phrase boundaries, for which duration is the primary cue (Jasmin et al., 2021), was linked to the precision of duration perception, while perception of linguistic focus, for which pitch is the primary cue (Symons & Tierney, 2023), was linked to the consistency of neural encoding of F0.Prior research has indicated a link between auditory processing and individual differences in speech perception in childhood (Cumming, Wilson, & Goswami, 2015) as well as in adults learning a second language (Kachlicka et al., 2019).Our results here suggest that auditory processing continues to drive speech perception in one's native language throughout the lifespan.
An alternate explanation for our results is that individuals with pre-existing consistent neural encoding of sound, excellent prosody perception, and better-than-usual musical phrase perception are more likely to choose to engage in voice acting as a profession.Indeed, prior research has shown that perceptual and cognitive factors can predict whether individuals will engage in music training (Corrigall, Schellenberg, & Misura, 2013;Corrigall & Schellenberg, 2015), a finding which could extend to acting training as well.We would argue that this interpretation is not very parsimonious, as it seems unlikely that F0 consistency and prosody and musical phrase perception could innately co-vary in the absence of variation in more basic acoustic discrimination and musical aptitude skills.It seems particularly unlikely that musical phrase perception skill would be a major factor determining whether an individual pursues acting experience; this cross-domain relationship between voice acting and music perception is better explained as resulting from a transfer of learning from speech to music.Nevertheless, only future research using random assignment of interventions to participants and a longitudinal design could definitively establish whether voice acting can cause auditory neural enhancements.
Here we treat voice acting training as a homogenous category; this is in line with standard practice in research on the cognitive and perceptual profiles of musicians and nonmusicians, which generally characterizes any individual with at least six years of training as a musician (Zhang et al., 2020 Q4 ).In practice, however, the experience of being trained in music can differ widely depending on instrument (voice, wind, percussion, stringed) or genre (classical, jazz, rock, electronic).Similarly, voice acting training is heterogenous, with training varying depending on genre, including commercials, animation, and video games.Future work should compare and contrast different types of acting training and experience to determine the factors driving improvement in auditory neural encoding.
Overall, we find that although voice acting training does not require precise auditory perception, it is linked to enhancements in neural encoding of sound, as well as detection of musical and prosodic patterns.As a result, training which requires attention to, memory for, and production of exaggerated sound features may provide a method of boosting auditory neural function in individuals with poor auditory perception who might struggle to complete more demanding auditory training programs requiring fine precision.

Open practices
The study in this article has earned Open Data, and Open Materials for transparent practices.The data, and materials are available at: https://osf.io/jevbm/Q5 .Uncited references Q7 Jansen et al., 2023;Soranzo and Grassi, 2014.

Fig. 1 e
Fig. 1 e Performance across all seven main outcome variables in voice actors (left, red) and non-actors (right, black).The horizontal line indicates the mean.

Fig. 2 e
Fig. 2 e Stimulus spectrogram (left) and neural encoding of speech in voice actors (middle) and non-actors (right).The color at each time/frequency point indicates the ITPC at that frequency in a 40-ms window centered on that time point.