Generalization of auditory expertise in audio engineers and instrumental musicians

From auditory perception to general cognition, the ability to play a musical instrument has been associated with skills both related and unrelated to music. However, it is unclear if these effects are bound to the specific characteristics of musical instrument training, as little attention has been paid to other populations such as audio engineers and designers whose auditory expertise may match or surpass that of musicians in specific auditory tasks or more naturalistic acoustic scenarios. We explored this possibility by comparing students of audio engineering ( n = 20) to matched conservatory-trained instrumentalists ( n = 24) and to naive controls (n = 20) on measures of auditory discrimination, auditory scene analysis, and speech in noise perception. We found that audio engineers and performing musicians had generally lower psychophysical thresholds than controls, with pitch perception showing the largest effect size. Compared to controls, audio engineers could better memorise and recall auditory scenes composed of non-musical sounds, whereas instrumental musicians performed best in a sustained selective attention task with two competing streams of tones. Finally, in a diotic speech-in-babble task, musicians showed lower signal-to-noise-ratio thresholds than both controls and engineers; however, a follow-up online study did not replicate this musician advantage. We also observed differences in personality that might account for group-based self-selection biases. Overall, we showed that investigating a wider range of forms of auditory expertise can help us corroborate (or challenge) the specificity of the advantages previously associated with musical instrument training.


Limitations
However, the conclusions that can be drawn from the current literature on the topic are somewhat limited by conflicting evidence and theoretical issues.An example is speech-in-noise perception.As noted above, a number of studies have reported musician advantages in perceiving speech in noisy or distracting environments, but equally, several studies have failed to detect an association with musical training across multiple experimental conditions (e.g.Boebinger et al., 2015;MacCutcheon et al., 2020;Madsen, Marschall, Dau, & Oxenham, 2019;Madsen, Whiteford, & Oxenham, 2017;Ruggles, Freyman, & Oxenham, 2014).It has been suggested that the advantage of musicians for speechin-noise perception might depend on the relevance of pitch discrimination for the given task (Fuller, Galvin, Maat, Free, & Bas ¸kent, 2014), along with rhythmic skills (Slater et al., 2018) and the presence of spatial cues (Bidelman & Yoo, 2020;Clayton et al., 2016; but see Madsen et al., 2019), and may be partially negated by musicians' high levels of chronic noise exposure (Skoe, Camera, & Tufts, 2019).Importantly, the musician advantage for speech-in-noise perception could also be mediated by other and possibly preexisting cognitive abilities (e.g., working memory, attention) rather than being a direct effect of musical experience (Escobar, Mussoi, & Silberer, 2020;Schellenberg, 2015Schellenberg, , 2019;;Yoo & Bidelman, 2019).Thus, despite the interest in the topic and promising clinical applications (e.g. the rehabilitation of sensorineural and agerelated hearing loss; Alain, Zendel, Hutka, & Bidelman, 2014;Lo, Looi, Thompson, & McMahon, 2020;Parbery-Clark, Strait, Anderson, Hittner, & Kraus, 2011), current evidence does not unequivocally support the hypothesis that musical training enhances speech-in-noise perception.
Another example is the musicians' advantage for auditory sequence memorisation and reproduction (Krishnan, Carey, Dick, & Pearce, 2021;Tierney, Bergeson-Dana, & Pisoni, 2008), which Carey et al. (2015) did not replicate using the same general paradigm, despite testing a relatively large number of highly trained violinists and pianists.More generally, many studies have only observed expertise-related skill transfer to contexts closely related to the original training context (for a review, see Green & Bavelier, 2008), although a lack of granularity in the definition of population characteristics and behavioural measurements might make it difficult to reach conclusive and replicable results (Green, Strobach, & Schubert, 2014).
For example, simple comparisons of musically trained and untrained individuals cannot explain whether any of the observed advantages are specifically associated with unique features of musical training or could instead be observed (or even enhanced) with other types of training.Evidence from single-task randomised controlled training studies on non-musicians shows that several auditory perceptual thresholds (i.e., pitch, duration, intensity, interaural time and level difference) can indeed be individually improved with training (Wright & Fitzgerald, 2005;Wright & Sabin, 2007), subsequently matching those of musicians (Micheyl, Delhommeau, Perrot, & Oxenham, 2006).
Additionally, other types of musical performers such as professional club disk jockeys have been shown to match trained percussionists in rhythmic ability (Butler & Trainor, 2015).Neuroplastic and behavioural correlates of other forms of auditory expertise unrelated to musical training have also been studied.For instance, 60 min of birdsong identification training was shown to lead to a decrease in early (200-300 ms) neural activity in left superior temporal gyrus and middle frontal gyrus in response to trained stimuli, but also a later (500-550 ms) increase in activity in the cingulate cortex bilaterally for untrained songbird stimuli (De Meo, Bourquin, Knebel, Murray, & Clarke, 2015).Additionally, scalp topography of P2 auditory-evoked potentials of songbird experts revealed a more frontal positivity than naive participants in response to not only birdsongs, but also voice and environmental stimuli, which might reflect a generalised difference in processing strategy (Chartrand, Filion-Bilodeau, & Belin, 2007).Another example is learning to decode Morse code, which has been associated with an increase in neural activity in the inferior and medial parietal cortex bilaterally and in grey matter density in the fusiform gyrus (Schmidt-Wilcke, Rosengarth, Luerding, Bogdahn, & Greenlee, 2010), while musicians have been shown to reproduce Morse code at variable speeds more accurately than non-musicians after training at a static speed (Slayton, Romero-Sosa, Shore, Buonomano, & Viskontas, 2020).Nonetheless, very little attention has been paid to other populations whose profession depends on high levels of auditory sophistication, such as audio engineers.
2 At least in the compartmentalised or quasi-Platonic western notions of "music" and "being a musician."(Cross, 2012;Wiggins, Müllensiefen, & Pearce, 2010) 3 These domains might in fact share perceptual and cognitive processing in the brain, despite appearing superficially unrelated (e.g. the OPERA hypothesis for music and speech processing; Patel, 2011;Patel, 2014).

Population characteristics
Audio engineers attempt to create, capture, and modify sound in order to resolve technical issues and meet multiple artists' objectives (e. g., a musician, a producer, or their own), ultimately curating the listener's experience (Zwicker & Zwicker, 1991).This process can involve the discrimination and manipulation of psychoacoustic attributes such as pitch and timbre via equalisation and filtering, loudness and dynamic range via compression and expansion, but also synchronicity, phase, filtering, masking, and spatial features via custom configurations of hardware and software tools (Corey & Benson, 2016).Other than professional practice, this level of perceptual expertise is usually achieved via technical ear training, which involves exercises designed to improve the ability to focus on and identify discrete elements of auditory sensations, and associate them with objective acoustical measurements (Corey, 2013;Iwamiya, Nakajima, Ueda, Kawahara, & Takada, 2003;Letowski, 1985), although this practice is not yet fully standardised (Kaniwa et al., 2011;Kim, Kaniwa, Terasawa, Yamada, & Makino, 2013;Marui & Kamekawa, 2013, 2019;McKinnon-Bassett & Martens, 2013).Additionally, audio engineers must learn to deliberately direct their attention to individual elements of sounds or auditory scenes, and to maintain them in memory.For example, the practice of mixing in music production can involve listening to a complex auditory scene (e.g. an instrument group), scanning the scene to identify a source of potential acoustic issues in the global sound (e.g.phase interference, tonal imbalance, lack of definition or "muddy" sound, timing issues, etc.), applying a fix at the level of individual instruments or elements, reintegrating them into the scene, and reevaluating the updated auditory scene (for a detailed description of what mixing entails, see e.g.Case, 2012;Izhaki, 2008).This process intuitively should require considerable sustained selective attention (auditory scene segregation and integration) and auditory working memory (mental sound manipulation and pre-post comparison); the relevant tasks are supported by visual cues provided by screening devices like spectrum analysers.

A different model of auditory expertise
Musicians who play in ensembles must also be able to track the auditory scene and, in large ensembles, interpret the conductor's cues to synchronise with the group and adapt their sound to the collective performance.By comparison, audio engineers are responsible for several sound sources at the same time, have a much larger toolbox for acoustic manipulation that is not constrained by the physical construction of a musical instrument and can work either synchronously (e.g., live performance) or asynchronously (e.g., studio work).Furthermore, there can be multiple ways of achieving similar acoustic outcomes depending on the available gear, personal workflow, and creative process (De Man et al., 2015).For instance, the adjustment of a sound's intensity could correspond to the turn of a knob or a push of slider on a mixing board, the click of a mouse in a digital audio workstation, or the repositioning of a microphone.Moreover, these gestures can affect sound in real time or with any amount of delay.Conversely, the correspondence between an instrumentalist's gestures and acoustic outcomes is narrower in terms of range of motion and temporal co-occurrence of action and sound, which may promote auditory-motor coupling (Alluri et al., 2017;de Manzano, Kuckelkorn, Ström, & Ullén, 2020;Li et al., 2018;Palomar-García, Zatorre, Ventura-Campos, Bueichekú, & Ávila, 2017;Zatorre et al., 2007).
Audio engineers are also equipped with domain-specific knowledge such as signal processing, electronics, audio theory, and psychoacoustics (Howard & Angus, 2009), as well as technical language and professional jargon (Porcello, 2004), which can provide context and assist the interpretation of sensory perception.Taken together, the skills of these professionals correspond to a model of auditory expertise that is very different in nature from that of musical instrument training.In contrast to performing musicians, audio engineers do not need high proficiency in playing a musical instrument to excel in their profession.These unique characteristics of audio engineers can be exploited to test the specificity of some of the auditory advantages associated with musical training described in the literature, in particular fine auditory perception and auditory scene analysis.

Current study
The current study contrasts two different ecologically valid, auditory-based forms of expertise: audio engineering and playing a musical instrument.First, we tested the hypothesis that both audio engineers and musicians would show superior auditory skills compared to matched controls across a broad set of auditory-based measures that are both associated with musical training and essential for the practice of audio engineering.We included 6 psychophysical measurements (i.e., frequency, duration, intensity, sinusoidal amplitude modulation, interaural level difference, and interaural time difference) and 4 measurements of auditory scene analysis.The latter were: 1) a sustained auditory selective attention task (Laffere, Dick, & Tierney, 2020) where participants discriminate between two concurrent streams of tonal sequences; 2) a working memory and sound segregation task that involves the memorisation and matching of three concurrent sounds varying in frequency and amplitude modulation with a target sound; 3) a task that involves the detection of changes in the statistical properties of an auditory scene; and 4) a diotic speech-in-babble-noise task.
Second, we ran a set of exploratory analyses to identify and describe the unique attributes of our auditory expert cohorts.To complement the observational nature of this study and detect cohort qualities that may contribute to self-selection and performance, we also included selfreport measures of personality and musical sophistication.The latter is particularly important as musicians and audio engineers can present partially overlapping forms of auditory expertise, thus posing a challenge to the interpretation of observational data.It is possible, for instance, for audio engineers to be excellent instrumentalists and, viceversa, for musicians to be knowledgeable in the field of audio engineering, although we aimed to partially reduce the overlap between these two populations by explicitly recruiting musicians with no expertise in audio engineering, including recording, mixing, and mastering.We then evaluated the associations between different levels of audio engineering experience, musical experience, and auditory skill.
Third, we explored whether, and to what extent, low level perceptual ability, auditory scene analysis, and speech-in-babble perception correlate with each other and compared the manner in which these associations manifest between groups.

Participants
Participants (n = 64) were undergraduate students of either audio engineering, a musical instrument degree, or any other non-musical degree.All participants were native English speakers between 19 and 26 years old and reported no history of hearing impairments.Audiometric thresholds were verified manually (see 2.2.1).Because audio engineers have not previously been studied as an expert auditory group, and therefore no well-motivated effect size could be estimated, N per group was determined by reviewing the literature reviewed above with musicians, which has shown quite consistent, musician-specific effects on aspects of auditory perception (Bidelman and Yoo (2020), Boebinger et al. (2015), Clayton et al. (2016), Escobar et al. (2020), Fuller et al. (2014), Kaganovich et al. (2013), Kishon-Rabin et al. (2001), Mac-Cutcheon et al. (2020), Madsen et al. (2019), Ruggles et al. (2014), Slater et al. (2018)).The average group size (N(musicians) + N (nonmusicians)/2) across all these studies was N = 20 (range 14-30); we recruited on this basis.

Audio engineering students
Students of audio engineering (n = 20, 17 M; age range = 19-26, mean (SD) = 21.3 (1.9)) were recruited first through email and flyer advertising.At the time of testing, they were enrolled full time (year 1, n = 2; year 2, n = 8; year 3, n = 10) in the Music Technology and Sonic Arts (BSc) program at Queen's University Belfast, where they were tested in a sound-insulated recording studio.They reported having on average 3.9 years of experience with audio recording, mixing or mastering (SD = 1.7, range = 1-7; see Table 1).

Musical instrument students
Musicians (n = 24, 16 M; age range = 20-26, mean (SD) = 23.9(1.69)) were students of a musical instrument degree (see Table 1 for instruments) recruited in London through flyer advertising and UCL/ Birkbeck SONA systems.Recruitment criteria included the practice of any musical instrument other than percussion for 4 or more years, with an average daily practice of at least 2 h a day and no experience with audio engineering, mixing, mastering, or recording.Despite efforts being made to match all cohorts' demographics, participants in the musician group included 5 more female participants and were on average 2.5 years older than engineers and controls.The effects of these potential confounds on the auditory measurements were evaluated posthoc via nonparametric univariate testing (see 2.4.2) for gender and Spearman correlations for age.No association was found for either demographic variable.

Control group
Controls (n = 20, 17 M; age range = 19-25, mean (SD) = 21.6 (1.9)) were also recruited in London through the UCL and Birkbeck SONA systems.They were undergraduate students of non-musical degrees (i.e., psychology, anthropology, pharmacy, history, management, mathematics, social sciences, finance, jewellery design, computer science, medicine), with no formal training or history of regular practice playing a musical instrument or audio engineering, mixing, mastering, or recording.Both music instrumentalists and controls were tested in a quiet testing booth at Birkbeck, University of London.

Procedure
The test battery was composed of one audiometric screening, 10 behavioural tasks, and 2 questionnaires.Each testing session lasted up to 2 h, with average duration being about 1 h and 45 min.To minimise differences across individuals due to task order, tasks and questionnaires were run in the same order for all participants, which is the order in which they are presented below.All auditory tasks were piloted by three expert raters who determined the ideal headphone volume in terms of task difficulty, clarity (i.e., task not loud enough), and comfort (i.e., task too loud).Once set, volume was kept constant for all tasks and participants.The study was approved by the Birkbeck Department of Psychological Sciences ethics committee (approval number 111228) and all participants gave their informed consent before the start of the experiment.

Audiology
Two different tools were used to measure audiometric thresholds.Students of audio engineering were tested with a Kamplex KC35 Audiometer, while musicians and non-musicians were tested with an Otopod system paired with Symphony software on a Windows XP laptop.In both cases, a 10 dB-down, 5 dB-up adaptive staircase procedure (British Society of Audiology, 2018) was used, and thresholds were measured using pure tones from a range of frequencies presented in this order: 1 kHz, 1.5 kHz, 2 kHz, 3 kHz, 4 kHz, 6 kHz, 8 kHz, 125 Hz, 250 Hz, 500 Hz, 750 Hz.After manually checking that they could hear a sample sound from both ears, participants were asked to listen carefully and to press the provided response button whenever they could hear a tone, starting at 10 dB HL.All frequencies were presented monaurally starting with the left ear.For each frequency, a threshold was determined when the participant performed 2 reversals at the same intensity.

Speech in babble noise (SIN)
Participants were instructed to listen carefully to a target sentence in the presence of four-talker babble and repeat that sentence out loud to the experimenter.Target sentences, spoken by a British male, were sampled from the Bamford-Kowal-Bench Speech-in-Noise (BKB-SIN) sentences (Bench, Kowal, & Bamford, 1979;Etymotic Research, 2005) and included 3 keywords.All stimuli were presented diotically.Participants were encouraged to repeat any word they heard, regardless of whether that was a single word or an entire sentence.The experimenter marked the number of correct words that were repeated.Unlike the original BKB-SIN test, we estimated speech/babble SNR thresholds using an in-house adaptive staircase procedure implemented in MATLAB (2013b; The MathWorks Inc, 2013).The initial SNR value was set to +10 dB and changed adaptively up or down according to participants' response (1-up 1-down).A response was considered correct if at least keywords were identified.After recording the participant's response, a new sentence was presented.The first step size was set to 8 dB and reduced to 6 dB, 4 dB, and 2 dB after each reversal.SNR changes were obtained by increasing or decreasing the amplitude of the target sentence, while the amplitude of the babble mask was kept constant.The experiment terminated after 6 2-dB-step reversals or when the limit of 20 sentences was reached.A final SNR threshold was calculated as the average SNR ratio of the stimuli presented after the first 3 reversals (i.e., the final set of stimuli presented with a 2 dB step size).

Sustained auditory selective attention (SASA)
This task was designed to quantify participants' sustained selective attention (Dick et al., 2017;Holt, Tierney, Guerra, Laffere, & Dick, 2018;Laffere et al., 2020).Each block consisted of a stream of 30 short sequences, each made of six 125 ms cosine-ramped sine tones sampled with replacement from two frequency bands in an alternating pattern (Fig. 1).Each band was composed of three tones set two semitones apart: 185, 207.7, and 233.1 Hz (F#3, G#3, and A#3) for the lower band and 370, 415.3, and 466.2 Hz (F#4, G#4, and A#4) for the higher band (i.e., one octave above).Tones were presented at regular intervals at a rate of 8 Hz followed by a 250 ms pause and the first tone was always sampled from the lower band.As higher-frequency stimuli tend to be perceived as louder, a difference of 8 dB was set between the amplitudes of the tones in the high and low bands.A total of 30 sequences was presented in each block.For the first 10 blocks, participants were asked to respond by pressing the space bar when they heard 2 consecutive identical sequences in the high band.After a short break, participants completed another 10 trials, this time detecting repetitions in the low band while ignoring tones in the high band.Each trial included between 3 and repetitions.The experiment was preceded by 4 training blocks for each condition, during which the amplitude of the confounding stream was initially set to zero and linearly increased until it reached its final amplitude.Answers were evaluated within a 1 s window starting at the onset of the third tone of a sequence (i.e., between 0.5 s and 1.5 s after a sequence's onset).Participants received feedback on screen immediately after responding.Sensitivity to repetitions in the attended band was calculated as d'.

Goldsmiths musical sophistication index (gold-MSI)
A digital version of the Goldsmiths Musical Sophistication Index (Gold-MSI) (Müllensiefen, Gingras, Musil, & Stewart, 2014) was administered.This extensively normed questionnaire quantifies individual differences in musical sophistication according to five dimensions, Active Engagement, Perceptual Abilities, Musical Training, Singing Abilities, Emotions, and one common factor, General Sophistication.Participants rated on a 7-point Likert scale how much they agreed with a statement that described their experience with music.Scores for each dimension were calculated as the sum of the ratings given to each item belonging to that dimension after inverting negativescore items.

Ten item personality inventory (TIPI)
A computerised version of the Ten Item Personality Inventory (TIPI; Gosling, Rentfrow, & Swann, 2003) was administered.In this brief questionnaire, each of the Big Five personality dimensions (i.e., extraversion, agreeableness, conscientiousness, emotional stability, and openness to experience) is represented by two pairs of adjectives, one positive (e.g., "sympathetic, warm" for agreeableness) and one negative (e.g., "reserved, quiet" for extraversion).Participants were asked to indicate how much they identified with each pair of adjectives on a scale from 1 (Strongly disagree) to 7 (Strongly agree).The final scores were calculated by taking the average of the 2 items representing each dimension after inverting the ratings of the negative items.

Auditory scene recall (ASR)
This task was designed to measure participants' ability to segregate different sounds in an auditory scene analysis and maintain them in memory for a short period of time (Pomper, Curetti, & Chait, 2023, in press).Each trial was made of three phases.During the first phase ("encoding phase"), participants listened to an auditory scene (2 s) composed of three amplitude-modulated pure tones ("streams") drawn from a fixed pool of 20 log-spaced frequencies between 200 Hz and 3000 Hz, with square-wave amplitude modulation rates set at either 3 Hz, 7 Hz, or 19 Hz, applied with 5 m cosine ramps at the onset and offset of each pulse.Tone frequencies and modulation frequencies were set so that they would not be multiples of each other.The second phase consisted of 1.5 s of silence.During the third phase ("test phase"), a single stream (2 s) was presented: in half the trials, the stream was identical in both frequency and amplitude modulation rate to one of the streams presented in the encoding phase, whereas in the other half it had a new unique combination of frequency (sampled from the three frequencies presented in the encoding phase) and modulation rate (Fig. 1).For each trial, participants were asked to memorise the three streams presented simultaneously in the encoding phase and determine whether the single stream heard in the retrieval phase was one of the three tones they memorised.This is analogous to a reversed delay match-to-sample task, in that the options are presented before the sample.Participants responded by pressing the "F" key if they believed the target tone was present in the encoding phase, and "D" key if it was not.Participants were allowed to respond as soon as they heard the target stream and up to 4 s after the stream offset.Before the task was administered, the experimenter played several sample sounds to make sure participants understood the task.100 trials were generated for each participant using MATLAB (2015b; The MathWorks Inc, 2015).Stimuli were generated at a 44.1 kHz sampling rate, saved as WAV files, and subsequently presented to participants in the form of two blocks of 50 trials each, with a break in between the two blocks.Visual feedback was provided for each trial and a summary score of false alarms, correct, and invalid responses was displayed at the end of each block.Target detection sensitivity was calculated as d' following a "1/2 N" correction for extreme proportions of hit or false alarm rates (Macmillan & Kaplan, 1985;Stanislaw & Todorov, 1999).

Psychophysics
Six psychophysical tasks were administered using the Maximum Likelihood Procedure (MLP) for auditory threshold estimation implemented in the Psychoacoustics toolbox (Soranzo & Grassi, 2014) in MATLAB (2013b; The MathWorks Inc, 2013) running on a MacBook computer.During the pitch discrimination (PD), duration discrimination (DD), intensity discrimination (ID), and sinusoidal amplitude modulation detection (SAMD) tasks, participants were asked to listen carefully to three randomly ordered sounds in a sequence (3AFC): 2 standard sounds set to a fixed parameter value, and 1 target sound whose parameter changed adaptively across trials.They then identified the sound that differed ('odd one out') from the standard sounds by pressing 1, 2 or 3 on the keyboard.For the interaural level difference (ILD) and interaural time difference (ITD) tasks, only two sounds were presented (2AFC), and participants were asked to identify whether the first of the two sounds was perceived as coming from the left or from the right (with the second sound having the same parameter magnitude but coming from the opposite side).All six psychophysical tasks were administered in two blocks of 20 trials each and no feedback was provided.Details of all six psychophysical tasks are reported in Table 2.
The MLP aims to achieve a fast estimate of psychophysical thresholds through a nonparametric adaptive procedure.After each trial, the procedure identifies the logistic function that best fits the expected psychometric function of each participant based on their current responses and other fixed variables such as the function slope, expected error rates (e.g.due to attentional lapses), and chance level (e.g.0.33 for a 3AFC).The toolbox's default function slopes and expected error rates were used for all tasks.The procedure then calculates the next target stimulus as the parameter corresponding to a certain probability of a correct answer (i.e., 0.73 for 3AFC, 0.81 for 2AFC) in the previously estimated psychometric function.Details of the psychometric function estimation and stimulus selection are described in Grassi & Soranzo, 2009).This procedure is very sensitive to attentional lapses, particularly at the beginning of each block, as early estimations of the participants' psychometric functions affect all remaining estimations within a given block.For this reason, if participants failed to identify the target stimulus in the first trial, which was always the easiest and therefore expected to elicit a correct answer, the corresponding block was marked as invalid.Final thresholds were calculated as the average of the two blocks after excluding blocks that were determined to be invalid.
Finally, we calculated the difference between participants' thresholds and the standard values of the Pitch Discrimination, Duration Discrimination, and Intensity Discrimination tests.For instance, a threshold of 335 Hz for the Pitch Discrimination task, which has a standard value of 330 Hz, would correspond to a difference of − 5 Hz.This was done to facilitate data visualization across auditory tasks by having greater values always correspond to a greater sensitivity.

Stochastic auditory scene (StAS)
This task aimed at measuring participants' sensitivity to statistical changes in auditory sequences.Participants were presented with random sequences of concatenated 50 ms tone pips (gated on and off with 5 ms raised cosine ramps), selected with replacement from a pool of 20 distinct log-spaced frequencies between 222 and 1912 Hz (12% steps or 1/6 of an octave).All trials began with a series of randomly selected tones drawn from the pool.In half the trials, after 40-50 tones (with the number drawn randomly per trial), the sequence would then switch to a halved pool of only 10 frequencies for 40 tones (i.e., 2 s).There were two conditions: in the "full-to-middle" (F-M) condition, the halved pool consisted of the 10 middle frequencies (391-1085 Hz) of the original pool, whereas in the "full-to-edge" (F-E) condition it consisted of the five highest (1215( -1912 Hz) Hz) and five lowest (222-349 Hz) frequencies (Fig. 1).Listeners were instructed to press the spacebar as soon as they heard a change in the auditory scene.Although they were not given information on what exactly would change, participants were provided with several examples and one practice trial per condition, as well as receiving visual accuracy feedback on the screen at the end of each sequence.Overall detection sensitivity was obtained by calculating d' for the two conditions, correcting for extreme proportions of hit and false alarm rates (Macmillan & Kaplan, 1985;Stanislaw & Todorov, 1999), and averaging them.

Data preprocessing
Scores for all behavioural tasks were screened for univariate outliers and patterns of missing data using JMP 15.2.1.

Outliers
Extreme data points were evaluated manually based on overall data distributions, previous benchmarks, and a combination of robust measures of centre and spread.More specifically, values over one interdecile range from the first or ninth decile or over 4 robust spreads from the centre (M-estimates; Huber, 1973;Huber, 2011) were initially flagged as extreme.A total of 5 individual data-points were flagged as outliers: pitch discrimination (29.52 Hz, or about 8.9% of the 330 Hz reference stimulus) for participant M6; intensity discrimination (7.08 dB SPL) for participant C20; sinusoidal amplitude modulation detection (− 4.25 dB, 20log(m)) for participant M9; interaural level difference (4.54 dB) for participant C15; and speech-in-babble (− 24.5 dB SNR) for participant M22.The first 4 observations correspond to exceptionally high (i.e., poor) psychoacoustic threshold estimates, by far higher than any other participant or benchmark (e.g.Kidd, Watson, & Gygi, 2007).Further inspection revealed that these were due to mistakes (e.g., attentional lapse, wrong button pressed, random guessing) made by participants within the first few trials of both blocks, to which the MLP staircase procedure is particularly sensitive (Soranzo & Grassi, 2014).For this reason, these measurements were judged as invalid and excluded from further analyses.The speech-in-babble outlier, on the other hand, corresponds to an extremely low SNR threshold (i.e., exceptionally good) which cannot be ruled out as a measurement error and so it was retained.None of the other potential outliers identified by manual inspection of data distributions could be attributed to technical error and so they were retained as valid measurements.

Missing data
A total of 15 out of 640 (2.3%) missing data points were identified across the behavioural dataset, due to either outlier exclusion, technical issues during testing, or time constraints.Missing data points were distributed across nine participants who failed to complete one tasks each, and one participant (E1) from the audio engineers group who failed to complete six tasks (auditory scene recall and all psychophysical tasks except interaural time difference).Gold-MSI data for 3 participants from the musician group was also missing due to data corruption during the online questionnaire saving process.When using statistical methods that require complete data vectors for every participant (i.e.variable importance, see 2.4.3),Participant E1 (audio engineers) was entirely excluded; for the remaining participants, multivariate normal imputation based on a least squares prediction from the non-missing variables with shrinkage (Schäfer & Strimmer, 2005) was calculated for the rest of the dataset using JMP (15.2.1) and employed as an alternative to listwise deletion in order to retain as much information as possible (Schafer, 1999).While imputation can generate redundancy in data and increase the risk of Type 1 error, listwise deletion can increase Type 2 error and reduce statistical power (Cheema, 2014;Mishra & Khare, 2014).Where applicable, both methods were utilised and results compared to verify whether missing data would cause critical differences in statistical analyses.Since results obtained with both procedures were nearly identical, for simplicity, only the results obtained with imputation of missing data are reported.Pairwise deletion was instead employed when calculating correlations (see 2.4.5).

Statistical analyses
Statistical analyses are divided in two sections.First, we tested the a priori global hypothesis that groups do not come from the same auditory population, followed by the more specific hypothesis that auditory experts (i.e., musicians and audio engineers) will outperform controls across each of the auditory tasks, with emphasis on inferential statistics and control of Type 1 error.Second, we ran a series of exploratory analyses to uncover any meaningful patterns in the dataset, as well as to test finer-grained hypotheses regarding the specific differences between the musician and audio engineer populations, relationships between auditory tasks, and the role of musical and audio engineering experience.Methods for data exploration included graphical methods, descriptive statistics, point estimates of relevant sample statistics, and data-driven models (Behrens, 1997;Szucs & Ioannidis, 2017).Any a posteriori hypothesis formulated during data exploration was made explicit in order for the associated confidence intervals and p-values to be interpreted as per their descriptive content (Amrhein, Korner-Nievergelt, & Roth, 2017;Lavine, 2014), rather than as confirmatory evidence for inference at the population level (Cohen, 1994;Gaus, 2015).Robust metrics and/or nonparametric methods were preferred across all statistical analyses to accommodate for differences in distribution characteristics across tasks and groups, unbalanced classes, heteroscedasticity, and presence of outliers without recurring to arbitrary data transformations or post-hoc analytic choices.Finally, the signs of all psychophysical and speech-in-babble thresholds were reversed before analyses so that a greater number always represented better performance across all tasks to improve readability.

Multivariate differences (nonparametric MANOVA)
The global null hypothesis of no group differences in auditory skills was tested with the nonpartest function in the npmv R package (version 2.4.0;Ellis, Burchett, Harrar, & Bathke, 2017), which employs a multivariate ANOVA-type test statistic based on ranks (Brunner, Dette, & Munk, 1997;Brunner & Munzel, 2000) and p-values calculated via an asymptotic F-distribution approximation (Bathke & Harrar, 2008) or resampling.This is a nonparametric equivalent of a MANOVA.

Univariate multiple comparisons and relative effects
In the case of a rejection of the multivariate null hypothesis, a set of univariate tests was planned to test whether experts outperform controls in each auditory task.This was done with a rank-based nonparametric multiple contrast test procedure (MCTP) implemented in the mctp function in the nparcomp R package (version 3.0; Konietschke, Placzek, Schaarschmidt, & Hothorn, 2015; Noguchi, Abel, Marmolejo-Ramos, & Konietschke, 2020).This procedure was selected for all univariate comparisons as it does not make assumptions about distribution shape, heteroscedasticity, or class imbalance.The MCTP tests hypotheses of stochastic inequality, that is the probability of a random observation from one sample to be larger (or smaller) than a random observation from another sample.This operationalises the notion that one group will tend to outperform another without reference to measures of central tendency and spread (Cliff, 1993;Delaney & Vargha, 2002).This probability is referred to as relative effect and was calculated for each group against a reference unweighted mean distribution of all group distributions, so that a random measurement from one group is always evaluated in the context of the entire dataset.Relative effects were used to formulate hypotheses about group inequalities.Specifically, for each auditory task, we tested the one-tailed null hypothesis that control participants will show equal or better performance compared to musicians or audio engineers, that is an equal or higher relative effect.The rejection of a null hypothesis for a given task would then support the alternative hypothesis that one or both auditory expert cohorts scored significantly higher than controls for that task.This was done by setting type = "Dunnett" (i.e., many-to-one comparisons) and alternative = "greater" in the mctp function.In addition to the simple difference between relative effects, a point estimate of a transformed log odds-type effect size comparable in magnitude to Cohen's d was also calculated and reported to facilitate interpretation (Noguchi et al., 2020).The MCTP is a single step procedure, in that overall and specific contrasts are evaluated at the same time with no contradiction (i.e., a statistically significant omnibus test always corresponds to a significant "post-hoc" test and vice-versa) and under strong control of the family-wise error rate (FWER).Asymptotic estimates of adjusted p-values and simultaneous confidence intervals were calculated following a multivariate tbased approximation with adjusted degrees of freedom (Noguchi et al., 2020).The p-values of the overall effects, which always correspond to the lowest p-value of any pairwise comparison, were further corrected following the Benjamini-Hochberg false discovery rate (FDR) (Benjamini & Hochberg, 1995) adjustment implemented in the p.adjust function from the stats R package.An equivalent testing procedure for simple pairwise comparisons (i.e., a studentised permutation test (Neubert & Brunner, 2007) with the npar.t.test function from the same package) was used to complement plots and descriptive statistics during exploratory analyses between audio engineers and musicians.In these cases, pvalues were left uncorrected and explicitly reported as such to suggest an appropriate interpretation.

Classification of musicians and audio engineers: Variable importance
To further explore the different characteristics of our expert cohorts on a multivariate basis, we extracted variable importance from a random forest classifier (Breiman, 2001) trained with personality scores, Gold-MSI sub-dimensions, and auditory measures as predictors.Random forests are non-parametric algorithms that aggregate predictions from binary decision trees constructed on bootstrap samples or sub-samples of the original dataset and random subsets of predictors (for an overview, see e.g.Strobl, Malley, & Tutz, 2009).We selected a class of random forests that utilises conditional inference trees as base classifiers (Torsten, Kurt, & Achim, 2006).These perform permutation tests (Strasser & Weber, 1999) at each node to identify the predictor most strongly associated with the response variable along with the optimal split point that maximises the discrepancy between the subnodes (Torsten et al., 2006).This method, when applied with subsampling without replacement, has been shown to be unbiased to the nature of a predictor (e.g., categorical, scale, ordinal).This differs from other types of binary decision trees that rely on measures of impurity reduction such as classification and regression trees (Strobl, Boulesteix, Zeileis, & Hothorn, 2007).This feature is particularly important as our predictors include both continuous variables and low-cardinality questionnaire data.We grew our forest with cforest from the partykit R package (Hothorn & Zeileis, 2015), with hyper parameters set to ntree = 10,000 (number of trees in the forest), mtry = 4 (number of random predictors tested at each node of a tree; default is √p where p is the number of predictors), and perturb set to a subsampling fraction of 0.632 with no replacement in order to achieve unbiasedness to predictor type (see above).Trees in the forest were allowed to fully grow by setting minsplit = minbucket = 1 (minimum size of a node), only limited by a minimum significance of a permutation test set with mincriterion = 0.95 (1-p-value).These were set with the goal of achieving a compromise between variance (i.e., node size of 1) and bias (i.e., high criterion of 0.95) (see guidelines in Probst, Boulesteix, & Bischl, 2019).The importance of each predictor in the model was calculated as conditional permutation importance (Strobl, Boulesteix, Kneib, Augustin, & Zeileis, 2008).Permutation importance corresponds to the mean decrease in prediction accuracy when the values of a predictor are randomly permuted.Conditional permutation importance also accounts for collinearity between variables by measuring associations between predictors and permuting collinear ones together.This was calculated using the varimp function in partykit with parameters nperm = 5 (number of permutations), conditional = TRUE, and threshold = 0.95.As per default, prediction accuracy and importance were calculated on the "out-of-box" data (i.e., OOB = TRUE), that is on the data excluded during subsampling.Random forests were employed here as a fully nonparametric tool for data exploration (Jones & Linder, 2015) which, given a high number of predictor variables and low number of observations, specifically serves the purpose of identifying and ranking a subset of variables (i.e., feature selection) that can best describe the differences between musicians and audio engineer.As multiple imputations and listwise deletion lead to interchangeable results, only results following listwise deletion are reported.This F. Caprini et al. corresponds to a total of 40 participants, 19 audio engineers and 21 musicians.Variables with importance above 2.5%, corresponding to mean decrease in accuracy equivalent to at least one participant (i.e., 100%/40), were included in an alternative reduced model.For the purpose of replicability, results were obtained using a random seed of 1112.

Musical and audio engineering experience
To draw a more direct comparison between musicians and audio engineers with a similar musical background, we clustered participants in two groups based their score in the Musical Training sub-dimension of the Gold-MSI questionnaire.Specifically, apart from one musician who scored 31, musicians scored between 37 and 49 (Fig. 2).Therefore, using a cut-off of 37, we were able to match all but this one musician with 8 audio engineers with a similar musical background.The underlying meaning of this cutoff was further examined using two items of the Gold-MSI questionnaires that contribute the musical training score, namely "I engaged in regular, daily practice of a musical instrument (including voice) for___years" and "I have had___years of formal training on a musical instrument (including voice) during my lifetime," in order to qualify possible differences in formal or informal training between cohorts (Fig. 2).We then re-examined differences in behavioural measures between musicians and engineers with a similar level of musical training, as well as audio engineers with different levels of musical training, using the same methods described in paragraph 2.4.2.Additionally, we explored associations between mixing and mastering experience and behavioural measures with Spearman correlations.

Correlations between auditory tasks
Monotonic relationships between behavioural variables were estimated using Spearman's rank correlations coefficients (ρ) separately for each group.Empirical confidence intervals for individual bivariate ρ were calculated via bootstrapping (Haukoos & Lewis, 2005;Wright, London, & Field, 2011).Relevant correlations, as well as their differences across groups, were assessed graphically with a series of correlograms as well as bivariate scatterplots on both raw data and ranked data.To facilitate comparisons between groups, data were ranked within group and centred at the median rank before plotting.

Auditory expertise: Multivariate and univariate tests
The multivariate null hypothesis that participants come from the same population was rejected (ANOVA-type test statistic = 4.254, df1 = 11.616,df2 = 301.082,p-value <0.0001), confirming that groups do indeed exhibit overall different degrees of auditory ability.After FDR correction, the null hypothesis of stochastic equality between experts and controls was rejected at the 0.05 level on all tasks except duration discrimination, intensity discrimination, and stochastic auditory scene (full details of test statistics can be found in Table 3).On perceptual tasks, both students of audio engineering and musical instrumentalists had significantly lower thresholds for pitch discrimination and interaural time difference tasks than controls.Musicians also showed significantly lower thresholds than controls on sinusoidal amplitude modulation discrimination and interaural level difference tasks (Fig. 3).On auditory scene tasks, musicians were more accurate than controls on the sustained auditory selective attention task, while audio engineers were more accurate than controls on the auditory scene recall task.Finally, musicians, but not engineers, showed significantly lower SNR thresholds for the speech-in-babble-noise task (Fig. 4).Pitch discrimination had the largest expertise-related effect size across all auditory tasks for both expert cohorts compared to the control group, with median thresholds for audio engineers (median = 3 Hz, or 0.9% difference reference tone, MAD = 1.659Hz (0.5%) and musicians (median = 3 Hz (0.9%), MAD = 1.248Hz (0.37%) being approximately half of those of control participants (median = 6.667Hz (2%), MAD = 3.983 Hz (1.2%).

Random forests: Variable importance
To summarise the variables in our dataset that can best discriminate between musicians and audio engineers and rank their relevance, we calculated conditional permutation importance -i.e., mean decrease in classification accuracy following a permutation of a given predictor -of a random forest classifier built on all variables in our dataset.The overall accuracy of the full model including all 20 predictors was 80%.A reduced model (Fig. 5) which only included variables with importance above 2.5% had an accuracy of 82.5%.The predictor with the largest influence on prediction accuracy was singing abilities (25.6%), followed by speech-in-babble-noise thresholds and musical training (~15%), and emotional stability (5.4%).Minor contributions between 2.5% and 5% were obtained for active engagement and 2 psychophysical tasks, interaural time difference and duration discrimination.Bivariate Spearman correlations among the top three predictors revealed that while singing abilities and musical training were strongly correlated for both musicians (ρ = 0.53, 90% CI [0.14, 0.83]) and audio engineers (ρ = 0.62, 90% CI [0.20, 0.86]), speech-in-babble-noise thresholds had no correlation with either predictor.

Auditory tasks
Data plots (i.e., Fig. 3, Fig. 4) and descriptive statistics were used to integrate the results from the random forest importance classification and interpret the directionality of its prediction.In terms of behavioural variables, speech-in-babble-noise thresholds of musicians (median = − 9.87 dB SNR, MAD = 1.82 dB SNR) were significantly lower than both controls (median = − 8.61 dB SNR, MAD = 1.42 dB SNR) and engineers (median = − 8.15 dB SNR, MAD = 1.62 dB SNR; post-hoc Brunner-Munzel, effect size = 0.674, test statistic = 3.347, p = 0.003), although musicians were also the most inconsistent within group and displayed the largest range (20.5 dB SNR) of responses on this task -a point we return to below, and in Experiment 2. As for the other auditory scene performance tasks which did not add a unique contribution to classification accuracy according to the random forest model, median sustained auditory selective attention d' was marginally higher for the musician group (0.777, MAD = 0.173) than audio engineers (0.709, MAD = 0.148), while the opposite was true for the auditory scene recall task (audio engineers: median = 1.411,MAD = 0.307; musicians: median = 1.187,MAD = 0.519), although these differences were not statistically significant. 4As for psychophysical tasks, with the exception of sinusoidal amplitude modulation discrimination, audio engineers' median thresholds were the lowest across all tasks, albeit by also a very small margin.The most apparent difference between expert cohorts (Fig. 3) was duration discrimination (audio engineers: median = 29.03ms, MAD = 5.98 ms; musicians: median = 32.55 ms, MAD = 9.47 ms), although a post-hoc test showed this difference was also not statistically significant at the 0.05 level (post-hoc Brunner-Munzel, effect size = 0.408 test statistic = 1.927, p = 0.063).

Musical expertise and personality
Musical sophistication (Fig. S1) and personality traits (Fig. S2) were among the most important variables in the discrimination of musicians and audio engineers.Unsurprisingly, musicians scored substantially higher than audio engineers in the musical training (post-hoc Brunner-Munzel, effect size = 0.693, test statistic = 3.311, p = 0.004) and singing abilities dimensions (post-hoc Brunner-Munzel, effect size = 1.1486983, test statistic = 6.505, p < 0.001) of the Gold-MSI questionnaire, but also marginally higher in the perceptual abilities (post-4 An experiment with much larger sample sizes would be needed to appropriately test the statistical significance of such small effect sizes. F. Caprini et al. hoc Brunner-Munzel, effect size = 0.498, test statistic = 2.393, p = 0.028) and emotions (post-hoc Brunner-Munzel, effect size = 0.448, test statistic = 2.134, p = 0.041) dimensions.However, a comparable level of active engagement with music was present in musicians compared to audio engineers.Results from the TIPI questionnaire revealed significant differences in emotional stability (post-hoc Brunner-Munzel, effect size = − 0.523, test statistic = − 2.569, p = 0.015), with musicians on average seeing themselves as less emotionally stable than audio engineers.Musicians and audio engineers also appeared to cluster around equally higher scores compared to controls in the openness to experience dimension, which included an item about creativity.

Table 3
Results of many-to-one testing procedure between audio engineers (E) and musicians (M) compared to controls (C).Tasks: pitch discrimination (PD), duration discrimination (DD), intensity discrimination (ID), sinusoidal amplitude modulation discrimination (SAMD), interaural level difference (ILD), interaural time difference (ITD), sustained auditory selective attention (SASA), auditory scene recall (ASR), stochastic auditory scene (StAS), speech in babble noise (SIN).H a : Alternative hypotheses expressed as the probability that a random participant from the audio engineer group (E > C) or musician group (M > C) would have a higher score than a random participant from the control group.Rel.Effect [95% CI]: relative effects with one-tailed 95% confidence interval.Effect size: log-odds type effect size comparable in magnitude to Cohen's d.Statistic: test statistic.p: test significance with strong control of the family-wise error rate within each task.p omni : significance of the omnibus test.p FDR : significance of the omnibus test corrected for false discovery rate across all tasks (bolded if p < 0.05).

Musical training and audio engineering experience
Clustering participants based on their musical training background (see 2.4.5) did not affect previous results: musicians displayed lower speech-in-babble thresholds than audio engineers with a matched degree of musical training (post-hoc Brunner-Munzel, effect size = 0.733, test statistic = 2.93, p = 0.019) and there were no significant differences in auditory ability between audio engineers with different musical backgrounds (nonparametric MANOVA, permutation test of ANOVAtype statistic with 10,000 replications, p = 0.687).On the other hand, audio engineering experience was moderately correlated with both stochastic auditory scene (ρ = 0.43, 90% CI [0.08, 0.70]) and speech in babble noise (ρ = 0.49, 90% CI [0.13, 0.78]) performance, although even the most trained participants' scores fell within the range of control participants.

Associations between fine perception, auditory scene analysis, and speech in noise
Among the auditory scene tasks, sustained auditory selective attention d' scores appeared to be the most consistently (i.e., across groups) associated with psychophysical thresholds, in particular with pitch discrimination, intensity discrimination, and interaural time difference (Fig. 6).Correlations between speech-in-babble-noise thresholds and psychophysical tasks were mixed across groups and overall negligible.Correlations between the auditory scene and speech-in-babble-noise Fig. 3. Dot plots, same area violin plots, and box plots for all psychophysical measures by group.Just noticeable differences are reported on the y axes with opposite signs in order for a positive effect size to consistently correspond to a better performance across tasks.Brackets above graphs display log-odds-type effect size and one-tailed p values when p < 0.05.

Experiment 2: Online coordinate response measure (CRM) task
The apparent advantage for musicians perceiving speech in babble versus both audio engineers and controls could potentially be related to the difference in dialect between audio engineers (Irish) and musicians and controls (English).Also, as noted above, there was considerable variability in musician performance on this task, with most musicians performing like the other groups, but a minority performing well at very challenging SNR levels.Therefore, we conducted an online coordinate response measure experiment as a follow-up test of the performing musician advantage we observed in recognizing words in the presence of competing babble.It also served as a test of the potential effect of native (Irish versus English) accent on perceiving a southern English accent in challenging conditions.

Participants
We recruited online participants via email advertisement from Queen's University Belfast (n = 32) and via the online recruitment portal Prolific (n = 84) using the following criteria: no hearing impairments, English-speaking monolingual, age range 18-35, nationality (English; Irish or Northern Irish), experience with musical instrument (musicians ≥ 5 years; non-musicians ≤1 year).To further validate the screening criterion relating to musical training, we collected data on the number of years of regular practice (mean ± SD: musicians, 9 years ±4.8; nonmusicians, 0.8 years ±1.8) and formal training (musicians, 5.9 years ±4.6; non-musicians, 0.7 years ±1.8) in any musical instrument.Additionally, we asked participants to describe their English background in their own words (e.g."I grew up in London, my family spoke Punjabi and English at home to me.").Five participants whose accent was neither English nor Irish were excluded from further analyses.The resulting four groups were English musicians (n = 24, 10F, age = 29.6 years old ±4.2),English non-musicians (n = 30, 16F, age = 28.2 years old ±5.1),Irish musicians (n = 27, 15F, age = 23.9 years old ±4.3), and Irish nonmusicians (n = 32, 23F, age = 26.3 years old ±5.2).

Procedure
The speech-in-noise task was an online implementation of the adaptive Coordinate Response Measure (CRM) (Bianco, Mills, de Kerangal, Rosen, & Chait, 2021;Bolia, Nelson, Ericson, & Simpson, 2000;Messaoud-Galusi, Hazan, & Rosen, 2011), in which the target stimuli were sentences including a colour and a number following the format: "show the dog where the [colour] [number] is".We chose this speech-innoise task as a self-administered alternative to the one used in the first experiment, which instead requires interaction with an experimenter.Each stimulus was masked by the same speech babble used in the previous speech in multi-talker babble task (see paragraph 2.2.2), applied using an adaptive 1-up 1-down staircase procedure.The initial talker-masker ratio and step size were set to +20 dB and + 9 dB respectively.
Step size was reduced by 2 dB at each reversal, with each run stopping after 4 reversals.Each participant completed 4 runs in total.To perform the task, participants clicked on boxes with the corresponding combination of number and colour.All combination of numbers (1 to 9, except the bisyllabic 7) and colours (black, white, green, red, blue, pink) were displayed at all times.Thresholds were calculated as the average SNR of the last 3 reversals.Before the start of the experiment, we used an online sound-level-setting paradigm (Zhao, Brown, Holt, & Dick, 2022) to help participants set the amplitude of the stimulus at an average of 74 dB SPL (range 67-80 dB SPL).

Results
A 2 × 2 ANOVA was conducted to examine the effect of native accent and musical training on speech-in-noise perception thresholds (Fig. 7).There was no statistically significant main effect for either variable (Accent, F(1,112) = 0.447, p = 0.505; Musical training, F(1,112) = 0.031, p = 0.860), nor their interaction (F(1,112) = 0.473, p = 0.493).Furthermore, there was no correlation between SNR thresholds and years regular instrument practice (rho = − 0.08) or formal musical training (rho = − 0.05).As can be seen in Fig. 7, there is essentially complete overlap in the range and distribution of thresholds across groups.

Results summary
This study introduced a novel population of auditory experts: audio F. Caprini et al. engineers.We first tested the hypothesized superiority of their fine perceptual and auditory scene analysis skills in relation to naive participants and contrasted their performance to that of music instrumentalists.We found that, when compared to naive participants, both auditory expert cohorts had lower thresholds for pitch discrimination and interaural time difference discrimination, while musicians also had lower thresholds for sinusoidal amplitude modulation discrimination and interaural level difference discrimination.Audio engineers performed better than controls in auditory scene recall, which requires participants to determine whether a target sound matches one of three sounds presented earlier in terms of pitch and amplitude modulation frequency.On the other hand, musicians outperformed controls in sustained auditory selective attention, during which participants identified repetitions of three-tone sequences in an auditory stream while ignoring a competing stream one octave apart.Musicians also had significantly lower thresholds for speech-in-babble-noise perception than both naive participants and audio engineers (a result that did not extend to the new sample of musicians and non-musicians in Expt 2).Both auditory expert cohorts showed higher levels of openness to experience and audio engineers had higher levels of emotional stability compared to musicians.Audio engineers had a wider variety of musical backgrounds, although controlling for this did not affect previous conclusions about group differences.The number of years of audio engineering experience was moderately associated with better sensitivity in the stochastic auditory scene task and lower speech-in-babblenoise thresholds, but overall scores for both tasks fell within the normal range.Finally, psychophysical scores were the most associated with sustained auditory selective attention and speech in babble noise was associated stochastic auditory scene, particularly for audio engineers.
In sum, we gathered evidence supporting the hypothesis that audio engineers' auditory expertise, similarly to musical training, corresponds to generalised advantages in fine auditory discrimination and auditory scene analysis.The apparent musician-selective advantage for speech in babble -a topic that has been of considerable debate in the literaturewas more ephemeral, with the initial result from Experiment 1 not extending to the online study of Experiment 2.

Fine perception
Musicians and audio engineers showed superior fine auditory perception, with pitch discrimination having the largest effect size and clear-cut separation between experts and controls.Thresholds for the control group followed generally a wider distribution, as reflected by a higher median absolute deviation across all psychophysical tasks, and the top performers always matched the performance of the expert groups.These results reflect one challenging aspect of designing a control group for expert populations in a cross-sectional study, as pseudo-randomly sampling from the general population will unavoidably correspond to a wider spectrum of responses and include highly skilled individuals, despite controlling for musical training (Law & Zentner, 2012).Overall, we could not detect a clear advantage of musicians over audio engineers or vice versa in fine auditory perception.

Auditory scene analysis and speech in babble noise
While musicians significantly outperformed controls in sustained auditory selective attention (SASA), audio engineers performed better than controls in auditory scene recall (ASR). 5In addition to the differences in cognitive loads for each task (i.e., SASA relies more heavily on sustained selective attention and ASR on working memory), SASA stimuli are comparatively more "musical," in that the two competing auditory streams are constructed from the first three tones of a major scale separated by an octave, which might resemble competing melodies.ASR stimuli, on the other hand, are simple pure tones defined by a pitch and an amplitude modulation frequency but have no tonal relation with each other.Audio engineers' selective attention ability in this case might benefit from a more technical understanding of sound components and a more generalised experience working with any type of sound, musical and non-musical.Furthermore, this task required participants to analyse and maintain the whole auditory scene (i.e., three sounds) in memory, as no distinction between target and foil can be made until the target sound is heard.This type of mental sound manipulation and asynchronous pre-post comparison is common during mixing practices (see paragraph 1.2.1) and could in part account for the audio engineers' advantage.
In Experiment 1, musicians showed significantly lower SNR thresholds for speech-in-babble-noise perception compared to both controls and audio engineers, even accounting for differences in musical training.However, musical training and general musical sophistication (as measured by the Gold-MSI questionnaire) showed no association with speech-in-babble-noise thresholds within each group, implying that the musician effect might be due to characteristics intrinsic to the group not detected by our test battery -a possibility we explored in Experiment 2. Indeed, the only behavioural measure in our data that showed a fairly consistent positive correlation across groups with speech-in-babblenoise perception was sensitivity to statistical changes in a stochastic auditory scene.This could be explained by a better ability to detect changes in higher-order statistics of a sound sequence (Barascud, Pearce, Griffiths, Friston, & Chait, 2016;Skerritt-Davis & Elhilali, 2018), spectral entropy (Overath et al., 2007;Stilp & Kluender, 2010) or, more generally, informational content in a noisy signal, which is a strategy implemented for instance in speech-in-noise recognition algorithms (e.g.Misra, Ikbal, Bourlard, & Hermansky, 2004;Toh, Togneri, & Nordholm, 2005).However this is speculative and a dedicated experiment is needed to test this specific hypothesis.In this vein, Oberfeld and Klöckner-Nowotny (2016) found that individual differences in selective attention measured in both auditory and visual modescould explain variations in speech-in-noise perception abilities in their sample.However their stimuli consisted of two individual competing speakers presented binaurally and one central target speaker, which might more explicitly depend on the ability to pay selective attention to one of multiple intelligible elements.Similarly, Tierney et al. (2020) found a correlation between non-verbal sustained selective attention and the perception of speech masked by one distractor talker.On the other hand, De Kerangal, Vickers, and Chait (2021) found an association between musical training and sustained attention, but not between sustained attention and speech in noise perception.
Furthermore, musical training and general musical sophistication (as measured by the Gold-MSI questionnaire) showed no association with speech-in-babble-noise thresholds within each group, implying that the musician effect might be due to characteristics intrinsic to the group not detected or controlled by our test battery.As noted above, one of these characteristics is the different regional accent of audio engineers (Belfast, UK) and musicians (London, UK).It has been shown that listening to speech in an unfamiliar accent can negatively affect the perception of speech in noisy enviroments even after long-term familiarization, possibly due to an increase in processing cost (e.g.Smith, Holmes-Elliott, Pettinato, & Knight, 2014).For this reason, we hypothesized that audio engineers might have been at a disadvantage, given that target sentences were spoken in a Southern British English accent, and conducted a second experiment to test this hypothesis.

Speech in babble noise, musical expertise, and native accent
Data from our second online speech-in-noise experiment suggest that neither native accent nor musical training affect the perception of speech in babble noise, which appears to contradict our previous findings.On the one hand, it is possible that results from the first study were simply due to random sampling.On the other hand, the studies differ in several qualities which might have affected participants' SNR thresholds.Specifically, in the first study, responses were directly communicated to the experimenter in an acoustic environment that was controlled and kept constant across participants (i.e.quiet room, headphones, sound pressure levels), which might have led to more precise measurements.Additionally, there was a much greater variability in musical training in the online experiment due to the limitations and selfreport nature of the screening procedure, indicating that the "musician" groups in the two experiments might not come from the same expert population.Nonetheless, when examining musical expertise directly, we did not observe any significant correlations between SNR thresholds and years regular instrument practice nor formal musical training.
In summary, our observations appear to be in line with the current literature on this topic, namely the difficulty in capturing a consistent advantage for speech-in-noise perception in musicians.If the hypothesis on the musician advantage is actually true, several variables that are assumed to be constant across studies might affect measurements to a greater extent than musical training itself, making studies not directly comparable.

Personality and musical sophistication
Not surprisingly, students of audio engineering had higher levels of musical sophistication compared to the general population, with about half of the participants reporting a similar degree of formal musical training as musicians.According to random forest variable importance, Gold-MSI Singing Abilities is the measure that can best discriminate between musicians and audio engineers in our dataset.Items that contribute to this sub-dimension include questions on singing itself (e.g."I am able to hit the right notes when I sing along with a recording"), but also melodic memory (e.g."I only need to hear a new tune once and I can sing it back hours later" or "I can sing or play music from memory") and performance anxiety (e.g."I don't like singing in public because I'm afraid that I would sing wrong notes").In terms of personality, both musicians and audio engineers scored higher in openness to experience, which is associated to creative abilities (McCrae, 1987) and musical sophistication (Greenberg, Müllensiefen, Lamb, & Rentfrow, 2015) and has been shown to predict auditory and musical abilities by predicting engagement with music and musical training Corrigall, Schellenberg, and Misura (2013).Neuroticism, which is the reverse of emotional stability, was significantly higher in musicians than audio engineers.The association between musicianship and neuroticism has been observed before Gillespie and Myors (2000), although the connection between the two is not yet fully understood (for a review, see Miranda, 2020).These findings imply that there can be several covariates specific to the musician population that are not normally controlled during the recruitment process or considered in the interpretation of musicians' data.For instance, differences in musical sophistication profiles and personality could be interpreted as an effect of self-selection of creative individuals (i.e., high openness) who chose a stage-oriented career as music instrumentalists as opposed to a more studio-oriented or "behindthe-scenes" profession such as audio engineering (i.e., emotional stability and singing abilities).

Limitations and future directions
One limitation of this study was the inclusion of students of audio engineering who might still be relatively inexperienced, as they reported having between 1 and 6 years with recording, mixing, and mastering, while musicians had from 4 to over 10 years of regular practice of a musical instrument.For instance, speech-in-babble-noise thresholds showed an association with the years of audio engineering experience, although the performance of even the more experienced audio engineers in our sample was entirely within the range achieved by controls.Data from more experienced professionals could clarify whether audio engineering training can be associated with speech-in-noise perception abilities beyond the levels of the general population.Additionally, the inclusion of only one diotic speech-in-babble-noise test somewhat limits the generality of the conclusions that can be reached with our data.The inclusion of a wider range of tests in future experiments will allow us to determine whether cohort differences should be interpreted at a construct level rather than at a single task level (Green et al., 2014) and to pinpoint which auditory abilities might benefit specific aspects speech-in-noise perception.Finally, cross-sectional experiments like the one presented in this paper cannot enable conclusions to be drawn about the causality of an observed group effect.Despite this, the legitimacy of causal inference in this category of music training studies has been often erroneously assumed (Schellenberg, 2019), underestimating the complexity of the interaction between individual differences and environment (Schellenberg, 2015).For instance, the association between musical training and IQ could be explained by genetic pleiotropy (Mosing, Madison, Pedersen, & Ullén, 2016) and the undertaking and duration of music practice itself can be predicted by general cognitive ability, personality, socioeconomic status (e.g.Corrigall et al., 2013;Schellenberg, 2011;Swaminathan & Schellenberg, 2018), and genetics (Mosing, Madison, Pedersen, Kuja-Halkola, & Ullén, 2014).Genetic variability accounts for individual difference across several musical skills (Gingras, Honing, Peretz, Trainor, & Fisher, 2015) and even in the absence of actual musical training, auditory and musical abilities are associated respectively with enhanced neural encoding of speech (Mankel & Bidelman, 2018) and emotion recognition (Correia et al., 2020).In the current study, by contrasting musicians with another population of auditory experts, we were able to draw a more nuanced and specific picture of musicians' profile in terms of auditory ability, personality, and musical sophistication.More generally, despite not being able to directly test causality, we showed that the inclusion of additional control groups and covariates in cross-sectional studies on musical expertise can help clarify the implicit assumptions about the musically trained population, challenge the specificity of the observed perceptual or cognitive advantages, and form new hypotheses about the potential source of such advantages beyond musical training itself.

Declaration of Competing Interest
None.
Education, Music and Psychology Research (Arnold Bentley New Initiatives Fund).We would also like to thank Christina Makri who helped with data collection.

Fig. 1 .
Fig. 1.Schematic representation of auditory scene analysis stimuli. A. Sustained auditory selective attention.Three-tone repetition in the high band marked by black rectangle.B. Auditory scene recall.Three tones with different frequencies and square-wave amplitude modulation rates followed by a target tone with a new combination of frequency and modulation rate.C. Stochastic auditory scene.Example of a "full to middle" (F-M) transition.Vertical dotted line represents the change in frequency sampling pool for the random tones.

Fig. 2 .
Fig. 2. Musical training background.Left plot represents Musical Training dimension scores from the Gold-MSI questionnaire.Data points above the dashed line correspond to musicians and audio engineers with a matching degree of musical training, defined by a Gold-MSI score higher or equal to 37, which captures all but one musician.Right plot shows the musical training background of the three cohorts, as well as musical training clusters, in terms of years of formal training and regular practice of a musical instrument.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

F
.Caprini et al.

Fig. 4 .
Fig. 4. Dot plots, same area violin plots, and box plots for all auditory scene tasks by group.Speech-in-babble thresholds are reported with opposite signs in order for a positive effect size to consistently correspond to a better performance across tasks.Brackets above graphs display log-odds-type effect size and one-tailed p values when p < 0.05.Values in blue brackets correspond to post-hoc two-tailed tests and are not corrected for multiple comparisons.Note that for the Auditory Scene Recall task, audio engineers' d' is significantly higher than that of controls overall, despite the two outlier control participants showing high d' values.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 6 .
Fig. 6.Correlograms of behavioural tasks for all groups and pooled correlations obtained by median-centring ranks by group.Top triangles: Spearmans' ρ.Positive correlations correspond to red-colored cells, negative correlations to blue-colored cells, while colour saturation reflects correlation magnitude.Correlations whose 90% empirical confidence interval does not include the null are marked with *.Bottom triangles: 90% empirical confidence intervals.Dashed horizontal lines represent ρ = 0. Thicker black margins identify psychophysical tasks and auditory scene tasks.Acronyms: PD = pitch discrimination; DD = duration discrimination: ID = intensity discrimination; SAMD = sinusoidal amplitude modulation discrimination; ILD = interaural level difference; ITD = interaural time difference; SASA = sustained auditory selective attention; ASR = auditory scene recall; StAS = stochastic auditory scene; SIN = speech in babble noise.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 1
Demographics, years of formal training (i.e.musical instrument lessons), years of regular practice of a musical instrument, and years of audio engineering experience.

Table 2
Details of the six psychophysical tasks.