Cognitive performance in open-plan office acoustic simulations: Effects of room acoustics and semantics but not spatial separation of sound sources

The irrelevant sound effect (ISE) characterizes short-term memory performance impairment during irrelevant sounds relative to quiet. Irrelevant sound presentation in most laboratory-based ISE studies has been rather limited to represent complex scenarios including open-plan offices (OPOs) and not many studies have considered serial recall of heard information. This paper investigates ISE using an auditory-verbal serial recall task, wherein performance was evaluated for relevant factors in simulating OPO acoustics: the irrelevant sounds including the semanticity of speech, reproduction methods over headphones, and room acoustics. Results (Experiments 1 and 2) show that ISE was exhibited in most conditions with anechoic (irrelevant) nonspeech sounds with/without speech, but the effect was substantially higher with meaningful speech compared to foreign speech, suggesting a semantic effect. Performance differences in conditions with diotic and binaural reproductions were not statistically robust, suggesting limited role of spatial separation of sources. In Experiment 3, statistically robust ISE were exhibited for binaural room acoustic conditions with mid-frequency reverberation times, T30 (s) = 0.4, 0.8, 1.1, suggesting cognitive impairment regardless of sound absorption representative of OPOs. Performance differences in T30 = 0.4 s relative to T30 = 0.8 and 1.1 s conditions were statistically robust. This emphasizes the benefits for cognitive performance with increased sound absorption, reinforcing extant room acoustic design recommendations. Performance differences in T30 = 0.8 s vs. 1.1 s were not statistically robust. Collectively, these results suggest that certain findings from ISE studies with idiosyncratic acoustics may not translate well to complex OPO acoustic environments.


INTRODUCTION
Open-plan offices (OPOs) are notorious for the detrimental effect of their acoustic environments on occupants' activities [1][2][3]. Many such activities including proofreading, writing, etc. are believed to involve a critical role of short-term processing of verbal information in serial order [4]. Verbal short-term memory (STM) performance has been shown to be vulnerable to certain irrelevant sound sequences [5]. This phenomenon is referred to as the irrelevant sound effect (ISE) and has a rich history of development [6] since its first report [7].
ISE studies feature prominently in laboratory research with a focus on OPOs besides many OPO studies citing ISE research in general (summary in [8]). In fact, a model based on verbal STM serial recall performance [8] is integral to the room acoustic standard for OPOs [9].
However, a basic aspect shared by most ISE studies with or without an OPO focus remains largely unexplored. It involves acoustic presentation of the irrelevant sounds, and the corresponding ecological validity -'the degree to which research findings reflect real-life hearing-related function, activity, or participation' [10] -of the ISE findings. The acoustic content in most ISE research has largely been idiosyncratic (e.g., spoken words, letters, etc.; details in I.A.2), with predominantly headphone-based diotic/monaural (same signal to both ears) sound reproduction (details in I.A. 3), and an ambiguous role of room acoustics (details in I.A.4).
Findings from such ISE studies represent low ecological validity for complex acoustic environments such as OPOs. Even for ISE studies using more representative OPO simulations [8,[11][12][13], broader engagement with various germane aspects of acoustic presentation (e.g., irrelevant sounds, reproduction methods, room acoustic effects, etc.) has generally not been extensive. Moreover, the task used in most ISE studies involves serial recall of visually presented sequences (e.g., digits, words). Many situations or tasks in OPOs can involve processing aural information (e.g., from computers, headphones, telephones, neighbors) in the presence of irrelevant sounds. Hence, serial recall of aurally presented verbal information represents a valid consideration. Although studied less in comparison, serial recall of aurally presented information (e.g., [14][15][16]) has been shown to exhibit a similar effect as serial recall of visual information [15,16], which is a reasonable starting point for further ISE investigations in OPO simulations.
Hence, this study aims to investigate ISE using serial recall of aurally presented digits in laboratory settings wherein the acoustic presentation includes a closer representation of the mundane OPO speech and nonspeech acoustic environment, compared to extant studies. Herein, the acoustic presentation involves several considerations regarding the composition of the irrelevant sounds and the headphone-based reproduction method in laboratory settings.

A. Considerations for irrelevant sound presentation in ISE studies
To set the stage, OPOs typically comprises of several spatially spread sources of intermittent (e.g., workstation sounds, doors, etc.) and continuous nonspeech sounds (e.g., heating, ventilation, and air conditioning (HVAC) noise, etc.), along with multi-talker speech from neighboring and far-away workstations, etc. [3]. Further, these sounds interact with the room acoustics in a complex manner [17]. To characterize such complex acoustic conditions in laboratory, arguably an appropriate level of sophistication in the simulation is needed. This is addressed in the following, starting with a more elaborate description of the ISE.

Duplex-mechanism account for ISE for serial recall
For task-irrelevant background sound sequences to notably impair verbal serial recall performance, at least one of two conditions must be met [5]. The first refers to the changing-state characteristics of the sound sequence(s). Herein, to elicit the ISE , a certain degree of change in spectro-temporal state is required to impede verbal STM performance compared to steady-state sound sequences and/or quiet conditions [18]. This is attributed to the interference-by-process principle. In the current case, this principle posits that the order information for successively varying auditory-perceptive tokens in changing-state sound stream(s) is processed obligatorily and pre-attentively, which interferes with the processing of order required in the serial recall task [5]. The second condition refers to unexpected deviations in auditory-perceptive stream(s), known as the attentional capture mechanism [5,6]. Herein, the deviations exhibit attention grabbing potential (e.g., one's name being called, a brief tonal deviation, etc.) away from focal activity that may or may not require processing of serial order. The interference-by-process and attentional capture mechanism constitute the duplex-mechanism account for ISE for serial recall [5], which postulates functional differences between disruption due to these two mechanisms.
Although, see [19] for the competing unitary account.

Content of irrelevant sounds in ISE studies and role of semanticity
From an OPO perspective, speech and (sufficiently) changing-state nonspeech sounds that can be segmented into coherent auditory-perceptive stream(s) have been shown to exhibit ISE ( [20] for office sounds; general review in [5]). This is relevant for OPO simulations, where nonspeech sounds and their typical regularity of occurrence can be deduced from audio recordings or literature [3]. The role of semanticity of speech, however, needs additional considerations. Within OPO occupants' surveys, which perhaps represent maximal ecological validity, intelligible speech, especially from nearby workstations [21], consistently ranks as the most disturbing component among all OPO sounds [1,2,21]. This is reflected in current OPO acoustic standards. These prioritize reducing speech transmission index (STI) across workstations as a key strategy for improving room acoustics [9] and overall acoustic comfort [22]. However, as per the duplex-mechanism account (section I.A.1), while meaningful speech can capture attention away from the focal task (e.g., due to emotional or personal relevance to speech, taboo words) [23][24][25][26], the role of meaningful speech is considered irrelevant for interference-by-process. This is because the serial recall task is presumed to not require extensive semantic processing ( [27]; review in [6]). Yet, it is worth noting that speech has consistently been reported as the most potent distractor during serial recall [28][29][30]. Further, speech in participants' own language has generally been shown to be more disruptive in magnitude than speech in a foreign language, although not always significantly so [29].
Additionally, emerging evidence points towards automatic semantic processing of irrelevant sounds during serial recall. Such automatic processing can disrupt task performance when contextual expectations are not met. This includes categorical deviation (e.g., number in a word sequence) [31][32][33], or semantic mismatch (e.g., unexpected ending to a sentence) [19,25,34].
Unlike attentional capture, the categorical deviation and semantic mismatch effects (both arguably semantic effects) have been shown to be immune to top-down control. For instance, these effects have been observed to be immune to habituation over the course of experiment [25,26,33,34]. Further, these effects have been shown to occur for deviants that are not relevant for the participants [25,26,31]. Categorical deviation has additionally been shown to be immune to foreknowledge about the deviant [31,32] and being unrelated to working memory capacity [32]. These effects are hence being proposed as being functionally separate from the current formulations of attentional capture in serial recall [26,31,33]. Besides, physiological evidence from a pupillometry based study of auditory attention also suggests that, while performing a cognitive task, meaningful irrelevant speech in one's native language consumes more attentional resources than meaningless speech in a foreign language [35]. This indicates that although taskirrelevant, native speech may nevertheless be processed semantically, and perhaps be more attention grabbing than foreign speech. Determining the functional basis (if any), and potential differences between automatic semantic processing and attentional capture mechanism during serial recall, are beyond the scope of the current study. However, the higher magnitude of ISE for native vs. foreign speech, and the emergent categorical and semantic mismatch effects at least provide some basis for exploring semantic effects in simulated OPO environments.
Moreover, in most ISE studies (primarily with cognitive psychology roots), the irrelevant nonspeech content has traditionally included tones, music, etc., and/or spoken content including words, letters, repetitive sentences, etc. The spoken content is typically acoustically dry and clear speech, often spoken by a single talker, rather than in a natural or conversational speech style by multiple talkers (see [5,6,28,36] for summaries). Such relatively simple stimuli may be necessary/justified for certain investigations, e.g., for basic cognitive psychology, and indeed be ecologically valid for such purposes. However, they are restricted representations of everyday scenarios such as OPOs, which may involve aspects that are not possible to study using simplistic stimuli (e.g., fluctuating spatial location of speech, staggered storylines as in halfalogues over telephones, etc.). To address the role of more conversational speech to an extent, some ISE research with an OPO context has included studies with more complex stimuli.
This includes using a recorded mix of speech and several nonspeech sounds common in offices ( [20] included sounds from doors, computers, typing, telephones rings, etc.), using in-situ OPO sound recordings with speech and nonspeech sounds (e.g., [28,37,38]), and recordings of multitalker speech with repetitive (e.g., [20,39,40]) or non-repetitive content [11,12] in mock office set-ups. However, no study (to the best of authors' knowledge) has considered both naturalistic and non-repetitive multi-source speech (in native and foreign language to the participants) and nonspeech sounds representative of the OPO context in a controlled setup (i.e., not uncontrolled in-situ recordings) using an auditory-verbal serial recall task.

Acoustic reproduction format
The reproduction of acoustic content involves additional spatial, binaural, and room acoustic considerations. Diotic/monaural (i.e., same signal to both ears) headphone-based reproduction implies spatially-fused 'in the head/internalized' perception of acoustic content for the participant, which oversimplifies binaural hearing in complex acoustic environments such as OPOs. However, diotic headphone-based reproduction of irrelevant background sounds is predominant in previous ISE studies. Exceptions include some studies using binaural (i.e., adequately considering and reproducing interaural differences) reproduction (e.g., [11,13,17,17,39,[41][42][43]).
Using binaural reproduction, externalization of sound sources can be achieved. This leads to more complex variations in the intelligibility/audibility of certain sources compared to diotic reproduction of the same stimuli (summary in [44]). Such changes to intelligibility/audibility in binaural reproduction can include contributions due to spatial release from energetic masking including binaural unmasking and better-ear listening [45]. Additionally, contribution from informational masking is also likely [46], at least for nearby talkers [47,48]. This is relevant if the unmasked sounds are salient enough to draw attention, which is of interest from an attentional capture perspective, and may have additional semantic relevance for naturalistic speech. In that regard, spatially separated multi-talker speech from loudspeakers in acoustically dry conditions has been shown to restore the capacity of irrelevant speech to reduce cognitive performance, compared to same multi-talker speech from spatially fixed condition [39].
However, the stimuli in [39] included repetitive speech, which underrepresents complex speech in OPO environments. A recent study compared visual-verbal serial recall performance in a baseline condition with steady-state ventilation noise condition, with diotic and binaural reproduction of irrelevant classroom sound conditions (nonspeech sounds mixed with foreign language multi-talker speech) [43]. Herein, the serial recall performance in the classroom sound conditions was significantly different than the baseline condition i.e., ISE was exhibited. Further, the performance difference between the diotic and binaural conditions did not vary significantly for either adult or child participants [43]. However, a comparison of traditional diotic reproduction and more acoustically accurate binaural reproduction of the same irrelevant acoustic content using an auditory-verbal serial recall task, and using perceptually salient speech and nonspeech sound mix representative of OPOs, has not been conducted yet. Such a comparison is a key consideration in the current paper.

Room acoustics
Beyond (anechoic) spatial separation in reproduction format, OPO room acoustics can further complicate characterizing impact of task-irrelevant background sounds on STM cognitive performance [49]. Serial recall performance during irrelevant speech with an unrealistic (at least for OPOs) reverberation time (T in seconds) of 5 s was shown to be similar to performance in quiet [50]. Using a mix of OPO nonspeech sounds (telephones rings, door slams, ventilation, etc.) and conversations recorded anechoically in virtual settings, another study showed no significant difference between serial recall performance in T = 0.7 vs. 0.9 s (several details including source placements unavailable) [51]. However, performance in these reverberant conditions was significantly more disruptive than in quiet (i.e., ISE was exhibited). Another study tested the effect of using a 3-or 15-voices mix (repetitive speech) originating 10 m from a simulated listener position for T = 0.4 s and 1.0 s, and an anechoic condition [52]. The results showed that serial recall performance relative to the quiet condition reduced significantly in all conditions. These included conditions with increased reverberation time and/or number of voices (including an anechoic condition), except for the condition with 15 voices and T = 1.0 s [52]. In the latter, the performance was not significantly different to the quiet condition, which the authors attributed to sufficiently reduced changing-state with multiple voices and high reverberation [52]. Moreover, the authors suggested long reverberation times as a possible solution to reducing speech-based distraction in multi-talker OPOs [52]. However, both the studies with more plausible OPO reverberation times [51,52] do not consider realistic spatial sound arrangements and background sounds. Most speech-based disruption in OPOs tends to be from intelligible speech from spatially-separated nearby workstations [1,21]. Hence, distant voices 'mixed/fused' (e.g., [52]) likely underestimate the changing-state characteristics of actual OPO speech. Besides, the current room acoustic perspective for OPOs [9] advocates reducing speech intelligibility between workstations with the combined use of sound masking and sound absorption. The latter leads to lowered T values, which goes against the recommendation of high reverberation times in [52]. Yet, there is also a trend of eschewing sound absorption in many recent OPOs including those with activity-based working (ABW) [3]. From this perspective, investigating the effect of higher T values is informative using realistic simulations of representative OPO room acoustics, which is considered in this paper.
Moreover, while not studied systematically in the current paper, early reflections (~ 50-100 ms after direct sound in a room impulse response) can assist with speech intelligibility [53,54], at least for nearby talkers. Hence, when implemented in room acoustic simulations (see section III), early reflections may potentially degrade cognitive performance and/or increase attentional focus when the reverberant decay is not too long [12]. This may need to be considered alongside the smearing effect due to reverberant energy. Overall, the role of realistic OPO room acoustics on ISE using representative background sounds is still largely unexplored.

B. Motivations for the current paper
As expounded above, most ISE studies typically include a limited representation of complex acoustic conditions such as OPOs [55]. This paper investigates the role of several variables that are relevant for laboratory-based simulations of OPOs in ISE studies. Based on the literature review above, these variables include: semanticity of speech (native vs. foreign language for participants mixed with nonspeech sounds; section I.A.2), spatial presentation of stimuli Overall, the expectations are that while the ISE in serial recall may be exhibited in all conditions with changing-state sounds, a more realistic representation of OPO sound environment regarding spatial presentation of stimuli and room acoustics may contribute to the overall decline in cognitive performance. This overall decline may be more pronounced in conditions with meaningful speech than speech foreign to the participants. This is expected due to the speech content here being more salient, and thus potentially more attention grabbing compared to more rudimentary speech in traditional ISE studies, besides other semantic effects.

II. EXPERIMENTS 1 AND 2
These experiments test the ISE in terms of two variables noted above for irrelevant sound stimuli that includes nonspeech sounds and speech: the semanticity of speech, and spatial presentation of stimuli. The main difference between the experiments was that the irrelevant speech was in either semantically meaningful native German (Experiment 1) or in semantically meaningless Hindi language (Experiment 2). Hence, effect of semanticity was tested in a more ecologically valid manner compared to studies using manipulations including spectral [56,57] or vocoding [58], etc. of speech. Such manipulations, while preserving temporal fluctuations, generally do so at the cost of sounding artificial. Each experiment had five acoustic conditions and irrelevant sounds were presented as continuous audio over headphones (Table 1). These conditions included Quiet as the baseline condition for cognitive performance. The other conditions include diotic and binaural reproductions of task-irrelevant nonspeech sounds only, and nonspeech sounds mixed with speech, respectively. All conditions (including Quiet) included low sound pressure level (SPL) noise representing HVAC noise in offices (see Table 1 and section II.A.1). The diotic conditions represent most traditional ISE studies where the same audio signal is presented in both headphone channels with generally no interaural difference and/or spatial cues. The binaural conditions in these experiments included spatialized sounds in anechoic conditions.

Irrelevant OPO background sounds for both binaural and diotic conditions
The goal here was to include sounds that are common in OPOs, which were selected based on actual office recordings (reported elsewhere [3]). First, pink noise shaped with a -5 dB/octave decay was used, which is representative of the HVAC noise spectrum in OPOs [3,59]. This noise was presented at LA,eq,1min (energy-equivalent A-weighted SPL over a 1-minute period) of 41.5 dB (Table 1). This represents relatively quiet/modern HVAC systems in OPOs [60], where the SPL is lower than what would be required for substantial sound masking [61]. The signal presentation was decorrelated between the two headphone channels for a more 'diffuse' noise perception. This provided seamless steady-state low SPL noise throughout the experiment, including during the Quiet condition ( Table 1). The latter provides a better representation of 'quiet' conditions (i.e., without other non-HVAC sounds) in OPOs.
Second, the nonspeech (NS) sounds used included a diverse set of anechoic recordings of typical activities in OPOs. The overall choice of nonspeech sounds and how frequently each group of sounds occurred was based on listening to actual recordings of several medium-sized OPO. As such, some sounds were more frequent than others to depict activities in a typical OPO.
Besides, it was ensured that each type of nonspeech sound was presented at irregular intervals to avoid repetitive patterns. The nonspeech sounds included those originating from relatively regular use of furniture, keyboard and mouse and other stationery items at workstations, and printer operation; and relatively irregular phone rings, footsteps, door opening and closing, and elevator bell (ping) [3]. The latter three sounds were especially irregular. In the diotic conditions, the sounds were presented without any spatial separation between source locations. In the binaural conditions, the sounds were spatially spread according to their source location in the simulated room. This spatial spread, combined with the randomized order of nonspeech sounds, resulted in spatial randomization (see section II.A.2).
Lastly, multi-talker speech was used. Herein, the goal was to present non-repeating multitalker speech with rich content and where each talker represents one side of a telephone conversation (i.e., a halfalogue) at a workstation, which is a common scenario in OPOs [21,56].
Studies have shown that such halfalogues are not only perceived as being more annoying and distracting compared to hearing both sides of a dialogue [62,63], but also impede performance in cognitive tasks involving semantic processing [56,64,65]. lips, similar to previous research [12].
Further signal processing was done in MATLAB with the overall goal to create naturalistic halfalogues. Voiced segments per talkers were separated (2-50 s long). Per experimental condition, contiguous segments from four different voices (2 F) were arranged into 4-channel files where two talkers were simultaneously active at any time (see Fig 2. in [12] for a visual representation). Besides the natural pauses while speaking, a variable amount of silence (randomized between 0 -4 s) was introduced between the segments. The long-term spectra of the talkers were matched to the 'normal' vocal effort spectra in [9] as used in previous studies [11,12]. Finally, the order of active talkers was algorithmically randomized. None of the conditions had any repetition in speech content and it was ensured that the conversation topics for halfalogues per condition were different.
For the binaural conditions, the randomly changing order of two simultaneous talkers meant that the spatial location of active talkers changed in tandem (i.e., randomly). Along with the randomized order of nonspeech sounds, this is an example of spatial changing state for the sound sources (i.e., talkers and nonspeech sounds). The effect of such spatial changing state was, however, not experimentally tested. In comparison, the perceived spatial location of active talkers and nonspeech sounds remained the same in the diotic conditions, although the order of active talkers and nonspeech sounds changed randomly.

FIG. 1. Normalized SPL in the diotic nonspeech (NS) conditions and nonspeech and speech conditions in Experiments 1 (native German (G) speech) and Experiment 2 (foreign speech in
Hindi (H)).
The overall presentation level of the conditions, except Quiet, was LA,eq,1min of 55 dB (Table   1) Table 1. Figure 1 shows the spectra for the diotic conditions in Experiments 1 and 2, normalized with respect to the 500 Hz band (binaural conditions had similar spectra). Due to concentration of speech energy, the SPLs of conditions with speech mixed with nonspeech sounds (i.e., NS_G and NS_H) have substantially higher SPL compared to corresponding conditions with nonspeech sounds only (NS) in the 250 Hz and 500 Hz octave bands. There is also relatively higher SPL in the 125 and 1000 Hz in the NS_G and NS_H sound conditions compared to the nonspeech conditions, and somewhat relatively higher SPL in 4kHz and 8 kHz bands in the nonspeech conditions. Importantly, the spectral shape for NS_G and NS_H are relatively similar, i.e., limited spectral differences in the content of the two languages (German and Hindi).
In anechoic conditions with spatial separation, a SPL decrease for some sounds vs. the others will occur due to distance attenuation. This effectively means that the binaural sound conditions will have a lower SPL than the diotic conditions. However, to remove the influence of overall Octave band center frequency (Hz) Normalized SPL (dB) SPL acting as a confounding factor, SPLs were matched across the diotic and binaural conditions, while maintaining the relative level of different sounds per condition (determined by the simulation output). The disadvantage is that the binaural sound conditions do not represent the true acoustics of modelled source distances (see section II.A.2). However, there is some evidence that the overall presentation SPL does not affect serial recall performance for SPLs typical of OPOs [12].  [68]. There were 16 sound sources in total. Twelve of these were assigned omnidirectional directivities to represent nonspeech sound sources and were placed at a height of 0.7 m each per workstation. Four sound sources were arranged at a (mouth) height of 1.1 m each at the corner workstations along two rows in front of the listener: two talkers at 3 m distance at ± 30 degrees, and two talkers at 5 m distance at ±20 degree. The latter four sound sources represented talkers (1 M, 1 F per row) and were assigned directivity of a singer (deemed sufficient here). The auralization output was 16 binaural room impulse responses (BRIR; one per source-receiver combination) for each of the experimental conditions except Quiet ( Table 1).

Simulated room and sound sources for binaural conditions
The BRIRs were convolved with the respective audio per sound source, which resulted in auralized files for 12 nonspeech and 4 speech sources per experimental condition.

Auditory stimuli for serial recall
The to-be remembered digits were recorded as spoken digits (0-9) in clear speech in German by trained professional speakers using normal intonation and constant vocal effort [69]. The recordings were conducted in an anechoic chamber using a Neumann TLM170 microphone placed at a 1 m distance from the talkers. The voice used in this paper was the female voice B in [69], where more details of the recording are included. Each (anechoic) digit was 0.6 s in duration and presented diotically over headphones at 61 dB LA,Fmax in all conditions (Table 1).
Hence, there was an effective signal-to-noise ratio (SNR) of at least 6 dB (20 dB for Quiet) for the difference between the to-be-remembered digits and the to-be-ignored background sounds (Table 1), which ensures sufficient intelligibility for the spoken digits. For comparison, SNR of 4 dB was sufficient in [16] for eliciting the ISE with auditory presentation of digits. Moreover, pilot experiments were conducted with three participants (neither the experiment participants nor the authors) whose data was not used any further. They reported that that the digits were clearly audible in all conditions, and were different sounding from the irrelevant sounds that included conversational (not clear) speech and nonspeech sounds. Further, it must be noted that the rather simple auditory presentation of digits was to establish a baseline and more complex presentations could be explored in future studies wherein elaborate investigations of masking mechanisms may be necessary.

Calibration
The auralized files with irrelevant sounds were mixed down to a 2-channel audio file per experimental condition ( Table 1). The to-be-recalled digits were always presented diotically. The headphone (Sennheiser® HD650; Wedemark, Germany) output per channel (for each sound condition and each digit) was then calibrated using an artificial ear as described in [70]. The artificial ear used was a Brüel & Kjaer (B&K, Naerum, Denmark) Artificial Ear Type 4153 with a B&K Type 4192 omnidirectional microphone capsule (conditioning through B&K Nexus Type 2690).

Exploratory metrics quantifying changing-state characteristics
While there is no consensus in terms of signal-based changing-state characterization, two metrics are included here for exploratory purposes: Fluctuation Strength (FS; [28,29,71,72]) and the Frequency Domain Correlation Coefficient (FDCC; [58,73,74]). FS (in vacil) is a psychoacoustic parameter that quantifies slower amplitude modulations in a signal (up to 20 Hz).
FS was calculated using [75] as implemented within [76]. FS has been shown to be useful in characterizing ISE [28,72] in laboratory studies, although not always [29], and to exhibit a negative relationship with increasing number of workstations in actual OPOs [3]. FDCC aims to model changing-state characteristics (hence, the ISE) by analyzing spectral similarities between successive sound segments. FDCC has been used in studies with multi-talker speech and noise mix [58,77] and was calculated here based on [74]. FDCC values range from 0 -1, with values approaching 1 exhibiting lower spectro-temporal changing state, and vice-versa for values approaching 0. While there are conceptual differences between these metrics [74], the general idea here is that decreasing FS and increasing FDCC relate to decreasing acoustic spectrotemporal changes in the irrelevant sounds.
However, both these parameters are currently limited in terms of reliably describing changing-state characteristics. Both FS and FDCC are monaural parameters and hence do not incorporate binaural effects. The thresholds for exhibiting an effect (i.e., just-noticeable differences (JNDs)) are not known for either of these parameters. Moreover, it is not clear whether these parameters sufficiently address all acoustic and perceptual aspects necessary to quantify changing-state characteristics [28,58].
Given these limitations, the interpretations of these parameters are largely considered tentative. FS and FDCC values for the experimental conditions (Table 1) expectedly show a negative relationship with each other. The baseline Quiet condition has the highest FDCC value (close to 1), and an appropriately low FS value, which can be used to predict the least changingstate characteristics and best performance within this condition. Table 1   There were 12 trials per experimental condition, hence 60 trials overall (Table 1). Latin squares were used to balance the sequence of digits across trials. Participants were instructed to remember the digits in the order of presentation for subsequent recall, and to ignore the other sounds. Once all the eight digits were presented, there was a 6 s retention period. Following this period, the GUI presented three rows of digits (rows with 3, 2 and 3 digits), with each digit in black font in a square against a white background. The order of the digits across rows was randomized. The participants then recalled the digits by clicking corresponding digits in the GUI.
Once a digit was selected, it was greyed out and could not be selected again. After each experimental condition, participants were notified via the GUI, and were allowed to take a short break. The procedure here and in Experiment 3 was approved by the Medical Ethics Committee at the RWTH Aachen University (EK 055-18)

Data analysis
The same statistical analysis method was used for Experiments 1 and 2, conducted using the software R (version 4.2.2). For each experimental condition per experiment, the first and last trial out of the 12 trials were removed. This was done to minimize any effect due to familiarization and fatigue during the first and last trials, respectively (result were similar with 12 trials). For the remaining 10 trials per experimental condition, a correct response was registered for every digit recalled in the exact serial order of its presentation (i.e., strict serial recall criterion). The total number of correct responses per condition were added, and percentage error per condition (err%) calculated. The relative difference in the error percentages between conditions (RDerr%) was also calculated based on [6]. This minimizes the possibility of erroneous effect sizes that may result from fitting small or noisy data sets [78]. Additionally, given prior distributions and the data, Bayesian modelling provides the entire posterior probability distribution (PPD) per effect, rather than a point estimate of the most probable effect in the frequentist methods. Bayesian credible intervals (CIs), calculated here using the Highest Density Interval (HDI) of the PPD rather than quantiles, provide a more comprehensive picture of uncertainty. This is because Bayesian CIs indicate the range of values that have the highest probability of containing the true value of the effect [79]. In contrast, confidence intervals used in frequentist statistics only indicate the range of values that have a certain probability of containing the true parameter value, and do not provide information about the probability of the parameter being within that range. Finally, Bayesian methods also provide the ability to test for null effects by calculating the probability of a direction (PD), which is a tool for understanding the strength of evidence in support of a particular hypothesis (e.g., null hypothesis).
The Bayesian modelling was done using the brms (version 2.18) package [80] with mildly informative conservative priors. Prior distributions were sampled 80000 times: 8 independent chains of 11000 samples each, and discarding the first 1000 warm-up samples. No-U-Turn sampling was used [80] to avoid dependencies within each chain. More specifically, Gaussian distribution with a mean of 10 and a standard deviation of 5 was used as the prior to describe the difference in err% between Quiet and each of the other sound conditions (i.e., ISE per condition relative to Quiet). This prior is based on various studies where the ISE was demonstrated for mainly diotic presentation of nonspeech sounds with/without speech in conditions that represent office environments at least to some extent (e.g., [20,28]). The large standard deviation for the prior is chosen as a conservative representation for the variation in values amongst such ISE studies, and that no previous study has reported binaural reproduction effects for speech and nonspeech sounds mixed to the extent of the current study.
The calculated PPDs are summarized using median values and associated 95% CIs (calculated using HDIs). The latter is the interval with a 95% chance of containing the effect's true value, given the data and the model. Possible existence of an effect is inferred if the 95% CI does not contain zero when comparing two conditions, and PD > 97.5% in the effect's (positive or negative) direction. The latter provides a link to the frequentist framework since PD > 97.5% is mathematically equivalent to p < .05 [81]. CIs and PDs per PPDs were calculated using the bayestestR package [82]. . 1 shows the modelled median errors (err%) in the serial recall task per experimental condition. Table 2 shows the effect sizes in the form of relative difference in error percentages

Experiment 1 (native German speech)
The err% were lower and statistically robust in the Quiet condition compared to the DNS (-

Experiment 2 (foreign speech in Hindi)
The err% were lower and statistically robust in the Quiet condition compared to the DNS (-

C. Discussion for Experiments 1 and 2
As expected, and in line with most previous research, ISE was exhibited for the diotic conditions in both experiments, albeit using an auditory-verbal serial recall task here. However, the main finding is the extensive difference in the conditions with irrelevant speech across the two experiments (Fig. 2). In Experiment 1, ISE was substantially larger in both diotic and binaural conditions with meaningful speech (i.e., DNS_G and BNS_G), compared to Experiment 2 (i.e., DNS_H and BNS_H) which included foreign speech that was meaningless to the participants.
This difference is evident from the fact that speech conditions in Experiment 2 had 49% higher RDerr% (averaged across diotic and binaural conditions; Table 2).
An important consideration here is whether these results can be explained using changingstate variations, at least based on the (exploratory) signal-based (i.e., non-semantic) parameters in section II.A.5 and Table 1 [29]) include diotic reproduction, a closer comparison is possible with a recent study which included Quiet, diotic and binaural conditions that resembled those in Experiment 2 [43]. The main difference was that this previous study included classroom nonspeech sounds mixed with multi-talker speech of children in Hindi [43] (recorded using the same method as in II.A.1), which was a language foreign to the participants. For the adult participants in that study [43], ISE relative to Quiet was reported in the diotic condition (RDerr% ≈ 21%) like that for DNS_H in Experiment 2 (RDerr% = 16%; Table 2). However, ISE relative to Quiet was also reported in the binaural condition (RDerr% ≈ 29%) [43] unlike that for BNS_H in Experiment 2 (RDerr% ≈ 11%). While the RDerr% values for the diotic conditions in Experiment 2 and in [43] are somewhat similar, those for the binaural conditions are not. To address the latter, the role of some methodological differences between Experiment 2 and [43] must be noted. Primarily, this included differences in the speech (adult's voices here vs. children's voices in [43]) and the nonspeech sounds used. Moreover, the much lower sound reproduction SPL in Experiment 2 (< ~ 8 dB than in [43]) that may be relatively less disruptive [83]. To consider a signal-based assessments of the audio used, while the FDCC values for DNS_H and BNS_H in Experiment 2 were identical (Table 2), the FDCC values for the diotic and binaural conditions in the previous study [43] changed from 0.7 to 0.6 (although with identical FS of 1.3). This implies larger differences in the spectro-temporal/changing-state characteristics in the diotic and binaural conditions in the previous study [43]. These larger differences may explain the ISE in both conditions compared to Quiet in [43], but not in the binaural condition in Experiment 2. Besides issues in predicting ISE in BNS noted above, the role of FS seems ambiguous in this case as well, as it did not vary much between the diotic and binaural conditions in the previous study [43]. While the FS values varied in Experiment 2 here, the predictions do not match the results as discussed above. Hence, the different audio used in Experiment 2 and a previous study [43], and correspondingly different changing-state To continue with sound reproduction, the role of energetic masking (including its relationship with changing-state characteristics) between the diotic and binaural conditions was considered in the previous sentences. Furthermore, the speech content (two talkers randomly active at any time from out of four locations) may have involved some variation in information masking across these conditions. However, the role of informational masking is harder to assess here based on previous findings. This is because the closer talkers in Experiments 1 and 2 are at longer distances and wider angles than talkers in [48], where informational masking with close talkers was found. Hence, a detailed examination of masking principles is not possible here. Yet, the results overall suggest that differences in cognitive performance in the more ecologically valid binaural multi-talker reproduction of meaningful speech (in German, and without room acoustics; see Experiment 3) mixed with OPO nonspeech sounds, relative to corresponding monaural reproduction (common in traditional ISE studies), were not statistically robust.
Finally, to investigate the role of auditory vs. visual presentation of digits for serial recall, it is worth comparing the RDerr% values in current vs. previous studies. For instance, RDerr% of around 36% using a visual-verbal serial recall task was reported for diotic speech (meaningful to participants, in English) mixed with office nonspeech sounds (similar sounds as the current study) relative to a condition with just office nonspeech sounds in [20]. In comparison, the RDerr% in DNS_G relative to DNS is around 41% ( and a range of -5% to 62.7% has been reported in [6]. This included studies with comparisons between various types of diotic (non-office) nonspeech sound conditions (similar in principle to DNS) and quiet [6]. Moreover, as discussed previously, the RDerr% values in Experiment 2 and in a study with a similar design of diotic audio [43], albeit using a visual-verbal serial recall task, are quite close. These RDerr% comparisons between Experiments 1 and 2 and previous studies, especially those with meaningful speech conditions (e.g., [20,43]), suggest that current results with an auditory-verbal serial recall task are at least plausible, if not expected to be similar, compared to visual-verbal serial recall tasks. Note again that similar results for visual-and auditory-verbal serial recall task have been shown previously for irrelevant music, and irrelevant (meaningful) speech conditions, relative to quiet [16]. Yet, foreign speech in the binaural condition in Experiment 2 did not lead to robust increase in err% values relative to Quiet or binaural nonspeech OPO sounds. Hence, it cannot be ruled out entirely that the auditory presentation of digits in native language did not influence the results. Moreover, while the presentation level of digits was determined here based on pre-experiment tests (section II.A.3), the effect of masking mechanisms may be relevant for more complex presentations. These aspects regarding the digit presentation, however, were not tested directly and are recommended as future research using variable SNR and more detailed investigations of potential masking mechanisms .
To summarize, ISE was exhibited in irrelevant nonspeech conditions and conditions with native (hence, meaningful) speech. Moreover, the reported STM performance in anechoic OPO simulations using speech in native language for the participants was much worse than performance in foreign speech, suggesting a semantic effect beyond changing-state that are representative of OPOs on STM performance, which is studied in the next experiment.

III. EXPERIMENT 3
This experiment investigates the role of representative room acoustics in binaural OPO simulations using typical OPO nonspeech sounds, and multi-talker speech in participants' native language (German). The chosen room acoustic scenarios include reverberation times (T30 in seconds) spanning a mid-frequency (average of 500 Hz -2 kHz octave-band center frequencies) T30 range of 0.4 s -1.1 s typical of most OPOs [3]. This T30 range is moreover comparable with similar (presumably broadband) T range of 0.4 s -1 s in previous studies [51,52]. Besides the reverberation time, the ratio of early and late energy was considered by keeping the clarity index (C50 in dB) within an acceptable range.

Simulated room acoustics and irrelevant sounds
The same room setup as described in section II.A.2 is used here. However, the room surfaces and furnishings are now assigned typical absorption and scattering coefficients within RAVEN [67]. The various material groups in the room included ceiling tiles, floor (with carpet), chairs with sitting persons, tables, windows, door, etc., which provided a room averaged mid-frequency T30 of 0.4 s and C50 of 13.1 dB (RA-1 in Table 3). These values represent 'high' speech clarity in relatively 'high' sound absorption conditions, also shown in the STI values representing 'goodexcellent' speech intelligibility for single talker-receiver scenarios [84]. Note that the STI values here are strictly for illustrating the room acoustics variations, as STI is not well defined for fluctuating background noise (section II.A.4).
Starting from RA-1, absorption coefficients of the materials were iteratively adjusted using the MATLAB-based ITA-Toolbox [85] to derive the conditions RA-2 and RA-3 in Table 3. This was done while ensuring that the spectral shape of reverberation (T30 plotted for octave-band center frequencies) remained similar across conditions RA-1 to RA-3 (3 -5 in Table 3) over the mid frequencies. During this process, absurd absorption coefficients for surfaces were also avoided. RA-2 could nominally be ascribed relatively 'medium-quality' room acoustics for speech communication (high T30 and fair-good STI and C50) for its room volume. However, RA-3 is edging towards relatively 'bad' room acoustics for intelligible speech communication with relatively high T30 and poor STI and C50, achieved by mostly reflective surfaces. Moreover, the potentially excess reverberant energy in RA-3 (and perhaps RA-2) leads to even higher SPL of background speech. Such high SPL background speech potentially provides some beneficial speech masking and 'live' ambience preferred in some workplaces. Conversely, such high SPL background speech could also be perceptually more annoying and distracting than quieter background noise with lesser reverberant energy buildup [3,21]. However, exploring such subjective aspects is outside the current scope of research. Besides, the nominal assessment here could be questioned within certain contexts (see Figure 8 and linked discussion in [3]). Yet, RA-1 to RA-3 at least represent OPO conditions derived using acoustically more appropriate simulation methods than previous studies [51,52].
The binaural room impulse responses (BRIRs) from the procedure above were convolved with the same nonspeech sounds as in section II.A.  Table 3), while FS values changed slightly over the same conditions.  (Table 3), besides also including the anechoic BNS_G condition (referred to as Anechoic in the following) from Experiment 1 and the Quiet condition (Table 1).  100% PD < 0) conditions. The RDerr% values for comparisons between experimental conditions shared between Experiment 1 (Quiet vs. BNS_G; Table 2) and Experiment 3 (Quiet vs. Anechoic;  Table 3).

C. Discussion for Experiment 3
This experiment shows that serial recall performance in the Quiet condition was better (statistically robust differences) than other conditions including Anechoic, and room acoustic conditions with representative T30 values for medium sized OPOs (C50 variation in Table 3), and hence, exhibited the ISE. The relatively similar err% in the Anechoic and RA-1 (Table 3) conditions, which were not statistically robust, imply that a 'moderately-highly' absorptive room with high speech clarity may not be optimal for STM performance. This is consistent with the current recommendations in room acoustic design where speech clarity/intelligibility reduction between OPO workstations is the key consideration [9].
Moreover, [52] had reported ISE in a condition with T = 0.4 s in an OPO simulation with spatially fused multi-talker speech. The current results regarding RA-1 are hence consistent in principle with [52] in terms of ISE but not in terms of RDerr%. The RDerr% values in [52] between the quiet and T = 0.4 s condition were 9% and 4% for 3 and 15 voice mixes, respectively. These values, which are much lower than the RDerr% of 44.9% for Quiet vs. RA-1 (Table 4). While err% decreased with decreased sound absorption in RA-2 and RA-3 relative to RA-1, also consistent with [52] at least in terms of the direction of the effect, err% differences between RA-2 and RA-3 were not statistically robust. The latter is consistent in principle with [51] wherein serial recall performance in T = 0.7 s vs. 0.9 s conditions did not vary significantly.
However, err% in RA-3 was still higher compared to Quiet (difference statistically robust). This is not consistent in principle with the main finding in [52]. Therein, serial recall performance was not significantly different in quiet compared to a condition with T = 1.0 s and with irrelevant speech from 15 voices at a 10 m distance [52]. The RDerr% were 1% and 7% higher in 3-and 15-voice mixes, respectively, with T = 1.0 in [52] and are quite different from 27.5% for RA-3 vs. Quiet (Table 4).
Although both refs. [51,52] used a range of T similar to the current T30, they have a rather limited and ambiguous representation of room acoustics and speech sources in OPO simulations.
This likely leads to different conclusions and RDerr% between conditions across studies.
Further, besides not including any OPO nonspeech sounds, the speech in [52] did not include nearby talkers, whose speech is generally intelligible and considered the primary source of auditory distraction in OPOs [1,2,21]. Hence, the current results can be considered a more ecologically valid comparison between different room acoustic conditions representation of OPO scenarios. As such, the current findings disagree with the (rather unintuitive) suggestion in [52] that increased T30 (around 1.0 s) with distant and spatially-fused multi-talker 'babble' can exhibit serial recall performance similar to quiet conditions in OPOs.
Moreover, increased T30 (i.e., reduced sound absorption) values for a fixed volume come with their own set of problems. These include increased overall reverberant SPL, etc., and designers generally tend to control the reverberant energy in OPOs. Note that this discussion does not intend to ascribe a special role to T30, which would lead to a rather simplistic (even misinformed) assessment of room acoustics at best. As outlined in several studies [1,9,21,86,87], a careful combination of various room acoustic criteria is needed to achieve 'good' acoustics in OPOs. One such aspect of sensible room acoustic design that was not included in the current simulation was the use of partitions/screens between workstations, which was partly to allow comparisons with simulations in refs. [51,52]. Sound absorptive screens/partitions are typically used in OPO room acoustic design to allow better attenuation of direct sound between workstations, implying more effective reduction of speech clarity/intelligibility. In comparison, a previous study with almost the same multi-talker speech simulation (in English, and without nonspeech sounds) as the current had included absorptive screens between workstations. This previous study had reported RDerr% of 40% for a condition with T30 = 0.5 s relative to quiet [12], which is relatively similar to RDerr% of 45% for RA-1 vs. Quiet in the current study (Table 4). This comparison underscores again the general difficulty in sufficiently reducing speech intelligibility between the nearest workstations using sound absorption alone, and where additional sound masking may be beneficial [1,21].
From another (and topical) perspective, the situation in RA-2 and especially RA-3 depicts recent trends in OPOs where screen/partitions are eschewed including within offices supporting activity-based working [3]. The current results do not support such practices without additional considerations such as sound masking, work type, etc. [22]. This is because the current findings show that STM performance in conditions with/without high sound absorption but without screens/partitions may still exhibit substantial ISE. However, there are a few caveats that need to be considered. Lombard speech [88], wherein speech with higher vocal effort is likely in reverberant conditions [89], was not implemented here. While there is some evidence that the overall SPL within OPOs may not be a major factor mediating ISE [12,90,91], this needs to be tested in simulations using Lombard speech corresponding to varying room acoustics. This would allow more realistic comparisons between different room acoustic conditions. In RA-2 and RA-3, the task performance was almost the same (and around 12% better than RA-1).
Although higher T30 along with lower C50 than RA-3 may be achievable, it would perhaps be unreasonable for an OPO of current volume. In that regard, one may consider RA-2 as a compromise, with err% in serial recall performance similar to RA-3 but lower (statistically robust) than RA-1. However, the 'sweet spot' for OPO room acoustics is a much broader topic (see [9]), which needs to be tested using serial recall and other OPO tasks (e.g., writing, etc.) with variable room acoustics.
Consequently, the results here only represent the role of sound absorption without screens/partitions in OPO room acoustic design in terms of performance in STM tasks. While the current results represent higher ecological validity than previous ISE studies, many areas of improvements are possible. In this regard, as discussed in section II.C, the effect of better source localizations with individual/individualized HRTFs and headtracking needs to be tested. The tobe-recalled stimuli were presented as acoustically 'dry' spoken digits, i.e., did not change according to room acoustics. This unrealistic aspect of experimental design could be considered in future studies. Moreover, the current experiment is limited to medium-sized OPOs, with two nearby talkers and two talkers that are further away, along with representative OPO nonspeech sounds that are spatially distributed. Larger OPOs will have a more complicated assortment of speech and nonspeech sound sources, and the spatial release from masking can be quite complicated. Hence, the generalization of the current results needs caution beyond the conditions considered and more systematic studies to explore the effect of nearby sources.
In terms of the metrics to explore changing-state characteristics, the values are more informative compared to Experiments 1 and 2 ( Table 1). The values of both FDCC and FS represent the err% across conditions except for RA-2 vs RA-3. One major difference in this experiment was that the irrelevant sound content remained the same, whereas in Experiments 1 and 2, the content changed across conditions. Hence, it seems likely that FDCC and FS may be more useful in some cases than others, and the role of these metrics needs elaboration in future studies.

IV. GENERAL DISCUSSION AND OUTLOOK
The overall aim of this paper was to investigate the well-established ISE paradigm for STM serial recall using representative acoustics for OPO environments. To summarize, in Experiments 1 and 2, ISE was exhibited as expected in all but one condition with OPO nonspeech sounds. Further, ISE was exhibited in all conditions with meaningful speech mixed with nonspeech sounds in Experiment 1, and in the diotic condition with foreign speech mixed with nonspeech sounds in Experiment 2. However, the magnitude of the ISE was much larger in conditions with native speech relative to foreign speech, which suggests a semantic effect.
Moreover, findings from Experiments 1 and 2 show that STM cognitive performance may not vary between anechoic diotic and binaural reproduction, suggesting limited role (yet no disadvantage) of incorporating realistic spatial separation and interaural cues in the irrelevant stimuli. Experiment 3 showed that with realistic room acoustics for OPOs, serial recall task performance may not approach performance in Quiet as suggested in previous studies (e.g., [52]). As such, findings in Experiment 3 do not endorse exorbitant reverberant conditions as solutions to auditory distraction. Nor do they support highly reverberant conditions due to eschewing sound absorption in some office design philosophies, e.g., those with activity-based working (ABW). Instead, Experiment 3's findings support sensible room acoustic and workplace design considerations, as noted previously in other studies [1,3,9,22].
These findings collectively highlight, at the very least, the importance of using representative speech and nonspeech sound conditions when studying the ISE using serial recall tasks in complex acoustic scenarios such as OPOs. Yet, it is worth noting that while the current acoustic presentation is an improvement over studies using simple irrelevant sounds (e.g., words, short sentences, etc.), it is but an instance of a simplified version of actual OPO acoustics. Hence, the current findings are not intended to challenge the basis of studies using simpler irrelevant sounds. Moreover, many other relevant variations of acoustic presentation are possible, and recommended for future studies. If anything, findings here present another piece in the puzzle for understanding cognitive performance in verbal STM tasks, wherein studies with both simple and complex acoustic representation may be justified. An important aspect in the current study was the use of the less common auditory-verbal serial recall task (to represent auditory information in OPOs). However, the STM performance trends were similar to those reported in several ISE studies using the more traditional visual presentation of verbal information. Besides visual-and auditory-verbal serial recall performance has been shown to be similar previously [16].
In general, it is suggested that ISE studies methodically describe and critically examine the context of irrelevant sounds (e.g., sounds used and the reproduction method) much more than is common in general. This should enable more extensive studies that investigate aspects such as ecological validity in a more controlled manner, which is relevant for both simple and complex acoustic presentations. Within this backdrop, it is worth addressing the acoustic conditions in the context of the results, and in terms of limitations of the current experimental design. As mentioned in section III.A.1, SPL presentation across all conditions (except Quiet) was kept the same, which is not realistic. In terms of acoustic reproduction, the binaural method here used generic HRTFs. It is possible that the attentional grabbing nature of binaural presentation could vary [92] due to spatial and binaural unmasking effects in more ecologically valid [10] settings including individual/individualized HRTFs. Signal-based changing-state characterization using objective parameters (Tables 1 and 3) has limitations (sections II.C and III.C), but the parameter FDCC provided more reliable predictions than FS. The finding that FS is not a reliable predictor of ISE is consistent with at least one previous study [29], although some studies have suggested otherwise [28,72]. Moreover, the subjective nature of sound design choices within these conditions cannot be overlooked either (section II.C). To address such limitations, a conservative approach is proposed. Herein, given their higher fidelity to actual OPO acoustic environments, it is suggested that the scope of current findings be aligned very closely with the acoustic conditions and other methodological choices herein, and generalizations (e.g., for large OPOs, and different acoustic presentations) be reserved for future research.
In that regard, the following research directions are proposed: considering the ISE (and auditory distraction in general) within variations in 'busyness' and hence changing states of acoustic events for both speech and nonspeech sounds; testing the effect of intonation, emotionality, etc. of voices and having more realistic speech content than current (e.g., using Lombard speech); systematically comparing ISE in quantified changing-state experimental conditions using parameters such as the FDCC (and perhaps even FS); more acoustically accurate simulation approaches (e.g., SPL reflecting spatial separation of sources, individualized HRTFs, room acoustics for nearby and far reflections, etc.); more realistic presentation of auditory to-be-remembered items and possible masking mechanisms involved for to-be-remembered items due to irrelevant sounds (and vice-versa); using multimodal stimulus presentation and measuring physiological responses during task performance (e.g., [35]); and comparing semantic effects in traditional tasks such as visual-verbal and auditory-verbal shortterm memory serial recall with those believed to involve extensive semantic processing (e.g., writing, etc. [36,40,56]); and exploring greater task engagement to study the role of attentional capture [56]. Besides these largely acoustic considerations, it is worth noting that the current experiments were conducted in laboratory conditions with limited correspondence to actual OPOs in terms of visuals, etc. The current laboratory conditions (excluding acoustics) represented those in many such laboratories and test booths worldwide, and hence allow easier repetition by other groups in the future. Regardless, for higher ecological validity, the use of mock offices with realistic visuals, furniture, etc., and/or testing in actual OPOs is highly recommended.
From an even broader perspective, current findings present a case for pushing the boundaries of representative acoustic conditions in laboratory-based experiments. This seems most pertinent for studies like the current that are at the crossroads of multiple disciplines, and the need for such studies to occupy a space alongside studies following a more traditional and more limited acoustic design. Moreover, there is also a possibility of repeated findings from studies with nonrepresentative acoustic design, which, while following established experimental paradigms such as the ISE in serial-recall task, may not follow expectations within realistic settings such as OPOs and similar application areas. If studies such as the current are not conducted and their results are not further explored within basic and applied research, the authors believe that it would stifle the otherwise accelerated and comprehensive growth possible in understanding cognitive phenomenon related to auditory distraction from either end of the ecological validity continuum. This continuum is meant here to span simple laboratory acoustic representations to corresponding real-world instances, and the current findings fit in a region closer to the latter.
CRediT AUTHOR STATEMENT