Listeners are sensitive to the speech breathing time series: Evidence from a gap detection task

The effect of non-speech sounds, such as breathing noise, on the perception of speech timing is currently unclear. In this paper we report the results of three studies investigating participants ’ ability to detect a silent gap located adjacent to breath sounds during naturalistic speech. Experiment 1 ( n = 24, in-person) asked whether partici- pants could either detect or locate a silent gap that was added adjacent to breath sounds during speech. In Experiment 2 ( n = 182; online), we investigated whether different placements within an utterance were more likely to elicit successful detection of gaps. In Experiment 3 ( n = 102; online), we manipulated the breath sounds themselves to examine the effect of breath-specific characteristics on gap identification. Across the study, we document consistent effects of gap duration, as well as gap placement. Moreover, in Experiment 2, whether a gap was positioned before or after an interjected breath significantly predicted accuracy as well as the duration threshold at which gaps were detected, suggesting that nonverbal aspects of audible speech production specif-ically shape listeners ’ temporal expectations. We also describe the influences of the breath sounds themselves, as well as the surrounding speech context, that can disrupt objective gap detection performance. We conclude by contextualising our findings within the literature, arguing that the verbal acoustic signal is not “ speech itself ” per se, but rather one part of an integrated percept that includes speech-related respiration, which could be more fully explored in speech perception studies.


Introduction
Breathing provides both flow and form to speech. The development of the intercostal muscular control required to sustain vocal air pressure was a turning point in human evolution (MacLarnon & Hewitt, 1999), and David Abercrombie, an early proponent of one influential theory of speech rhythm, considered breathing as its basis, stating that speech rhythm is "essentially a muscular rhythm, and the muscles concerned are the breathing muscles" (Abercrombie, 1967). 2 Despite the vital contribution of breathing to vocalisation, its role in determining the temporal-dynamic structure of speech remains scientifically enigmatic, in part because the percept of breathing is rarely investigated experimentally. Indeed, the speech research community has been subject to recent criticism concerning its perceived over-reliance on unnatural, laboratory-produced stimuli, but even these commentaries neglect to engage with breathing as a necessary and ubiquitous, yet overlooked component to speech production and perception in the wild (see, for example, Alexandrou, Saarinen, Kujala, & Salmelin, 2020;Hamilton & Huth, 2020;Hitczenko, Mazuka, Elsner, & Feldman, 2020). That speech breathing has not received widespread scientific attention may be attributable to its very banality: we tend not to consciously notice respiration-our own, or another person's-at all, unless some irregularity or breakdown has occurred. The addition of breath sounds (but not acoustically similar, non-breath sounds) to synthesized speech, however, improves listeners' recall (Elmers, Werner, Muhlack, Möbius, & Trouvain, 2021;Whalen, Hoequist, & Sheffert, 1995), and a speaking virtual character is perceived as more human and trustworthy when it audibly breathes (Bernardet, Kanq, Feng, DiPaola, & Shapiro, 2019). Multi-sentence synthesized speech (e.g., audiobook narration) with breath sounds modelled on human respiratory patterns is strongly preferred for naturalism over the equivalent with silent pauses (Braunschweiler & Chen, 2013). Although little is known concerning how speech breathing is planned, analysis so far suggests that breaths are predominantly taken at syntactic boundaries (Fuchs, Petrone, Krivokapić, & Hoole, 2013;Grosjean & Collins, 1979;Henderson, Goldman-Eisler, & Skarbek, 1965;Winkworth, Davis, Ellis, & Adams, 1994), and that longer phrases are usually preceded by longer or larger inhalations (Rochet-Capellan & Fuchs, 2013;Whalen & Kinsella-Shaw, 1997;Winkworth et al., 1994). There is also a strong social-pragmatic function for speech breathing, with dyadic and group experimental data revealing systematic relationships between respiratory and turntaking behaviours in conversation (Aare, Gilmartin, Wlodarczak, Lippus, & Heldner, 2020;Rochet-Capellan & Fuchs, 2014;Torreira, Bögels, & Levinson, 2015;Włodarczak & Heldner, 2016a, 2016b. Taken together, these preliminary data suggest that we expect to hear breath sounds in accompaniment of speech; that the duration of vocal inhalation may offer some cues as to the upcoming content being communicated; and that seamless interaction is possibly facilitated by interlocutors' awareness of each other's respiratory movements. What is unclear, however, is how breathing is perceptually integrated on the sub-second timescale as a component of speech prosody or rhythm. Do listeners, consciously or non-consciously, impose temporal expectations upon speech-related inhalation, for example, by expecting speech to occur at a certain point after hearing a breath sound? If so, are they sensitive to perturbations of the natural speech breathing time series? Towards an initial understanding of listeners' sensitivity to the timing of speech breath sounds, we devised two modified gap detection tasks. In its most basic form, a gap detection task consists of the serial presentation of auditory stimuli, such as tones or noise, in which a very brief silent (i.e., acoustically empty) gap may or may not occur. The participant's job is to report whether or not they noticed a gap, and the task can incorporate adaptive procedures to determine an individual's unique perceptual threshold in terms of the Gap Duration of the gap, or some other property of the stimulus. The results are taken as an indication of auditory temporal resolution (Phillips, 1999), a dimension that predicts future reading problems in preschoolers (Boets et al., 2011), and shows decline in the early stages of dementia (Jalaei et al., 2019). This form of perceptual acuity is considered essential to speech processing, which by its nature relies on the ability to track a rapidly changing signal in potentially challenging listening environments, an effort that demands efficiency. This need for efficiency may explain why sensitivity to gaps does not appear to be uniformly distributed in time: Henry and Obleser (2012) employed a gap detection task in conjunction with electroencephalography (EEG) to show that attending power-that is, the likelihood of successful gap detection-is concentrated at particular instants in the phase-relationship between a rhythmic stimulus (a frequency modulating tone) and neural oscillations in the delta band (3 Hz). Given the purported social-linguistic salience of breath sounds, it is possible that the interlocked rhythms of speech and speech breathing could also influence listeners' discrimination of gap targets; for instance, listeners may be more likely to detect a gap that follows, rather than anticipates, a breath sound associated with speech.
In the current paper, we ask how listeners temporally assimilate a speaker's breaths, and the extent to which such temporal assimilation is flexible, by manipulating the respiratory time series in the context of varied and naturalistic speech stimuli. Across three experiments, we alter the placement and duration of an artificially added silent gap adjacent to breath sounds and speech in order to assess the approximate threshold at which a gap can be reliably detected or located by listeners. Using two tasks in the first experiment, we asked participants to 1) report whether or not they perceived a gap that could occur between a breath sound and speech; and 2) identify after which of two breaths a gap occurred within a single utterance. In the second experiment, we followed up on the first task, but added multiple positions in which the gap could occur relative to two breath sounds. In the third and final experiment, we extended the results of the second task in Experiment 1 by manipulating the breath sounds themselves, in order to interrogate breath-specific influences when determining where the gap occurred.
These paradigms allow us to interrogate listeners' expectations of the timing of speech breathing from slightly different angles, and at differing levels of naturalism. Before describing the experiments in more detail, we make some broad predictions concerning the perceptual thresholds at which listeners may detect the manipulation of the natural speech breathing time series and outline their justifications in the following section.  (Mckay, Evans, Frackowiak, & Corfield, 2003;Von Euler, 1982), behaviourally (Winkworth et al., 1994), and acoustically (Trouvain, Möbius, & Werner, 2019) distinct from metabolic breathing, and it follows that a listener should plausibly expect to hear speech following a speechassociated inhalation sound. What is less obvious is how flexibly listeners treat the variable pause between the cessation of air intake, and the ensuing speech onset. Although this question appears to be scientifically unexplored thus far, previous physiological analyses of speakers' respiratory activity provide some a priori rationale to expect that listeners will be sensitive to temporal manipulations in this domain. Specifically, it is reported that phonation usually begins between 100 and 300 ms after the post-inspiration period, although nontrivial interspeaker differences are also recorded (Atkinson, 1973;Lieberman, 1967;Slifka, 2003). Other researchers have focused on the acoustic presentation of breathing, observing that audible breath sounds tend to be flanked by silent "edges", which tend to last for tens of seconds (Fukuda, Ichikawa, & Nishimura, 2018;Trouvain et al., 2019). In view of the recognisable acoustic profile of speech breathing, its necessary association with imminent speech, and the normative transition period between the end of a breath and vocalisation, we speculate that perturbations as short as an extra 200 ms between breath and speech may be sufficient for detection; however, it is likely that performance will increase with increasing gap length, as would be expected in the traditional gap detection task. than initial, breath should be more likely to register as a violation of natural speech breathing timing. This idea is congruent with evidence that, at least in nonverbal neural entrainment, the stimulus-brain phase relationship consolidates over several seconds of rhythmic stimulation (Bauer, Bleichner, Jaeger, Thorne, & Debener, 2018). It is also possible that the acoustic profile of interjected speech breaths maybe make them more amenable to temporal discrimination: by comparison, initial breaths tend to be longer lasting and larger in volume (Winkworth et al., 1994;Winkworth, Davis, Adams, & Ellis, 1995), and are hypothetically freer to vary than interjected breaths, which presumably conform to the surrounding speech rhythm. In sum, we predict that participants should be more likely to correctly identify silent gaps associated with interjected than initial breaths.
1.1.1.3. Rhythm sensitivity. In addition to the speech breathing gap detection tasks, we also administer a nonverbal rhythm discrimination task. Musicians achieve comparatively lower perceptual thresholds in traditional auditory gap detection tasks (Donai & Jennings, 2016;Mishra, Panda, & Herbert, 2014), and demonstrate enhanced abilities in auditory processing more generally (see Strait and Kraus (2014); Vasuki, Sharma, Demuth, and Arciuli (2016) for review), including speech-innoise perception (see Coffey, Mogilever, and Zatorre (2017) for review). It is therefore likely that musical aptitude, especially in the domain of rhythm, confers an advantage in the present experiment; however, rather than rely on self-reported musician status, we opt to measure sensitivity to rhythm directly, as rhythmic aptitude is known to vary across musicians by genre and training (Bailey & Penhune, 2010;Matthews, Thibodeau, Gunther, & Penhune, 2016). The rhythm task employed consists of same-different judgements following the presentation of rhythmic drum loops. We predict that performance in this rhythm discrimination task will positively correlate with accuracy in gap detection.
To summarise, we expect that listeners will be more likely to detect silent gaps adjacent to breath sounds that occur within, rather than before, the speech stream. We also expect that increasing the gap duration will facilitate gap detection, and that participants with an enhanced sense of timing in the nonverbal domain will also perform relatively better in making judgements about the timing of speech breath sounds. Finally, in Experiment 2, we contrast the detection of gaps added either before or after interjected breaths. If gap detection is superior for gaps occurring after breath sounds, this may suggest that breath sounds are informative concerning the timing of upcoming speech. In practice, temporally asymmetric sensitivity to breath sounds could serve as a sort of phase reset, thereby enhancing early entrainment to the speech stream. We discuss these and other findings in view of embodied speech perception and the recent uptake of interest in rhythmic speech entrainment. We close in making our case for speech breathing as a consequential, if hitherto latent, factor in auditory speech processing more generally.

Participants
Participants were locally recruited from the University College London subject pool and were paid for taking part in the experiment, which was approved by the local ethics committee and conducted in accordance with the Declaration of Helsinki. All participants had normal or corrected to normal hearing, and gave informed consent to participate. 25 participants took part in this experiment, of which one participant had to be excluded due to a data saving problem, resulting in a total N = 24 (Male 14, Female 10; ages 18-35).

Design
2.1.2.1. Gap detection tasks. Experiment 1 consisted of two gap detection tasks, the first of which (Experiment 1A) simply asked participants to report whether or not they detected a silent gap that may have been inserted between a breath sound and speech. Half of trials contained a gap, and the other half did not. Each utterance repeated once across the pseudo-random trial order, such that participants heard both the gap and no-gap versions of each unique utterance. The gaps ranged in duration from [1,001,600 ms], the exact values of which were determined by generating a random gamma distribution with 95% CI [325,595 ms] using the gamrnd function in MATLAB. Each duration value occurred once per speaker. The gamma distribution was chosen to give participants more chances to respond to trials with shorter, and presumably more difficult, gaps.
In the second task (Experiment 1B), participants heard an utterance containing two breath sounds, an initial and and interjected breath, one of which would be followed by a silent gap before speech started or resumed. The participant's job was to answer after which breath they thought the gap had occurred. The gap followed the initial breath (Breath 1) in half of trials, and the interjected breath (Breath 2) in the other half. The utterances did not repeat. The gap duration consisted of discrete levels, resulting in a 2 (Gap Position: Breath 1, Breath 2) × 3 (Gap Duration: 200, 400, 800 ms) within-subjects factorial design. The counts of gap duration values followed a ratio of 3:2:1, with more gaps having the shortest duration, and fewer gaps having the longest duration. The trial structures are depicted in Fig. 1.

Nonverbal rhythm discrimination task.
To gauge participants' nonverbal musical rhythm processing skills, we administered a forcedchoice discrimination task. Each trial consisted of the presentation of a short drum loop, followed by a pause, before a second drum loop played. Participants' job was to answer whether the second drum loop was the same or different from the first. The drum rhythm loop stimuli consisted of 3.2 s rhythmic sequences introduced by Tierney and Kraus (2015), adapted from Povel and Essens (1985), which were made up of nine conga drum sounds separated by the following inter-onset intervals: 5 × 200 ms; 2 × 400 ms; 1 × 600 ms; and 1 × 800 ms, the reorganization of which generate distinct rhythmic patterns. There were 40 trials, which were equally distributed between same/different correct responses. Each participant's percentage correct (Rhythm Task) was used as a covariate when modelling gap detection performance.

Stimuli
2.1.3.1. Gap detection task. The corpus from which we produced all speech stimuli used in the current study was balanced across four speakers (2 male, 2 female, ages 18-35), comprising of approximately 50% reading and 50% spontaneous speaking styles. Texts consisted of: adaptations of popular science articles chosen for an accessible reading level and neutral tone; four poems characterised as typical of traditional English rhyming verse; and a variety of prompts, characterised as small talk-style questions, to produce the spontaneous speech. The full-length recordings, consisting of approximately one hour of speech per speaker, were combed through for fluent utterances (i.e., without hesitations or silent pauses) where there was a discrete but minimal transition between the breath sound and the speech. In Experiment 1A, the gap detection task stimuli consisted of 108 unique utterances containing one breath followed by a complete phrase or sentence. Each speech stimulus was repeated (i.e., once with and without a gap), resulting in 216 trials in total. The mean duration of breath sounds was 578 ms (SD 260 ms) and the mean duration of the speech was 4.46 s (SD 1.58 s). In Experiment 1B, the stimuli consisted of 150 unique utterances that naturally consisted of a breath-speech-breath-speech structure. The speech stimuli did not repeat within the task. The mean duration of the first breath sound was 578 ms (SD 280 ms), and the second breath sound was 418 ms (SD 167 ms). Mean Speech 1 duration was 3.00 s (SD 1.17 s), and mean Speech 2 was 2.49 s (SD 0.96 s). In both Experiments 1A and 1B, the stimuli were roughly balanced across all four speakers (2 male, 2 female) and speaking styles.
The speakers were recruited from the broader UCL research community and were unaware of the specific aims of the experiment. They were given no instructions nor direction regarding their breathing during speaking. Source recordings were made in a sound-attenuated studio using SM58 cardioid dynamic microphones (Shure Inc., Niles, IL), positioned via a mic stand in front of the speaker's mouth and sampled at 44,100 Hz. The speakers wore transducer plethysmography belts (MLT1132, ADInstruments, Castle Hill, Australia) to monitor breathing and aid in breath event identification. The acoustic stimuli were segmented and preprocessed in Audacity (Audacity Development Team, 2020), and the beginning and ending of breaths were labelled following visual inspection of the spectrogram with corroboration from the respiratory belt data. Speech and individual breath sounds were root mean square normalised for intensity, which was verified by the authors and adjusted where needed to ensure an approximate perceptual balance within and across excerpts. Onset and offset ramping (10 ms) was applied to the boundaries of speech and breath sounds, and lowintensity background pink noise was added throughout each trial, including during the "silent" gaps, to mask any residual noise or artifacts caused by splicing the audio.
2.1.3.2. Nonverbal rhythm discrimination task. The conga sounds used to generate the rhythmic drum loop stimuli are freely available from MusicRadar (Music radar drum samples, 2022).

Apparatus
The complete experiment was run in PsychoPy (Peirce, 2007) using a laptop computer connected to studio headphones. Participants' responses were recorded via the built-in laptop keyboard.

Procedure
The components of Experiment 1 were performed in the following order: 1. Administration of information sheet and informed consent; 2. Experiment 1A: "Was there a gap?"; 3. Experiment 1B: "Where was the gap?"; 4. Nonverbal rhythm discrimination task; Verbal debrief on the aims and background of the study. The tasks all contained introductory practice sessions, each consisting of three example trials, with otherwise unused stimuli. In the practice sessions, automatic accuracy feedback was given and participants could repeat the practice sessions as many times as they liked. Both of the gap detection tasks and the rhythm discrimination task incorporated a visual warning stimulus (a headphones icon) to alert participants that an upcoming trial was about to begin in 500 ms. There was a variable delay of 500-1000 ms between each trial presentation. All testing took place in a private, quiet space at the Institute of Cognitive Neuroscience. An experimenter was on hand to answer any questions and discreetly ensure task compliance. Throughout the tasks, participants were periodically given the option to take a self-paced rest.

Data processing and analysis
All analyses in this study were performed using R 3.5.0 (R Core Team, 2013). To explore the factors affecting correct/incorrect responses in the two gap detection tasks, we ran logistic regressions with a binomial distribution and logit link function, using a generalised linear mixed effect model with the glmer function of the lme4 package with bobyqa optimisation . As a secondary analysis, reaction time data were also analysed using a linear mixed effect model with the lmer function, estimated with restricted maximum likelihood and optimx optimisation . In Experiment 1A ("Was there a gap?"), the explanatory variables were Gap Duration (continuous) and Rhythm Task (continuous) with a random intercept for Participant and Speech Excerpt, and a random slope for Gap Duration within Participant. Breath Duration and Speech Duration were also fit as continuous terms to control for natural variability in the speech excerpts. Only responses to trials truly containing gaps were analysed in the modelling, but signal detection analyses were also conducted to examine sensitivity. Hit rate was defined as the proportion of trials on which participants correctly detected the gap, and the False Alarm rate was defined as the proportion of trials in which participants incorrectly responded that a gap was present, with the loglinear correction applied where rates were 1 or 0 (Stanislaw & Todorov, 1999). Z-values were computed for Hit and False Alarm rates and d-prime was calculated. In Experiment 1B ("Where was the gap?"), explanatory variables were Gap Position (Breath 1/Breath 2), Gap Duration (200/400/800 ms), and Rhythm Task (continuous) with random intercepts for Participant and Speech Excerpt, as well as random slopes for Gap Duration and Gap Position within Participant. We also fit Breath 1 Duration, Breath 2 Duration, Speech 1 Duration, and Speech 2 Duration as covariates. Model selection was performed using a step-wise additive approach, beginning first with random intercepts, before including fixed main effects and interactions, then random slopes, with side-by-side model comparisons reported. Model residuals were visually inspected for normality. The significance of fixed effects was tested using Wald chisquare tests (Anova from the car package (Fox et al., 2012)). Statistical variation explained by fixed effects was estimated as semi-partial R 2 (R sp 2 ) calculated in the r2glmm package (Jaeger, 2017). Post hoc contrasts were conducted using estimated marginal means using the package emmeans (Lenth, 2018 Participants were able to achieve ≥70% correct at gap duration values as short as ~440 ms (Fig. 2); however, mean accuracy of trials with gaps <200 ms duration hovered around chance (48%, SD 15%). By contrast, ceiling performance (approximately 80%) was reached by 700 ms, with negligible improvement as gap duration increased beyond that. We found no association between Rhythm Task and correct responses to trials containing gaps (Appendix A, Table A2), despite Rhythm Task correlating very highly with overall performance in the task (r s = 0.71, p = 0.003). Signal detection analysis revealed that this relationship was driven by the False Alarm rate. Summarising by median-split Rhythm Task, participants with higher nonverbal rhythm perception ability had a much lower False Alarm rate (Mean 0.19, SD 0.11) in comparison to those with lower nonverbal rhythm perception (Mean 0.41, SD 0.11). The Hit rate, however, was more similar between participants with higher (Mean 0.71, SD 0.11) and lower (Mean 0.64, SD 0.09) nonverbal rhythm. In other words, although doing well in the Rhythm Task did not confer an advantage to gap detection, higher nonverbal rhythm-skilled participants were less likely to report gaps erroneously (Fig. 3).
Turning to stimulus-specific effects, we found Speech Duration did not improve model fit, but increases in Breath Duration were significantly associated with higher likelihood of gap detection (Beta = 0.

Interim discussion
Few experiments directly examine the perception of speech breathing, and to our knowledge, none so far have investigated its timing on the sub-second scale. We therefore relied on data from speech production to form our expectation that listeners would be sensitive to violations of the natural breathing time series at the level of 200 ms. We instead found in Experiment 1A that, as a group, participants did not reliably detect gaps (i.e., at >70% accuracy) until the gaps reached approximately 440 ms long. The covariate Rhythm Task, which was a measure of participants' nonverbal rhythm processing ability, failed to predict successful gap detection, but it was associated with the lower likelihood of incorrectly reporting gaps. By comparison, we observed a significant benefit of higher Rhythm Task score in Experiment 1B, which forced participants to choose whether they thought a gap had occurred after Breath 1 or after Breath 2. It is possible that participants adopted diverging strategies across the two tasks. In Experiment 1A, for instance, listeners might have relied on their sense of absolute or interval timing when deciding if a gap had occurred. In choosing between two potential gap locations, however, listeners could instead make a relative judgement, in which case the precise duration of the gap need not be known. It is therefore possible that the Rhythm Task, which involves making samedifferent comparisons between time series, draws on similar cognitive resources to Experiment 1B. Performance was not uniform between the two gap positions, however. Gaps that occurred after Breath 2, the breath that was interjected partway through the utterance, were more often correctly identified than gaps occurring after Breath 1, at the beginning of the utterance. Finally, in addition to the experimental manipulations, we observed that longer values of Breath Duration led participants to report having heard a gap, correctly or not (Experiment 1A). When asked after which breath a gap occurred, however, only the duration of the second, interjected breath influenced participants (Experiment 1B). We were somewhat surprised to see that increases in Breath 2 Duration resulted in fewer correct responses, regardless of whether the gap occurred after Breath 1 or Breath 2. We speculate that longer Breath 2 sounds may have been distracting, or could have possibly taxed participants' working memory. Unfortunately, because each speech utterance was paired to the same experimental condition across participants, the opportunity to tease apart speech-specific influences is limited in Experiment 1.
As a first exploration into the perception of the speech breathing time series, Experiment 1 provides early indication of the thresholds at which participants perceive an artificially introduced silent gap. We can moreover confirm that listeners' acuity when making judgements about the locations of gaps varies between breaths within an utterance, with stronger performance associated with breaths that interject, in comparison to breaths that initiate, ongoing speech. On the other hand, the role of individual differences in nonverbal rhythm perception was unclear due to task-related differences. The sample (n = 24) is not large, although we did not estimate power a priori, given the exploratory nature of this study. It is nonetheless possible that we were unable to estimate the true underlying relationship between speech breathing gap detection and inter-individual differences in nonverbal rhythm processing. Another consideration is that the naturalistic speech stimuli do not vary across conditions, which is problematic given the observed stimulus-specific effects. Finally, the advantage of Breath 2 in Experiment 1B led us to wonder whether participants were simply better able to identify a gap within speech; that is, to succeed in the task, listeners could ignore the breath sounds and attend to interruptions of speech timing only. Gap Position is therefore confounded with exposure to speech.
To confirm our in-person findings, extend the current results, and address the limitations discussed above, we planned two larger, online studies. Experiment 2 is, like Experiment 1A, a "true" gap detection task, in that we asked participants to report whether or not they heard a gap somewhere in the utterance; however, we introduced multiple gap positions to better understand the contribution of the breath sounds and to disentangle the role of speech exposure. For instance, if breath sounds contribute to listeners' expectations concerning the timing of upcoming speech, we should expect that gap detection is better after, rather than before a breath sound. If, on the other hand, the advantage of later gap positions is due to speech exposure alone, we won't find a difference between the two interjected breath gap positions. By comparison, Experiment 3 is a more faithful replication of Experiment 1B, as both share the same trial structure. The key development is that we manipulate the breath sounds themselves in Experiment 3. Specifically, Experiment 3 asks participants to judge after which breath a gap occurs, but the naturally occurring breath sounds have been inverted in half of trials. This allows us to dissociate Gap Position and inherent differences between initial and interjected breath sounds in naturalistic speech. We continue in Section 3 by describing Experiment 2 and its implementation in more detail.

Participants
Participants were recruited using either the professional online subject pool Prolific (Palan & Schitter, 2018), or University College London's internal online undergraduate subject pool. The former group were paid approximately £8 per hour for their time, and the latter were awarded course credits in exchange for participating. A total of 123 participants on Prolific and 76 participants from the university completed the experiment, resulting in a data set of N = 199 (gender was not confirmed; ages 18-35). Participants were prescreened for normal hearing, and all reported English as a primary language. On the basis of quality control, we excluded 15 participants from the Prolific set, and 2 participants from the university set, resulting in a final N = 182.

Gap detection task.
In Experiment 2, we examine silent gap detection using a similar paradigm to Experiment 1A, except that the gap can occur in one of three Gap Positions: After Breath 1, Before Breath 2, or After Breath 2 (Fig. 5). Participants were asked to report whether or not they heard a gap within each breath-speech-breathspeech trial, but the exact positions were not identified, and speech breathing was not mentioned in the task instructions. This experiment was designed as a 3 (Gap Position: After Breath 1, Before Breath 2, After Breath 2) × 5 (Gap Duration: 200, 325, 450, 575, 700 ms) withinsubjects design, with Rhythm Task as covariate. To address variability in the naturalistic speech stimuli, we shuffled each utterance across the factorial combinations, meaning that participants all heard the same stimuli, but in different experimental conditions. The trial structure is depicted in Fig. 5, Panel A.

Stimuli
To enhance experimental control, we employed a subset from the full corpus of speech excerpts produced by the two speakers (1 male and 1 female) who happened to produce the highest number of matched read excerpts that naturally followed the same breath-speech-breath-speech structure. In the case of spontaneous speech, we attempted to pair the excerpts as closely as possible between the speakers. This resulted in a balanced data set of 80 utterances, which consisted of of 50% reading and 50% spontaneous speech. There were a total of 80 trials, of which 25% truly contained no added silent gap. The remaining 75% of trials were divided up evenly among the three possible gap positions. There were 3× the shortest two durations (200 and 325 ms) and 2× the middle duration (450 ms) for every 1× the longest two durations (575 and 700 ms), or a ratio of 3:2:1. Trial order was pseudorandom, with seven different order conditions: four for the Prolific group, and three for the university group, with a projected 25 participants per order. The speech excerpts were shuffled across experimental conditions by trial order, meaning that a single excerpt would be associated with multiple trials across different experimental conditions, but only heard once per participant.

Apparatus
The experimental components and participant experience were, in general, unchanged from Experiment 1, except that the user interface was delivered online using Gorilla experiment builder (Anwyl-Irvine, Massonnié, Flitton, Kirkham, & Evershed, 2020).

Procedure
The components of the study were performed in the following order: 1. Administration of participant information and informed consent; 2. Gap detection task; 3. Nonverbal rhythm discrimination task; 3. Textbased debrief on the aims and background of the study.
3.4.1.1. Gap detection task. At the start of the task, participants were instructed that they would hear clips of speech wherein a brief silent gap may or may not have been added, and that this gap could occur in different locations within the speech. No mention was made of the gap's proximity to the breath sounds, nor of breathing more generally. Participants were given a practice session of 4 trials, consisting of no gap and one of each of the possible gap positions, with feedback on their accuracy. The task was otherwise implemented the same as in Experiment 1A.

Data processing and analysis
The analysis pipeline was in general similar to Experiment 1A. The explanatory variables were Gap Position (After Breath 1/Before Breath 2/After Breath 2), Gap Duration (200/325/450/575/700 ms), and Rhythm Task (continuous) with random intercepts for Participant and Speech Excerpt, as well as random slopes for Gap Duration and Gap Position within Participant. Breath 1 Duration, Breath 2 Duration, Speech 1 Duration, and Speech 2 Duration were also fit as covariates. Signal detection and reaction time analyses were also undertaken. In addition, we performed an exploratory analysis of the speech stimuli, using a data-driven random forest classification of responses to examine the role of linguistic and acoustic factors in gap detection. The details of this analysis are provided in Appendix D.

Accuracy
Descriptive statistics are provided in Table 2. ~4% (count = 475) of reaction times exceeded 5 s, and these responses were removed from further analysis. Accuracy varied by Gap Position, most strikingly in that participants were much more likely to detect gaps that occurred Before Breath 2 (Mean 68%, SD 20%; Beta = 1.73, SE = 0.14, z =12.07, p <0.001, R sp 2 = 0.013) or After Breath 2 (Mean 78%, SD 15%; Beta = 2.18, SE = 0.14, z =15.28, p <0.001, R sp 2 = 0.020), when compared to After Breath 1 (Mean 27%, SD 19%). Bonferroni-corrected post hoc tests between the three levels of Gap Position were significant at nearly every Gap Duration (p ≤0.001), such that After Breath 2 was most likely to predict correct answers, followed by Before Breath 2, and then After Breath 1. The exception was the Before versus After Breath 2 contrast, which failed to reach significance at Gap Duration 700 ms (p =0.35).
Closer examination of Gap Duration revealed a shift to lower gap detection thresholds from gaps placed Before Breath 2 to After Breath 2 ( Fig. 6). Specifically, gaps as short as 450 ms yielded a mean accuracy of 93% (SD 6%) when they were placed After Breath 2, with a maximum performance of 96% (SD 4%) at 700 ms. None of the pairwise tests between levels of Gap Duration that were longer than 450 ms met significance (p ≥0.63). By contrast, when gaps occurred Before Breath 2, responses to 450 ms-long gaps were 77% (SD 28%) correct on average. Ceiling performance was not approached until 575 ms (Mean 89%, SD 11%) with a maximum mean accuracy of 91% (SD 7%) at 700 ms. In other words, the perceptual threshold of Gap Duration was lower for gaps placed after, rather than before the interjected breath. Participants were very unlikely to report a gap that followed Breath 1, even when it was 700 ms long (Mean 38%, SD 18%). The addition of Breath 1 Duration, Speech 1 Duration, and Breath 2 Duration did not improve model fit (Appendix C, Table C1). We did, however, find a global negative effect of Speech 2 Duration on gap detection, regardless of Gap Taking trials without gaps into consideration, signal detection analysis showed that, whereas the mean Hit rate was 0.57 (SD 0.15, Range [0.14 0.86]), the mean False Alarm rate was just 0.18 (SD 0.18, Range [0.01 0.88]). Overall, d-prime ranged from a minimum of − 0.29 to a maximum of 3.09, with a mean of 1.36 (SD 0.70). 1% of participants (n = 2) produced negative d-prime values, meaning that their False Alarm rate (i.e., incorrectly reported gaps) was higher than their Hit rate (i.e., correctly reported gaps). On the whole, however, it appears that participants adopted a generally conservative strategy when reporting gaps in Experiment 2. As we had also seen in Experiment 1A, Rhythm Task predicted correct answer in trials without gaps (r s = 0.21, p = 0.03), but not trials with gaps (r s = − 0.01, p = 0.90; Fig. 7). Details concerning the modelling and post hoc tests are given in Appendix C, Tables C1, C2, C3, C4.

Reaction times
Trials where the gap occurred After Breath 1 produced the slowest reaction times (Mean 1.42 s, SD 0.42 s), followed by Before Breath 2 (Mean 1.30, SD 0.41; Beta = 0.10,

Interim discussion
Experiment 2 built on Experiment 1A, by asking participants to detect whether or not a silent gap had been added between a breath sound and speech; however, we found that participants were far more likely to correctly report the gap when it occurred adjacent to the second of the two breaths, perhaps due to speech entrainment-related enhancement in temporal processing. We moreover found evidence of a strong bias against the initial gap position. It is unclear why participants were so unlikely to report perceiving a gap in that instance, given participants in Experiment 1A could reliably detect gaps as short as 440 ms that followed initial breaths. If speech entrainment indeed facilitates temporal processing, it may be that the obviousness of gaps inserted later within the speech stream (i.e., after entrainment has begun) had the unintended effect of suppressing awareness of the gaps that preceded speech. Put otherwise, perhaps because participants were so confident when recognising gaps occurring in the two later gap positions, they may have overlooked or second-guessed the more difficult After Breath 1 trials. They were unlikely to realise that there were more trials with than without gaps, rather than a 50/50 split, which likely compounded the bias against gaps After Breath 1. On the other hand, we also found that the Before Breath 2 and After Breath 2 gap positions differed from another. Specifically, gaps following Breath 2 elicited higher rates of detection, and at lower thresholds in duration, suggesting that respiratory sounds may indeed contribute to listeners' temporal expectations concerning the onset of speech. Reaction times corroborated these results, indicating that participants responded more slowly even when correctly identifying gaps that occurred Before Breath 2, in comparison to After Breath 2. Finally, we replicated the pattern of results seen in Experiment 1A concerning Rhythm Task; namely, nonverbal rhythm skills did not predict Hit rates, but they were associated with lower False Alarm rates.
Up to this point in the current study, the natural locations of the breath sounds within the speech stimuli have been preserved. This means that the placement of gaps is hitherto confounded with potential breath-specific latent factors. For example, in our corpus and as reported more widely in the literature, interjected breaths tend to be shorter than breaths that initiate utterances (Winkworth et al., 1995(Winkworth et al., , 1994. Although we did not find an effect of either Breath 1 or Breath 2 Duration in Experiment 2, the latter of these was associated with poorer performance in Experiment 1B, which instead asked participants where the gap occurred. In planning Experiment 3, we therefore determined to manipulate the breath sounds themselves directly. A relatively naturalistic change is to simply re-arrange the natural breath sounds, such that the initial breath would instead be heard partway through the utterance, and the formerly interjected breath would then take place at the beginning of the utterance. This step disentangles gap position and unique characteristics of the natural breath sounds. We wondered whether properties of the respiratory sounds would have any bearing on listeners' perception of ensuing speech, in particular, if their ability to correctly identify the gap position would change depending on if the speaker's natural breathing was either preserved or manipulated. Although the exploratory acoustic analysis of Experiment 2 (Appendix D) did not indicate that acoustic properties of the breath sounds played a strong role in gap detection accuracy, inverting the breath sounds provides the opportunity to explore this possibility further.
Another question concerns the role of nonverbal rhythm perception skills. So far, we have observed that participants with higher Rhythm Task scores produce lower False Alarm rates during gap detection, and are also better at determining where a gap has occurred when given the choice between possible locations. It is possible, however, that the benefit of Rhythm Task merely reflects superior auditory processing, or   even forced-choice judgement, rather than timing or rhythm perception per se. We therefore developed a new task that could both serve as a control to the nonverbal rhythm discrimination task, as well as help ascertain the possible influence of conscious awareness of switched breath sounds. The solution was a switched breath recognition task. Briefly summarised, participants were presented with speech stimuli, half of which contained switched breaths, and the remaining half consisted of the unaltered utterances. The participant's objective was to report whether the breaths have been switched or not.
In planning Experiment 3 (Fig. 5, Panel B), we chose the same paradigm as in Experiment 1B, which asked participants to locate, rather than detect, a gap between breath sounds and speech. We were firstly motivated to replicate our in-person findings in a larger, online sample. The second reason was that the overall effect of Rhythm Task was stronger in Experiment 1B than in the other experiments, and we wished to confirm whether this relationship could be attributed to task-specific demands, rather than a spurious correlation arising from the small sample size. Finally, we found no effect of any breath duration in Experiment 2, but Breath 2 Duration was associated with poorer accuracy in Experiment 1B. It therefore seemed reasonable to select the paradigm that had, so far, elicited the strongest breath sound-specific effects. This final experiment is more exploratory than the others, and we formed three hypotheses, the first being the null hypothesis that altering the natural order of the breath sounds will not affect gap detection accuracy. The second hypothesis is that participants will in fact be more likely to correctly locate a gap if the breath sounds are manipulated, possibly due to the stimulus' unnaturalness, leading in turn to conscious awareness and/or heightened attention. The third hypothesis is that accuracy will be degraded in manipulated breath sound trials, potentially because of misleading information about upcoming speech timing conveyed by the breath sounds themselves.

Participants
Participants were recruited using Prolific (Palan & Schitter, 2018) and were paid £8 per hour for their time. A total of 109 participants completed Experiment 3 (ages 18-35). Participants were screened in the same manner as Experiment 2, and had not taken part in our earlier experiments. On the basis of data quality measures, we were forced to exclude 7 participants, resulting in a final N = 102.

Design
4.1.2.1. Gap detection task. Experiment 3 was similar to Experiment 1B, in that participants were asked to choose which of two gap positions (After Breath 1 / After Breath 2) contained a silent gap. To explore breath-specific effects, the natural initial and interjected breath sounds were inverted in half the trials. This allowed us to test whether unique characteristics of natural speech-related breathing would moderate successful gap detection. Experiment 3 was formed as a 2 (Switched Breath: Yes, No) × 2 (Gap Position: Breath 1, Breath 2) × 3 (Gap Duration: 200, 450, 700 ms) within-subjects design. The primary covariate terms were Rhythm Task, as well as score correct in the switched breath recognition task (Switched Breath Task), but we also fit terms for Breath 1 and Breath 2 Duration (i.e., the duration of the first and second breaths heard by the participant), and Speech 1 and Speech 2 Duration.

Switched breath recognition task.
To determine whether conscious awareness of switched breaths exerted an influence on gap detection performance, we administered a short 2-forced choice task wherein participants answered whether they thought the breaths had been switched. To make the task as easy as possible, utterances were chosen as the 4 with the most contrasting breaths by duration, per speaker, in the corpus, resulting in 16 trials in total, 8 of which contained truly switched breaths. Participants were distributed across two pseudorandom trial orders, with the switched and non-switched versions of each utterance counterbalanced across trial orders.

Stimuli
From the full corpus of four speakers, we selected, for each speaker, the twenty utterances with the greatest contrast in duration between the first breath and second breath. Within this subset, the mean duration was 705 ms (SD 295 ms) for initial breaths, and 371 ms (SD 108 ms) for interjected breaths. Although we did not explicitly balance the three speech types, there was nonetheless good representation of each, consisting of 38.75% Articles Reading, 28.75% Poems Reading, and 32.5% Spontaneous Speech. The stimuli were generated using the same audio processing pipeline as in Experiments 1 and 2. 50% of trials belonged to the Switched Breath condition, and the remaining 50% were unaltered in this respect. Gap position and duration were balanced across the 80 utterances, with the exception of the 700 ms duration condition, of which half as many occurred as the other two duration values. This approach was determined given the relatively high accuracy for longer gaps in Experiment 1B. Trial order was pseudorandom, with eight different order conditions across participants. The speech excerpts were shuffled across experimental condition by trial order, ensuring each utterance appeared across multiple conditions, but was heard only once per participant.

Apparatus
The experiment was again run via the Gorilla online testing platform (Anwyl-Irvine et al., 2020), using the same participant interface as in Experiment 2.

Procedure
The components of the study were performed in the following order: 1. Administration of information sheet and informed consent; 2. Gap detection task; 3. Switched breath recognition task; 4. Nonverbal rhythm discrimination task; 5. Text-based debrief on the aims and background of the study. 4.1.5.1. Gap detection task. The task instructions were the same as Experiment 1B and the user interface was essentially similar to Experiment 2. Although we did not describe or refer to any inversion of natural breath sounds in the task instructions, the practice session contained trials from both switched and non-switched conditions. 4.1.5.2. Switched breath recognition task. We implemented this new task similarly to the other components in the experiment, with the same inter-trial timing and visual cues prompting participants at the beginning of trials. Participants were explicitly informed that, in some trials, the breath at the beginning and the breath in the middle of the speech had been switch around. Their job was to guess whether the breaths were switched in the speech they heard. Due to time constraints, there was no practice session with feedback for this task.
4.1.5.3. Nonverbal rhythm discrimination task. The task procedure was the same as in the previous experiments.

Data processing and analysis
Analysis was performed using the same pipeline, in general, as in Experiments 1 and 2. The factorial explanatory variables of interest were Switched Breath (Yes/No), Gap Position (Breath 1/Breath 2), Gap Duration (200/450/700 ms). The covariate terms were Rhythm Task, Switched Breath Task, Breath 1 Duration, Breath 2 Duration, Speech 1 Duration, and Speech 2 Duration. There were random intercepts for Participant and Speech Excerpt, and random slopes for Gap Duration and Gap Position within Participant.

Accuracy
Descriptive statistics are provided in Table 3. ~5% (count = 380) of reaction times exceeded 5 s and were removed from further analysis. We found that the inclusion of the Switched Breath term did not improve the model fit (Appendix E, Table E1), and nor did we find any correlation between participants' ability to discern switched breaths and their gap detection (r s = 0.05, p = 0.64; Fig. 10). There was, however, a main effect of Rhythm Task (Beta = 0.33, SE = 0.11, z =2.86, p =0.004, R sp 2 = 0.008). Summarising by median-split group, low Rhythm Task scores were associated with a mean accuracy of 79% (SD 41%), and high scores 84% (SD 37%). Similar to our findings from the in-person Experiment 1B, gaps following Breath 2 were more likely to be correctly identified (Mean 83%, SD 38%) than those that occurred after Breath 1 (Mean 77%, SD 42%), even after controlling for the duration of the breath and speech segments (Beta = 0.57, SE = 0.17, z =3.47, p <0.001, R sp 2 = 0.002; Fig. 9). Unlike Experiment 1B, however, we found that increasing Gap Duration predicted higher accuracy, regardless of Gap Position, and that all pairwise contrasts between the factorial predictors were significant (p ≤0.04). Hence, the current data are, in short, less variable, and we see a comparatively subtler difference driven by the placement of the gap. Closer inspection of the complex interactions between Gap Position and the speech-specific covariates showed that increase in Breath 1 Duration was associated with improved accuracy, but only when the gap followed Breath 1 (Trend = 0.0008, SE = 0.0003, z = 2.43, p = 0.03; Fig. 11. Similarly, we found a positive correlation between Breath 2 Duration and accuracy when the gap occurred after Breath 2 (Trend = 0.0.002, SE = 0.0004, z = 5.19, p <0.001); on the other hand, when the gap was in fact after Breath 1, then Breath 2 Duration had a negative influence on accuracy (Trend = − 0.003, SE = 0.0003, z = − 11.13, p <0.001). Speech 1 Duration was associated with correct answers, but only when Breath 2 Duration <400 ms, and the gap followed Breath 2 (Trend = 0.22, SE = 0.08, z = 2.65, p = 0.008). None of the other post hoc covariate terms reached statistical significance (Appendix E, Table E4).

Interim discussion
In Experiment 3, we confirmed the association between Rhythm Task and the ability to locate a gap in the speech breathing time series, which we had first observed in Experiment 1B. Given that Rhythm Task only predicted False Alarm rates in Experiments 1A and 2, this suggests that the role of nonverbal rhythm sensitivity may indeed be task-specific, although the correlation was weaker in Experiment 3 (r s = 0.30, p = 0.003) in comparison to Experiment 1B (r s = 0.59, p = 0.003). This difference may be attributable to the smaller sample size of Experiment 1B, or the demographic character of our in-person versus online samples, the latter of which had a slightly higher Rhythm Task score (Mean 82%, SD 12%) than in the in-person group (Mean 78%, SD 14%). Similarly, participants in Experiment 3 were less troubled by gaps that occurred after Breath 1, although the results followed the same overall pattern as in Experiment 1B, with greater accuracy for gaps following Breath 2. Because we switched true initial and interjected breaths across our stimuli, we were moreover able to experimentally decouple breath duration from Gap Position, and show that the latter was not entirely dependent on differences between initial and interjected breaths.
The influences of Breath 1 and especially Breath 2 Duration were, nonetheless, appreciable, along with their interactions with Gap Position. Specifically, we saw that increase in breath duration was associated with greater propensity for participants to choose that breath as the gap's location. In particular, longer Breath 2 Duration appears to have dissuaded participants from correctly identifying gaps occurring after Breath 1. As a factor, however, Switched Breath status did not significantly explain any variability above and beyond Breath 1 and Breath 2 Duration. These results do not support our second hypothesis, which was that listeners would be more likely to correctly locate the gap in Switched Breath trials, perhaps having been alerted by the unnaturalness of switched breath sounds. Indeed, despite reasonable group performance in the switched breath recognition task (Mean 66%, SD 17%), participants' ability to explicitly identify whether or not breaths had been switched did not at all correspond with their success in the gap detection task. Given that we deliberately made the switched breath recognition task as easy as possible, this result may indicate that the identification of gaps indeed draws upon temporal acuity (as measured in the Rhythm Task), rather than auditory processing skills more generally. In any case, although we did not find definitive evidence for the alternative hypotheses that Switched Breath trials would confer either an advantage or disadvantage when identifying gap positions, we can establish that the duration of breath sounds affects listeners' impression of the timing of speech onsets.

Gap position and duration
The primary motivation of this study was to determine whether listeners impose strict temporal expectations upon the speech breathing time series. We explored this question using a more implicit, classic gap detection paradigm (Experiments 1A and 2), as well as an explicit gap location paradigm (Experiments 1B and 3). Across the three experiments (total n = 308), we report consistent results concerning the effect of gap position and its mediation of gap duration; that is, listeners are sensitive to sub-second perturbations in the speech and speech breathing time series, but this sensitivity is not uniformly distributed across time. Rather, short silent gaps are more likely to be detected and identified when they occur following an interjected breath, rather than an initial breath. But we also found greater accuracy for gaps occurring after, rather than just before, interjected breaths (Experiment 2). In other words, the amount of prior speech exposure alone does not explain the perceptual benefits associated with later gaps. Although, as we discuss in the following section, participants were also influenced by speech stimulus-specific characteristics, on the whole, the effect of gap position held when controlling for natural differences between initial and interjected breaths, as well as the surrounding speech context. The gap's position also determined at what duration threshold it would be noticed. In the constrained paradigm used in Experiments 1B and 3, participants were able to locate gaps as short as 200 ms, especially when the gaps followed the interjected breath. Experiments 1A and 2, however, provided a more nuanced look at gap duration. For instance, when the gap followed the interjected breath in Experiment 2, there was a large jump in accuracy from about 55% for gap duration 200 ms, to over 75% at 325 ms. Ceiling performance, >90% correct, was reached by 450 ms. By comparison, ceiling performance was lower and not achieved until gap duration 575 ms when the gap occurred before the interjected breath. Surprisingly, gap duration was relatively ineffective in trials where the gap occurred after the initial breath in Experiment 2, with participants displaying a strong bias against gaps even as long as 700 ms. This stands in sharp contrast to Experiment 1A, where the gap could only ever follow Breath 1, in which case, participants reached a 70% detection rate by about 440 ms. A major difference between Experiment 2 and the other experiments is that we did not explicitly describe or direct participants' attention to breath sounds in the task instructions; hence, the gap duration at which participants can reliably detect or identify gaps may scale with attention and their expectations as listeners.

Nonverbal rhythm sensitivity
We also observed that individual differences in nonverbal rhythmic auditory processing affected performance in Experiments 1B and 3, wherein listeners were forced to choose between gap positions. Additionally, Rhythm Task in Experiment 3 was associated with faster reaction times, providing indirect evidence that participants with enhanced rhythm processing skills may have also been more confident when judging where the gap occurred. Individual differences in awareness of manipulated breath sounds were not correlated with that task, suggesting a specific role for sensitivity to rhythm, and not enhanced auditory processing more generally. For Experiments 1A and 2, in which participants were asked to report the presence of a gap, participants with higher Rhythm Task scores were less likely to incorrectly report having heard a gap in trials when there truly was no gap, meaning that they performed gap detection task with greater specificity. But their hit rate was not improved as a function of sensitivity to nonverbal rhythm. We can again speculate that the nature of the task   and listeners' beliefs may underlie the changing influence of nonverbal rhythm perception skills between experiments, a result that speaks to the idea of rhythm processing as a form of active sensing (Morillon, Hackett, Kajikawa, & Schroeder, 2015). Namely, the forced choice between possible locations of the gap may have pushed listeners to adopt an active listening strategy more similar to that used in the Rhythm Task. Specifically, the participants may have taken on a listening approach optimised to facilitate the later comparison between events in a time series. By contrast, in Experiments 1A and 2, participants could have approached gap detection differently; for instance, they may have relied on their sense of duration or interval timing, which is behaviourally and neurally distinct from event timing (Teki, Grube, Kumar, & Griffiths, 2011;Tierney & Kraus, 2015). Put differently, participants paid attention to how long, rather than when, a gap happened. There are possible implications here for daily life, in that optimal listening strategies probably also vary across different social contexts. For example, during spontaneous conversation, we may be sensitive to the timing of our partner's breathing to facilitate turn-taking and conversational flow (Rochet-Capellan & Fuchs, 2014;Torreira et al., 2015). In other situations (e.g., audiobook listening), wherein respiratory cues need not be acted upon, we do not need to engage with breath sounds in the same way. In any case, we can qualify our results concerning Rhythm Task, and suggest that there is indeed a specific relationship between nonverbal rhythm perception and speech breathing gap detection, but this correspondence is not universal across listening conditions.

Speech entrainment
The current work contributes new evidence that exposure to speech modulates temporal processing (Bosker & Ghitza, 2018;Kösem et al., 2018), by demonstrating that listeners were substantially more likely to both detect and locate gaps that occurred within an ongoing speech stream, rather than before speech begins. We ensured that the breath sounds had a standardised intensity and even manipulated them directly in Experiment 3, in addition to visually priming participants at the onset of each trial. As such, the effect of gap position cannot be attributed to breath sound-specific effects (e.g., louder interjected than initial breath sounds), nor subjects' having been caught unawares by the first breath sound in the case of Experiment 2. Although this is an interesting finding in the context of speech perception generally, a primary question motivating our study concerned the anticipatory temporal relationship between speech breathing and speech production. Given the relevance of breath sounds to social interaction (Rochet-Capellan & Fuchs, 2014;Włodarczak, Heldner, & Edlund, 2015), does a listener's perception of a speaker's respiratory activity help them in forming predictions about forthcoming speech timing? Experiment 1A provides preliminary evidence that listeners require just a few hundred milliseconds of silence before they are sure of a gap's presence, even before having heard any speech. Moreover, our results in Experiment 2 show that gaps occurring after, rather than before, a breath sound are detected at a lower threshold of duration, and that listeners also respond comparatively more quickly to gaps that follow the breath. If breath sounds did not interact with ongoing speech entrainment processes, we wouldn't necessarily expect to see this asymmetry. Taking these finding into account, what is it that breath sounds contribute to temporal expectations? Research into visuomotor synchronisation suggests that humans are highly sensitive to changes in velocity when perceiving biological motion (Su, 2014;Varlet et al., 2014). Similarly, the instance of peak velocity of the stimulus envelope has also been shown to be the primary target during sensorimotor synchronisation to speech (Rathcke, Lin, Falk, & Bella, 2021;Scott, 1998). In short, moments of rapid acceleration within a given signal may provide a relatively discrete cue or anchor by which listeners can anticipate future events. When measured using inductance plethysmography, the kinematic profile of speech-related inhalation can be characterised as having a relatively shallow early slope, before quickly rising to terminate in a sharp peak as speech begins. To our knowledge, the acoustic correlates of this speech breathing "shape" are not established, but it seems intuitively likely that breath sounds convey kinematic information about speech preparation, thereby helping listeners to form temporal predictions. Some support for this idea can be taken from the switched breath recognition task (Experiment 3). Although it was primarily used as a control in the current study, performance in the task was above chance on average (61%), and some participants achieved a perfect or near-perfect score. If listeners can detect manipulations of the natural breath-utterance pairings, it suggests that unique properties of breath sounds may well carry some meaning concerning forthcoming speech, even though it was not associated with judgement of gap position in that experiment. Future work could explore this idea further by replacing natural breath sounds with spectrally matched noise or time-reversed breaths, for example.

Gap detection within and between utterances
Although our focus was on listeners' perception of the speech breathing time series, the data may also be of interest to pause perception, an area of research that explores sensitivity to interruptions during speech timing more generally. Previous work investigating silent pauses inserted between words suggests above-chance detection rates with ~200 ms gap duration (Lovgren & Doorn, 2005;Warner, Whalen, Harel, & Jackson, 2022). Given that naturally occurring gaps-if present at all-normally last for just tens of milliseconds in connected speech, this approximate threshold seems plausible, with the caveat that the studies by Lovgren and Doorn (2005) and Warner et al. (2022) each used just a small number of tightly controlled speech excerpts. In any case, the threshold for interruptions to speech breathing boundaries seems to be about twice this duration (~400 ms). We speculate that, whereas within-speech gap detection may operate at the faster, syllabic time scale (e.g., ~4 − 8 Hz), speech breathing gap detection could instead be more comparable to stressed syllable timing (e.g., ~2 − 3 Hz). Crosslinguistic studies could investigate this possibility further, for example, by comparing speech breathing gap detection between speakers of languages known for their differing rhythmic structure.
Indeed, multiple cues appear to influence listeners when asked to segment speech or detect pauses in real time, which extend beyond the absolute duration of silences (Cole, Mo, & Baek, 2010;Duez, 1985Duez, , 1993Lundholm Fors, 2015). These cues can include intelligibility (Butcher, 1981;Duez, 1985), prosodic cues such as vowel lengthening (Duez, 1993), and syntactical constraints, in that listeners may be primed to expect a pause at grammatical junctures (Cole et al., 2010). Moreover, evidence from cross-linguistic studies suggests that listeners of unfamiliar foreign speech are more objective than native listeners when estimating the length of pauses, at least for English listeners of Italian (Chiappetta, Monti, & O'Connell, 1987). Finally, the rapidity with which participants categorise pauses may also be influenced by online lexical processing, such that reaction times to pauses are slower when the pause occurs following a plausible word, in comparison to a nonsense word in short utterances (Mattys & Clark, 2002). In sum, temporal expectation in pause perception is subject to both acoustic and linguistic influences. With regards to speech breathing, we observed more mixed effects of the speech and breath sounds. Specifically, in Experiments 1A and 3, increasing breath duration was associated with the likelihood of listeners reporting or identifying a gap, correctly or otherwise. We did not, however, find strong modulation of performance by breath duration in Experiment 2, and longer interjected breaths in Experiment 1B tended to make participants perform worse, no matter where the gap was.
There was also a generally deleterious effect of longer speech that followed both breath sounds in Experiment 2. Given that Speech 2 by definition occurred after the presentation of any gap, this finding could potentially be linked to cognitive load. Cognitive load is typically studied with dual-task paradigms, which the current study did not incorporate. On the other hand, speech perception is itself an active, cognitive process, rather than passive and automatic (Heald & Nusbaum, 2014). It is therefore possible that, if making temporal judgements draws upon the same attending resources as ongoing speech perception, increasing the amount of speech participants heard could have inhibited their retrospective awareness of gaps. In this case, the effect of Speech 2 Duration may be related to previous work showing that higher levels of cognitive load interfere with listeners' ability to make fine acoustic-phonetic judgements, such as vowel duration estimation (Chiu, Rakusen, & Mattys, 2019;Mattys, Barden, & Samuel, 2014). (Bosker, Reinisch, & Sjerps, 2017) reported that increasing cognitive load led participants to overestimate the preceding speech rate and subjectively perceive a subsequent ambiguous word as longer. Hence, if increasing Speech 2 Duration did in fact introduce additional cognitive load, it could be that participants' recollection of the gap "shrunk", along with their sense of the timing of speech more generally. An alternative, but related explanation for how Speech 2 Duration could have affected gap perception is based on working memory constraints: (Teki & Griffiths, 2014) found in a nonverbal temporal memory task that accuracy decayed with an increasing number of sequential intervals to be memorised. The relationship between cognitive load and working memory is beyond the scope of the present discussion, but our incidental finding here suggests that gap detection tasks present a potentially useful paradigm in which to explore these ideas further.

Limitations
Having used naturalistic speech stimuli, we forfeited a level of control, which we have addressed via counter-balancing across experimental conditions in large sample sizes, as well as incorporating terms like breath duration in the modelling. We moreover interrogated speaker and other speech-specific predictors in the exploratory analyses (Appendix D), finding that, although present, the influence of unique characteristics of the speech excerpts did not in general distort our experimental results. Future work should incorporate stimuli more systematically produced to interrogate these features. For example, Werner, Fuchs, Trouvain, and Möbius (2021) found that utteranceinterjected breath sounds tend to have higher intensity values than utterance-initial breath sounds. We standardised the intensity of breath sounds when producing the stimuli, and so may have artificially subdued qualities of the breath sounds that would otherwise act as cues. With regards to syntactical aspects of the speech, we did not perform an in-depth linguistic analysis and cannot speak to these factors, such as the likely interaction between between grammar and gap detection. We did employ automatically calculated measures of pace and regularity in speech timing in the exploratory analysis, but saw no obvious effect of these terms. That said, algorithmically obtained parameters may not have been precise enough to reflect potentially subtle effects of speech timing (MacIntyre, Cai, & Scott, 2022). In light of the pause perception literature covered earlier, we expect that systematically varying the timing of breath sounds relative to local grammatical context could influence listeners' awareness of artificially imposed silent gaps, although breathing and grammar are not easily decoupled in naturalistic speech Grosjean & Collins, 1979;Henderson et al., 1965;Winkworth et al., 1994). Another limitation is that we only tested English speech on English primary language listeners, so the impact of intelligibility and other cross-linguistic factors remains unknown. Finally, there is of course, the artificial nature of the gap detection tasks themselves, which required participants to scrutinise speech in a manner unlikely to arise outside of a laboratory. Conversational speech is often conducted in live, face-to-face settings, but visual and other multimodal aspects of speech breathing are not addressed in the present experiments. Future studies could investigate how the visual and auditory percepts of breathing interact by experimentally decoupling them.

Conclusion
Speech breathing is not well explored by the current speech perception literature. Yet behavioural studies of social interaction (McFarland, 2001;Rochet-Capellan & Fuchs, 2014;Torreira et al., 2015) imply a facilitating role, and perhaps even an adaptive advantage, for sensitivity to speech breathing. Taken together, results in the present study indicate that the verbal acoustic signal alone does not comprise the "speech itself", but rather can be understood as part of an integrated percept that also includes nonverbal traces. Hence, instead of scrubbing the breath sounds from naturalistic recordings, or neglecting to account for respiration in synthesized stimuli, researchers might consider an active role for speech breathing in their own experimental work. For one hypothetical example, the distinctive sound of speech-related inhalations could serve as a rhythmic "phase reset", thereby readying listeners to entrain to an upcoming speech stream. In which case, altering the placement of breath sounds relative to speech may afford a unique opportunity to manipulate the perception of speech rhythm with a greater degree of naturalism than is normally possible, leading to new avenues in entrainment research. Indeed, engaging directly with speech breathing could open the door to generating new hypotheses using many paradigms already familiar to speech science, as well as help build towards a more naturalistic and complete understanding of speech perception in the wild. But beyond speech, the current work underscores the dynamic nature of auditory attention, and in particular the potential for embodied traces to exert a persuasive influence on temporal perception more generally. If confirmed in future work, these results may be of consequence for clinical applications, such as in the development of smart hearing devices that are sensitive to speech-related respiration. After all, if humans are specially attuned to social signals, few signs affirm the presence-nor communicative intent-of another so well as breathing does.

Data availability
Data associated with the gap detection tasks are available to download from https://osf.io/3ecj7/ Data from the other tasks, the stimuli, and the scripts used to generate stimuli and run the analyses are available by request from the corresponding author.

Funding
Funding was provided by University College London Graduate and Overseas Research Scholarships to Alexis Deighton MacIntyre.

Declarations of interest
None.

. Introduction
The speech stimuli employed in the current study were naturally produced by non-professional speakers reading from a variety of texts and speaking spontaneously. The reason for including a variety of speaking styles was to enhance the generalisability of our results; however, we wondered whether some acoustic or linguistic aspects of the stimuli may have influenced accuracy in the gap detection task. We therefore determined to extract a set of acoustic features to explore this question further, using data from Experiment 2, which employed the most naturalistic paradigm in the current study (participants were not informed of the relevance of breathing during task instructions). First, we estimated acoustic features from the breath sounds, the speech, and short segments (250 ms) of the speech immediately adjacent to (i.e., before and/or after) the breath sounds, which we term "Post-Breath 1", "Pre-Breath 2", etc. The extracted acoustic features included root mean square, spectral centroid, and F0 values. In addition, we calculated duration values of Breath 1 and Breath 2, as well as of Speech 1 and Speech 2, which refer to the complete vocalisations following Breath 1 and Breath 2, respectively. Finally, in the case of the speech, we generated the amplitude envelope following the approach described by Oganian and Chang (2019) and MacIntyre et al. (2022). From this envelope, we used the peaks in its first derivative to estimate vocalic onsets, and took the intervocalic onset as an approximation of the syllabic unit (MacIntyre et al., 2022;Oganian & Chang, 2019). This time series allows us to derive the mean and coefficient of variation of inter-event intervals as metrics of speech timing. We performed this process twice, varying the peak-finding algorithm (islocalmax in MATLAB) to be less or more stringent (i.e., by adjusting the threshold of peak prominence) in order to estimate the syllable-and stressed syllable-levels of timing, respectively. Finally, the estimated counts of syllables and stressed syllables were also divided by the durations of Speech 1/ Speech 2 to produce the measure of speech rate. Gap Position, Gap Duration, Rhythm Task score, Speech Type (Articles/Poems/Spontaneous), and Speaker Identity were also included as predictors.

D.1.2. Feature selection with random forest classification
With this many potential predictors, there is a strong danger of over-parameterising, which may lead to unreliable estimates and overfitting in models, especially where there is likely to be collinearity between features (Harrison et al., 2018;Zuur, Ieno, & Elphick, 2010). We therefore performed a feature selection analysis with a random forest classifier using the caret package in R (Kuhn, 2008;Kuhn & Johnson, 2019). A random forest is an ensemble machine learning method that consists of many decision trees-logical nodes-that classify new observations by majority vote. During training, each decision tree applies a subset of randomly selected predictors to uniquely bootstrapped samples, making this technique highly robust to over-fitting (Biau & Scornet, 2016). Random forests are also sensitive to interactions between variables (provided they are of a sufficient magnitude, see discussion in Darst, Malecki, & Engelman, 2018;Inglis, Parnell, & Hurley, 2022). The random forest was conducted using 5-fold cross validation repeated 10 times, with 10% of the data held out for testing. The class to be predicted was "correct" or "incorrect" response, and we also included data from the trials where no gap had occurred (Total number of observations 15,022).
The classifier accuracy for test data was 75% (95% CI [73% 77%]), and this overall accuracy rate was found to be significantly greater than the rate of the majority class, which was "correct" (No Information Rate 62%; p <0.001). Having established that the model performed reasonably well, we extracted the variable importance weights (Kuhn, 2012), with the complete list of weights given in Appendix D, Table D3. As can be seen, Gap Position, Rhythm Task, and Gap Duration are weighted most highly, with a substantial drop-off from Gap Duration (weight 39.03) to the highestranked acoustic feature, which was Speech 2 Duration (weight 6.11). No single acoustic predictor has a comparatively large weight, and the decision of which ones to retain was therefore somewhat arbitrary. As 29/36 (81%) of the variables were awarded a weight <3.00, we opted to take forward the predictors with weight ≥3.00, each of which was duration-based: Speech 2 Duration (weight 6.11), Breath 2 Duration (weight 4.75), and Speech 1 Duration (weight 3.43). Breath 1 Duration, though not weighted as highly, was also included for completeness. These speech excerptspecific, continuous predictors were fit as fixed effect terms in the modelling throughout the current study.       P-values are adjusted using the Bonferroni method.

Table E4
The results of post hoc contrasts using estimated marginal means for Breath 1 Duration, Breath 2 Duration, Speech 1 Duration, and Speech 2 Duration with Gap Position in Experiment 3. Breath