Tapping into linguistic rhythm

Tamara Rathcke1,2,3, Chia-Yuan Lin3, Simone Falk4,5 and Simone Dalla Bella5,6,7,8 1 Department of Linguistics, University of Konstanz, DE 2 MARCS Institute for Brain, Behavior and Development, Western Sydney University, AU 3 English Language and Linguistics, University of Kent, UK 4 Department of Linguistics and Translation, University of Montreal, CA 5 International Laboratory for Brain, Music and Sound Research (BRAMS), Canada 6 Department of Psychology, University of Montreal, Canada 7 Centre for Research on Brain, Language and Music, Montreal, Canada 8 Department of Cognitive Psychology, University of Economics and Human Sciences in Warsaw, Warsaw, Poland


Introduction
Is language rhythmic? For decades, this seemingly simple but profoundly important question that connects language with other aspects of human cognition has been controversially debated (Cummins, 2012;Roach, 1982). Early accounts of linguistic rhythm suggest that it relies on some acoustic isochrony in spoken language-either at the level of syllables or inter-stress intervals (Abercrombie, 1967). However, this idea did not stand up to acoustic measurement as no temporal analyses of speech have ever provided any evidence for an isochrony-based rhythm (Dauer, 1983;Fowler & Tassinary, 1981;Pointon, 1980;Roach, 1982;Uldall, 1971;van Santen & Shih, 2000). The failure to find isochrony in speech has led to the development of alternative approaches. One of the most prominent proposals has suggested that linguistic rhythm may be ascribed to the durational variability present in consonantal and vocalic intervals, with languages being more or less variable and thus sounding rhythmically different (Dellwo & Wagner, 2003;Deterding, 2001;Grabe & Low, 2002;ling Low, Grabe, & Nolan, 2000;Ramus, Nespor, & Mehler, 1999;White & Mattys, 2007). Initially appealing and fuelling much research into the cross-linguistic study of rhythm, this approach has been recently critiqued as empirically inadequate and misrepresenting the issue at heart (Arvaniti, 2009(Arvaniti, , 2012Arvaniti & Rodriquez, 2013;Barry, Andreeva, & Koreman, 2009;Kohler, 2009;Rathcke & Smith, 2015a, 2015bWiget et al., 2010). The latest attempts at capturing rhythmic properties of spoken language involve analyses of the prosodic hierarchy and its temporal signatures, i.e., durational implementation of the hierarchical structure of linguistic utterances (Rathcke & Smith, 2015a;White, Payne, & Mattys, 2009), though this proposal has also seen some counterevidence (Mairano, Santiago, & Romano, 2015). Other recent approaches are less theoretically than acoustically driven, and rely on signal analyses such as properties of the amplitude envelope to capture some rhythmic properties in language (Goswami et al., 2002;Port, Cummins, & Gasser, 1995;Šturm & Volín, 2016;Tilsen & Arvaniti, 2013). Following on from these diverse and controversial accounts of rhythm, ideas have been put forward that language has a scale of rhythmicity (Kohler, 2009), is only occasionally rhythmic (White, Mattys, & Wiget, 2012), or even anti-rhythmic (Nolan & Jeon, 2014).
However, it has also been noted that previous production and perception studies of linguistic rhythm do not capture one of the core features of rhythm-its ability to entrain movement (Cummins, 2009(Cummins, , 2012. The idea that rhythm perception and movement are closely interconnected looks back at a long history (Bolton, 1894), and is supported by a growing body of evidence showing that beat and rhythm perception involve motor regions of the brain (e.g., the basal ganglia and premotor cortex) and their connections to auditory regions (e.g., Brett & Grahn, 2007;Grahn & Rowe, 2009;Patel & Iversen, 2014;Zatorre, Chen, & Penhune, 2007). Behavioural research has exploited the potential of external, rhythmically structured events to entrain movement, with the goal to gain a better understanding of rhythmic mechanisms and their underpinnings. The sensorimotor synchronization (SMS) paradigm has been developed and successfully utilized to study rhythm perception and the properties of the human timing system by observing how a motor action is temporally coordinated with an external auditory event (Aschersleben, 2002;Repp, 2005;Repp & Su, 2013). Such coordination of a perceived rhythm and a motor action gives insights into mechanisms underlying our capacity to achieve complex coordination in time when we dance, jointly sing, or chant, such as perceptual beat tracking and the generation of temporal expectancies (Repp, 2005). In the present study, we test the potential of SMS to provide new insights into the rhythmic organization of language.
The simplest way to measure SMS is to record finger tapping in time with an auditory stimulus, such as repeated sounds of a metronome or more complex musical sequences (Aschersleben, 2002;Repp, 2005;Repp & Su, 2013; for batteries of tests involving SMS see Dalla Iversen & Patel, 2008). Typically, the task consists in synchronizing finger taps produced with the dominant hand to the beat perceived in the auditory signal. Measures of the temporal asynchrony between the stimulus and the tap, the duration, and the variability of inter-tap intervals quantify the synchronization performance and motor stability during the task.
According to this research, the most stable patterns of synchronization arise when participants tap at 1:1 or other multiple integer ratios in-phase with the beat whereas more complex ratios and anti-phase tapping are generally more difficult (Bouvet, Varlet, Dalla Bella, Keller, & Bardy, 2019;Repp, 2005). Synchronized tapping on temporal scales longer than 800 ms is often more difficult and breaks down completely when inter-onset intervals exceed 1.8-2 seconds (Engström, Kelso, & Holroyd, 1996;Mates, Müller, Radil, & Pöppel, 1994). The fastest tapping rates occur at the inter-onset intervals of 150-200 ms (Repp, 2005;Truman & Hammond, 1990), and are faster in musicians than non-musicians (Repp, 2003). In addition to motor constraints, upper and lower rate limits reveal general cognitive constraints on temporal processing (Repp, 2005) and have been related to the working memory capacity (Pöppel, 1997). 1 Importantly, both upper and lower SMS limits of manual synchronization with a pacing signal have meaningful temporal counterparts in speech acoustics. Lower limits correspond to the average duration of a vowel or a syllable, and upper limits can correspond to the duration of larger units such as a prosodic or a syntactic phrase. Existing evidence further demonstrates that SMS can take place even in syncopated signals (Large & Palmer, 2002) and in auditory stimuli with complex metrical structures (Madison, 2014). That is, SMS responds to those features that bear the closest resemblance to language. Unlike traditional speech perception experiments that rely on listeners' metalinguistic conceptualization of rhythm, and delivered inconclusive results in the past (Miller, 1984;Rathcke & Smith, 2015b), SMS appeals as more intuitive to non-specialists and taps into the motor routines that are known to sharpen sensory rhythmic representations in music (Morillon, Schroeder, & Wyart, 2014;Ravignani et al., 2019).
Although it has been noted that the ability to synchronize movement to an external timekeeper is predominantly human and might have even played an important role in the evolution of language and music (Merker, 2000;Ravignani et al., 2019), SMS paradigms have so far inspired relatively little interest within the linguistic field. Some early work on speech rhythm (Allen, 1972) utilized a version of sensorimotor synchronization with speech by asking listeners to tap to a designated syllable in a spoken sentence that was repeated and played back to them 50 times. The results suggested that the paradigm was able to unveil the 'beat location' of a syllable which was close to a vowel onset and varied with prominence and syllable structure. Similarly, Falk, Rathcke, Dalla Bella, and Bella (2014) used sentence looping and demonstrated that SMS was highly sensitive to a lowlevel, within-language timing variation, thus suggesting that the method is well suited for the study of subtle rhythmic differences in spoken signals, despite their high complexity and temporal variability. Lidji, Palmer, Peretz, and Morningstar (2011) compared finger tapping performance of monolingual French and English participants as well as French-English bilinguals when tapping to French and English spoken sentences with a regular metrical structure of strong and weak syllables. The study revealed that both the listener's native language and the language-specific acoustics affected the obtained tapping patterns in terms of tapping frequency and inter-tap-interval variability. Falk and Dalla Bella (2016) used finger tapping with metrically regular sentences to examine potential benefits on speech and language processing that might arise from a concurrent motor activity while listening. Tapping congruently (i.e., in-phase) with accented syllables was found to enhance speech processing compared to incongruent (i.e., anti-phase) tapping, or listening without the motor activity. Such linguistic processing advantages may be supported by increased attentional resources being available through the coupling of perception and action that is typical of SMS tasks (cf. Hommel, 2015;Large & Jones, 1999). Overall, the recent SMS studies with language suggest that movementbased paradigms tap into language-specific rhythmic properties of speech.
Most of the studies above reported measures of motor rate and variability (including duration of inter-tap-intervals, ITI, and the coefficient of their variation, CV). Dalla Bella and colleagues (Dalla Bella, Białuńska, & Sowiński, 2013) note that SMS with language displays a relatively high amount of variation in contrast to SMS with music (reflected in a CV of 30% versus 4%, respectively), though it is unclear if this variability arises from the fact that language entrains movement less than temporally more regular stimuli (e.g., music) as the authors suggest, or is rather reflective of the unique temporal properties of language such as lack of isochrony in its acoustic signal (cf. Dauer, 1983;Pointon, 1980;Roach, 1982;Uldall, 1971;van Santen & Shih, 2000).
Only a few studies to date have addressed the question of potential SMS anchors in the acoustic signal of speech (Allen, 1972;Falk, Volpi-Moncorger, & Dalla Bella, 2017;Rathcke, Lin, Falk, & Dalla Bella, 2019). An answer to this question crucially hinges on empirical evidence that would demonstrate whether or not listeners attempt to systematically synchronize their movement with some specific points in the time course of an acoustic signal. Falk et al. (2017) defined SMS-anchors to coincide with the so-called 'perceptual centres ' (or p-centres, Marcus, 1981). A p-centre describes the subjective moment of occurrence of an event (typically a syllable in speech). More often than not, the p-centre and the acoustic onset of the corresponding event do not co-occur in time (Cooper, Whalen, & Fowler, 1986;Marcus, 1981;Morton, Marcus, & Frankish, 1976). The original interest in p-centres arose from a search for some temporal constancy in language, and led to the hypothesis that temporal isochrony in language might be perceptual, and not acoustic, in nature (Lehiste, 1977;Morton et al., 1976). There have been attempts to localize the p-centre at the midpoint of the amplitude rise-time at the onset of nuclear accented vowels (following Cummins & Port, 1998; see also Morton et al., 1976). However, neither kinematic nor any acoustic properties of speech signals seem to consistently capture the essence of the p-centre location (De Jong, 1994;Patel, Löfqvist, & Naito, 1999), and after 40 years of p-centre research, a comprehensive and reliable account of the phenomenon still remains a desideratum (Villing, Repp, Ward, & Timoney, 2011).
In our own study (Rathcke et al., 2019), we systematically examined several potential anchors by measuring SMS with very simple verbal stimuli, namely regularly spaced sequences containing alternations of syllables /bi/ and /bu/. The results of the study suggest that vowel onsets are likely to serve as attractors of individual taps. Moreover, SMS accuracy with similarly structured verbal and tonal stimuli did not significantly differ, if SMS in verbal stimuli was measured at vowel onsets. The latter result echoes previous findings obtained using a different tapping task (Dalla Bella et al., 2013). When linguistic stimuli closely resemble the metrical structure of music, the discrepancy between music and speech in their ability to attract movement disappears. However, natural speech is rarely metrical and never isochronous. Thus, rhythmic motor entrainment with language is yet to be demonstrated.
In contrast, a movement task that does not involve synchronization avoids the challenge of locating a tapping anchor in the acoustic signal. This non-synchronized motor reproduction (henceforth, NMR) paradigm has occasionally been used in previous speech research (Donovan & Darwin, 1979;Scott, Isard, & de Boysson-Bardies, 1985;Wagner, Cwiek, & Samlowski, 2019). In this task, listeners are asked to tap or drum a perceived rhythmic pattern after listening to an auditory prompt. NMR is somewhat similar to the synchronization-continuation paradigm which is common in the timing literature (e.g., Wing, 2002), with the difference that a synchronized tapping phase is missing. In NMR, listeners' tapping performance can be quantified by the interval duration between their taps (ITI) and by the variability in the interval duration (CV of the ITIs). When measuring period tracking of a beat in a linguistic input by means of such a non-synchronized task, ITI and CV could reflect some meaningful timing properties of the corresponding speech signal, e.g., syllable, word, or phrase duration, and the number of taps could reflect the number of rhythmically relevant events. Since this beat tracking ability of NMR has not been explicitly demonstrated in previous work, it is as yet unclear if, and how well, this motor paradigm can assist with understanding rhythm perception in language.

Aims and hypotheses
The aims of the present study were two-fold: (1) to test whether or not motor paradigms can help to tap into rhythm perception in natural language, and (2) to identify which paradigm would optimally support this.
While it seems natural and easy to synchronize to music, prolonged motor synchronization to speech is at first sight a less obvious and widespread activity (Dalla Bella et al., 2013), although previous studies have provided some evidence that it is not impossible (e.g., Allen, 1972;Falk et al., 2014;Lidji et al., 2011;Rathcke et al., 2019). It has its natural precursors in motor engagement with nursery rhymes (Cardany, 2013), clapping to political oratory (Tanaka & Rathcke, 2016) or co-speech gesturing (Wagner, Malisz, & Kopp, 2014). There are different implementations of a laboratory SMS paradigm with speech that can be found in the literature. For example, Lidji et al. (2011) asked their participants to listen to three repetitions of a spoken sentence in total, and to synchronize their finger taps with the beat during the second and the third presentation of the sentence. In the present study, we decided to use a larger number of repetitions, and participants were instructed to tap along with the perceived beat throughout a sentence loop.
Capitalizing on the general perceptual phenomenon of repetition (Margulis & Simchy-Gross, 2016;Rowland, Kasdan, & Poeppel, 2019), looped speech has the potential to reveal underlying rhythmic structures of sentences (Falk et al., 2014;Rathcke, Falk, & Dalla Bella, 2018). After having listened to repetitions of a sentence, listeners are no longer engaged in cognitively demanding semantic and syntactic processing. Instead, they can attend to the prosodic structure of the sentence and extract its rhythmic properties more easily. Looped speech is also known to sometimes induce the so-called 'speechto-song illusion' (S2S). S2S describes a perceptual phenomenon in which an originally spoken phrase switches to being perceived by many people as a song if it is embedded in a loop (Deutsch, 2003;Deutsch, Henthorn, & Lapidis, 2011). However, not all phrases are equally likely to transform into song (e.g., Falk et al., 2014), and we controlled for this phenomenon in our materials. Early work by Allen (1972) also utilized looped sentences, though participants of this experiment were only asked to synchronize with one designated syllable on each repetition and not tap along with the beat of the looped sentence as in this study.
The present approach is also different from the NMR paradigm implemented in previous research in which listeners were either presented with one repetition of each stimulus and could hear the stimulus again if needed (Wagner et al., 2019), or presented with 10 repetitions of a stimulus but were asked to tap after each repetition (Donovan & Darwin, 1979;Scott et al., 1985). The looped version of both SMS and NMR represents a principled way of expanding and applying current movement-based paradigms to language.
Given the differences in the nature of the SMS and NMR tasks, different scenarios would indicate (the degree of) the success of the paradigm in representing perceived rhythmic structures. In the case of SMS, motor entrainment necessarily involves the presence of a consistent anchor of synchronization in the speech signal. Lack of temporal consistency between a tap occurrence and an acoustic landmark will suggest lack of entrainment. In the case of NMR, beat tracking and rhythmic reproduction would be considered successful if tapping rates diverge from participants' natural and preferred tapping rate and converge towards the IOI of meaningful durational intervals in the linguistic input. If patterns resulting from NMR deviate from this prediction, beat tracking cannot be assumed to have been successful during the task. We further expect to find individual variation in both tasks, which should be at least partially explainable by individual musicality, timekeeping, and synchronization abilities (Dalla .
In summary, this study set out to assess two movement-based paradigms that had been previously used with language, synchronization and reproduction, with the aim of providing empirical evidence on a methodology that would be best suited for studying rhythmic properties of spoken language.

Experimental stimuli
Six English sentences were chosen from an existing database that we had previously created to investigate S2S. Every sentence in this database has been tagged for S2Slikelihood based on perception data obtained from 40 healthy native listeners (Rathcke et al., 2018, forthcoming; see Table 1). The sentences were read by a female native Standard Southern British English speaker (22 years old at the time of the recording).
The materials of the present study comprised six sentences, two with four syllables, two with seven, and two with ten syllables. The ten-syllable sentences were syntactically more complex than the shorter ones. To control for the possibility that repetition might induce musical interpretation of speech and thereby bias synchronization or reproduction in unexpected ways, the chosen pairs varied in their probability to induce S2S, with one high-transforming and one low-transforming sentence in each pair (see Table 1).
The stimuli were repeated 20 times for the SMS task and 10 times for the NMR task (see 3.7). A 400 ms pause separated the repetitions.

Sentence annotations
A trained phonetician (first author) annotated onsets of vowels and syllables in the test sentences. Vowels were defined as syllabic nuclei identified by the presence of voicing, formant structure, and relatively high intensity. Accordingly, pre-aspiration was excluded. Segmentation of vowel onsets in post-sonorant contexts combined acoustic and auditory criteria that were guided by impressions of the intended vowel quality. There were no segmental reduction or deletion phenomena, given that the recordings comprised of clear, read speech samples. There were also no cases of glottalization in these recordings. Segmentation examples are given in Figures 2 and 3. All materials are available from https://osf.io/3dh4m/. An independent annotator segmented vowel onsets in all test sentences following the criteria given above, and reached a cross-annotator agreement of .999945 (p < .001) in Pearson's correlation coefficient.
Additionally, each syllable (and its vowel) was specified with respect to its metrical status (strong or weak) and phrasal prominence (accented or unaccented). From these annotations, we derived acoustic timings of the two potential synchronization anchors (syllable or vowel) and linguistic prominence of the underlying units (0 for metrically weak, 1 for metrically strong but unaccented, 2 for accented syllables/vowels).

Acoustic pre-processing
To define potential acoustic SMS anchors, linguistically informed data preparation above was complemented by analyses of the amplitude envelope (cf. Goswami et al., 2002;Port, Cummins, & Gasser, 1995;Tilsen & Arvaniti, 2013) and signal energy derivatives (Šturm & Volín, 2016). Amplitude envelopes were created by employing the envelope function in Matlab (2018b). Accordingly, envelopes were derived from the absolute signal amplitude, and smoothed using a spline interpolation with a window of at least 500 samples (amounting to approximately 11 ms). Smoothed energy contours were derived following the procedure developed by Šturm and Volín (2016) which was based on the calculation of energy averages across 40-ms segment windows with a 44-sample shift and the 6th-order moving-average filter. Figure 1 compares the amplitude envelope and the smoothed energy contour for the test sentence M2, and demonstrates core differences between the two contours. The amplitude envelope (shown in blue) closely follows the waveform, apart from the sections where there is a discrepancy between positive and negative amplitude values (since the envelope is based on an average of the absolute values). In contrast, the energy function shows multiple deviations from the original waveform, especially in regions of low sonority. Moreover, energy contours of open vowels at the beginning of the sentence are more closely matched to the waveform than the contours of close vowels towards the end of the sentence.
Additionally, local energy dynamics of voiced parts of the acoustic signal were described following the formula in Šturm and Volín (2016). This function operates on the smoothed energy contour, calculates differences between two neighbouring samples (disregarding samples with zero-crossing rates higher than 4000 that are typical of voiceless fricatives), and smooths the difference values via a moving-average filter of order 10. Figure 2 displays the two energy contours in comparison (red versus green lines). Subsequently, local maxima of the smoothed energy function (maxE) and local maxima of the energy difference function (maxD) were identified and localized within each syllable of the test sentences. According to Šturm and Volín (2016), maxD represents a close approximation of the p-centre in Czech.
Both values (maxE and maxD) were subsequently examined with respect to their ability to serve as SMS anchors, along with the syllable amplitude maxima derived from amplitude envelopes and the syllable/vowel onsets identified manually. Figure 2 compares temporal locations of the five potential SMS anchors for the bisyllabic word "Friday" (taken from L1). The figure illustrates our general finding that distances between the five temporal landmarks varied depending on the properties of the syllable, and could be rather small (as in the second syllable of the example word) or large (as in the first syllable). Each temporal landmark was then further described on the basis of the acoustic properties in their local amplitude envelopes. More specifically, two measures of local changes in the signal amplitude were identified: (1) rise-time, i.e., the temporal distance between a local amplitude minimum and a maximum (Goswami et al., 2002;Goswami & Leong, 2013) and (2) rise-slope, i.e., the steepness of a change in the envelope measured as the amplitude differences between a minimum and a maximum, divided by the duration of their rise-time. These measures are illustrated in Figure 3.
To derive these measures, neighbouring maxima and minima in the amplitude envelope were located using the findpeaks function in Matlab 2018b. First, a local maximum was found in the closest proximity to the event onset (in vowels, it could precede or follow the identified vowel onset but in syllables, the temporal location was restricted by syllable boundaries). Second, the algorithm searched for a preceding minimum in the amplitude envelope. The local minima were mostly located around the 'valleys' between local maxima in the amplitude envelope (see Figure 3). A local threshold was adjusted to each individual case, based on a combination of two parameters (duration of the sampling time window and average amplitude decrease over a series of consecutive sampling intervals). Automatically detected turning points of the amplitude envelope were manually checked by a trained phonetician (first author).

Individual data
All participants had to fill in an online questionnaire prior to their scheduled experimental session. The form asked about musical training, ongoing and past musical activities, and dancing experience. A general musicality index was derived from these data (similar to the approach by Šturm & Volín, 2016). The index was an aggregate score based on years of musical training (from 0 to 12 in the present sample), current regular music practice (0 for non-active and 1 for active participants), number of musical instruments (which included singing and dancing, from 0 to 4 in the sample), and finally the age at which participants started their musical training (below the age of 10 was coded as 2, from 10 up to 20 years as 1, above 20 years as 0). The resulting musicality indices varied from minimally 0 (no musical experience) to 18 (a high level of musical experience and skills). There were no professional musicians or dancers among the participants of this study, though 69% had received some musical training, taken dancing classes, or danced regularly. Given the aims and hypotheses of the present study, the questionnaire only included questions about active music practice and did not collect information about passive experience of listening to specific music styles.
We assessed individual SMS abilities with the Battery for the Assessment of Auditory Sensorimotor Timing Abilities (BAASTA, Dalla . Six tasks were selected from the battery, including two unpaced tapping tasks, two paced tapping-to-tones tasks, and two paced tapping-to-music tasks. Unpaced tapping tasks measured the speed of participants' spontaneous and fast tapping rates without a stimulus, and their ability to sustain the regular motor activity. In these tasks, participants were instructed to either tap at their most comfortable speed for 60 seconds, or to tap at their fastest possible speed for 30 seconds, paying attention to maintaining a constant speed for the whole duration of a trial. In the paced tapping tasks, participants' synchronization abilities were measured with a simple regular sound (here, a piano tone with a frequency of 1319 Hz or E6) and computer-generated excerpts of classical music. When tapping to piano tones, participants were presented with 60 repetitions of a tone presented at a faster (450 ms) and a slower (600 ms) IOI, and asked to tap in synchrony with the tones throughout the repetitions. When tapping to music, participants were instructed to synchronize with what they perceived as the beat in musical excerpts from Bach's "Badinerie" and Rossini's "William Tell Overture." Both music extracts consisted of 64 beats with a quarter note of 600 ms IOI (see Dalla Bella et al., 2017 for more detail). The task order was counterbalanced across experimental sessions, following a Latin square design. The order within each task category was fixed though: In the unpaced tapping task, participants first tapped spontaneously then fast; in the paced tapping task, they first synchronized with the metronome of 450 ms IOI and then 600 ms IOI; in the music synchronization task, they first tapped to Bach and then to the Rossini piece.
After the experiment, all participants filled in a brief questionnaire that collected confidence ratings for their self-evaluated experimental performance. Participants indicated how easily they were able to extract beat patterns from looped test sentences, and how well they were able to replicate them in the NMR task. In addition, they were asked to self-report how confident they felt about tapping precisely in time with the stimuli they experienced in the SMS task. A 9-point Likert scale (with 9 being the highest level of confidence) was used to collect the ratings.

Participants
Thirty-one native speakers of Southern British English (21 female; mean age 23.1 years, range 18 -36 years) participated in the study. They gave informed consent and received a small fee in compensation for their time and efforts. The data of two participants were removed from the sample because they self-declared as dyslexics. All remaining participants had no existing history of language impairments or motor disorders that could affect their rhythmic processing or SMS abilities (e.g., dyslexia: Leong & Goswami, 2014;apraxia: Park, 2017;dystonia: Liu et al., 2008), and no hearing impairments at the time of testing. Moreover, their individual performance with the metronome tasks of BAASTA did not indicate any issues with their general synchronization abilities (cf. Dalla .

Tasks, procedure, and apparatus
The study obtained ethical approval from the Ethics Committee of the University of Kent, and was conducted in a quiet behavioral testing room of the Kent Linguistics Laboratory. Each experimental session consisted of one SMS and one NMR task with the experimental stimuli. During the SMS task, participants were presented with 20 repetitions of each target sentence and asked to start synchronizing with what they perceived as the beat structure of the sentence as soon as they felt able to, while the repeated auditory sequence was still ongoing. In the NMR task, participants were asked to listen to 10 repetitions of a test sentence first, and then to replicate the beat pattern they had heard. No instruction was given as to how many taps or cycles they should reproduce. During each task, test sentences were presented in increasing order of complexity, i.e., the short sentences were tested first, the long ones last. Test sentences were played binaurally through Sennheiser HD 380 headphones. The order of the SMS and NMR tasks was counterbalanced across participants. At the start of an experimental session, participants familiarized themselves with the equipment and had an opportunity to clarify their questions about the procedure. The session ended with the BAASTA tests and the post-test questionnaire, and took 35-45 minutes in total to complete.
Tapping responses were collected using a Roland HandSonic drum pad (HPD-20) and a Dell Latitude 7390 laptop in the CakeWalk MIDI software (BandLab). BAASTA was implemented as an app running on an Acer tablet (Iconia One 10 B3-A40FHD 32GB) with an Android 7.0 system. Participants were free to adjust the sound volume to a comfortable level.

Preparation of SMS data
Collected taps can be analyzed with regards to different aspects of their distribution in time. Figure 4 shows a hypothetical example of two taps produced in time with the bisyllabic word Friday. The first tap follows the syllable onset (resulting in positive asynchronies measured with this landmark) but precedes all other landmarks (resulting in negative asynchronies indicative of anticipation of different magnitudes: maxD can be considered less anticipated than the vowel onset in this example). The second tap shows positive asynchronies for all landmarks, though the magnitude of the time lag is landmarkspecific-here, it is the smallest for maxE and the largest for the syllable onset. Moreover, the distance between the taps (or the inter-tap interval, ITI) and the variability of these intervals can provide insightful information about the sychronization performance with each sentence as a whole.
We extracted the tapping data using Matlab MIDI toolbox (Eerola & Toiviainen, 2004), and corrected the timing of taps by subtracting the delay of the MIDI device (here, 5 ms). For each sentence and participant, we calculated the temporal distribution of individual taps within the temporal window of the sentence duration and then aggregated the available taps across all repetitions of the same sentence. Using ggplot2 (Wickham, 2016), a Gaussian kernel estimation with a bandwidth adjustment of ⅛ was applied to the aggregated data. This procedure allowed us to obtain a smoothed distribution for each participant and sentence while retaining salient peaks of the aggregated taps. Figure 5 shows an example of such density functions for the test sentence S1. Individual densities were obtained from the SMS data and aggregated across all participants. The resulting distribution in Figure 5 is clearly multimodal, with one tapping peak per syllable of this test sentence. Figure 6 displays density functions created for the group tapping performance with the same sentence during the NMR task. The NMR task seems to increase variability at both individual and group level, and lacks the clearly defined quadrimodality observed in the SMS task with this sentence.
The temporal location of the density peak maxima (see Figure 5) was used to quantify the individual SMS performance. To derive this measure, the findpeaks function from the R-package pracma (Borchers, 2018) was applied. It identified all peaks using a 40%-threshold of the maximum peak value of each sentence, separated by at least 100 ms distance. The timepoints of the density peaks and locations of the temporal landmarks under investigation (see 3.3) were then compared. Asynchronies between the taps and the temporal landmarks were calculated for those density peaks which occurred within a ±120 ms window of the corresponding landmark location (cf. Repp, 2004).

Period tracking in SMS and NMR
To compare the properties of period tracking in SMS and NMR, we calculated ITIs (in ms) that participants produced in each target sentence during the very first movement cycle as well as mean ITIs upon completion of a trial. Variability of the interval duration between taps was expressed by the coefficient of variation CV, calculated as SD(ITI)/mean(ITI).

Figure 5:
An aggregated density function of the group SMS-performance with the four-syllable test sentence S1 ("I wove a yarn").

Figure 6:
An aggregated density function of the group NMR-performance with the test sentence S1 ("I wove a yarn").

Motor activity in SMS and NMR
First, one-sample Kolmogorov-Smirnov tests confirmed that tapping data were not uniformly distributed in both tasks. That is, participants did not tap randomly in either task, or any of the test sentences (see Table 2; all p values were < .01). Moreover, confidence ratings did not differ significantly between the two tasks. Participants felt equally confident (median: 6, interquartile range: 5-7) about their ability to extract the beat patterns in NMR and to synchronize with the beat in SMS.

Comparisons between landmarks as potential SMS anchors
To find the most appropriate SMS-anchor, we fitted linear mixed-effects models to absolute asynchronies between the tapping peaks and the temporal landmarks. The model included a five-level predictor landmark and two random effects: participant (P1-P29) and sentence (1-6). We started with a maximal random effect structure recommended by Barr, Levy, Scheepers, and Tily (2013), and iteratively removed random effects if the model failed to converge or produced a singular fit. A change of the default optimizer (to 'optimx, ' John et al., 2020) helped to resolve the model convergence issues and keep the random effect structure maximal. The likelihood ratio tests were run to determine the best-fit models. Figure 7 displays estimates and standard errors of absolute asynchronies for the five landmarks under investigation. Smaller asynchronies indicate a higher accuracy of a tap  in the proximity of the corresponding landmark (the ±120 ms window applied across all landmarks). Here (and below), raw duration measurements in ms were logarithmically transformed to reduce or remove the skewness of the distribution that is typically observed in durational data (Baayen, 2008, pp. 31ff). Visual inspection of the estimates in Figure  7 led to the conclusion that vowel onsets demonstrated the smallest temporal discrepancy between SMS-peak locations and the nearby landmarks. Vowel onsets were thus taken as the reference level for the pairwise comparisons with the Bonferroni-corrected α-level set to 0.0125 (0.05/4). Accordingly, syllable onsets (t = 6.80, p < .001) and maxE (t = 3.15, p < .01) differed significantly from the asynchronies measured with vowel onsets while LAM did not reach significance at the Bonferroni-corrected α-level of 0.0125 (t = 2.18, p = 0.03). Despite the numerical difference observed in Figure 7, maxD did not produce significantly longer absolute asynchronies in comparison to vowel onsets (t = 0.91, n.s.). Based on the analyses above, we conclude that vowel onsets constitute the best anchor of SMS in these data. Figure 8 displays an example of the group performance with the vowel onsets of the test sentence S1. Despite some individual variation, cumulative tapping peaks shown in the graph are temporally well aligned with the vowel onsets (indicated by vertical dashed blue lines). While the sentence-initial vowel demonstrates large negative asynchronies typical of SMS performance with a metronome (e.g., Aschersleben, 2002), all following vowels seem much less anticipated in this example, i.e., display smaller or no negative mean asynchronies.

Probability of a tap in the SMS task
A logistic mixed-effects regression was performed to test for the likelihood of a tapping peak being present (1) or absent (0) in the proximity of a vowel onset (±120 ms around the temporal landmark; see 3.7). Metrical status (i.e., the vowel being nucleus of a metrically weak, strong, or pitch-accented syllable), rise-time and rise-slope of the amplitude envelope, S2S likelihood of the sentence (high/low), and participant-specific characteristics were entered as predictors. We also tested if the order of tasks (SMS first versus NMR first) had an impact on tapping with vowels. Participant (P1-P29) and sentence (1-6) were fitted as random effects. Again, we started with a maximal random effect structure and retained those random effects that allowed the models to converge. To combat the model convergence issues of the mixed-effects logistic regressions, we changed the default optimizer (to 'bobyqa') and increased the number of iterations from default 10,000 to 100,000. Summary of the best-fit model established by the likelihood ratio tests can be found in the supplementary materials.
The best-fit model produced two main effects, including the metrical status of the vowel and S2S likelihood of the sentence (see Table 3). Accordingly, metrically weak vowels were less likely to attract a tap, in comparison to either metrically strong (z = 3.28, p < .01) or accented vowels (z = 4.50, p < .001). Although accented vowels were slightly more often tapped to than metrically strong but phrasally unaccented vowels, the difference between them was not significant. Sentences identified as hightransforming in previous S2S-experiments (Rathcke et al., 2018, forthcoming) were also more likely to induce a higher number of taps, forming a tapping peak in the density map around a vowel onset (z = 2.37, p < .05). These effect estimates are summarized in Table 4.

SMS accuracy
SMS accuracy was measured as absolute asynchronies between SMS peaks and vowel onsets (on a logarithmic scale), and entered linear mixed-effects modelling as the dependent variable. Again, we tested the predictive power of metrical status (i.e., the vowel being nucleus of a metrically weak, strong, or pitch-accented syllable), rise-time and rise-slope of the amplitude envelope, S2S likelihood of the sentence (high/low), and participant-specific characteristics. We further added the order of tasks (SMS first/NMR first) to check if SMS improved in those participants who first performed NMR. Individual participant (P1-P29) and sentence (1-6) were fitted as random effects. Starting with the maximal random effect structure and changing the default optimizer (to 'optimx, ' John et al., 2020), random effects were iteratively removed if they produced convergence or singular-fit issues. The best-fit model established by the likelihood ratio tests is given in the supplementary materials. Table 5 displays the best-fit model which included two factors, (1) the rise-time of the amplitude envelope around the vowel onset and (2) the musicality score of participants.  Effect estimates from the best-fit model are plotted in Figure 9. Vowels with shorter amplitude rise-times displayed smaller asynchronies (t = 2.50, p < .05). Higher levels of musical training also improved SMS accuracy (t = -2.27, p < .05).

Anticipation during SMS
To see if participants displayed anticipation in SMS with language, we analyzed signed asynchronies between tapping peaks and vowel onsets. Here, negative values indicated that a tap preceded (i.e., anticipated) a vowel onset. Linear mixed-effects models tested four stimulus-related predictors, including metrical status (weak, strong, or accented), risetime and rise-slope of the amplitude envelope, S2S likelihood of the sentence (high/low), and participant-specific characteristics. As observed before, targets that occurred at the beginning of a sentence seemed more anticipated than any of the subsequent targets, i.e., they show larger negative mean asynchronies. To see how systematically this effect occurred in our data, we included the serial order of targets within a sentence as a covariate. We also fitted the order of tasks (SMS first/NMR first) as a fixed effect to see if the anticipation of the upcoming vowels is reduced after participants had experienced the sentence during the NMR-task. The model further contained two random effects: participant (P1-P29) and sentence (1-6). The maximal random effect structure initially included random slopes and was iteratively simplified if convergence or singular-fit issues persisted despite the change in the optimizer (John et al., 2020). The final model established by the likelihood ratio tests is shown in the supplementary materials.  The best-fit model retained three covariates related to the acoustic and positional properties of sentence targets (see Table 6). Both rise-time and rise-slope of the amplitude envelope around the vowel onset showed a strong influence. 2 More specifically, vowels with longer rise-times (t = -4.21, p < .001) and steeper rise-slopes (t = -3.55, p < .001) were more anticipated than vowels with shorter rise-times and shallower rise-slopes (see Figure 10A-B). As far as the serial order of a vowel in a sentence was concerned, our preliminary observations were confirmed. Each subsequent vowel showed smaller negative asynchronies and was thus less anticipated than its predecessor (t = 2.29, p < .05). That is, SMS accuracy increased incrementally and was particularly high for sentence-final vowels (see Figure 10C).

Number of repetitions in SMS
Given that our SMS task involved a total of 20 stimulus presentations, we examined participants' tapping behaviour across the repetitions. In particular, we were interested in answering the two main questions. Firstly, when did participants start tapping, and how might this have been influenced by their self-reported confidence in the personal synchronization performance? Secondly, assuming that SMS improved with practice, how many tapping cycles were needed for participants to achieve their best, stable SMS performance in this task?
On average, most participants started to tap during the third repetition of a sentence (median: 3, interquartile range: 2-3). The first tap was recorded slightly later at the very beginning of the SMS task (interquartile range: 2-4) and generally shifted to an earlier repetition cycle at the end of the SMS session (interquartile range: 2-3). Only on a few 2 A correlation test (see supplementary materials) indicated that rise-time and rise-slope were not correlated in these data (r = -0.12, n.s.), i.e., each affected the participants' SMS-performance independently.  trials, participants started to synchronize as late as during the 10th or the 11th repetition cycle. The location of the first synchronization attempt within the loop was unaffected by the self-reported confidence ratings participants provided upon completion of the task, or by their musicality score. The order of tasks (SMS first or NMR first) did not have any impact, either. The time-point at which tapping performance stabilized also differed across participants. Figure 11 displays examples of the time-series data collected for the participants P02 (A) and P12 (B) during an SMS trial with the target sentence S2 ("I took the prize"). All taps collected for each participant are plotted along the x-axis where 1 demarcates the first tap recorded. If participants had tapped to every single vowel from the very first repetition of the sentence in the loop, the total number of taps would be (4 vowels × 20 repetitions =) 80, which was not the case for either participant in the example. Instead, there was a lot of individual variability. The overall number of taps available per trial differs, depending on when the participant started tapping and how many vowels they sychronized with (P02 produced more taps than P12).
The y-axis in Figure 11 displays signed asynchronies (in ms) where 0 represents the vowel onset. The two chosen examples suggest that P02 started off tapping 10-20 ms ahead of the vocalic targets in this sentence and became consistently more accurate in synchronization with the vowel onset after s/he had produced 12 taps, while P12 started off lagging behind the vowel targets by 20-40 ms and reached a stably improved performance after s/he had tapped 17 times. Alternatively, these asynchronies could be interpreted as stable from the start of synchronization but timed with a different landmark at the beginning versus toward the end of a trial. Yet, this interpretation seems very unlikely. As shown in Figure 2, timing of every landmark varied quite substantially with respect to the vowel landmark. For example, maxE could occur before or after a vowel onset in two successive syllables. Such variability means that trajectories plotted in 10-A or 10-B would show little systematicity prior to the identified point of stability when synchronization with the vowel onset begins (which was clearly not the case).
These time-series data were analyzed using R-library changepoint (Killick & Eckley, 2014). For each participant and sentence, we identified the individual point of change in the synchronization accuracy by examining global fluctuations in the mean and variance. We further calculated the number of sentence repetitions that participants required as input as well as the number of tapping cycles that participants performed until they achieved stable synchronization (as measured by the mean and variance in their signed asynchronies). In Figure 11, horizontal red lines show estimated means of asynchronies. The local discontinuity between the two fitted lines indicates the location of the change point. On average, participants performed 5 tapping cycles of each sentence (interquartile range: 3-7) until they reached the point of stability in their synchronization. Each of these tapping cycles could consist of 10-20 taps, depending on the participant's performance. None of the hypothesized predictors (confidence ratings, musicality, order of tasks) had an influence on the individually achieved point of synchronization stability.

Period tracking in SMS versus NMR
To understand how well participants were able to track the beat period in speech stimuli in the two tasks under investigation, we compared SMS and NMR in two aspects: (1) how successful the two tasks were in making participants deviate from their spontaneous tapping rates (measured by BAASTA, Dalla ; and (2) if both tasks induced convergence between participants' tapping rates and IOI of the intervocalic intervals of the test sentences. A mixed-effects regression was fitted to the dependent variable mean ITI per sentence and participant. We tested for two interactions, namely (1) between the task and the participant's unpaced spontaneous tapping rate and (2) between the task and vocalic IOI. We also fitted the order of tasks (SMS first/NMR first) as a predictor to control for a potential task order effect. Participant and sentence were defined as random effects. Again, we started with a maximal random effect structure and iteratively simplified it if the model failed to converge or produced a singular fit. A change of the default optimizer (to 'optimx, ' John et al., 2020) counteracted some of the convergence issues. The likelihood ratio test helped to determine the best-fit model which is given in the supplementary materials.
Both interactions were significant in the best-fit model (see Tables 7 and 8). Accordingly, larger intervocalic intervals significantly increased ITI, with a positive linear relationship in both tasks (t = 4.76, p < .01). However, an increase in vocalic IOI showed a notably smaller effect on the increase of ITI in NMR than in SMS (t = -2.63, p < .01). While individual tapping rates did not have any effect on ITI obtained in the SMS task, ITI in the NMR task tended to show longer durations if participants' spontaneous tapping tempo also had longer ITI (t = 2.08, p < .05). These findings demonstrated that period tracking was present in both tasks, though it had a subtler effect in NMR whose ITI drifted toward the participant's preferred individual tapping tempo in the absence of a simultaneous auditory signal. Crucially, an ITI regularization could also be observed in  the NMR task. According to an additional model fit to the CV of ITI (see Tables 9 and 10), this dependent variable differed significantly across the two tasks, showing that NMR led to less variability across ITI than SMS did (t = -2.55, p < .05). That is, taps were paced more regularly in NMR.
To test whether or not the above effects were merely consequences of a self-sustained, repeated movement in the SMS task, in contrast to the NMR task which generally led to fewer taps, we compared SMS and NMR data collected on the very first tapping trial. However, these comparisons produced comparable results.

Discussion
The present study was conducted to examine the suitability of two movement-based paradigms-synchronization (SMS) and reproduction (NMR)-for the study of rhythm perception in natural language, and to provide empirical evidence on the settings of such a paradigm. Below, we discuss the results with reference to our original research aims and hypotheses and comment on how they compare to previous research with other types of auditory stimuli.

Suitability of motor paradigms for linguistic rhythm research
The present study demonstrates that motor paradigms are suitable tools to investigate rhythm perception in language. Our results suggest that particularly SMS is informative and better suited than NMR to support rhythm research in language. Our version of SMS with natural language produced consistent patterns of synchronization with vowel onsets, thus replicating our previous results with simpler verbal stimuli (Rathcke et al., 2019). In our version of NMR, a certain level of period tracking could also be observed. However, the NMR results showed weaker relations to linguistic stimuli and an overall trend to converge towards participants' spontaneous tapping rates. Alongside this shift toward individually preferred tapping rates, the overall variability of inter-tap intervals was reduced, suggesting that participants tapped more regularly. This result parallels previous research with a similar motor reproduction paradigm (see Donovan & Darwin, 1979;Scott et al., 1985), as well as with other paradigms that involve beat tracking during speech production (Jungers, Palmer, & Speer, 2002;Port et al., 1995).
When movement is not synchronized in time with an auditory signal, temporal regularization of ITI prevails in linguistic stimuli, though not in other types of stimuli (see Donovan & Darwin, 1979;Scott et al., 1985). Such regularization is likely to arise due to a high level of rhythmic complexity in language that lacks temporal isochrony (Dauer, 1983;Pointon, 1980;Roach, 1982;Uldall, 1971;van Santen & Shih, 2000) while employing a highly intricate hierarchy of nested constituents (cf. Nespor & Vogel, 1986;Selkirk, 1984) and prominence alternations (Liberman & Prince, 1977;Prince, 1983). In the context of such complexity, the superiority of SMS is in line with previous research  that demonstrated that movement along a complex rhythm facilitates the discovery of its beat (Su & Pöppel, 2012). Such advantage of a synchronized movement possibly arises due to an enhanced internal representation of an auditory rhythm that accompanies movement (Chemin, Mouraux, & Nozaradan, 2014). In contrast, movement without a concurrent auditory signal relies more heavily on an internal representation of temporal patterns, thus increasing working memory load and the associated processing cost (Repp & Su, 2013). Recent evidence suggests that tapping without a concurrent auditory signal might be a more demanding task than SMS (Koch, Oliveri, & Caltagirone, 2009;Lewis, Wing, Pope, Praamstra, & Miall, 2004). More specifically, reproduction or continuation tasks with a metronome, whose settings are quite similar to the NMR task with language in our experiment, seem to be placing higher demands on both working memory (Jantzen, Oullier, Marshall, Steinberg, & Kelso, 2007;Koch et al., 2009) and motor timing abilities (Serrien, 2008). In contrast, SMS is likely to enhance basic perceptual abilities (Valdesolo, Ouyang, & DeSteno, 2010). These findings provide some explanations as to why participants might have tended to converge toward their spontaneous tapping rates in the NMR but not in the SMS task, as well as why SMS might be superior to NMR in the context of beat perception and rhythmic processing in language.

SMS anchors and acoustic influences on SMS
As hypothesized, SMS with language can produce systematic responses to the temporal structure of natural spoken sentences. The present study tested five potential anchors of synchronization, including onsets of linguistic units (syllables, vowels) and acoustic landmarks (local maxima of energy or amplitude, local changes in the smoothed energy contour). The shortest asynchronies were observed between SMS peaks and nearby vowel onsets, followed by the moment of the fastest change in the smoothed energy contour (maxD) and the local amplitude maximum (LAM). The numerical difference in synchronization accuracy with these three landmarks was not significant at the Bonferroni-corrected α-level, though vowel onsets produced smallest asynchronies. In contrast, anchoring taps to syllable onsets and local maxima in the energy contours led to a significantly deteriorated accuracy in the participants' performance. These results indicate that vowel onsets seem to reliably attract taps not only in simple verbal prompts (Rathcke et al., 2019), but also in complex temporal patterns of natural spoken language. Recent evidence from naturally evolved drummed languages like Amazonian Bora further corroborates this finding. In drummed Bora, rhythmic units have also been shown to consistently match intervocalic intervals, irrespective of syllable complexity (Seifart, Meyer, Grawunder, & Dentel, 2018). Vowels play an important role in shaping the trajectory of the sonority contour in speech signal, frequently constituting local sonority peaks (Morgan & Fosler-Lussier, 1998;Wang & Narayanan, 2007). The sonority contour reflects variable degrees of energy emanating from the vocal tract during speech production and is particularly high for open vowels. The cyclical production of vowel gestures in connected speech has been previously highlighted as one of the potential reasons why spoken language might be rhythmic in nature (Fowler, 1983;Fowler & Tassinary, 1981). Local fluctuations in signal sonority related to vowel acoustics have also been argued to guide speech segmentation and to assist first language acquisition (Räsänen, Doyle, & Frank, 2018). It thus does not seem surprising that beat perception (at least in English) locks on to vocalic and not to syllabic onsets, though more research is needed to determine if beat perception in English involves tracking of vowels per se or rather tracking of nuclear constituents within larger units such as syllables.
Importantly, SMS-performance with the linguistic stimuli of the present study demonstrated anticipation of vowel targets (as indicated by negative mean asynchrony, cf. Aschersleben, 2002). Especially the first synchronization target within a sentence, i.e., the target that occurred after an acoustic silence, showed larger negative asynchronies and was thus more anticipated than all subsequent vocalic targets. This finding is in keeping with the existing evidence on the properties of SMS with other types of auditory stimuli. Anticipation seems to be a characteristic of non-musicians' syncronizing with a metronome signal where regular auditory prompts are interspersed with silences (Aschersleben, 2002;Repp, 2005). The negative mean asynchrony is reduced, or even disappears completely in more complex rhythmic contexts where synchronization targets are not separated by silences, e.g., in music (Thaut, Rathbun, & Miller, 1997;Wohlschläger & Koch, 2000). SMS in the present study was more precise with those vowels that had shorter risetimes. Unfortunately, rise-time (and also rise-slope) of an amplitude envelope is a complex acoustic measure that is influenced by many aspects of speech production. These properties can change depending on the manner and the place of articulation of the onset consonant(s), levels of syllabic prominence (weak, strong, accented, or emphatic), degrees of coarticulation, and syllable reduction. This underlying complexity impedes a meaningful interpretation of the rise-time contribution to beat perception (cf. Peelle & Davis, 2012), though interestingly, once again we find parallels to SMS with a metronome where shorter rise-times of tones have been shown to improve synchronization accuracy (Vos, van Kruysbergen, & Mates, 1995).

SMS sensitivity to the metrical structure
One of the crucial findings of the present study is that SMS is sensitive to the metrical structure of spoken sentences. In the present study, native English participants were more likely to tap to metrically strong than metrically weak syllables. These results are somewhat comparable with the findings by Allen (1972: 89) who concluded that English participants tend to "tap before the nuclear vowels of rhythmically accented syllables." Given that the prosodic system of English incorporates word-level stress and sentencelevel alternations of strong and weak syllables (Liberman & Prince, 1977;Prince, 1983), it is highly likely that the prosodic system of a listener's native language plays a major role in inducing their feeling of a beat in speech and that rhythm perception in language might be a constructive perceptual process.

On the role of repetition in SMS
In our view, repetition is a crucial aspect of the success of the SMS-paradigm with language. Despite being a laboratory task, looping resonates with the idea that linguistic rhythm arises on large temporal scales through repeated experience with one's native language. Unlike other approaches that rely on special cases of language use like poetry, mantra, or chant (cf. Cummins, 2012) or on short, metrical, or regularized speech (Lidji et al., 2011), looping can be applied to any natural spoken utterances, leading to an increased ecological validity of the proposed paradigm (cf. Allen, 1975). The present SMS method creates a unique situation for unlocking the rhythmic structure of natural, unmanipulated language while bypassing other mechanisms of sentence processing (cf. Rathcke et al., 2018).
In the present experiment, most participants appeared to have created an internal representation of the sentence beat structure after two repetitions and could start synchronizing during the third presentation cycle of the sentence. As our results indicate, a total of three repetitions used in previous research (Lidji et al., 2011) is not quite sufficient to fully capture stable SMS patterns. For example, the Kernel density fitting procedure relies on the presence of at least two events and can lead to missing data in shorter sentences or in participants who might require longer to entrain. Given the results of our time-series analyses, we recommend using at least 10 repetitions of a sentence to produce stable, consistent, and representative patterns across individual participants (e.g., Gérard & Rosenfeld, 1995;Pressing & Jolley-Rogers, 1997;Repp, 2005;Repp & Penel, 2002).
Finally, the results of the present study exclude the possibility that the speech-tosong illusion interferes with SMS patterns in a significant way. Both high-and lowtransforming sentences tested in the present study produced similar results in terms of synchronization accuracy and targets. The only difference between high-and lowtransforming sentences consisted in the overall number of recorded taps. Accordingly, sentences that led to more S2S transformations (Rathcke et al., 2018, forthcoming) also induced more taps. The reason for this effect is as yet not quite clear, though it might be related to a higher level of the overall signal sonority in the high-transforming set (Rathcke et al., 2018, forthcoming). It is, however, clear that the speech-to-song illusion is not a core prerequisite for a successful application of the SMS paradigm to language, which is in line with our previous work showing that S2S transformations rely more heavily on pitch-than on time-related features of speech (Falk et al., 2014;Falk & Rathcke, 2010).

Individual variability in SMS with language
As expected, we found some individual variation in the SMS task. Some aspects of this variation, e.g., SMS accuracy, could be partially explained by individually varying levels of musical training. Participants produced lower asynchronies if they had higher levels of musical training and experience (which included playing an instrument, singing, and dancing). Musical sophistication is also known to decrease error and variability in synchronization with a metronome in non-professional musicians (e.g., Gérard & Rosenfeld, 1995;Pressing & Jolley-Rogers, 1997;Repp & Penel, 2002). However, measures of general synchronization performance with music and metronome employed in the present study (BAASTA, Dalla  did not help to explain individual variability in SMS with speech. Reasons for this might be multiple (e.g., different mechanisms of rhythm perception in isochronous versus non-isochronous signals or in non-speech versus speech) and require further investigation.

Conclusions and outlook
The two movement-based paradigms that were elaborated and tested in the present study view language rhythm as a consequence of general internal timekeeping mechanisms that allow us to synchronize, anticipate, and adapt our behaviour in response to an external stimulus (Repp, 2005). We showed that SMS performance can be successfully used with linguistic stimuli and that SMS patterns resemble welldocumented findings of SMS with metronome and music (Aschersleben, 2002;Repp, 2005;Repp & Su, 2013) in listeners of various degrees of musical training (Gérard & Rosenfeld, 1995;Pressing & Jolley-Rogers, 1997;Repp & Penel, 2002). Like music, beat perception in language can be linked to temporal expectancy and prediction of upcoming events, and we showed that such expectancies can be elicited during SMS with spoken sentences presented in a loop. Our study further demonstrated that vowels constitute the most likely rhythmic anchors in language, though more work is required with diverse languages to establish if the present finding generalizes beyond English. An alternative movement task, NMR, showed some potential to engage listeners' capacity to extract rhythmic patterns from speech, though it also tended to evoke motor regularization arising from preferred individual finger-tapping rates.
In sum, the present study demonstrates that natural language can entrain movement. Our setting of the SMS paradigm is a valid experimental paradigm to study beat tracking and rhythm extraction in linguistic stimuli of different degrees of complexity, and can be used in future work to answer many open questions on rhythm perception and cognition across prosodically diverse languages.