Auditory-motor synchronization varies among individuals and is critically shaped by acoustic features

The ability to synchronize body movements with quasi-regular auditory stimuli represents a fundamental trait in humans at the core of speech and music. Despite the long trajectory of the study of such ability, little attention has been paid to how acoustic features of the stimuli and individual differences can modulate auditory-motor synchrony. Here, by exploring auditory-motor synchronization abilities across different effectors and types of stimuli, we revealed that this capability is more restricted than previously assumed. While the general population can synchronize to sequences composed of the repetitions of the same acoustic unit, the synchrony in a subgroup of participants is impaired when the unit’s identity varies across the sequence. In addition, synchronization in this group can be temporarily restored by being primed by a facilitator stimulus. Auditory-motor integration is stable across effectors, supporting the hypothesis of a central clock mechanism subserving the different articulators but critically shaped by the acoustic features of the stimulus and individual abilities.

6. Line 146. The authors show that the low synchronizers are actually synchronizing instead of reacting. It would be helpful to read a clear interpretation of phase lag sign and behavior (well before reaching the Discussion where there is a paragraph dedicated to this). For instance, in the Introduction the authors note that in finger tapping the taps typically precede the tones by some tens of milliseconds. In this work, does a positive phase lag mean that actions succeed stimuli? (Additionally, in that case, please consider that the 222 ms period plus the observed 33 ms lag would indeed be compatible with a reaction time, i.e. reacting to the stimulus before. Please discuss.) 7. Line 401. The phase lag is an angle. Wasn't the mean phase lag computed via circular statistics? 8. Line 426. Please provide details about the random forest clustering. Was it computed before the t-SNE reduction, or after? 9. Figure 1a. The first triangle is smaller than the others, does it mean anything?
10. Figure 1b. Please indicate axis titles. 12. Figure 4. Panel b shows that in the baseline condition the subjects are all consistently bad at synchronizing. In the other conditions there seems to be substantial variability across subjects (it is mean +/-SD across subjects, right? It is not stated), as if only some of them can actually make use of the priming. Please discuss.
Reviewer #2 (Remarks to the Author): The authors investigated individual differences in how acoustic features of stimuli effect in sensorimotor synchronization (SMS) abilities in both clapping and speaking modalities. The results indicate a subpopulation struggles with SMS accuracy to rhythmic spoken syllables and tones, but only when synchronizing with stimuli that vary over a sequence, such as with tones of differing pitches, or different spoken syllables. In addition, differences were seen across genders in the low performing group with males outperforming females when clapping, but not for vocal synchronization, and when synchronizing with tones, but not for synchronizing with vocals. Further findings indicate a crosseffector-modality priming effect, that the authors interpret as evidence of modality general effector timekeeping. The overall results show highlight the importance of factoring in individual differences when researching SMS capabilities.
Overall this paper is well written with findings that should be of interest across multiple sub fields. I have one suggestion and a number of minor points that will need clarification (mostly in the methods). I suggest the authors may wish to apply an autocorrelation analysis to the effector data. Lag1 autocorrelation can reveal the amount of error correction that occurs during a synchronization task. This may be of use when comparing across groups, and between the training and synchronization conditions where you would expect different lag 1 autocorrelation results. Here is an example: Iversen, J. R., Patel, A. D., Nicodemus, B., & Emmorey, K. (2015). Synchronization to auditory and visual rhythms in hearing and deaf individuals. Cognition,134,[232][233][234][235][236][237][238][239][240][241][242][243][244] See the rest of my comments below: Intro / Results 1) 130: is this result just for the low synchrony group? 2) 134: saying "the presence or absence of the stimulus does not modify the behavior of low synchronizers" may be overstated. For example, there may be differences related to error correction (lag 1 auto-correlation) that do not show up in the mean phase measurements. 3) 160: is this experiment 2? Please clearly indicate so it can be tied to the methods. 4) 187: This is experiment 3? 5) 201: is this a 4th experiment? Discussion 6) 265: Confusing/missing wording. I do not understand this sentence: "However, our pattern of results evidence that synchronization is less ubiquitous than previously assumed" Please clarify.
Methods 7) 334: How were the numbers of participants for the different experiments determined? 8) 352-353: Was volume level measured, variation may play a role in the results 9) 364: The 4.3-4.7 Hz stimulus presentation rate is very fast for any SMS task. How was the presentation rate determined? 10) 366: Were the spoken stimuli in a male voice/female voice? Was the voice always the same? 11) 370: what are the duration, and the rise and fall times of the tones? 12) 370-378: How was the tone frequency range determined? Were they based on previous works? 13) 374: Please describe the sine function to facilitate replication. 14) 382: How was it determined to use the 'tah' effector, as opposed to 'dah' or 'bah', for example? 15) 364: So, seven total runs for each stimuli? Please clearly indicate the total number of runs for each condition/stimulus type. 16) 408: it is stated the stimuli are presented at a mean of 70 db, yet earlier it is stated that subjects adjusted volume? Please clarify 17) How were effector onsets determined? By algorithm? By hand? What software was used? Related, how was alignment determined between stimuli timing & effector timing?
The authors study sensorimotor synchronization to auditory stimuli. Experimental manipulations include stimulus type (speech/tones), effector (vocal/clap), and additional analysis factors include synchronization ability (low/high), and gender. They find interesting interaction effects, e.g. tones vs speech improve synchrony but only for low-synchronizers, and hands improve synchrony but only for high-synchronizers. The manuscript is well-written. The introduction is succinct and clear, the results are well organized and the discussion is thorough. The quality of the figures is very good and have a clear relationship with the results. I have some concerns, though, so I recommend revision before publication.
We thank the reviewer for the positive overall assessment of our work. We integrated all his/her suggestions into this new version of the manuscript. Please, find below a point-by-point answer to each of the raised comments. Here, as well as in the new version of the manuscript uploaded to the system, you can find all the modifications done highlighted in red.

MAJOR COMMENTS
1. In subsection "Auditory-motor synchronization..." the authors make the claim that synchronization is impaired (instead of just reduced) for low synchronizers with speech stimulus. To show this they take the training data (i.e. free motor gestures, no stimuli--traditionally known as continuation) and pair them with an imaginary constant-period sequence of stimuli to compute a surrogate synchronization value. Then the training step and main task are compared. 1.a. (Minor suggestion) I suggest changing the wording "synchrony" and "synchronization" when referring to the training data, as subjects were not actually synchronizing to anything but in "free clapping/speaking". At least enclose it in quotes with a modifier like "surrogate" or "sham" as in Figure 2. 1.b. (Major concern) The authors conclude that "the presence or absence of the stimulus does not modify the behavior of low synchronizers". I find this conclusion hardly supported by the data. First, the stimuli sequence in the main task is continuously accelerating from 4.3 Hz to 4.7 Hz, do the authors claim that the subjects were not adjusting their motor actions to it? Second, the authors are comparing two different tasks that are most likely supported by different brain regions (Lewis et al 2004, Teghil et al 2019, Repp & Su 2013. Third, in order to show that synchronization is impaired I would suggest comparing it to random (continuation is not), for instance by computing the PLV after shuffling stimuli and responses. I suggest revising this subsection. I think the rest of the manuscript would do just fine if "impaired" is replaced by "reduced/altered".
We thank the Reviewer for this thoughtful comment. Following his/her advice we: changed the word synchrony when referring to the training data to sham synchrony and we added an extra analysis using a surrogate audio. We believe that this extra analysis strengthens our results by addressing the Reviewers' concerns ("First, the stimuli sequence in the main task is continuously accelerating from 4.3 Hz to 4.7 Hz, do the authors claim that the subjects were not adjusting their motor actions to it? Second, the authors are comparing two different tasks that are most likely supported by different brain regions"). Following the Reviewer's advice, we estimated a surrogate synchrony by pairing the sounds produced during the main task with an audio keeping the same acoustic features as the used stimulus but with a fixed rate of 4.3Hz. Results show that for the low synchronizers, when the stimulus comprises speech, the surrogate synchrony did not differ from the experimental one. This result proves that the low synchronizers do not adjust their motor actions to match the accelerating stimulus. This new analysis has been included into the results section and two extra panels were added to Figure 2 and Figure S1.

Additionally, we explore whether the accelerating feature of the auditory stimulus impacts the rate of the low synchrony group. Are they increasing the rate of their speech, even if not matching the external frequency? To answer this question, we computed a "surrogate synchrony" between the sounds produced during the main task and an audio file comprising the same acoustic units as the experimental auditory stimuli but concatenated at a fixed rate of 4.3 Hz. For the low synchrony group, we found the same pattern of results as the one obtained for the sham synchrony. While the experimental synchrony (i.e, the one computed with the auditory stimulus accelerating from 4.3 to 4.7 Hz, the one listened by the participants) did not differ from the surrogate one for the speech-like stimulus, it surpassed it when the stimulus comprises tones (Fig. 2c&d; speech: t(20)=-8.017, pBonf<0.001 & tones: t(20)=-11.284, pBonf<0.001). Again, the same analysis for the high synchronizers reached significance for both stimulus types (Fig. S1c&d; speech: t(29)=-35.879, pBonf<0.001 & tones: t(29)=-41.291, pBonf<0.001).
These results show that, in a subgroup of the population, auditory-motor integration is impaired for the speech-like stimulus; neither the presence or absence of the stimulus, nor its accelerating nature, modify the syllabic rate produced by the low synchronizers."

Supplementary Figure 1; High synchronizers' synchrony remains stable across stimuli. a&b. Sham synchrony (i.e., estimated with the whispering and clapping produced without auditory stimulation) compared to the one obtained during the main synchronization task. Average across effectors for speech like-stimulus and tones panel a and b, respectively (N=30). c&d. Surrogate synchrony (i.e., estimated between the sounds produced during the main task and a surrogate audio with a fixed rate of 4.3Hz) compared against the experimental synchrony (i.e., the one estimated between the sounds produced during the main task and the accelerated stimulus presented to the participants). Average across effectors for speech like-stimulus and tones panel c and d, respectively (N=30). Dots represent mean values, bars SD and **p<0.001. e&f. Rose plots illustrate the histogram of the mean phase lag between the produced and the perceived sounds. Lag between perceived tones and: whispered "tahs" in panel e, hands clapping in panel f. All panels relate to Experiment 1.
2. Line 160. Subsection "Which auditory feature..." In Figure 3b the authors commpare between speech stimuli and same tone stimuli and conclude that the repetition of the same acoustic unit makes synchronization improve. There is an additional hint that repetition might have an effect by itself: comparison same tone vs 16 different tones. However, two aspects of the stimuli are changed in the comparison in Figure 3b: the repetition (same vs different) and the stimulus type (speech vs tone), so it is difficult to find a causal relationship between a single factor and the effect. Why didn't the authors compare between "speech" and "repeated speech"? (a stimulus of the same syllable repeated 16 times) In fact for low-synchronizers switching from speech to tones has a significant effect (Figure 1e), so this could potentially account for the whole effect shown in Figure 3b. I think you would need to control for it. I'm not asking for any additional experiment, but please discuss.  experiment 1, comprising the 16 different syllables). The other task, instead, comprised a "repeated speech" stimulus where all syllables were replaced by the syllable "go". Results show that synchronization significantly differs between tasks, with synchronization for "repeated speech" stimulus being higher than for the "variable speech" one ( Fig. S2 t(11)=−4.421, p=0.001). This outcome demonstrates that synchronization is restored when the same unit is repeated, independently of its particular acoustic features."

MINOR COMMENTS
3. What is the purpose of accelerating stimuli? (as opposed to the more traditional constant-period stimulus sequences) The present study builds up on previous works of our team showing a bimodal distribution in the general population while assessing the speech-to-speech synchrony using two versions of the same protocol: the Implicit Fixed Version and the Explicit Accelerated one (Lizcano-Cortez et al., 2022). While the first version comprises a constant-period stimulus sequence, as suggested by the Reviewer, it has the disadvantage of the instructions being orthogonal to the synchronization task (i.e., participants are instructed to recall the perceived syllables no explicit instruction is given about the expected syllabic rate). We got previous concerns about different subjects understanding differently the instructions, and this factor being the one leading to the bimodal outcome (despite this explanation is hard to be reconciled with the brain structural differences between the groups). For this reason, we chose the Explicit Accelerated version, which also has been shown to grant a bimodal, and there is no plausible confusion in the instructions interpretation since the experimenter directly asks participants to synchronize to the stimulus. Additionally, the stimulus comprising the sequence of tones was not compatible with the Explicit Fixed Version (i.e., it was not plausible to instruct the subject to recall the auditory units being repeated since it was always the same tone).
4. Please indicate which experiment you are analyzing in each subsection. For instance, I assume subsection "Which auditory feature..." describes results from experiment 2 based on the number of subjects (N=16), and that subsection "Can rhythmic priming..." describes experiment 3 (but then I found N=15 in the text and N=45 in Methods). Please clarify.
We apologize for the lack of clarity and for a mistake committed in the descriptions of the groups. On one hand, the Reviewer is right and there is a mismatch between the N reported in Results and in Methods. There were 31 participants who completed Experiment 3 (16 high and 15 low synchronizers), not 45. We amended this in this new version of Manuscript. Additionally, to clarify which analysis was performed on each experiment we specify this information in the caption of each figure. At the end of each caption, we added the following sentence: "All panels relate to Experiment n" 5. Line 84. "The synchronization values obtained for the four effector-stimulus combinations were submitted to a random-forest clustering algorithm". Please clearly state what the main observable "synchronization value" is. I imagine it is the 4D-vector PLV?
We clarified this in this new version of the manuscript.

"The synchronization values obtained for the four effector-stimulus combinations (i.e., a 4 dimensions PLVs vector per subject) were submitted to a random-forest clustering algorithm."
6. Line 146. The authors show that the low synchronizers are actually synchronizing instead of reacting. It would be helpful to read a clear interpretation of phase lag sign and behavior (well before reaching the Discussion where there is a paragraph dedicated to this). For instance, in the Introduction the authors note that in finger tapping the taps typically precede the tones by some tens of milliseconds. In this work, does a positive phase lag mean that actions succeed stimuli? (Additionally, in that case, please consider that the 222 ms period plus the observed 33 ms lag would indeed be compatible with a reaction time, i.e. reacting to the stimulus before. Please discuss.) Fig. 2c) and that this value increases for the hands (Fig. 2d; mean=54°)."

"For this goal, we computed the phase lag between the produced sounds (i.e., clapping and "tahs") and the train of tones. The phase lag has been estimated as the phase of the auditory stimulus minus the one of the participant's response (see Methods). Thus, a positive value indicates that the tone precedes the response. We found that when the effector is the vocal tract, participants have phase lag of 15.2° (
We believe that the misunderstanding with the lag sign led to the confusion regarding adding one period. According to the method used to estimate phase, it is not necessary to add one period. To clarify this point, we computed the phase lag between two short exemplar signals (see Figure R1). We got a mean phase lag of 52.4º, since the rate of the signal is 4.4 Hz it corresponds to approximately 33 msec, a value that aligns well with the observed lag between signals. Reviewer is right, we apologize for the inaccuracy. This new version of the manuscript reads:

phases of the envelopes of the participant's response and the acoustic stimulus, respectively, and T is the discretized total task time."
8. Line 426. Please provide details about the random forest clustering. Was it computed before the t-SNE reduction, or after? 10. Figure 1b. Please indicate axis titles.

The random forest clustering was computed on the PLV values, the t-SNE is only for visualization purposes and as stated in JASP the axis are uninterpretable. According to the JASP manual:
t-SNE cluster plot: Generates a t-SNE plot of the clustering output. t-SNE plots are used for visualizing high-dimensional data in a low-dimensional space of two dimensions aiming to illustrate the relative distances between data observations. The t-SNE two-dimensional space makes the axes uninterpretable. A t-SNE plot seeks to give an impression of the relative distances between observations and clusters. To recreate the same t-SNE plot across several clustering analyses you can set their seed to the same value, as the t-SNE algorithm uses random starting values.
To avoid confusions, the main text now reads:

"The synchronization values obtained for the four effector-stimulus combinations (i.e., a 4 dimensions PLVs vector per subject) were submitted to a random-forest clustering algorithm."
And the Figure's caption states:

"b. t-Distributed Stochastic Neighbor Embedding of the synchronization data. This panel is only for visualization purposes, it illustrates the relative distance between the four-dimensional synchronization measurements, axis are uninterpretable"
9. Figure 1a. The first triangle is smaller than the others, does it mean anything?
The triangle is smaller because we cut the audio signal at a random time and the tone started some milliseconds before. To avoid confusions, we modified the figure choosing a starting time point corresponding to a silence in the sequence of tones. See new figure 1. Figures 2, 3 and 4. please indicate whether the p-values are corrected for multiple comparisons like in Figure 1.

In this new version we reported all the corrected p values and explicitly stated so.
12. Figure 4. Panel b shows that in the baseline condition the subjects are all consistently bad at synchronizing. In the other conditions there seems to be substantial variability across subjects (it is mean +/-SD across subjects, right? It is not stated), as if only some of them can actually make use of the priming. Please discuss.
Reviewer is right, we added the red sentence in the Discussion: Fig. 4b)

Reviewer #2
The authors investigated individual differences in how acoustic features of stimuli effect in sensorimotor synchronization (SMS) abilities in both clapping and speaking modalities. The results indicate a subpopulation struggles with SMS accuracy to rhythmic spoken syllables and tones, but only when synchronizing with stimuli that vary over a sequence, such as with tones of differing pitches, or different spoken syllables. In addition, differences were seen across genders in the low performing group with males outperforming females when clapping, but not for vocal synchronization, and when synchronizing with tones, but not for synchronizing with vocals. Further findings indicate a cross-effector-modality priming effect, that the authors interpret as evidence of modality general effector timekeeping. The overall results highlight the importance of factoring in individual differences when researching SMS capabilities. Overall this paper is well written with findings that should be of interest across multiple sub fields.
We very much thank the Reviewer for his/her supportive statements. We modified our Manuscript according to his/her comments, which we believe strongly increased the quality of the study. Please, find below an answer for each of the raised concerns. Here, as well as in the new version of the manuscript uploaded to the system, you can find all the modifications done highlighted in red.
I have one suggestion and a number of minor points that will need clarification (mostly in the methods). I suggest the authors may wish to apply an autocorrelation analysis to the effector data. Lag1 autocorrelation can reveal the amount of error correction that occurs during a synchronization task. This may be of use when comparing across groups, and between the training and synchronization conditions where you would expect different lag 1 autocorrelation results. Here is an example: Iversen, J. R., Patel, A. D., Nicodemus, B., & Emmorey, K. (2015). Synchronization to auditory and visual rhythms in hearing and deaf individuals. Cognition, 134, 232-244.
While we acknowledge that lag1, as well as higher-lags autocorrelations, represent informative measures about the sensorimotor synchronization processes, this kind of analysis does not fit well with our experimental design. Our main goal was to assess differences between participants in their degree of synchrony and how those differences are modulated by different effectorstimulus combinations. We aimed to better understand previous work identifying a two-group segregation in the speech-to-speech synchronization abilities of the general population, predicting cognitive skills as well as functional and brain features (Assaneo et al., 2019). A deeper exploration of the specific features of how the synchrony is established was out of the scope of this study. For that reason, we designed a paradigm in which the sounding participants' responses were recorded in parallel with the perceived sounds, and the synchrony was estimated as the PLV between the envelopes of these two acoustic signals. The envelopes filtered between 3.3 and 5.7 Hz, +/-1Hz around the syllabic rate, that starts at 4.3 and ends at 4.7 (see Figure 1). We did not compute each response onset; instead, we estimated the synchrony between the two continuous time signals (Lizcano-Cortés et al., 2022). In the present dataset, while it would be relatively easy to define onsets for the clapping and tones, it is not so straightforward for the speech audios (see Figure  1). This work represents a first attempt to characterize individual differences in the degree of synchrony across effectors and stimulus and is successful in the sense that it identifies which acoustic features altered synchrony in a subgroup of the population. We believe that assessing lag1, lag2 and lag3 autocorrelations could represent the goal of a follow up study specifically designed for this goal. The speech stimulus can be constructed by concatenating consonant-vowel syllables (instead of being coarticulated as in this case) starting with stopunvoiced consonant to get a clear auditory onset. Participants' responses can be collected using electropalatography (https://icspeech.com/electropalatography.html) to get the precise time point in which the tongue touches the palate.

(2022). Speech-to-Speech Synchronization protocol to classify human participants as high or low auditory-motor synchronizers. STAR Protocols.
See the rest of my comments below: Intro / Results 1) 130: is this result just for the low synchrony group?

Reviewer is right, we clarified this in this new version of the manuscript. Line 104 now reads:
"For the low synchrony group, we found that …" 2) 134: saying "the presence or absence of the stimulus does not modify the behavior of low synchronizers" may be overstated. For example, there may be differences related to error correction (lag 1 auto-correlation) that do not show up in the mean phase measurements.
In line with the Reviewer's observation, we modified this sentence to: "These results show that, in a subgroup of the population, auditory-motor integration is impaired for the speech-like stimulus; neither the presence or absence of the stimulus, nor its accelerating nature, modify the syllabic rate produced by the low synchronizers." 3) 160: is this experiment 2? Please clearly indicate so it can be tied to the methods. 4) 187: This is experiment 3? 5) 201: is this a 4th experiment?
We apologize for the lack of clarity. In this new version of the manuscript, for each analysis we clearly stated to which experiment it belongs. At the end of each figure caption, we added the following sentence: "All panels relate to Experiment n". Discussion 6) 265: Confusing/missing wording. I do not understand this sentence: "However, our pattern of results evidence that synchronization is less ubiquitous than previously assumed" Please clarify.

We hope this modified version is clearer, we would be happy to clarify further if Reviewer considers it necessarily:
"However, our pattern of results evidence that synchronization is more restricted than previously assumed."

Versaci, L., & Laje, R. (2021). Time-oriented attention improves accuracy in a paced finger-tapping task. European Journal of Neuroscience.
To clarify this procedure, we included the following paragraph in the Participants section:

"The number of participants were determined based on previous studies with similar protocols. Previous work exploring the auditory-motor synchronization abilities in the general population reported positive results with numbers of participants in the same order of magnitude as in experiment 1 1,2 . For the 2 other experiments the number of participants was smaller since subjects were already classified as high or low synchronizers (experiment 2: 16 low synchronizers; experiment 3: 15 low synchronizers and 16 high synchronizers). When dealing with better characterized populations positive results have been reported with number of participants in the range of 10 to 15 3-5 ."
8) 352-353: Was volume level measured, variation may play a role in the results Volume level has not been measured in this study, because in a previous work we showed that it plays no role in the high vs low synchronizers classification. See: Lizcano-Cortés, F., Gómez-Varela, I., Mares, C., Wallisch, P., Orpella, J.,