Music of infant-directed singing entrains infants’ social visual behavior

Significance Singing to infants is observed in all human cultures. Beyond known roles in infant soothing and social bonding, this study shows that singing to infants elicits physical entrainment, an elemental phenomenon enabling synchronization of wide-ranging physical and biological processes. Here, singing synchronizes subsecond social visual interactions between infants and adults. When an adult sings expressively to an infant, infants often increase their looking to the adult’s eyes around the musical beats. Adults also synchronize their facial expressions in time to their singing, presenting facial expressions on the beats that are more positive and engaging than those between beats. The rhythm of infant-directed singing thus entrains rich social-communicative interpersonal engagement, providing a remarkable yet ready-made means of supporting infants’ social development.

Correspondence and requests for materials should be addressed to Miriam.Lense@vanderbilt.edu or 9 Warren.Jones@emory.edu 10 11 12 This 1 (as above);; rather, abiding here by a narrower definition allows the behavior of human infants to be 113 studied within a mathematical framework that is common to other studies of elemental entrainment 114 processes (from mechanical and electrochemical coupling (1), to phase-locking of cells in a network (4), 115 to the synchronization of animals' activity (5, 6)). 116 117

Audiovisual Recordings 118
Children watched audiovisual recordings of actresses singing common infant-directed songs 119 (e.g., "Twinkle, Twinkle Little Star", "Old MacDonald"). Nine audiovisual recordings were presented, each 120 with an average duration of 23.6 s (SD = 3.7 s;; range 18.2-29.4 s). In total, the 9 audiovisual recordings 121 comprised 227 beats (consistent with song notations). Actresses in each recording were filmed singing 122 directly into the camera (to engage the onlooking child) in front of a background decorated like a child's 123 room, with toys, pictures, and stuffed animals (see Figure  S1A Actresses were non-professional singers, with naturally-occurring variation in tempo, amplitude, 135 and tone, instructed to sing as if they were engaging with an infant: the average inter-beat interval across 136 all songs (strong/weak beat metric structure) was 434 ms (SD=112 ms) (138 beats per minute);; the 137 average coefficient of variation was 12.7% (SD=1.9%). 138 5 We used audiovisual recordings of infant-directed singing to create an explicit, unidirectional test 139 of infant entrainment: while coordination of actual infant-caregiver interaction is, of course, bidirectional(9, 140 10), in our experimental design, infant behavior could have no effect on caregivers (the audiovisual 141 recordings);; if the two became synchronized, the effect would necessarily be due to infant entrainment to 142 caregiver cueing (rather than caregiver accommodation). This experimental design is critical for this initial 143 investigation of infant entrainment and lays the groundwork for future studies of mutual entrainment. 144 However, we note that more complex mathematical techniques will need to be employed when 145 investigating dynamic, bidirectionally coupled systems, especially in light of potential confounds of a 146 conscious, accommodating partner (e.g., as demonstrated in more simplified systems in (11-13));; this 147 mutual entrainment of caregiver and infant behaviors that can be either automatic/reflexive or 148 volitional/consciously controlled, are different than those encountered when measuring phase-locking of 149 two signals not under conscious control (such as EEG). average inter-beat interval across replication stimuli was 488 ms (SD=119 ms) (123 beats per minute);; 157 the average coefficient of variation was 13.2% (SD=1.0%). As in the original experiment, replication set 158 stimuli were interleaved with two other types of stimuli as part of ongoing experiments not analyzed here 159 (naturalistic scenes of infant-directed speech, as in (7), and scenes of other children at play (8)). 160 In addition, replication set videos were interleaved with reduced predictability stimuli (see next 161 section below). The experimental design decision to use fewer audiovisual recordings to test for 162 replication (4 rather than 9) was made to allow for additional data collection time to test effects of reduced 163 predictability: because replication of the original entrainment finding is a necessary prerequisite to 164 meaningfully disrupt it, we needed to present both original waveform videos (for replication of the main 165 finding) and reduced predictability stimuli (for the new experiment). Consequently, within the same total 166 duration of viable testing time for infants, half of the videos presented in replication were original 167 (unmanipulated) audiovisual recordings and half were reduced predictability stimuli. 168 169

Reduced Predictability Stimuli 170
To test whether entrainment in infant eye-looking does or does not depend upon rhythmic 171 predictability of caregiver cueing, we experimentally manipulated original audiovisual recordings to make 172 jittered versions of each recording, resulting in reduced rhythmic predictability. Specifically, we re-173 sampled the original audiovisual recordings-which had naturally varying but predictable inter-beat 174 intervals-to instead reduce their predictability: in each song, two-thirds of the inter-beat intervals were 175 randomly varied by +/-30% of their original duration, disrupting the original rhythmic structure and 176 reducing beat-to-beat predictability (main text Figure  4). The manipulation of inter-beat intervals in 177 audiovisual stimuli was accomplished via granular resynthesis in Ableton Live, simultaneously warping 178 audio and visual signal to ensure fully synchronous audiovisual stimuli. Jittered versions had mean 179 duration of 20.8 s (SD=4.7 s, range 15.4-26.7 s). The average inter-beat interval was 492 ms (SD=118 180 ms) (122 beats per minute);; the average coefficient of variation was 28.4% (SD=5.5%). 181 It is worth noting that the reduced predictability stimuli, while designed and implemented by 182 means of changes in beat predictability, also provide strong experimental controls for both simple motion 183 effects and for caregiver visual cueing, as the overall motion and visual facial cueing of the caregiver are 184 preserved in the reduced predictability stimuli: i.e., the same range of head and facial motion and all 185 affective cues are preserved and presented, while only their relative temporal predictability is 186 manipulated. Stated differently, the reduced predictability stimuli present the same range in rigid head 187 motion, range in facial feature motion, and range of facial expressions (all in the same spatial locations), 188 as in the original audiovisual recordings, but the predictability in timing of when those events occur is 189 disrupted.

Experimental Procedures and Data Collection 192
Data collection procedures matched those reported in (7). Infants sat in a reclined bassinet 193 mounted on a table that was raised or lowered to ensure standardized position of infants' eyes relative to 194 7 the display monitor (28 inches diagonally, subtending an approximately 24° x 32° portion of the infants' 195 visual field). Lights in the room were dimmed. A parent or primary caregiver accompanied the infant at all 196 times but both the parent and experimenter were out of the infant's view during data collection. 197 Experimenters monitored infants via the eye-tracking camera and a second video camera that displayed  Data collection began by presenting soothing but engaging videos to acclimate the child to the 205 testing set-up (e.g., Baby Mozart). When the infant was attentive, a 5-point calibration scheme was 206 presented utilizing audiovisual stimuli (spinning and/or flashing lights, cartoon animations, together with 207 accompanying sounds). Calibration stimuli began as large targets (≥10° in horizontal and vertical 208 dimensions) which then shrank (via animation) to their final size of 1° to 1.5° of visual angle. The 209 calibration routine was followed by verification of calibration in which more calibration targets were 210 presented at any of nine on-screen locations. Throughout the remainder of the testing session, calibration 211 targets were shown between experimental videos to measure possible drift in accuracy. After calibration 212 checks, the system was re-calibrated if excessive drift (>3° of visual angle) in calibration accuracy 213 occurred. Please see Data Processing: Calibration Accuracy below for measures of calibration accuracy. 214 215

Identification of Eye Movement Events 217
Analysis of eye movements and coding of fixation data were performed with software written in 218 MATLAB (MathWorks). The first phase of analysis was an automated identification of non-fixation data 219 comprising blinks, saccades and any missing data or fixations directed away from the presentation 220 screen. Saccades were identified by eye velocity using a threshold of 30° per sec (14). We tested the 221 velocity threshold with the 60-Hz eye-tracking system described above and, separately, with an eye-222 tracking system collecting data at 500Hz (SensoMotoric Instruments GmbH). In both cases saccades 223 were identified with equivalent reliability as compared with both hand coding of the raw eye-position data 224 and with high-speed video of the child's eyes. Blinks were identified as described in (15). Missing data 225 and off-screen fixations (when a participant looked away from the video) were identified either by missing 226 values in gaze vector data or by gaze vectors directed to locations beyond the stimuli presentation 227 monitor.

Calibration Accuracy 230
Average calibration accuracy for all groups was less than 1° of visual angle. Figure  S1C,D shows 231 total variance in calibration accuracy, and Figure  S1E,F shows average calibration accuracy. Calibration 232 accuracy did not differ significantly between age groups ( Figure  S1E,F).

Minimum Valid Data Criterion 235
For each audiovisual recording, we used a minimum-valid-data criterion of fixation time greater 236 than or equal to 20% of total recording duration, as in (8). We set no thresholds for either minimum 237 number of audiovisual recordings nor minimum number of beat trials sufficient for inclusion of an infant's 238 data in analyses;; if usable data were collected, with a given audiovisual recording fixated at a level 239 greater than or equal to the minimum-valid criterion noted, then the infant's data were included. Of 9 240 possible audiovisual recordings (main experiment), the mean number included for 2-mo-olds was 4.3(1.5) 241 and for 6-mo-olds was 5(2.5) (data given as mean(SD)), t110=1.71, p=0.09). Mean number of beat trials 242 per child at 2 months was 96.9 (38.0) (mean(SD)) and at 6 months was 105.3(55.9) (t110=0.92, p=0.36) 243 ( Figure  S1G).

Region-of-Interest (ROI) Comparisons 246
Eye movements identified as fixations were coded into four regions of interest (ROIs) that were 247 defined within each frame of all video stimuli as shown in Figure  S1A We also considered other acoustic cues related to rhythmic structure. The rhythmic structure of 269 infant-directed singing necessarily involves multiple inter-related prosodic parameters, including variation 270 in parameters such as pitch and loudness (16). These parameters could play a role in modulating infants' 271 visual attention. We quantified these parameters as follows in order to measure their relationship to infant 272 looking: Acoustic measures of pitch and loudness were calculated as mean fundamental frequency (Hz;; 273 proxy for perceived pitch) (20) and root-mean-square amplitude (proxy for perceived loudness)(21), 274 respectively, in time intervals equivalent to the duration of each video frame (i.e., in 33.3 ms bins). For 275 each video, time intervals with fundamental frequency or amplitude values greater than the 90 th percentile 276 were used to define time series of "high" frequency or amplitude. As noted in the main text (Figure  2), 277 neither high frequency nor high amplitude alone was sufficient in and of itself to drive synchronous infant 278 eye-looking. To assure that results were not dependent upon threshold selection (90 th versus other 279 percentiles), follow-up analyses were conducted with varying thresholds (95 th , 92 nd , 88 th , 85 th , 80 th 280 percentiles) and yielded consistent results across all comparisons. Note that in infant-directed song, 281 frequency is influenced by the melody of the song;; this is in contrast to infant-directed speech, which 282 employs pitch accents for communicative emphasis (22). Amplitude, however, is related to rhythmic 283 structure (16) but also reflects the variable volume (i.e., musical dynamics) used during expressive 284 singing. The goal of the comparative analyses of the effects of different parameters was to test the extent 285 to which discrete occurrences thereof offer evidence for which parameters play a greater (beat) or lesser 286 (any high frequency, high amplitude) role in driving synchronous responding. 287 288

Motion of Singers 289
Motion of the singers was quantified in two ways for two different kinds of motion: motion of the 290 internal features of the face, and rigid motion of the head. To quantify motion of the internal features of 291 the face, we calculated the absolute difference in image intensity (luminance) per video pixel over time. 292 Change in intensity was summed for all pixels in the eyes region-of-interest (ROI) to provide a metric of 293 change within the eye region. We then identified frames with values less than the 10 th percentile to define 294 a time series of low motion (i.e., periods relatively free from motion in the eye region). We were interested 295 in periods of low motion as they represent relative stilling. As before, to assure that results were not 296 dependent upon threshold selection, we repeated analyses with additional thresholds (5 th , 15 th , 20 th );; 297 results across varying thresholds were consistent with those presented in Figure  3. 298 To quantify rigid motion of the head, we tracked the (x,y) location in video pixel coordinates of the 299 tip of the nose through all frames of all videos. With these data, we could measure up-and-down and 300 side-to-side motion of the head, in relation to the beat and in relation to infant eye-looking. Not 301 surprisingly, up-and-down movements of the head are synchronized with the beat (we found no 302 significant side-to-side head motion versus beat synchrony). Notably, however, increase in infant eye-303 looking precedes the up-and-down motion of the head, indicating anticipatory looking behavior.

305
Blinking of Singers 306 11 Blinks of the singing actresses were coded manually from each video using frame-by-frame 307 inspection. Timing of blink on- and offsets were coded based on coder's observation of occlusion of the 308 singer's pupils. All blinks were determined by two independent coders, with >99% agreement. Frame-by-309 frame binary time series were then created to indicate whether or not each video frame aligned in time 310 with a singer's blinking. 311 312

Emotional Expression of Singers 313
In general, when singing to infants, caregivers display positive affect and smiles (23, 24). In our 314 analyses, we were most interested in changing facial expressions reflecting (a) varying levels of caregiver 315 communicative content and (b) varying levels of caregiver engagement, both of which will impact what 316 and how a caregiver conveys information and may also impact infants' attention to a singing caregiver's 317

eyes. 318
To quantify emotional expressions in the faces of singing caregivers, we used IntraFace software 319 (25). In brief, IntraFace uses feature tracking in videos of faces (via a "Supervised Descent Method" (26)) 320 to track points on a face, and then, based on the positions of those points, quantifies the activity of facial 321 action units (accomplished by an inductive machine learning approach dubbed a "Selective Transfer 322 Machine") to categorize the resultant patterns into generic facial expressions. The result is a 323 quantification of facial action unit activity and a probability rating of emotional expression for every frame 324 of video. [Note: When these analyses were conducted, Intraface Software was freely available for 325 research use;; it was subsequently acquired by Facebook and is no longer publicly available. OpenFace is 326 a comparable package that can be found at https://cmusatyalab.github.io/openface/ .] 327 Analyses focused on variation in two facial expressions: neutral, which involves relaxed 328 eyes/brows (the absence of facial action unit activity;; IntraFace's "neutral" classification), and "mock-329 surprise"/wide-eyed engagement, which involves raising of the upper eyelids and brows (action units 1, 2, 330 5;; IntraFace's "surprised" classification). The "surprised" classification from IntraFace is consistent with 331 the canonical expression of surprise in adults but also with an expression called "mock-surprise" or the 332 "wow" expression that commonly occurs in infant-directed communication (27,28). This infant-directed 333 mock-surprise involves raised eye action units (wide open eyes, raised eyebrows) and open mouth, and 334 12 is rated as expressing surprise, excitement, and interest by naïve raters (28). It is worth noting that mock-335 surprise, despite being extremely common in caregiver-infant interaction (29, 30), and immediately known 336 to most parents and caregivers, is rarely mentioned in the adult facial expression literature (31): rather, 337 mock-surprise exists specifically within the developmental context of infant-caregiver interaction (one of 338 multiple such acts that emerge and exist specifically within the context of dyadic interaction with infants 339 (30)). 340 To test whether IntraFace facial expression classifications were consistent with human observer 341 perceptions, 10 naïve adults rated the emotional expressions of video frames pseudo-randomly selected 342 from all videos (selected pseudo-randomly to ensure a variety of expressions;; 36 ratings for each of 10 343 coders). Frames classified as "surprised" by IntraFace were consistently rated as higher in surprise, wide-344 eyed engagement, excitement, and interest than non-surprised frames (t(349)'s³6.44, p's<0.001), 345 confirming the reliability of the software's surprise classification. 346 For each video, Intraface ratings were used to define frame-by-frame time series indicating 347 presence or absence of the expression of interest (either neutral expression or wide-eyed engagement). To examine whether PSTH magnitude and shape were significantly greater for 6-month-old versus 385 2-month-old infants, we again used permutation testing. In 10,000 random re-samplings, we repeatedly 386 created two groups of independent infants, randomly selected across all 6-month-old and 2-month-olds, 387 and then computed their between-group difference in PSTHs. The mean difference across all 10,000 388 permuted samples represents chance-level difference at each time point. We then compared the actual 389 observed 6- versus 2-month between-group PSTH difference against the 95 th percentile of PSTH 390 14 differences expected by chance alone (one-sided test, alpha=0.05). That comparison enabled us to test 391 whether time-locked eye-looking at 6 months was significantly greater than at 2 months. 392 393

Phase Analyses 394
To estimate each infant's phase of response, ϕ, at the beat, each infant's PSTH data were first 395 fitted with each of 3 models (fitting via nonlinear least squares method). The data were fitted with a simple To analyze distributions in ϕ estimates, we used circular statistics. We assessed synchronization 421 of eye-looking response with the beat using the V-test, testing for non-uniformity of ϕ distributions around 422 0 (32, 33). To compare tightness of phase-locking between the two- and six-month groups (i.e., 423 consistency of response among individuals), we used the Wallraff test of angular dispersion (32, 34). 424 425

Lissajous Curves 426
As a complementary method for observing synchronization between the beat of infant-directed 427 singing and the looking behavior of infants, we constructed Lissajous curves comparing the changing 428 phase of the beat with the varying probability of infant-looking behavior. Lissajous curves provide a direct 429 record of how two time-varying signals vary in relation to one another, and Lissajous curves can be used 430 to visualize synchronization between two continuous signals, to quantify phase shift from one signal to 431 another, and to identify higher order synchronization (e.g., 2:1, 3:1,… n:m frequency coupling) (Figure  432

1L). 433
In these analyses, the phase of the beat was estimated as a continuously varying cosine function 434 as plotted in Figure  1D. As noted in the main text and described above in "Rhythmic Structure and 435 Acoustic Parameters" section, beats were coded and quantified as the vowel durations of all metrically 436 strong syllables within each song. With manually-labeled beats in all songs, the corresponding cosine 437 function was calculated to reach a local maximum at the midpoint of each labeled beat and to reach a 438 local minimum value at the midpoint of each between-beat interval. 439 To quantify probability of infant-looking behavior, the probability of a given behavior was defined 440 as the number of infants performing that behavior (numerator) divided by the total number of infants who 441  were no significant differences in proportion of time spent in eye-looking between the two age groups. 468 That absence of significant differences contrasts somewhat with results from our earlier work (7), in which 469 we observed an increase in eye-looking between 2- and 6-mo-old males followed longitudinally. Notably, 470 however, the current comparisons differ in 3 ways from those prior results. First, the results here are for 471 infant-directed singing, not speech. Second, the results here are for independent-sample, between-472 subjects, cross-sectional comparison of means rather than a longitudinal within-subjects comparison of 473 developmental change. And third, the current sample includes both males and females rather than males 474 alone in (7), and when followed longitudinally, females increase their eye-looking more rapidly than 475 males, from 2 until ~4 months, before then decreasing slightly from ~4 to 6 months;; in contrast, males 476 increase looking more slowly from 2 until 6 months, as in (7). 477 478

Lissajous Curves: Comparisons of Continuous Time-Varying Signals 479
Complementary analyses of synchronization compared continuously-varying measures of the 480 changing phase of the beat with continuously-varying probability of infant-looking behavior by 481 constructing Lissajous curves ( Figure  S2). 482 Beginning with the 6-month data, as shown in Figure  S2C Figure  S2D), maximally reduced prior to the beat, but is more phase-498 shifted in 2- than 6-month-olds: ~π/2.97 vs ~π/5.5, respectively. As with 6-month-olds, variation in 499 probability of 2-month-old body-looking does not vary significantly with the beat ( Figure  S2F). Finally, the 500 Lissajous curve for 2-month-old saccade probability appears to show early developmental transition 501 towards 2 saccades per 1 beat period, but only weakly so, approaching ~2:1 coupling and phase-shifted 502 by ~ π/2.82 ( Figure  S2H). 503 All Lissajous curves plotted in Figure  S2B-I show average probability across all beat trials, with 504 variance in beat-by-beat response indicated by gray shading, which shows +/-1 standard error of the 505 mean. 506 507

Caregiver Acoustic Cues 508
Infant-directed communication is well-known for its properties of heightened fundamental 509 frequency, greater pitch contours and variability, longer pauses, slower tempo, and increased 510 repetition(16, 22, 36). Prior research indicates that these acoustic features of infant-directed 511 communication capture and maintain infants' attention (e.g., high fundamental frequency, (22, 37)). 512 However, when we specifically examine moment-by-moment drivers of infant visual attention to the eyes 513 of an engaging caregiver, infant eye-looking was time-locked to the rhythmic structure (beats) but was not 514 significantly time-locked to moments of high frequency or high amplitude (main text Figure  2). These 515 results are not in contradiction with the general importance of pitch and loudness in infant-directed 516 communication;; rather, they offer evidence that during infant-directed singing, rhythm organizes those 517 and other features. The lack of time-locked looking to high amplitude or frequency events may be due to 518 the context of infant-directed song. During song, the singer's use of volume for expressiveness (i.e., 519 musical dynamics) impacts amplitude levels while melodic contours dictate frequency patterns. Individual 520 notes also exhibit greater pitch stability in infant-directed singing compared to infant-directed speech (38, 521 39). Thus the use of specific acoustic parameters in song contrasts with the role of pitch accents in 522 contributing to rhythmic structure during infant-directed speech, during which high pitch and pitch 523 variability capture infants' attention (40, 41). The global prosodic frequency and amplitude contours of 524 song may have rendered these specific acoustic cues less relevant for dynamically modulating infants' 525 eye gaze on a moment-by-moment basis. Even while cues such as high frequency are important for 526 attracting infants' overall attention, including during infant-directed singing (22, 37), the precise timing of 527 infants' attentional allocation to a singing caregiver's eyes is more strongly influenced by rhythm than by 528 other acoustic cues. Previous studies of non-infant-directed singing (e.g., professional or layperson 529 19 singing performances directed toward other adults) indicate that when acoustic cues are constrained due 530 to the musical/singing context (e.g., by melodic contour), they may be less informative for socio-531 communicative judgments: when pitch level is controlled, naïve observers are less accurate at identifying 532 specific emotions in audio-only versions of singing versus visual-only or audiovisual versions (42) and the 533 identified emotions are also perceived less intensely in audio-only formats (43). Additionally, as rhythm 534 and other acoustic elements (e.g., pitch) are intertwined for the listener during song perception (44-47), 535 the temporal organization provided by the beat-based rhythmic structure constrains pitch and melody 536 perception: rhythmically shifting a song so that specific pitches are or are not aligned with the beats 537 changes the perceived tonality and reduces recognition of pitches (even if the pitches themselves are 538 unchanged (44, 45)). This is consistent with rhythm as a temporal organizer of listeners' experiences: 539 rhythm plays an important role in structuring and scaffolding experience when engaged with song. 540 All stimuli in the current study were common children's songs performed in an infant-directed 541 manner (i.e., higher fundamental frequency) to be developmentally appropriate for our sample and 542 research questions. Future studies could use specifically constructed melodies performed at multiple 543 different pitch levels to further examine effects of high frequency when controlling for the rhythmic 544 structure in which frequency is embedded. 545 546

Caregiver Visual Cues and Rhythmic Structure 547
Rhythm is a salient cue to infants because it is expressed amodally (48, 49). As demonstrated in 548 current data in main text Figure  3, caregivers unconsciously structure their own visual cueing in time to 549 the rhythm of their singing, redundantly and repeatedly highlighting infant-relevant communicative cues. 550 Because caregivers use these cues to engage their infants socially, especially during infant-directed 551 singing, a key question is whether these cues drive infant behavioral response independently (i.e., are 552 sufficient on their own), or if infant response relies on or benefits from the redundant, repeated structure 553 provided by rhythm and entrainment (in order to ultimately, most effectively engage infant behavior). A 554 related question is how this confluence of cueing affects multimodal social information transfer 555 developmentally, to support children's social adaptive learning over time. 556 20 function of different components of caregiver cueing and the extent to which responses vary 558 developmentally. We hypothesized that by imposing a structure to the interaction, rhythm may support 559 other cueing signals by enabling predictable and repeated presentation of multimodal social information, 560 and that these effects should strengthen over developmental time. 561 To test, we compared entrainment of infant eye-looking during the following conditions: during all 562 beats;; during beats without co-occurring wide-eyed, positive affect, and during beats with co-occurring 563 wide-eyed, positive affect ( Figure  S3). In both 2- and 6-month-old infants, entrainment is evident during 564 all beats ( Figure  S3A,D, results repeated from Figure  2A,B). However, a developmental progression is 565 apparent when we separate instances when beats either co-occur or not with a caregiver's presentation 566 of wide-eyed, positive affect: at 2 months, entrainment is driven by the beats, with no effect for co-567 occurring presentation of wide-eyed, positive affect ( Figure  S3B,C);; at 6 months, however, the timing of 568 infant looking is aligned with the beat but also potentiated by a caregiver's presentation of co-occurring 569 wide-eyed, positive affect ( Figure  S3E,F). With development, precise time-alignment of eye-looking 570 behavior is supported by the rhythmic structure of multiple redundant cues. 571 While these findings in 2-month infants may seem surprising, closer inspection of individual 572 phase responses provides some indication of why this may be. As depicted in Figure  S3A, while 2-573 month-olds entrain to the beat, there is also variability in infants' precise individual response timing, with 574 some 2-month-olds aligning just prior to, and others just after, the beat. This variability is consistent with 575 the increased variability in latencies to saccade onset observed in control comparisons between 2- and 6-576 month-olds ( Figure  S1K), and would be consistent with less mature motor control in 2-month-olds. By 577 comparison, individual 6-month-olds are less variable in their individual time alignment with the beat 578 ( Figure  S3D). We can then compare the slightly increased variability in 2-month-old response with the 579 time-alignment of caregivers' wide-eyed positive affect (also in relation to the beat;; i.e., comparing time-580 alignment of infant response to the beat versus time-alignment of caregiver behavior to the beat). Time 581 alignment of caregiver wide-eyed positive affect with the beat (main text Figure  3A) is much more 582 precise, tightly aligning with or just prior to the beat. We think it likely that the slightly increased individual 583 21 variability in 2-month-old time-aligned eye-looking, coupled with the precise time-alignment of caregivers' 584 own synchronized expressions, leads to the pattern of observed results. 585 This developmental progression, aided by infants' maturing oculomotor function, suggests that 586 the rhythm of infant-directed communication provides a scaffolding mechanism for increasing the 587 effectiveness of social information transfer, supporting infants' developing sensitivity to meaningful social 588 signals by presenting those signals repeatedly and predictably. To test for further evidence of rhythm as 589 the primary driver of infants' entrained eye-looking to caregivers' social-affective cueing, we also 590 examined whether infants time-align their eye-looking to any moments of caregivers' wide-eyed positive 591 affect (i.e., regardless of whether such expressiveness occurs on or off the beat). While caregivers 592 increase wide-eyed positive affect in time with the beat, this visual cue also occurs at other times 593 throughout their singing. At neither two nor six-months of age do infants significantly time-align their eye-594 looking to this social-communicative cue when it occurs irrespective of the rhythmic structure ( Figure  S4). 595 (We highlight that these results focus on time-aligned change in levels of infant eye-looking in relation to a 596 given caregiver cue, rather than infants' overall levels of eye-looking. Therefore, these results do not 597 imply that infants don't look at caregivers' wide-eyed positive affect (they do);; rather, these combined 598 results demonstrate that the precise timing of infant-looking is time-aligned to the rhythmic structure more 599 than to caregivers' affect presentation alone). Taken together, the analyses of caregiver visual cueing, 600 both overall and in relation to rhythmic structure, indicates that although what a caregiver expresses in 601 unimodal cueing is important, when and how that cueing occurs are more critical for the infant's response 602 and receipt of information. Rhythm-to specify the "when" of predictable repetition-and rhythmic 603 entrainment-to specify the "how" of complementary redundancy-seem ideally suited to the task of 604 supporting successful social information transfer between caregiver and child. 605 Beyond infant-directed singing, visual communicative cues including eye contact, head 606 movements, and facial expressiveness are important in other performative musical contexts (42, 50-53). 607 Visual cues involving the eyebrows, lips, jaws, and head positioning covary with aspects of the musical 608 structure (e.g., facial movements provide cues to pitch intervals, phrase closure, and amplitude of the 609 vocal signal (54-56)) while also conveying associated emotions (e.g., eyebrow raises, forward head 610 movements, and upward lip corner movements are associated with positive emotions during singing as 611 22 they are in speech, highlighting the cross-modal expression of cues during musical performances (42, 612 57)). Indeed, visual displays are particularly salient for expressing emotions during singing (more so than 613 isolated acoustic counterparts) (42, 43). Observers perceive greater communication and expressiveness 614 from performers who use direct gaze, and this increases the observers' liking and emotional judgments of 615 the performance (50). Some professional musicians are particularly well-known for their expressive visual 616 cues during performances (e.g., (53)). It is possible that in musical performances more generally, the 617 expressive visual cues will be time-aligned to the rhythmic structure as demonstrated in the infant-618 directed singing. At the same time, the use of such expressive cues and their timing will depend on 619 multiple aspects related to the song requirements, performer attributes, and audience (e.g., (58, 59)). 620 Regardless, it is remarkable that when engaging with infants, who have limited communicative skills and 621 require external support to modulate their attention and arousal, caregivers adopt the highly expressive 622 and engaging visual cues used in performative contexts. (L) Fixation duration following first saccade when presented with a non-social target. In (K-L), we measured latency to first saccade after stimulus onset and the duration of first fixation as additional measures of oculomotor control. While 2-and 6-month-olds do not differ in mean or median latency to first saccade, they do differ in variance in saccade latency, with 2-month-olds being more variable than 6-month-olds (F 1,107 = 15.9, p < 0.0001; Levene's test for equality of variance  In Lissajous curves in parts (B-I), mean looking probability is plotted in blue while gray areas denote ±1 standard error of the mean (sem). In all traces, the arrowhead denotes mean response level at the beat (beat phase = 0), with trace thickness denoting direction of travel (thickening as time moves forward, resetting immediately after the beat). Y-axis ranges in parts (B) and (C), and in parts (H) and (I) are the same, whereas Y-axis spans are the same in parts (D) and (E), and (F) and (G), but their ranges differ. Mean probabilities of mouth and body-looking differ between groups; spans are matched for between-group comparison but ranges necessarily differ. Note that a Lissajous curve when no synchrony is present fills the plot area, and the average response probability is unchanged relative to the beat (a horizontal line, with no significant output signal change relative to beat phase, as observed for body-looking in (F) and (G)). Probability of eye-and mouth-looking in 6-month-old infants both show 1:1 synchrony with ~π/5.5 phase shift; however, mouth-looking is synchronous in anti-phase (maximally reduced after the beat). Saccades in 6-month-olds are synchronized at 2 saccade periods per 1 beat period, with maximum increase prior to (in anticipation of) the beat. When comparing synchronization of eye-and mouth-looking at 2-months (left columns) and 6-months (right columns), note greater magnitude of change in probability for 6-month-olds. Similarly, 6-month-olds exhibit greater increase in probability of saccades before the beat versus 2-month-olds. ± π 0 ± π 0 Discovery Cohort 6-Month-Olds, N = 56