Spontaneous dyadic behavior predicts the emergence of interpersonal neural synchrony

Synchronization of neural activity across brains – Interpersonal Neural Synchrony (INS) – is emerging as a powerful marker of social interaction that predicts success of multi-person coordination, communication, and cooperation. As the origins of INS are poorly understood, we tested whether and how INS might emerge from spontaneous dyadic behavior. We recorded neural activity (EEG) and human behavior (full-body kinematics, eye movements, and facial expressions) while dyads of participants were instructed to look at each other without speaking or making co-verbal gestures. We made four fundamental observations. First, despite the absence of a structured social task, INS emerged spontaneously only when participants were able to see each other. Second, we show that such spontaneous INS, comprising speciﬁc spectral and topographic proﬁles, did not merely reﬂect intra-personal modulations of neural activity, but it rather reﬂected real-time and dyad-speciﬁc coupling of neural activities. Third, using state-of-art video-image processing and deep learning, we extracted the temporal unfolding of three notable social behavioral cues – body movement, eye contact, and smiling – and demonstrated that these behaviors also spontaneously synchronized within dyads. Fourth, we probed the correlates of INS in such synchronized social behaviors. Using cross-correlation and Granger causality analyses, we show that synchronized social behaviors anticipate and in fact Granger cause INS. These results provide proof-of-concept evidence for studying interpersonal neural and behavioral synchrony under natural and unconstrained conditions. Most importantly, the results suggest that INS could be conceptualized as an emergent property of two coupled neural systems: an entrainment phenomenon, promoted by real-time dyadic behavior.


Introduction
Social interaction with conspecifics is one of the most complex and critical parts of our life ( Chen and Hong, 2018 ).To successfully navigate our social environment, an information exchange between our brain and the brains of others must occur, leading to a continuous and reciprocal update of information on partners' inner states and actions ( Frith and Frith, 2007 ;Gallotti et al., 2017 ;Hasson et al., 2012 ).This continuous exchange and update drives our social interactions and may ultimately impact our capacity to anticipate and adapt to others' behavior.How do individuals accomplish such information transfer during a social interaction?
Interpersonal Neural Synchrony (INS) -the temporal alignment of neural activities between interacting partners -has been proposed as a mechanism mediating interpersonal exchange of information ( Dumas et al., 2010 ;Hasson et al., 2012 ).This proposal is based on the notion that neural oscillations reflect transient states of cortical excitability, so it is assumed that their interpersonal align-ment might facilitate individual agents to coordinate their behavior, or even to share their inner cognitive or affective states ( Kingsbury and Hong, 2020 ).This view is supported by studies indicating that INS predicts the success of social interaction, which is typically measured using structured tasks implying coordination, cooperation, or verbal communication ( Cui et al., 2012 ;Dikker et al., 2017 ;Hoffmann et al., 2019 ;Kingsbury et al., 2019 ;Meshulam et al., 2021 ;Pan et al., 2018 ;Parkinson et al., 2018 ;Reinero et al., 2021 ;Stephens et al., 2010 ;Xie et al., 2020 ;Yang et al., 2020 ;Zhang and Yartsev, 2019 ).
Research on human INS has so far focused on correlating the strength of INS with success in a given 'structured' social task, i.e., the achievement of a collective goal according to task instructions ( Babiloni and Astolfi, 2014 ;Nam et al., 2020 ;Wang et al., 2018 ).This approach is sufficient to establish a relationship between INS and social behavior and, in some cases, might even suggest that INS has direct consequences on social behavior ( Kingsbury et al., 2019 ;Novembre et al., 2017 ;Yang et al., 2021 ).Yet, this approach alone has not clarified how Fig. 1.Experimental setup and design.a: Experimental setup.Twenty-three dyads of healthy participants were seated facing each other while their neural activity (EEG), eye gaze, body, and facial movements were recorded (over 2 min-long trials).Participants were not permitted to speak or to make co-verbal gestures.b: Experimental design.Using a 2 × 2 factorial design, we manipulated interpersonal visual contact (Vision, No Vision) and interpersonal spatial proximity (3 m, 1 m).We also included a baseline condition, during which participants were seated in two separate rooms and could not exchange or share information in any way (we assumed that no INS could emerge in such a "ground truth " condition).little attention by this research community, often assuming that INS is either triggered by the task itself (e.g., Djalovski et al., 2021 ;Levy et al., 2017 ;Li et al., 2021 ), or by subtle behavioral cues that are normally not monitored ( Jiang et al., 2012 ;Kinreich et al., 2017 ;Leong et al., 2017 ;Schippers et al., 2010 ).Notably, when these cues are measured, they are often confounded with task execution, as it happens when e.g., body language is used to reinforce verbal communication ( Hömke et al., 2018( Hömke et al., , 2017 ; ;Tartter, 1980 ;Tartter and Braun, 1994 ;Wohltjen and Wheatley, 2021 ).
To shed light upon the origins of INS, we designed a hyperscanning study that did not enforce a structured social interaction.This allowed us to measure spontaneous brain activity and behavior without the constraints and confounds posed by a specific task ( Hamilton, 2021 ).We recruited 23 dyads of participants and asked them to simply look at each other while acting spontaneously ( Fig. 1 a).The participants were not permitted to speak or to make co-verbal gestures.We measured INS by simultaneously recording electroencephalography (EEG) in both participants.Crucially, combining eye tracking and automated (deep learningbased) video-based analysis, we also recorded a number of behavioral measures that allowed us to estimate the temporal unfolding of three fundamental behavioral cues: eye contact, body movement, and smiling ( Depaulo, 1992 ;Kleinke, 1986 ;Martin et al., 2017 ).We integrated these neural and behavioral measures and investigated their real-time mutual dependency using computational methods such as hierarchical Bayesian modeling and Granger causality.
We tested two hypotheses: first, whether INS might emerge spontaneously in the absence of a structured task.To do so, we manipulated two factors that implicitly regulate social behavior: interpersonal visual contact and interpersonal spatial proximity ( Fig. 1 b).Second, assuming that such minimal social conditions are sufficient to modulate INS, we hypothesized that specific behavioral cues would predict INS.While testing this second hypothesis, we compared two models: one where the individual behavior of one participant, working as a signal, is sufficient to observe INS; and a second model where such behavior needs to be reciprocated.

Participants
Forty-six individuals (26 females; mean age 21.43 years, range 18-30 years) formed 23 dyads (13 same-sex dyads).All participants form-ing a dyad were familiar with each other (partner years of familiarity with each other were 6.59 ± 5.08 SD years; partner subjective closeness was rated 7.87 ± 2.27 SD on a scale of 1-10 where 1 is low closeness and 10 is highest subjective closeness).Participants had normal or corrected to normal vision and no history of psychological or neurological disorders.All participants provided written informed consent and were compensated with 25 Euros for their participation.All experimental procedures were approved by the local ethical committee and were carried out in accordance with the principles of the revised Helsinki Declaration (World Medical Association General Assembly, 2008).We determined our sample size in advance based on previously published studies measuring INS ( Dumas et al., 2010 ;Goldstein et al., 2018 ;Hirsch et al., 2017 ;Noah et al., 2020 ).

Experimental design and procedure
The experiment comprised four main experimental conditions organized according to a 2 × 2 factorial design ( Fig. 1 b).We manipulated interpersonal visual contact, i.e., whether the two participants forming the dyad could see each other or not (Vision, No Vision), and interpersonal spatial proximity, i.e., whether the participants forming the dyad were either 3 m or 1 m away from each other (Far, Near).The design also included two control conditions, which we refer to as "positive " and "negative ".The positive control condition implied a social task -holding hands -that has previously been associated with INS ( Goldstein et al., 2018 ;Long et al., 2021 ;Nguyen et al., 2021 ).The negative control condition (baseline) entailed a resting task where participants were placed in two separate rooms ( Fig. 1 b).During such baseline condition, the participants did not share the same environment and could not exchange information in any way.Hence, we assumed that no INS could emerge under such a "ground truth " condition.
The data were collected over repetitions (trials) each lasting 2 min.We collected 6 trials associated with the negative control condition (one half of the trials were collected at the start of the experiment and the other half at the end).For all other conditions, we collected 3 trials for a total of 21 trials.The trials associated with the experimental conditions and the positive control condition were further grouped into three blocks, each including 5 trials whose order was randomized (with one caveat: the two far and the two near conditions were always consequent to one another in order to minimize movement of the participants throughout the experimental procedure).
Before the start of the experiment, participants were provided with information about the equipment used to collect the data.Throughout the experiment, they were asked to simply relax and act spontaneously while looking at each other (unless prevented by a screen that was placed in between the two participants; see also Fig. 1 a).In particular, participants were asked to relax, behave naturally and when possible, look at the other person (not necessarily making eye contact) [This is the original text that the experimenter read to the participants before initiating the task (in Italian): "Il vostro compito è semplice: restare rilassati sulla sedia, comportarsi in modo naturale e, quando è possibile, guardare l'altra persona (non necessariamente negli occhi)."].Participants were not permitted to communicate verbally or through co-verbal gestures.We further specified that participants were not required to necessarily look at each other's faces or eyes, but rather they were generally asked to look at the body of their partner.Following the end of the experiment, participants were informed about the scope of the experiment.
The full procedure, including task instructions, preparation of the equipment (dual EEG, dual eye tracker, dual video recording) and debriefing took approximately 2.5 h.

Behavioral and neural recordings
During the experimental procedure, we recorded neural activity (using dual EEG), eye movements (using dual eye tracking), and videos from both participants.The video recordings were used to post-hoc quantify body movement and facial expressions (in particular smiling behavior).A custom library 'Multi-device-Inter-Synchronizer' ( https://github.com/ateshkoul/Multi-device-Inter-Synchronizer ) written in Python orchestrated the synchronization of this heterogeneous equipment (which is further described below).The library provided a graphical user interface (GUI) for recording participants' data, counterbalancing of experimental conditions as well as for generating synchronized triggers across the devices.The GUI was written using the package 'PySimpleGUI' ( https://pypi.org/project/PySimpleGUI/).The library interfaced with the dual EEG system, the two standalone video cameras, and the dual eye tracking system.Two types of information were transmitted using the library -details on tested participants (via Ethernet connection) and trigger codes (via National Instruments card to the computer controlling the video cameras; and via virtual serial port to the computer controlling EEG).The communication with the eye tracking system was performed using custom libraries that interfaced over the network library ZeroMQ ( https://zeromq.org/ ) using the efficient binary serialization format MessagePack ( https://msgpack.org/ ).
Eye movements : We recorded eye movements from both participants simultaneously.Each participant's eye movements were tracked using a binocular, lightweight eye tracking system (Pupil Labs Core; Pupil Labs, Berlin, Germany ( Kassner et al., 2014 )).The eye tracking system consisted of three different cameras.Two infrared spectrum eye cameras monitored the two eyes simultaneously (120 Hz sampling frequency; 320 × 280 pixels).A third (head-mounted) camera was mounted on the participant's head, video-recording from the participant's viewpoint (100°fisheye field of view, 30 Hz sampling frequency; 1280 × 720 pixels).Data were sampled using the pupil capture software (Pupil Labs,Berlin,Germany;version 1.23).Prior to initiating the recording, we calibrated the eye trackers using the Pupil Calibration Marker (Pupil Labs, Berlin, Germany).The calibration required participants to track a fixation marker while the experimenter was moving the Pupil Calibration Marker towards different (random) locations in space.The mean resulting validation accuracy was 2.03°(standard error = 0.054°) while precision was 0.1°(standard error = 0.0023°).Following the recording, the raw gaze data were exported using the open source Pupil Player software.
Body and facial movements : We recorded body and facial movements from both participants simultaneously using a dual video camera setup [two standalone cameras (SVPRO USB Webcam 5-50 mm Varifocal Lens) mounted on tripod stands].The cameras framed the two participants from the front side, having a slightly tilted ( ∼30°-45°with respect to the participants) aerial-and side-view of the participants ( Fig. 1 a).The distance between the cameras and the participants changed minimally across Near (160 and 180 cm) and Far (316 and 260 cm) conditions.A custom Python library 'synchCams' ( https://pypi.org/project/synchCams/ ) was written for interfacing the video cameras with the rest of the recording system.This library allows a frame-locked dual video recording (i.e., the system acquires frames from the two cameras in an alternating fashion).'synchCams' utilizes the Python-based libraries: 'opencv' ( https://opencv.org/ ) for video capture, 'pyserial' ( https://pythonhosted.org/pyserial/ ) for access to the serial port and 'socket' ( https://docs.python.org/3/library/socket.html ) for communication over Ethernet.
Neural recordings : The electroencephalogram (EEG) was recorded from 64 Ag/AgCl active electrodes placed on the scalp according to the extended international 10-10 system (Biosemi Active-2 system).EEG signal was locally amplified and digitized using a sampling rate of 2048 Hz.One additional electrode recording electro-oculogram was placed laterally to the right outer canthus.

Behavioral data analysis
The analysis of behavior focused on (i) eye contact, (ii) body movement, and (iii) smiling.These were chosen based on previous evidence demonstrating that they play an important role in social interaction, including in the context of hyperscanning studies ( Hirsch et al., 2017 ;Leong et al., 2017 ;Yun et al., 2012 ).Furthermore, a qualitative analysis of all video recordings (not assessing eye contact) indicated that body movement and smiling were the most pronounced social behaviors (37.71% and 56.38% of all social behaviors, respectively), compared to others such as laughing (1.08%), talking (0.14%) or gesturing (1.61%) (supplementary Fig. S1; see also supplementary Table T1).
All behavioral analyses -reconstructing eye contact, body movement, and smiling -relied on an automated machine learning-based estimation of face and body landmarks.Therefore, we first describe such estimation in detail and then zoom into each of the specific behavioral measures.We utilized a deep learning-based approach to extract (i) bidimensional (2D) face landmark locations from videos captured by the head-mounted cameras of the eye tracker and (ii) 2D full body landmarks from videos captured by the two standalone video cameras.A trained multi-stage Convolutional Neural Network (CNN) utilized single frames as input to first jointly predict a set of 2D vector fields that encoded the location and orientation of limbs in the image domain -Part Affinity Fields (PAFs) -as well as confidence maps for body part detection.Next, a greedy inference was used to output the face and body part landmarks by parsing the confidence maps and the PAFs.A collection of 25 body landmarks (across head, torso, arms, and legs), as well as 70 facial landmarks, were estimated.A custom library (py-torch_openpose) written in opencv ( https://opencv.org/ ) and pytorch ( https://pytorch.org/ ) was used to load the trained CNN model via openpose Python API ( Cao et al., 2017 ) and predict 2D face and body landmarks for each frame in a video.For a fast prediction of the landmarks, an NVIDIA GeForce RTX 2060 SUPER graphics processing unit (GPU) was used.
Eye contact : Eye contact was approximated as the distance between the gaze focus (of a given participant) and the center of the face (of the participant's partner).To do so, we integrated facial landmarks data (gained from the head-mounted video recordings) and eye gaze data (gained from the eye tracker).Specifically, for each frame of the eye tracker head-mounted videos, the estimated facial landmarks were used to measure the Euclidean distance between the gaze location and the center of the face (estimated as the average of the nose landmarks).
Body movement : Overall (i.e., whole body) movement was estimated using 25 body landmarks (gained from the videos recorded by the standalone cameras).We specifically extracted absolute (as opposed to relative) body locations in order to avoid issues related to head movements of the eye tracker wearer.Each body landmark carried 2D coordinates, i.e., movement on the x and y axes (relative to the camera's field of view) over time.These time series underwent a preprocessing procedure aimed at removing outlying values (3 standard deviations away from the mean of a given trial, 0.68% of all data) or data points where the algorithm wasn't able to predict body position (8.0% of all data).These outlying values were then interpolated (1D interpolation), and then the resulting time series were smoothed using a moving mean (window = 1 s).We then computed a composite index of movement change over time by first computing the Euclidean distance between (x,y) and the reference (0,0) coordinate and then calculating the absolute value of the first derivative of the resulting time series.Prior to summing the data associated with the 25 body landmarks, the time series were smoothed using a moving mean (window = 1 s).Finally, the data were visually inspected and trials associated with artifacts (i.e., high degree of variance, defined as above) were removed (2.17%).
Smiling : Smiling was approximated by computing the aperture of the mouth, which was estimated as the Euclidean distance between the two extreme landmarks on the left and right side of the lips (extracted from the facial landmarks).To increase signal-to-noise ratio, a procedure similar to that for body landmarks was used: first, outlying values were removed (more than 3 standard deviations), then values interpolated, and finally, the resulting time series were smoothed using a moving mean (window = 1 s).We evaluated the accuracy of the automated smiling estimation against a manual frame-by-frame annotation made by a human observer.We confirmed that the detection of smiles, as generated by thresholding the aperture of the mouth, corresponded well to the manual annotation performed by a human observer (accuracy = 0.86, p < 0.001).

Neural data analysis
EEG data preprocessing and analysis were performed using custom scripts running on MatLab (version 2020a MathWorks, Natick, MA) and exploiting several functions from the Fieldtrip toolbox ( Oostenveld et al., 2011 ).The recorded EEG data were band-pass filtered (cutoff frequencies: 0.3 Hz and 95 Hz, Butterworth, filter-order: 3).A notch filter (47-52 Hz) was also used to remove electrical noise.We then re-referenced the data using a Common Average Reference (CAR).
Noisy or faulty electrodes were interpolated by replacing their voltage with the average voltage of the neighboring electrodes (155/8832 channels across all dyads; average across the dyads 1.75%, standard deviation 1.33%, sem 0.28%).To remove movement artifacts, we used a validated algorithm for automatic artifact correction: Artifact Subspace Reconstruction (ASR) ( Kothe and Makeig, 2013 ;Plechawska-Wojcik et al., 2019 ).ASR is an adaptive algorithm based on principal component analysis.It estimates clean portions of data to determine thresholds that are later used to reject large variance components.Based on the comparative results from ( Chang et al., 2018 ), we used a threshold value of 10.Next, we submitted the ASR-cleaned data to an ICA analysis, and artifacts due to eye blinks or eye movements were subtracted using a validated method ( Jung et al., 2000 ).In all datasets, independent components related to eye movements had a large electrooculogram channel contribution and a frontal distribution.
The resulting data were then time frequency transformed by using a sliding window Fourier analysis with a tapering function.To reduce spectral leakage and control the frequency smoothing, we used a Hanning-tapered window with a frequency dependent time window that linearly decreased from 500 ms (1 Hz) to 100 ms (95 Hz) (using the 'ft_freqanalysis' function with 'mtmconvol' method as implemented in FieldTrip) ( Novembre et al., 2019 ).We computed power estimates for all frequencies between 1 and 95 Hz in steps of 1 Hz and a time resolution of 100 ms (10 Hz).To avoid overrepresentation of certain bands embracing a disproportionate amount of frequency bins (in particular, gamma band spanning from 31-95 Hz), we defined sub-bands for further analyses as illustrated in Table 1 .Where normalized data c and data c are power estimates for each experimental condition at any time point, channel, frequency, and trial, while mean (data all experimental conditions ) are power estimates averaged over trials and timepoints.The normalized data were then averaged over time and trials to obtain average normalized power estimates for each channel and each frequency.
In accordance with the 2 × 2 factorial design, the main effects of interpersonal visual contact, interpersonal spatial proximity, and the interaction between interpersonal visual contact and interpersonal spatial proximity, were computed using a cluster-based non-parametric permutation test ( Maris and Oostenveld, 2007 ).This procedure controls for false-alarm rate by using a cluster statistic that is evaluated under a single permutation distribution.More specifically, this algorithm compares power estimates between conditions, separately for each channel and frequency sub-band, using two-sided paired-samples t-tests, yielding one t-value for each channel and frequency sub-band.The algorithm then forms clusters of (at least three) neighboring channels whose tvalues exceed the significance criterion (here p < 0.05) and computes a cluster-level statistic value (sum of t-values within the cluster).Next, cluster-level statistics from 1000 random partitions of the data are used to build a distribution upon which the significance probability (here p < 0.05) of the experimental cluster can be estimated.In order to run this analysis in the Fieldtrip toolbox (see above), it was necessary to average data across conditions to assess the main effects and interactions using t-tests.Main effects were tested by contrasting the averaged Vision vs.No Vision conditions (irrespective of interpersonal spatial proximity) or Far vs. Near conditions (irrespective of interpersonal visual contact).The interaction was tested by contrasting the difference of Far Vision and Near Vision vs. difference of Far No Vision and Near No Vision.Effect sizes (Cohen's d) were calculated as standardized mean difference for each contrast over the channels and sub-bands that comprise a cluster ( Meyer et al., 2021 ).

Interpersonal Neural Synchrony (INS)
Because intra-brain power estimates were statistically different across conditions (see below), we employed a separate normalization of the time frequency transformed data that mitigated these effects and aimed at keeping a comparable range of variability across different conditions.This procedure was intended to minimize the possibility of measuring spurious INS.Hence, before computing INS, we used the follow-ing normalization procedure: Where data c is the (channel-and frequency-specific) power time course data for the entire duration of a single trial (0-120 s) and mean(data c ) is the average (over time) of this data.
To control for time-varying effects on normalized time courses, we detrended these time series and then computed INS using a trial-by-trial Pearson's correlation (i.e., between the normalized power time courses of the two participants forming a dyad; similarly to Liu et al., 2021 ;Rose et al., 2021 ;Zamm et al., 2018 ;Zhang and Yartsev, 2019 ).We used this measure of INS because (i) amplitude changes are easier to be estimated and often more reliable than changes in other signal properties such as phase (notably, phase might be estimated even when there is no signal in the frequency of interest) ( Burgess, 2013 ;Thatcher, 2010 ), (ii) amplitude modulations have more extensively been characterized (compared to phase modulations) across several EEG, MEG and also LFP studies ( Aru et al., 2015 ;Muthukumaraswamy and Singh, 2011 ), and notably (iii) amplitude modulations have been suggested to reveal neural coupling that is typically not detectable, or overestimated by other phase-related INS measures such as coherence ( Bruns et al., 2000 ;Hipp et al., 2012 ;Xu et al., 2022 ;Zamm et al., 2018 ).Some studies have reported that amplitude modulations capture brain activity similar to that captured by fMRI and fNIRS measurements ( Mantini et al., 2007 ;Scheeringa et al., 2011 ).The INS measure was both channel-and frequency band-specific, and entailed only homologous channels and bands.Restricting the analysis to homologous channels was sufficient to address the study's hypothesis.Besides, this implies noteworthy benefits in terms of (i) interpretability and (ii) computational costs.First, using homologous channels yields results that are easier to interpret and relate to traditional electrophysiological studies examining single brains.Second, focusing on homologous electrodes requires less computational resources and less statistical comparisons (4096 in the present case).We then averaged the correlation coefficients across the different trials belonging to the same experimental condition and Fisher z-transformed (inverse hyperbolic tangent function) them to obtain averaged normalized correlation coefficients for each channel and each frequency band (across conditions within each dyad).Similar to what we did above with intra-brain power, we averaged the coefficients according to frequency sub-bands.Finally, we compared INS (i.e., the normalized coefficients) across conditions using the cluster-based non-parametric permutation test illustrated above ( Maris and Oostenveld, 2007 ).
A control analysis was run to confirm that significant effects yielded by the previous analysis were attributable to the real-time interaction between the participants forming a given dyad, as opposed to stereotypical modulations of brain activity spuriously causing INS.For this analysis, the power time courses associated with one member of a dyad were correlated to those associated by all other partners apart from the actual partner's power time courses.'Surrogate' data were thus generated from 'pseudo' dyads, yielding INS values that were subtracted from the genuine data.The difference (correlation) coefficients were submitted to a new cluster-based non-parametric permutation test illustrated above ( Maris and Oostenveld, 2007 ).

Integration of neural and behavioral data 2.6.1. Instantaneous synchrony
We estimated time-by-time INS by computing instantaneous correlations between the two partners, focusing on the Vision conditions (i.e., when the two participants could see each other).We focused only on the power time course associated with the three significant clusters resulting from the previous INS analysis (i.e., one  cluster, one  cluster, and one  cluster).Power time courses of each participant were taken; first, their means were subtracted and then normalized to unit length.Each time point of this normalized data was then represented as a vector in 2D space where the two axes corresponded to normalized data from two subjects.Next, the correlation index was computed by taking the smaller angle among the two angles subtended by this vector to a line orthogonal to the equality line (at 45°angle) ( Zhang and Yartsev, 2019 ).In such a scenario, an angle of 90°represents a perfect correlation while an angle of 45°represents no correlation.The instantaneous INS thus obtained was max-normalized and ranged from 0 to 1.
We also estimated behavioral synchrony between the two partners (again focusing on the Vision conditions).To this aim, the behavioral data -eye contact, smiling, and body movement -were processed as the power time courses, leading to estimates of instantaneous behavioral synchrony.Such behavioral synchrony was then resampled to 10 Hz to match the sampling frequency of the instantaneous INS.

Computational modeling
We used a Bayesian hierarchical linear regression to predict INS (again, focusing on the three significant INS clusters and on the Vision conditions) from behavioral synchrony (eye contact, body movement, and smiling) (see supplementary Fig. S2 for the depiction of the model).
In a first hierarchical model, we predicted INS from discretized behavioral data (results presented in Section 3.3 ).Specifically, for discretizing eye contact, we first estimated the area occupied by the partner's face on each image recorded by the eye tracker (relying on landmarks estimated using OpenPose).Next, this information was combined with the gaze information retrieved from the eye tracker resulting in a binary code (1 if gaze location overlapped with face location, 0 otherwise).For discretizing smiling and body movements, we used a trialspecific threshold of mean + 1 standard deviation.This analysis revealed that, during the Vision conditions, participants looked at each other's faces 21.20% of the time (20.48% of which in synch with their partners), smiled at each other 14.16% of the time (35.48% of which in synch with their partners) and performed spontaneous body movement 18.38% of the time (22.84% of which in synch with their partners).Next, we compared two hierarchical models testing whether an individual behavior of one participant, working as a signal, is sufficient to observe INS (IND Model), or, such behavior needs to be reciprocated and therefore occurs in both individuals simultaneously (REC Model).A second hierarchical model instead predicted time-by-time INS from continuous behavioral, i.e., not binary data (results presented in Section 3.4 ).
Hierarchical models were utilized as these models have parameters that meaningfully describe the observed data at their multiple levels as well as integrate information within and across levels ( Kruschke and Vanpaemel, 2015 ).The Bayesian approach was chosen as it provides a full representation of the estimated parameter uncertainty (through the posterior distribution) which is directly interpretable.Bayesian methods do not involve the construction of sampling distributions from auxiliary null hypotheses as in the frequentist interpretation of parameters.Additionally, Bayesian methods are inherently designed to provide clear representations of the parameter uncertainty in comparison to frequentist methods ( Kruschke, 2013( Kruschke, , 2010 ; see also Wagenmakers, 2007 for more details).
The hierarchical regression is a partial-pooling or multilevel approach that estimates the group coefficients while allowing the subjectlevel coefficients to vary ( Gelman et al., 2004 ).The assumption is that the subject-level coefficients come from a distribution that is centered around their respective group mean.We modeled the average of group level coefficients as random variables with normal distributions ( Eq. ( 1) ) centered at 0, standard deviations of 100, and the group standard deviations as half-normal distributions ( Eq. ( 2) ) with standard deviations of 5.The overall error was modeled as Half-Cauchy log-likelihood ( Eq. ( 3) ) with a scale parameter of 5.
Where  = 1 σ 2 , s is the random variable and s ∈ ℝ , μ is mean,  is precision, and  is the standard deviation of the distribution.
Where σ 2 = 1  , s is the random variable and s ∈ [0, ∞),  is precision, and  is the standard deviation of the distribution.
Where s is the random variable and s ∈ [0, ∞) and  is the scale parameter.
The subject level coefficients were modeled as normal distributions with the mean centered at the group level average, and standard deviation at the group level standard deviation.The linear regression was estimated in a Bayesian framework.We sampled the posterior distribution of the parameters using the Markov chain Monte Carlo (MCMC) sampling algorithm 'No-U-Turn Sampler' (NUTS) ( Hoffman and Gelman, 2014 ).NUTS is a self-tuning variant of Hamiltonian Monte Carlo (HMC) algorithm that avoids random walk behavior and sensitivity to correlated parameters by utilizing gradient information from the likelihood.Such features allow NUTS to converge to high-dimensional target distributions better than traditional sampling methods like random walk Metropolis or Gibbs sampling ( Hoffman and Gelman, 2014 ).
We sampled 2500 samples on 4 separate chains that were run using parallel processing.Model convergence was verified using R-hat statistics ( Vehtari et al., 2021 ).R-hat measure compares the between-and within-chain estimates for model parameters.An R-hat value that deviates from a value of 1 reflects issues with the model convergence.
The model selection was performed using the Pareto-smoothed importance sampling leave-one-out (PSIS-LOO) cross-validation method ( Vehtari et al., 2017 ).This method estimates the expected log pointwise predictive density (elpd) by fitting a Pareto distribution to the upper tail of the distribution of the importance weights.This procedure is asymptotically equal to Watanabe-Akaike information criterion (WAIC; ( Watanabe, 2010 )) but PSIS-LOO has been suggested to be more robust in the finite case with weak priors or influential observations ( Vehtari et al., 2017 ).
We determined the significance of the group level coefficients based on the overlap of their estimated posterior distributions with the test value of '0 ′ .A difference of less than 5% in the posterior distribution overlap (P p|D ) was considered significant.Since the hierarchical regression procedure violates the independence assumption between the dyads, we did not analyze regression parameter estimates in a frequentist test.All Bayesian modeling was performed using 'pymc3 package' ( Salvatier et al., 2016 ) while model selection was performed using the package 'arviz' ( Kumar et al., 2019 ).

Cross-correlation
We explored whether behavioral synchrony anticipated or followed INS.To do so, we cross-correlated the instantaneous estimates of neural and behavioral synchrony (separately for the three examined behaviors and the three identified clusters) at different lags ( ± 10 s) in steps of 0.1s.We tested each pointwise cross-correlation coefficient for significance using a two-sided one sample t -test against a test value of '0 ′ .We corrected the p-values obtained using a false discovery rate (fdr) correction ( Benjamini and Hochberg, 1995 ).

Granger causality
We further performed a Granger Causality analysis to determine the temporal relationship between the neural and behavioral synchronies i.e., whether behavioral synchronies granger-caused neural synchronies or vice versa.Given two variables X and Y, X is said to Granger cause Y if the past of X does convey information about the future of Y above and beyond all information contained in the past of Y ( Granger, 1969 ).We operationalized the Granger causality in a multivariate, modelbased approach by utilizing vector autoregressive (VAR) model theory ( Hamilton, 1994 ;Lütkepohl, 2005 ).We applied the Granger causality on the continuous power time courses data from the significant clusters obtained in the previous analyses (each containing 1200 data points).We acknowledge that these time series are longer than those commonly used in the field, and we highlight that this was meant to compensate for the fact that we could rely on only three trials per condition.However, it should also be noted that previous work has indicated that Granger causality indices relying on so few trials might lead to variable estimates and therefore be suboptimal ( Bastos and Schoffelen, 2016 ).For this reason, we recommend future studies to rely on more trials, if applicable.We then z-scored this data and estimated the information criteria (Akaike and Bayesian) for the VAR model (maximum model order for model order estimation = 15).We subsequently used Bayesian model order to fit the VAR model.The resulting regression coefficients were then used to obtain autocovariance sequences and consequently the pairwiseconditional time-domain multivariate Granger causalities (MVGCs).We used significance testing to compare differences between the two directions.Since the data were not distributed normally (Anderson-Darling test p s < 0.05), we used non-parametric Wilcoxon signed-rank tests.The Granger causality analysis was performed using the MVGC MatLab toolbox ( Barnett and Seth, 2014 ).

INS emerges spontaneously through interpersonal visual contact
We parameterized INS as the Pearson's correlation between the electrode-and sub-band-specific EEG power time courses (i.e., envelopes) measured from the two individuals forming a dyad ( Fig. 2 a, see also supplementary Fig. S3 and Section 2.5.2 for a full description).We compared INS across conditions using a fully data-driven approach (i.e., a cluster-based permutation test, see Section 2.5.1 for a full description).INS clusters represent homologous scalp regions whose activity unfolds similarly over time.

b) as well as when comparing INS across Vision and
No Vision conditions (irrespective of interpersonal spatial proximity, Fig. 2 c).
A third alpha (  env 10-11 Hz) INS cluster indicated a complex interaction between Vision and Interpersonal Spatial Proximity ( Fig. 2 c;  env INS cluster p = 0.019, Cohen's d = 0.52).None of the follow-up (clusterbased permutation) t-tests aimed at assisting its interpretation yielded significant results ( p s > 0.05).Furthermore, while the  env and  env INS clusters were replicated in a series of control analyses relying on different preprocessing pipelines, this was not the case for the  env INS cluster (supplementary Fig. S4).
As a final note, we report that in another supplemental analysis we were able to replicate the above findings after segmenting the data from our trials (being 2 min long) into shorter epochs of either 60 s, 30 s, 20 s, 10 s, or 5 s (supplementary Fig. S5).  1 ).

INS does not simply depend on individual EEG modulations
For control purposes, we conducted two additional analyses aimed at assessing whether INS depended only on neural modulations occurring in one individual -i.e., irrespective of the modulations simultaneously occurring in the partner's brain.
In a first analysis, we compared intra -brain power across conditions, pooling all participants regardless of the dyads.This analysis revealed power modulations, entailing multiple bands, induced by both visual contact and interpersonal spatial proximity (supplementary Fig. S6).These modulations were often qualitatively different from those associated with INS.For instance,  env power was not only enhanced when participants were able to see each other (Vision vs.No Vision), but also when they were closer to each other (Near vs. Far) (cluster p = 0.012, Cohen's d = − 0.46).Furthermore,  env power was suppressed during Vision conditions, and this effect did not show any clear lateralization like what was observed in the  env INS cluster (compare supplementary Fig. S6 with Fig. 2 ; p = 0.005, Cohen's d = 0.72; see also supplemen-tary Fig. S7 for power modulations separately for Vision and No Vision conditions).
In a second analysis, we demonstrated that INS emerged from dyadspecific interactions, as opposed to analogous EEG modulations occurring in all participants and leading to spurious synchrony.We performed a permutation analysis, forming surrogate dyads of participants that did not actually interact with one another (see Section 2.5.2 for details; ( Djalovski et al., 2021 ;Novembre et al., 2017 )).Comparing INS across real and surrogate dyads, we fully replicated our initial INS results and identified the same  env ,  env , and  env INS clusters described above (Vision compared to No Vision:  env INS cluster p = 0.025, Cohen's d = 0.60,  env INS cluster p = 0.002, Cohen's d = 0.62; interaction between Vision and Interpersonal Spatial Proximity:  env INS cluster p = 0.016, Cohen's d = 0.52; see supplementary Fig. S8).
Together, these results indicate that INS was not simply a by-product of individual EEG modulations; rather, it reflected dyad-specific neural dynamics, perhaps tracking real-time information transfer mediated by social behavior.We investigate this hypothesis in the next analysis.

INS correlates with reciprocated (synchronous) behavior
Having demonstrated that INS emerges spontaneously when two individuals simply look at each other, we searched for the behavioral correlates of spontaneous INS.For this analysis, we focused on the Vision conditions i.e., the experimental conditions where the two participants could see each other.A qualitative assessment of the video recordings on multiple behaviors (see Supplementary Table T1 for a full list and description) identified three social behaviors that the dyads most frequently displayed (see supplementary Fig. S1 and Section 2.4 ).These were: (i) eye contact, (ii) body movement (i.e., movement of any body part), and (iii) smiling.Using deep learning-based automated analyses, we extracted the time course of these behaviors for each participant ( Fig. 3 a; see also Section 2.4 for a detailed description).Next, we computed the time course of INS, i.e., instantaneous INS, a measure used to capture temporally resolved changes in synchrony (see Fig. 4 a, Section 2.6.1 , and ( Zhang and Yartsev, 2019 )).
We then tested whether and how instantaneous INS relates to these behaviors.We formulated two potential models for this interaction.Either the individual behavior of one participant, working as a signal, is sufficient to observe INS (IND Model), or, such behavior needs to be reciprocated and therefore occurs in both individuals simultaneously (REC Model).These models were analytically formalized by thresholding the continuous behavioral cues and using either logical XOR operation (for the IND model) or logical AND operation (for the REC model) (see Section 2.6.2 for details).We compared the performance of these two models using Pareto-smoothed importance sampling leave-one-out cross-validation (PSIS-LOO-CV; ( Vehtari et al., 2017 )).Results indicated that the  env INS cluster was better predicted by the REC model, while the  env and  env INS clusters were better predicted by the IND model ( Table 2 ).Thus, reciprocated social behaviors (eye contact, body movement, and smiling) predicted  env INS better than non-reciprocated behaviors (i.e., behaviors performed by only one individual within a dyad).This result suggested that, similar to brain activity, some behaviors may have synchronized within a dyad.This suggestion was confirmed by a correlation analysis indicating that indeed body movement ( p < 0.001), smiling ( p < 0.001), and, to a lesser extent, eye contact ( p = 0.0456) were significantly correlated (over time) across the two members of a dyad ( Fig. 3 b).Notably, just like INS, these synchronous behaviors occurred despite the absence of an instructed task.

Mapping INS clusters on specific behaviors
We estimated the relationship between INS clusters and synchronized behaviors using a Bayesian hierarchical (partial-pooling) linear regression procedure, predicting instantaneous INS from instantaneous behavioral correlation ( Gelman et al., 2004 ).
We first predicted INS using a model that included all three explored social behaviors: eye contact, body movement, and smiling.This model outperformed the null model (random variations across the dyads) ( Fig. 4 b).Specifically, model comparison using PSIS-LOO crossvalidation ( Vehtari et al., 2017 ) showed that the social behavior models explained INS variance better than their respective null models, and this was so for all INS clusters ( Table 3 ).
We then zoomed into the contribution of specific synchronized behaviors examining the posterior distribution of the regression parameter estimates.Notably, the  env INS cluster was significantly predicted by body movement and smiling, irrespective of one another (posterior dis-

Fig. 4. Behavioral cues predict Interpersonal Neural Synchrony (INS)
. a: Representation of the computation of instantaneous (time-by-time point) correlations for interpersonal neural and behavioral synchrony.The instantaneous correlation index is computed by taking the time series from each dyad (X,Y, being brain or behavioral data) and first subtracting its mean and normalizing it to unit length (x, y).Each timepoint of this normalized data is then represented as a vector in 2D space (x t ,y t ) where the axes correspond to normalized data from each subject.Next, the correlation index is computed from the angles subtended by the point (x t ,y t ), relative to a line at − 45°(i.e., a line orthogonal to the equality line at 45°).This is done by taking the smaller angle among the two angles subtended by the (x t ,y t ) vector to the line at − 45°.In such a scenario, a resultant angle of 90°represents a perfect correlation (the vector would be coincidental with the equality line at 45°angle), while an angle of 45°represents no correlation (the vector would be coincidental on one of the axes).Finally, the instantaneous correlation thus obtained was max normalized and ranged from 0 to 1. tribution overlap (P p|D ) [smiling < 0] < 0.001, P p|D [body movement < 0] < 0.001).Furthermore, the  env INS cluster was significantly predicted by eye contact (P p|D [eye contact < 0] = 0.042) ( Fig. 4 c, Table 4 ).Finally, the  env INS cluster was not predicted by any of the synchronized behaviors ( p > 0.05).
Hence, when all behaviors were combined into a single model, we could successfully predict all INS clusters.Instead, predicting INS clusters using specific behaviors yielded a more detailed picture.Specifically, synchronized body movement or smiling, taken alone, predicted  env INS.Only eye contact predicted  env INS.And no behavior alone predicted  env INS.

Synchronized behavior anticipates and predicts INS
Having identified a tight relationship between instantaneous INS and synchronized social behavior, we aimed at determining the temporal relationship between these variables.
To achieve this, we first cross-correlated the time course of INS and behavioral synchrony, separately for each specific social behavior and INS cluster ( Fig. 5 a).The results of this analysis indicated that synchronized social behavior, and specifically body movement and smiling, cooccurred with  env INS ( Fig. 5 b).Notably, synchronized smiling behavior slightly anticipated INS (see inset in Fig. 5 b), with a temporal lag PSIS-LOO estimate: Pareto-smoothed importance sampling leave-one-out crossvalidation estimate values, PSIS-LOO standard error: the standard error for the PSIS-LOO computations, Weight: These weights can be loosely interpreted as the probability of each model being true (among the compared models) given the data.Mean estimate: Average coefficient estimates from the posterior distribution, P p|D : posterior distribution overlap of the parameter higher than the test value of '0 ′ , 90% HDI: 90% Highest density interval, R-hat value: R-hat convergence diagnostic compares the between-and within-chain estimates for model parameters.An R-hat value that deviates away from the value of 1 reflects issues with the model convergence.
of ∼200 ms.This suggested that synchronized smiling behavior was followed by  env INS after a time interval of ∼200 ms.This precise temporal relationship was identified only in the  env INS cluster, while the  env and  env INS clusters did not exhibit a clear temporal relationship with synchronized behavior.Next, to shed further light upon the relationship between INS and social behavior, and to identify dependencies having potentially variable time lags, we used Granger causality analysis.This aimed to establish whether synchronized social behavior granger-caused INS, vice versa, or both.The results revealed that INS and synchronized social behavior mutually granger-caused each other, as suggested by the fact that all pairwise-conditional time-domain multivariate Granger causalities (MVGCs) were significantly higher than 0 ( p s < 0.001).Crucially, however, the MVGCs tagging the effect of behavioral synchrony on INS were significantly higher than those tagging the opposite relationship ( p s < 0.05), hence implying that behavioral synchrony was more likely to granger-cause INS than vice versa.This result generalized well across all INS clusters and social behaviors, implying that even though we could not observe a temporally-fixed relationship between social behaviors and the  env and  env INS clusters (as indicated by the cross-correlation coefficients), these INS clusters were still likely to be granger-caused by synchronized social behaviors, probably with a time lag that varied across trials and/or dyads.

Discussion
In the current study, we investigated the origins of Interpersonal Neural Synchrony (INS) by recording spontaneous behavior and brain activity from dyads of participants.We observed that, despite the absence of a structured task, INS emerged spontaneously when participants could simply see each other.
The observed INS had a specific spectral and topographic profile, comprising the envelopes of frontal alpha (  env ), right-posterior beta (  env ), and occipital-parietal gamma (  env ) EEG activity.This pattern of interpersonal neural activity could not be simply explained by similar intrapersonal neural activities (i.e., neural activities occurring in both individuals irrespective of their partners), for the following two reasons: first, because analyzing intra-neural activity separately for each individual yielded a markedly different pattern of results, and, second, because our results could not be replicated when extracting INS from surrogate dyads.Hence, the observed INS emerged from dyad-specific interactions.
Probing the behavioral correlates of spontaneous INS, we used advanced image processing, deep learning, and computational modeling to demonstrate that INS was rooted in behavioral synchrony.Notable behavioral cues -eye contact, smiling, and body movement -were reciprocated across dyads.Each of these behaviors predicted and Grangercaused the emergence of spectral and topographic specific INS.

Spontaneously-emerging synchrony
We conceptualize the above results in terms of a "spontaneouslyemerging " synchrony, i.e., co-occurrence of similar spontaneous behavior and neural activity arising without an explicit task.Indeed, the observed synchrony emerged even though participants weren't performing any structured task or achieving an instructed collective goal.While the majority of neuroscientific studies would tacitly assume that recording behavioral and neural activity in the absence of a task would lead to unpredictable changes over time, here we show that having two individuals sharing the same (social) environment leads to coupled variations of both behavior and brain activity.This implies that interpersonal synchrony, including INS, can be conceptualized as a fundamental emergent property of two coupled biological systems, i.e., in this case, two humans being coupled through visual contact.This observation and conceptualization are remarkably compatible with dynamical systems theory ( Kelso, 1995 ).In particular, the synchronization between two biological systems could be seen as a self-organized process emerging spontaneously and without the need of any external ordering or instruc- tion ( Camazine et al., 2003 ;Pezzulo et al., 2019 ).As such, INS would be reminiscent of many other self-organized entrainment phenomena observed across very heterogeneous fields of research, including psychology ( Oullier et al., 2008 ;Richardson et al., 2007Richardson et al., , 2005 ; ;Schmidt et al., 1998 ;Varlet et al., 2017 ), ethology ( Ballerini et al., 2008 ;Buhl et al., 2006 ;Cavagna et al., 2010 ;Couzin, 2007 ;Yates et al., 2009 ), or even physics ( Peña Ramirez et al., 2016 ;Von Humboldt, 1846 ).For instance, INS might be qualitatively similar to the spontaneous synchronization of behavioral states in multiple collective animal behavior like flocking in birds ( Ballerini et al., 2008 ), or swarms of insects ( Yates et al., 2009 ).
The fact that INS can emerge spontaneously (i.e., in the absence of a given task) has remarkable implications for the majority of previous studies investigating INS.Indeed, these studies have attributed INS to specific social tasks entailing coordination ( Dumas et al., 2010 ;Hoffmann et al., 2019 ), conversation ( Dikker et al., 2021( Dikker et al., , 2014 ; ;Jiang et al., 2012 ;Kinreich et al., 2017 ;Stephens et al., 2010 ), and cooperation ( Cui et al., 2012 ;Liu et al., 2017 ;Toppi et al., 2016 ).If, as we have shown, INS does not need an instructed collective goal to emerge ( Kourtis et al., 2019 ;Saito et al., 2010 ;Sebanz et al., 2006 ;Vesper et al., 2010 ), then it follows that we might need to reconsider our investigative approach for this phenomenon.Indeed, several studies in this field have observed INS while participants face each other and also perform a structured social task ( Barraza et al., 2020 ;Dikker et al., 2017 ;Djalovski et al., 2021 ;Fishburn et al., 2018 ;Hu et al., 2018 ;Liu et al., 2021 ;Meshulam et al., 2021 ;Pinti et al., 2021 ;Reinero et al., 2021 ;Tang et al., 2016 ).Would the observed INS be driven by the actual task, or by (not monitored) behavioral cues such as the ones we characterized here?If the latter holds true, then we might not need formal tasks to understand INS after all.
The proposition that INS is better investigated under natural and unconstrained behavior is further reinforced by recent animal studies ( Kingsbury et al., 2019 ;Yang et al., 2021 ;Zhang and Yartsev, 2019 ).For instance, in pairs of freely-behaving, interacting mice, INS emerges during periods of naturally-occurring social behavior (e.g., sniffing, escape and approach) but not during non-social behaviors (e.g., nesting, digging, and self-grooming) ( Kingsbury et al., 2019 ).Other studies, examining groups of freely-behaving bats, have even described frequencyspecific INS in the gamma band envelope (as in the current study) ( Rose et al., 2021 ;Zhang and Yartsev, 2019 ).Remarkably, this gamma INS was proved to be specific to spontaneous -as opposed to instructed -vocal interactions ( Rose et al., 2021 ).

Spontaneously-emerging INS is associated to social behavior
We probed the correlates of INS by looking into social behaviors such as eye contact, body movement, and smiling.To extract these data using an unbiased and unsupervised approach, we used a combination of advanced video-image processing, deep learning, and computational modeling.We found that these spontaneous behaviors were synchronized between the two partners of a dyad.Using a quantitative computational model we estimated the time-by-time relationship between the neural synchrony and the various behavioral synchronies.This allowed us to investigate the association between the behavioral and neural synchrony and to precisely quantify the contribution of each different behavior to INS ( Hamilton, 2021 ).Modeling all social behaviors together, we were able to predict all clusters of INS (i.e.,  env ,  env , and  env ), strongly implying a relationship between INS and behavioral synchrony.Further, when we modeled select social behaviors and their relationship to select INS clusters, we identified specific relationships: eye contact was associated to  env INS, while body movement and smiling were associated to  env INS.It is very difficult to interpret the functional significance of these relationships.We speculate that the  env INS might reflect a concurrent state of high visual attention and arousal, which might mediate information transfer of salient social cues such as body movement and smiling ( Stein et al., 2000 ;Steinmetz et al., 2000 ).The  env INS cluster is harder to interpret.Several previous studies have shown similar INS topographies (associated to both EEG and fNIRS datasets) and, overall, have shown that these activities are driven by cooperative behavior (for a meta-analysis, see Czeszumski et al., 2022 ).From this perspective, this finding might fit with evidence showing that eye contact provides a means to extract or deliver social cues during cooperative tasks ( Jarick and Kingstone, 2015 ;Siposova et al., 2018 ).These hypotheses are speculative in nature and call for further work.
Irrespective of the precise functional significance of the INS clusters, having probed the behavioral correlates of INS has two important implications.First, it directly links INS to a myriad of previous studies in social psychology that highlight the importance of these behavioral cues in mediating social interactions ( Adolphs and Tusche, 2017 ;Argyle and Dean, 1965 ;Bull, 2001 ;Depaulo, 1992 ;Kleinke, 1986 ;Martin et al., 2017 ).For instance, smiling is often conceptualized as a powerful social signal that can easily grab the attention of an observer ( Campos et al., 2015 ;Chang et al., 2014 ;Martin et al., 2017 ).Similarly, eye contact has been suggested to signal the intention to establish a communicative interaction ( Farroni et al., 2002 ;Kleinke, 1986 ;Leong et al., 2017 ;Wass et al., 2020 ).Finally, coordinated body movement is a widespread means of communication and cooperation observable in humans and several other species ( Greenfield, 1994 ;Pezzulo et al., 2013 ).
Second, our results highlight the importance of properly quantifying and characterizing behavior while interpreting the significance of brain activity.Previous research has mostly overlooked and underestimated the relationship between behavioral cues and INS.For example, while some studies focus only on select indices of task performance ( Cui et al., 2012 ;Dikker et al., 2017 ;Meshulam et al., 2021 ;Xue et al., 2018 ), others do not record behavior at all ( Hasson et al., 2004 ;Pinti et al., 2021 ;Redcay et al., 2010 ).Indeed, without an extensive characterization of behavior, it is difficult to understand the origin of brain signals, including INS and its underlying mechanisms.Hence, our study reinforces the recent call for richer quantifications of behavior in the context of neuroscience studies ( Krakauer et al., 2017 ), including hyperscanning ones ( Hamilton, 2021 ).

Ruling out artifactual origins of INS
It is imperative to also discuss a potential artifactual origin of our INS clusters, especially the  env INS cluster, given that previous research has shown that movement artifacts can leak signal in this frequency band ( Goncharova et al., 2003 ;Kappel et al., 2017 ).We highlight two specific results that rule out this interpretation.First, the posterior topography of the  env INS is not reminiscent of that commonly associated with movement artifacts.For instance, EMG artifacts from jaw movements have been shown to predominantly have a fronto-temporal topography ( Goncharova et al., 2003 ;Kappel et al., 2017 ).Likewise, other movements such as eye-blinking, vertical or horizontal saccades, or vertical or horizontal head movements do not lead to prominent modulations of  env on posterior electrodes ( Kappel et al., 2017 ).Second, the data in the current study underwent a thorough preprocessing pipeline -namely Artifact Subspace Reconstruction (ASR) -with a very aggressive cutoff parameter ( n = 10) that has been shown to successfully remove movement artifacts while retaining brain signals ( Anders et al., 2020 ;Chang et al., 2018 ;Plechawska-Wojcik et al., 2019 ;Richer et al., 2020 ).This was confirmed by visual inspection of the EEG data, which indicated that the topographical distribution of gamma power was not reminiscent of that typically associated with muscular artifacts (see supplementary Figs.S6 and S7;and Goncharova et al., 2003 ;Kappel et al., 2017 ).Notably, these results were also corroborated by a control analysis where, following Independent Component Analysis (ICA), we discarded all Independent Components (ICs) having artifactual or nonphysiological topographies (see supplementary Fig. S4) ( Frølich and Dowding, 2018 ;Klug and Gramann, 2021 ).
Last but not least, we stress that  env INS was not the only INS cluster predicted by behavioral cues.Actually, when all behaviors were combined together, we could also explain  env and  env INS better than the null models (see Table 3 ).Hence, all the INS clusters were predicted by behavior, even those falling into frequency bands that are less prone to capture movement artifacts.

Value of automatized behavioral analysis
Having demonstrated the general importance of properly monitoring behavior in the context of neuroscientific studies, we now highlight some important advances that our study brings to the field of social neuroscience.
We were able to reconstruct behavioral cues such as body movement, smiling, and eye contact using a rigorous methodology that is faster, cheaper, more rater-independent, and less subject to human error compared to current practices in social neuroscience ( Kingsbury et al., 2019 ;Kinreich et al., 2017 ;Levy et al., 2017 ;Nguyen et al., 2021 ;Zhang and Yartsev, 2019 ).For instance, while body movement is often measured using expensive systems, requiring time-consuming analyses ( Hale et al., 2020 ;Yun et al., 2012 ), here we were able to automatically extract this information from simple videos.Furthermore, while specific behaviors are often manually coded from video recordings in a frame-by-frame fashion ( Djalovski et al., 2021 ;Kingsbury et al., 2019 ;Kinreich et al., 2017 ;Levy et al., 2017 ;Rose et al., 2021 ), here we were able to do so automatically by combining deep learning predictions with the information acquired by the eye trackers.
Hence, our methodology may pave the way for future ecologicallyvalid studies combining rich behavioral datasets with measures of brain activity.In particular, exploring the natural complexity of social interactions under unconstrained, possibly ecological, conditions might be facilitated by the use of deep learning approaches ( Koul et al., 2023 ).

Causal relationship between INS and behaviors
Using cross-correlation and Granger causality, we investigated the temporal relationship between behavioral synchronies (eye contact, body movement, and smiling) and INS.We observed that while both synchronized social behavior and INS granger-caused each other, synchronized social behavior had a higher predictive effect on INS than vice versa.
The causal relationship between INS and behavior has been the subject of recent prolific debates ( Hamilton, 2021 ;Moreau and Dumas, 2021 ;Novembre andIannetti, 2021a , 2021b ).The majority of the studies in the field implicitly assume that INS leads to changes in social behavior ( Cui et al., 2012 ;Li et al., 2021 ;Lindenberger et al., 2009 ;Piazza et al., 2020 ;Reinero et al., 2021 ;Sänger et al., 2012Sänger et al., , 2011 ; ;Yang et al., 2021 ;Yun et al., 2012 ).Support to this assumption has been provided by studies showing that by exogenously enhancing INS (using multibrain stimulation) it is possible to augment social interaction -including interpersonal coordination ( Novembre et al., 2017 ), social learning ( Pan et al., 2021 ), and even spontaneous social behavior ( Yang et al., 2021 ).However, other authors have also suggested that behavioral cues might cause INS through a phase resetting of neural oscillations ( Leong et al., 2017 ;Wass et al., 2020 ).
A third possibility, which is directly supported by our study, is that INS and social behavior reciprocally influence each other.Indeed, even though we observed a stronger influence of behavioral cues over INS, the opposite directionality was also significant.Future research should try to better understand what factors affect the causal directionality between INS and social behavior.For instance, one factor that might be important is the availability of information about a partner.We speculate that when information is readily available, social behavior would be more likely to drive INS, while the opposite could be predicted for conditions in which information is hardly available ( Gugnowska et al., 2022 ;Novembre et al., 2017 ).
Finally, we wish to stress that the evidence of a relationship between spontaneous INS and social behavior provided by the current research is only correlational.Recently, it has been suggested that such evidence should be complemented by truly causal methods such as multibrain stimulation ( Moreau and Dumas, 2021 ;Novembre andIannetti, 2021a , 2021b ).We suggest that to fully clarify the causal relationship between social behavior and INS, analytic methods (such as Granger causality, used in the current study) should be complemented by other approaches such as multibrain stimulation.

Declaration of Competing Interest
The authors declare no conflict of interest.

Fig. 2 .
Fig. 2. Spontaneous emergence of Interpersonal Neural Synchrony (INS).a: Schematic representation of the computation of INS as the Pearson's correlation between the electrode-and sub-band-specific EEG power time courses (i.e., envelopes) measured from the two individuals forming a dyad.b: T-values representing the statistical contrast between each experimental condition and baseline.INS was significantly enhanced (compared to baseline) only when participants could see each other (Far Vision  env INS cluster p = 0.004, Cohen's d = 0.58; Near Vision  env and  env INS cluster p < 0.001, Cohen's d = 0.68).Notably, in a separate analysis, we assessed that INS during the baseline condition was not statistically different from a test value of 0, indexing no correlation ( p > 0.05).c: T-values representing the statistical contrast across experimental conditions.INS was significantly enhanced during Vision, as opposed to No Vision conditions (  env INS cluster p = 0.02, Cohen's d = 0.62;  env INS cluster p = 0.002, Cohen's d = 0.63).A third INS  env cluster reflected an interaction between Vision and Interpersonal Spatial Proximity (  env INS cluster p = 0.019, Cohen's d = 0.52).Significance for all analyses was determined using a cluster-based permutation test.L = low sub-band, M = mid sub-band, H = high sub-band (the frequency ranges are reported in Table1).

Fig. 3 .
Fig. 3. Automated extraction of behavioral cues and their interpersonal synchrony .a: A number of behaviors were automatically extracted using a trained multi-stage Convolutional Neural Network (CNN) that first encoded the location and orientation of limbs in the image domain.The deep neural network was used to then estimate 25 body and 70 face landmarks.These landmarks were combined with eye tracking data in order to yield estimates of eye contact, whole body movement, and smiling behavior.Eye contact was estimated as the Euclidean distance between the center of the face (i.e., indexed by the landmarks associated with the nose) and the gaze of the partner.Body movement was estimated as the average movement change (i.e., first order derivative) across the 25 body landmarks.Smiling was estimated as the Euclidean distance between the landmarks associated with the two lip ends (see Section 2.4 for details).b: A (Pearson's) correlation analysis revealed that all behavioral cues spontaneously synchronized, across members of a dyad, over time ( p < 0.05).Error bars represent standard errors.
Fig. 4. Behavioral cues predict Interpersonal Neural Synchrony (INS).a: Representation of the computation of instantaneous (time-by-time point) correlations for interpersonal neural and behavioral synchrony.The instantaneous correlation index is computed by taking the time series from each dyad (X,Y, being brain or behavioral data) and first subtracting its mean and normalizing it to unit length (x, y).Each timepoint of this normalized data is then represented as a vector in 2D space (x t ,y t ) where the axes correspond to normalized data from each subject.Next, the correlation index is computed from the angles subtended by the point (x t ,y t ), relative to a line at − 45°(i.e., a line orthogonal to the equality line at 45°).This is done by taking the smaller angle among the two angles subtended by the (x t ,y t ) vector to the line at − 45°.In such a scenario, a resultant angle of 90°represents a perfect correlation (the vector would be coincidental with the equality line at 45°angle), while an angle of 45°represents no correlation (the vector would be coincidental on one of the axes).Finally, the instantaneous correlation thus obtained was max normalized and ranged from 0 to 1. b: Behavioral cues (all combined) predicting each INS cluster in the Vision conditions.The bar plots represent pareto-smoothed importance sampling leave-one-out (PSIS-LOO) cross-validation values for model comparison across the three INS clusters (represented by their respective topographies in the top row).The model combining all behavioral cues explained INS variance in all clusters better than their respective null models.c: Individual behavioral cues predicting each INS cluster.Posterior plots for the regression coefficients for each behavioral cue and each of the three INS clusters.Eye contact predicted  env INS, while body movement and smiling predicted  env INS.The red vertical bar represents the value '0 ′ , while the dark horizontal bar below each posterior plot shows the 90% highest density interval (HDI).Error bars represent standard errors.* p < 0.05, * * * p < 0.001.

Fig. 5 .
Fig. 5. Temporal relationship between INS and synchronous social behavior.a: Schematic description of the cross-correlation analysis.Cross-correlation tracks the similarities of two time series (A and B) as a function of their relative displacement.Given two vectors A and B, cross-correlation measures the similarity between A and shifted (lagged) copies of B as a function of the lag.The analysis is useful to determine how well the two time series match up with each other and, in particular, at what lag the best match occurs.b: Cross-correlation coefficients and multivariate Granger causalities between behavioral synchrony and interpersonal neural synchrony.The black line represents the averaged correlation coefficients across ± 10 s time lags while the gray shaded area represents the standard error of mean at each timepoint.The heat background plots represent the significance of the coefficients (t-values, FDR corrected, significant coefficients are marked by a thicker red line).The bar graphs represent multivariate Granger causalities (dots represent single dyads).Note that all behavioral cues granger-caused INS (light blue bars), even though only some of them anticipated INS with a fixed temporal relationship.CGC = conditional Granger causalities, BEH = Behavioral synchrony, INS = Interpersonal neural synchrony.Error bars represent standard errors.* p < 0.05.The black arrow on the heat plots represents the peak in the cross-correlation.

Table 1
Frequency bands and sub-bands definitions.

Table 2
Comparison of Reciprocal and Individual models.The  env INS cluster was better predicted by the REC model, while the  env and  env INS clusters were better predicted by the IND model.
PSIS-LOO estimate: Pareto-smoothed importance sampling leave-one-out crossvalidation estimate values, PSIS-LOO standard error: the standard error for the PSIS-LOO computations, Weight: These weights can be loosely interpreted as the probability of each model being true (among the compared models) given the data.

Table 3
Model performances for the brain-behavior relationship.The Social behavior model explained INS variance in all clusters better than the null models.

Table 4
Regression coefficient estimates for the brain-behavior relationship.The  env INS cluster was significantly predicted by body movement and smiling, irrespectively from one another, while the  env INS cluster was significantly predicted by eye contact.Significant effects shown in bold.