Individual gaze shapes diverging neural representations

Complex visual stimuli evoke diverse patterns of gaze, but previous research suggests that their neural representations are shared across brains. Here, we used hyperalignment to compare visual responses between observers viewing identical stimuli. We find that individual eye movements enhance cortical visual responses but also lead to representational divergence. Pairwise differences in the spatial distribution of gaze and in semantic salience predict pairwise representational divergence in V1 and inferior temporal cortex, respectively. This suggests that individual gaze sculpts individual visual worlds.


Procedure
Participants watched in the eye-tracker the first 50 minutes of the movie, divided into 10 blocks of 5 minutes each.Each block started with a calibration.Participants were instructed to freely view the movie, keep their head still and pay attention.

Procedure
The scanning session was divided into three runs, each lasting approximately ~21 mins.Each run consisted of four blocks presenting 5 mins of the movie.Participants were instructed at the beginning of each block to either fixate a super-imposed dot at the center of the movie or free-view for the following 5 mins.Each run had the same order of condition blocks: fixation, free-viewing, fixation, free-viewing.This sequence repeated for each run.At the end of each block, a blank screen was shown for 20s.In addition to these functional scans, we recorded an anatomical scan and field map for each participant.

Data acquisition
The MRI session was carried out on a 3-Tesla imaging system (Siemens Prisma) with a 64channel head coil at the Bender Institute of Neuroimaging (BION) at Giessen University.
Stimuli were shown on a projector Epson EB-G5600 with a resolution of 1024 x 768 pixels and a refresh rate of 60 Hz.The stimulus video was rescaled to approximate the size of the stimulus in the Eyetracker with a width of ~48 dva and a height of ~27 dva.Eye movements were monitored online using a ViewPoint Eye-camera (Arrington Research Inc., Scottsdale, AZ) to ensure that participants followed the instructions in both conditions (fixating vs. freeviewing the movie).
Functional images covered the whole brain and were obtained using a multiband echo-planar imaging sequence (EPI) with an echo time (TE) of 33 ms and a repetition time (TR) of 1000 ms.Further parameters for obtaining functional data were as follows: Field of view (FoV) = 240 × 240 mm, in-plane resolution = 2.5 mm × 2.5 mm, 52 sagittal slices (descending) with a thickness of 2.5 mm and a distance factor of 20%, flip angle (FA) = 59°, acceleration factor = 4. Per participant, 3990 volumes of functional data were acquired (1330 per run).
High-resolution anatomical images were obtained using a T1-weighted magnetizationprepared rapid acquisition gradient-echo (MPRAGE) sequence with the following scan

Preprocessing
All image files were converted to NIfTI format and preprocessed using SPM 12 (http://www.fil.ion.ucl.ac.uk/spm/software/spm12/) and custom MATLAB code.The remaining functional images were realigned and unwarped using the voxel displacement maps generated from the field maps.Further, the functional images were co-registered to the structural scan and spatially smoothed using a 4 mm Gaussian kernel 1 .Time series were bandpass filtered with a discrete cosine transform removing slow drifts and high frequency noise with a cut-off of 1/128 Hz, and subsequently z-scored.To remove noise, we extracted six rigid-body motion parameters and framewise displacements (all estimated during realignment), as well as three principal components from cerebrospinal fluid and white matter time-series.These nuisance regressors were regressed out from functional data for each run 2 .
Finally, the first six volumes of all functional images were discarded to account for delays in the hemodynamic response.

Gaze consistency measures
To quantify gaze parameters, we extracted saccades and fixations using the SR Research saccade detection algorithm (velocity >30 degrees/s and acceleration >8000 degrees/s2).
Gaze coordinates that fell outside of video borders were excluded.Additionally, fixations with a duration under 100 ms were excluded (SR Research, 2022).This led to an exclusion of less than 1 % of fixations on average.To prevent erroneous gaze estimation during lid occlusion caused by a blink, saccades occurring 100 ms before or after a blink were also discarded (i.e., ~7 % of saccades on average were removed).Additionally, saccade and fixations with a duration > 1,000 or peak velocity > 1000 deg/s were removed.
We used several publicly available DNN algorithms to label the movie stimuli on a framewise basis.To label text, we used EAST (An Efficient and Accurate Scene Text Detector 3 ).
For face labels, we used YOLO5Face 4 , and finally to label all remaining objects, we used YOLOV5 5 .We discarded all object-labels that were overlapping with text and face.
We included several gaze parameters for this and subsequent parts of the analysis.Fixations that fell within a distance of 0.5 dva from a given label, were marked accordingly, as face, text or other object fixations.We then calculated the proportion of text and face fixations among all labelled fixations for each observer.Similarly, we computed the median saccadic rate and amplitude for each observer.To assess the consistency of individual differences, we correlated individual gaze parameters across odd and even splits of the movie.Furthermore, gaze parameters served as basis for calculating pairwise differences used in linear regression models (see below, Effect of gaze parameters on cross-decoding accuracy).
We established that individual gaze varies in highly systematic ways, even for a directed movie (Shaun the Sheep).Individual differences in low-level parameters of viewing dynamics were large (variability up to factor 4) and highly consistent (split half consistency of saccadic rate r(39) = .97,p < .001;saccadic amplitude r(39) = .94,p < .001).Similarly, individual fixation biases towards faces and text varied up to factor 2 and showed moderate to good consistency (face r(39) = .76,p < .001;text r(39) = .32, p < .05).

Response amplitudes
We specified boxcar regressors for blocks of fixation and free-viewing for each of the three runs and convolved them with the canonical hemodynamic response function, as implemented in SPM12.Each block had a duration of 5 mins.Additionally, we incorporated six motion parameters estimated during realignment.The conditions were contrasted against each other in a general linear model (free-view > fixation), generating t-maps for each participant (i.e.first level analysis).These t-maps were subsequently masked with the IT region of interest (ROI).ROI masks were individually defined for each hemisphere using the FreeSurfer (http://surfer.nmr.mgh.harvard.edu)parcellation algorithm 6 and included the following labels: G_oc-temp_lat-fusifor, G_oc-temp_med-Parahip, G_temporal_inf, Pole_temporal, S_collat_transv_ant, S_oc-temp_lat, S_oc-temp_med&Lingual, and S_temporal_inf.Finally, the masked t-maps were averaged for each participant and tested against zero using a one-sample t-test.corresponds to 0.24% (1/422 snippets).This procedure was repeated for all possible pairings of observers.Within each pair, decoding accuracy was averaged across both folds of training and test data and iterations in which either observer served as the target.
We performed a similar analysis for pre-defined IT and V1 ROIs, and extended our analysis to include all remaining ROIs extracted from the Destrieux and Benson atlases.Next, we tested for significant differences in cross-brain decoding accuracy between free-view and fixation conditions.Given the non-independence of the cross-brain decoding accuracy pairs (N = 171), we built Generalized linear mixed-effects models for each ROI to test an effect of conditions on cross-brain decoding accuracy.We included a categorical predictor indicating condition with two levels (Fixation and Free-view) and two random factors (Source and Target) expressing the identity of each subject, in which either observer served as the target.
We confirmed that the cross-brain decoding accuracy significantly decreased for the freeview condition for several ROIs, while taking into account the identity of each subject.
Additionally, we performed a conservative control analysis which reduced the degrees of freedom to the number of participants.For each observer, we averaged all pairwise instances of cross-brain decoding with this target observer.This is an index of how well this observer's activation patterns can be decoded on average, based on hyperalignment with all other brains.We did this separately for the free-viewing and central fixation conditions and entered the differences between conditions for each of our 19 observers into a one-sample t-test against zero, thus reducing degrees of freedom to 18.This conservative approach confirmed highly significant effects for both IT (t(18) = 5.37, p < .001)and V1 (t(18) = 10.14, p < .001).
Here, we list results of the first 12 ROIs with the largest decrease in cross-brain decoding accuracy in the free-view condition (which were all in early visual cortex and IT): V2 (51 % vs. 12 %, b = -0.07,SE = 0.007, t(682) = -10.33,p < .001).All significant differences between conditions for all ROIs are displayed in the main figure (see Figure 1c).
To estimate the success of hyperalignment, we additionally conducted correlation based Nearest Neighbor classification on normalized data in IT, without using hyperalignment.
First, we registered the functional data of each subject into standard MNI space using SPM12.Then, we repeated our classification pipeline (specified above) leaving out the Procrustes transformation to align brains of the pair of observers.This procedure resulted in a drastic drop of cross-brain decoding accuracy in both conditions (2.1 % in the fixation condition; 1.7 % in the free-view condition).We tested for the statistical significance of an effect of condition on the cross-decoding accuracy using GLME.This revealed a significant decrease in cross-brain decoding accuracy in the free-view compared to the fixation condition (b = -0.003,SE = 0.0009, t(682) = -3.84,p < .001).

Effect of gaze parameters on cross-decoding accuracy
To assess the impact of gaze parameters on representational divergence, we specified separate multiple linear regression models for IT and V1.Both models tested pairwise differences in gaze parameters during the eye-tracking session as predictors of pairwise representational divergence during the scanning session.Specifically, we regressed pairwise differences in gaze parameters onto pairwise cross-brain decoding accuracy in the freeviewing condition, controlling for the cross-brain decoding accuracy in the fixation condition.
Then we flipped the sign of the resulting best fitting weights to display the contribution of diverging gaze to representational divergence (i.e.positive weights indicating a decrease in cross-brain decoding in the free-viewing condition).The first, low-level model included unsigned inter-observer differences in median saccadic amplitude and rate, as well as the median Euclidean distance of gaze positions as predictors.The second, high-level model, included unsigned individual differences in the proportion of labelled fixations falling onto faces and text.All predictors of interest except the Euclidean distance of gaze positions were normalized by their average across a given pair (to express differences relative to the overall magnitude of a given trait; for instance, the difference in saccadic rate between two observers was normalized by the average of the saccadic rate across these two observers).This was repeated for all pairs of observers, resulting in observer dissimilarity matrices for each gaze parameter and a corresponding matrix of pairwise decoding accuracies, each of which was parameters: FoV = 240 × 240 mm, TE = 3.53 ms, TR = 1880 ms, inversion time = 949 ms, inplane resolution = 0.94 mm × 0.94 mm, number of slices = 176, slice thickness = 0.94 mm, flip angle (FA) = 8°.Magnetic field perturbations were accounted for by measuring a field map with the following scan parameters: FoV = 220 × 220 mm, TE (1) = 10 ms, TE (2) = 12.46 ms, TR = 1,000 ms, in-plane resolution = 2.0 × 2.0 mm, slice thickness = 3.0 mm, number of slices = 40 (transversal), FA = 90°.