Speaking for seeing: Sentence structure guides visual event apprehension

A B S T R A C T Human experience and communication are centred on events, and event apprehension is a rapid process that draws on the visual perception and immediate categorization of event roles ( “ who does what to whom ” ). We demonstrate a role for syntactic structure in visual information uptake for event apprehension. An event structure foregrounding either the agent or patient was activated during speaking, transiently modulating the apprehension of subsequently viewed unrelated events. Speakers of Dutch described pictures with actives and passives (agent and patient foregrounding, respectively). First fixations on pictures of unrelated events that were briefly presented (for 300 ms) next were influenced by the active or passive structure of the previously produced sentence. Going beyond the study of how single words cue object perception, we show that sentence structure guides the viewpoint taken during rapid event apprehension.


Introduction
Perception is not a process solely driven by bottom-up input. To the contrary, it is strongly guided by top-down factors, related to perceivers' prior expectations, knowledge, the current context, and task goals (e.g., Gilbert & Li, 2013;Lupyan, Abdel Rahman, Boroditsky, & Clark, 2020;Summerfield & de Lange, 2014). This holds for the processing of basic percepts, such as the orientation, size and identity of single objects (Summerfield et al., 2006), but also for more complex scenes. Already early stages of visual processing such as the rapid extraction of the "gist of a scene" Henderson & Ferreira, 2004) are conceptually guided (e.g., Henderson, Brockmole, Castelhano, & Mack, 2007). For example, people's prior experiences can enhance the detection of objects or basic scene category information (Biederman, 1981;Hollingworth & Henderson, 1998;Potter & Levy, 1969;Schyns & Oliva, 1994).
For object perception, language can provide rapid online conceptual guidance (Lupyan, 2012;Lupyan et al., 2020): Linguistic labels provide effective cues to perception because the conceptual representation evoked by a label includes a category-diagnostic sensory representation of the concept, so that hearing or reading the word "dog" activates a visual image of a dog. Activating this sensory representation prior to receiving actual perceptual input attunes the visual system to the expected percept and provides top-down feedback during stimulus processing, also when the to-be-perceived object is masked or degraded (Boutonnet & Lupyan, 2015;Lupyan & Ward, 2013;Ostarek & Huettig, 2017;Samaha, Boutonnet, Postle, & Lupyan, 2018). Linguistic labels thus cause, in Lupyan's (2012) terms, temporary perceptual warping.
However, the previous focus on single words leaves two knowledge gaps. First, is the perception of complex visual scenes (with relational structure) also susceptible to cueing effects by language? Second, can the syntactic structure of entire sentences (and their underlying conceptual structure) guide initial scene processing? Moving beyond single words and objects is a crucial step forward in unraveling how language interacts with vision, since objects are often observed in a relational context and we typically speak in sentences, not just single words. Of specific interest for addressing these issues are depictions of events-dynamic activities happening across time and space (e.g., someone cutting an apple). Central to understanding events are the relations between the participants involved in them, in terms of their event roles (Rissman & Majid, 2019;Zacks, 2020). Agents (the "doers"), patients (the "undergoers") and their relation (defining the event type, e.g., dressing or cutting) comprise the abstract, hierarchical structure of an event (Cohn & Paczynski, 2013;Jackendoff, 1990). These event role configurations are conceptual in nature, as they are not dependent on specific realizations of roles and their relations (e.g., Dowty, 1991;Rissman & Majid, 2019). This information can be extracted from visual stimuli effortlessly, even under very short viewing conditions (less than 100 ms: Dobel, Gumnior, Bölte, & Zwitserlood, 2007;Glanemann, Zwitserlood, Bölte, & Dobel, 2016;Hafri, Papafragou, & Trueswell, 2013;Hafri, Trueswell, & Strickland, 2018) and from early on in infancy (Galazka & Nyström, 2016;Johnson, 2003;Spelke & Kinzler, 2007). Early-stage visual event processing is immediately geared towards the extraction of conceptual and relational information on event roles and types. The ability to extract conceptual event structures rapidly suggests that events are critical units of representation in cognition (Richmond & Zacks, 2017;Zacks, 2020).
Events are also central to communication: We often talk about the events happening around us. When describing an event, one needs to package its conceptual structure into a sentence. This entails linearizing the linguistic expression of event roles and expressing a viewpoint on the event, during the construction of the sentence's message (the process of perspective taking, Bock, Irwin, & Davidson, 2004;Levelt, 1989Levelt, , 1999. For example, the event of a woman dressing a man (cf. Fig. 1) can be expressed with an active ("The woman is dressing the man") or a passive sentence ("The man is being dressed by the woman") in many languages. The core event structure, in terms of who is doing what to whom, expressed by these two sentences is the same: the woman is the agent and the man is the patient, and the relation between them involves some form of physical contact and transfer. Active and passive sentences differ, however, in the viewpoint selected by the speaker. While actives foreground the agent, passives put the patient in the foreground and the agent in the background in the conceptual structure of the event (Bock et al., 2004;Kazenin, 2001;Keenan & Dryer, 2007). 1 The backgrounding can be so strong that the agent can even be left unmentioned in passive sentences ("The man is being dressed"). The conceptual backgrounding of agents in passives is also shown in experimental work: For example, speakers of English were more likely to produce passives when describing stimuli in which the agent was visually less prominent (i.e., when only the agent's hands were shown, and not their face and torso, (Rissman, Woodward, & Goldin-Meadow, 2019). When the agent was thus backgrounded perceptually, speakers foregrounded the patient linguistically. Further, during event description, German speakers also placed fewer fixations on agents, and more fixations on patients, when planning passives as compared to actives (Sauppe, 2017b).
Can event viewpoints as conveyed by different syntactic structures guide information uptake during the rapid apprehension of upcoming scenes? More specifically, can the production of active and passive sentences, and their underlying conceptual structure bias visual attention to events in subsequently presented visual stimuli, analogous to single labels cueing object perception? Such attentional bias should arise through the pre-activation of an abstract event structure by the syntactic structure of the cue sentence; in this event structure either the agent or the patient is foregrounded, depending on active or passive voice. The conceptual foregrounding of patients is hypothesized to induce a bias in visual attention towards patients in subsequently presented event scenes, leading to an increase in first fixations on patients and a decrease in agent-first fixations. It is important to note that the viewpoint conveyed by actives and passives is independent from lexical semantics and form (e.g., "the man was hugged by the woman" and "the bird was eaten by the cat" converge in their viewpoint). This means that syntactic cueing effects could arise when cue and target event overlap in their most basic conceptual structure (i.e., a core skeleton of "agent acting on patient"), regardless of overlap in event type, and agent/patient identity. In the case of linguistic cues (in this case, entire active and passive sentences) preceding visual stimuli, we expect the syntactic structure of the cue sentences to influence the viewpoint that the perceiver takes on a subsequent unrelated event.
We propose that one can shed light on the process of scene apprehension using a brief exposure paradigm (Dobel et al., 2007;Gerwien & Flecken, 2016;Greene & Oliva, 2009) and eye tracking. In this paradigm, a picture is presented to participants so briefly that they either can only perceive it parafoveally (Dobel et al., 2007;Dobel, Glanemann, Kreysa, Zwitserlood, & Eisenbeiß, 2011) or have time for only a single saccade and fixation on the picture (Gerwien & Flecken, 2016). Target picture presentation times in brief exposure studies range from 37 ms (Hafri et al., 2013) when pictures are presented at the center of the visual field to 300 ms when pictures are presented at the corners of the display and thus require eye movements in order to extract detailed information (Gerwien & Flecken, 2016). Given that programming and executing a saccade takes between 100 and 200 ms (e.g., Kirchner & Thorpe, 2006;Pierce, Clementz, & McDowell, 2019), visual information can only be extracted foveally from the latter kinds of briefly presented stimuli for approximately 100-200 ms.
The location of the first and only fixation in the briefly presented picture is taken to be a direct reflex of the process of event apprehension (Gerwien & Flecken, 2016): Based on parafoveally collected information, viewers identify the core structure of the event and then rapidly decide, e.g., whether to fixate on the agent or the patient in the picture, first. Hence, an analysis of first fixation locations to tap into scene apprehension avoids a reliance on offline measures alone that might be influenced by memory and post-hoc reasoning (Firestone & Scholl, 2016;Lupyan, 2016;Lupyan et al., 2020). We hypothesize that the planning and execution of the first fixation can be influenced by the preactivated conceptual structure underlying the active or passive cue sentences, including the respective event viewpoint. We hypothesize this reflex of the apprehension process to be the locus of a potential syntactic cueing effect: A linguistically cued event viewpoint should be reflected in what people visually attend to first in the event picture.
Here, participants first described a picture of a cue event and then they saw a briefly presented target event. Crucially, cue event descriptions had either an active or a passive sentence structure. After producing the cue sentences, an unrelated target picture appeared for only 300 ms in one of the four screen corners, leaving time for only one fixation on the picture (Fig. 1). Participants then indicated by button press whether a probe picture presented next matched the target picture or not, to ensure participants attended to the target pictures. This design allowed us to test whether entire event representations constructed during speaking can guide the apprehension of subsequently seen events, reflected in cueing effects on the location of the first fixation on target pictures.

Participants
Forty-one native speakers of Dutch (27 female, age: mean = 24, range = 20-34) from the participant pool of the Max Planck Institute for Psycholinguistics participated for payment. Data from two additional participants were lost due to technical errors in recording or exporting the eye tracking data. The experiment was approved by the ethics committee of the Faculty of Social Sciences at Radboud University Nijmegen.

Materials and design
Materials consisted of cue, target, and probe pictures. Cue pictures showed 18 different transitive actions with human agents and patients (cf. Appendix A.1, pictures were taken from Segaert, Menenti, Weber, & Hagoort, 2011). Cue pictures were photographed with four actor pairs (two man-woman pairs, two girl-boy pairs) against a black background 1 Actives and passives also differ on additional dimensions. Passives are less frequent and impose more cognitive load during planning than actives and are morphologically derived, whereas actives are not (Sauppe, 2017a). For the current purpose, however, only the different event viewpoints they entail are relevant.
(cf. Fig. 1). Agent and patient were colored in red and green and participants were instructed to describe these pictures starting with the green character and using a prespecified verb; this reliably elicited active and passive sentences (as in Segaert et al., 2011;Segaert, Menenti, Weber, Petersson, & Hagoort, 2012).
Target pictures showed 36 transitive events with animate agents and inanimate objects as patients (cf. Appendix A.2, pictures were taken from Sakarias & Flecken, 2019; twenty pictures had female agents). Each target picture appeared eight times over the course of the experiment, once in each of eight blocks (four times after an active and four times after a passive cue event), and in each position on the screen (cf. Fig. 1), with agent-left and agent-right orientation, respectively. Each target picture was paired with eight different cue event pictures, each showing different agent-patient combinations and different actions. One half of the participants saw a given cue-target pair with an active cue, the other half saw it with a passive cue. For each participant, the order of blocks was randomized and the order of trials within blocks was pseudorandomized, so that no more than two consecutive target pictures appeared in the same screen position.
Probe recognition pictures were taken from the same stimulus pool as target pictures (Sakarias & Flecken, 2019). The probe recognition task had three conditions: target and probe picture were identical (Match condition, half of the trials), the agent mismatched, or the patient/action mismatched (each 25% of the trials). For the Action/Patient Mismatch trials, one of the other target events with the same agent was presented. For the Agent Mismatch trials, pictures of the same event with a different agent were presented.

Procedure
Participants were tested individually in a laboratory booth. The experiment was programmed in Presentation (Neurobehavioral Systems, Berkeley). Fixation data were collected with a SMI RED250m eye tracker (Sensomotoric Instruments, Teltow), sampling at 250 Hz. Stimuli were displayed on a 15.6 ′′ laptop computer screen with a resolution of 1920 × 1080 pixels, positioned approximately 60 cm away from participants. Target pictures subtended a visual angle of 8.35 • horizontally (500 pixels) and 5.64 • vertically (333 pixels); the target pictures' center was 9.70 • away from the central fixation cross participants fixated on at stimulus onset. Participants first received written instructions on the task and then read further instructions on the screen. After completing six practice trials, they had the chance to ask questions to the experimenter. The eye tracker was then calibrated with a five-point calibration and a four-point validation procedure and participants were told to sit still and not to move their eyes away from the screen. Participants wore a headset recording their descriptions of cue pictures. After every second block there was a self-timed break. The eye tracker was re-calibrated after each break. The total experimental session lasted around 50 min.

Data processing and analyses
For each target picture, (elliptical) agent and patient areas of interest were defined manually in the eye tracker manufacturer's BeGaze software. The agent area encompassed the face and the upper part (head and part of upper body) of the person performing the action. The patient area encompassed the object being manipulated (i.e., the patient in the narrow sense) and also the agent's hands and a potential instrument (i. e., where the action took place). It is often difficult to separate patients and action regions in naturalistic event depictions, e.g., when the agent's hands are touching an object. As patients have close ties to Fig. 1. Trial structure and example stimuli. Trials started with displaying the verb to describe the cue picture (here: "to dress (someone)"). In cue pictures, agent/patient were colored green/red or vice versa; participants were instructed to begin their descriptions with the green character (eliciting active or passive sentences). Cue pictures were presented on the screen until participants pressed a button after having finished their description. Next, after a central fixation cross, target pictures were briefly presented for 300 ms in one of the four screen corners. Finally, a recognition probe was presented and participants indicated by button press whether it matched the target picture. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) actions (at least in syntax, Kratzer, 1996), we employed an area of interest that encompasses both the patient and the action (Fig. 2). 2 Fixations were detected using the manufacturer's algorithm as implemented in BeGaze.
Trials in which participants did not produce the intended cue sentence (e.g., an active instead of passive when the patient was colored green) or did not look at the target picture during the brief exposure period were excluded from analyses. 3 In addition, two participants who had less than 50% of trials left after exclusions and one participant who had no correct probe recognition trials in the Match condition were excluded. On balance, 9852 trials from 38 participants (84.1% of all data) were available for analyses.
Single-trial level analyses were conducted with brms (Bürkner, 2017(Bürkner, , 2018R Core Team, 2018). Fixations to agents and patients/actions during exposure to the target picture were analysed with hierarchical Bayesian Bernoulli regression. The critical predictor was cueing condition (active vs. passive). Nuisance predictors (Sassenhagen & Alday, 2016) were: block in which each trial occurred (reflecting how many passive trials had been encountered), and the orientation (agent left vs. right) and the screen position of target pictures. Agent and patient/action fixations were analysed separately (Barr, 2008). Models included random intercepts and slopes for cue condition by participant and by item, consisted of six chains with 6000 iterations (including 3000 warmup iterations) and employed Student t distributions (5 degrees of freedom, μ = 0, σ = 3) as priors for all predictors and the intercept.
Predictive model performance with and without the cue condition predictor was assessed using model stacking (Yao, Vehtari, Simpson, & Gelman, 2018). Frequentist hierarchical regressions were computed with lme4 (Bates, Mächler, Bolker, & Walker, 2015) to supplement the Bayesian analyses and showed the same pattern of results. Statistical significance was assessed with likelihood ratio tests. The maximal random effects structure (that, in the case of frequentist models, allowed convergence) was used for all models (Barr, 2013;Barr, Levy, Scheepers, & Tily, 2013). Categorical predictors were sum coded. Block as continuous predictor was mean-centered.

Results
Participants fixated on target pictures on average 200 ms after stimulus onset (SD = 22 ms). Whether the cue pictures were described with actives or passives influenced how participants subsequently viewed pictures during brief exposure (Fig. 3). After passive cues, the likelihood of first fixations on the agent decreased and the likelihood of first fixations on the patient/action of the target events increased, as compared to after active cues. Models including cue condition as a predictor for the likelihood of agent and patient/action fixations performed better in model stacking than models ignoring the cues (Table 1; p Patient/Action = 0.02, χ 2 (1) = 5.38 and p Agent = 0.04, χ 2 (1) = 4.01 in frequentist models). In trials in which neither the agent nor patient/ action area of interest were fixated, participants mostly fixated the center of the picture in-between these two areas (as in previous studies, e.g., Gerwien & Flecken, 2016). These center fixations were presumably driven by the demands of the recognition task that required participants to rapidly extract information on the entire event. Concerning the two areas of interest, agents were more likely than patients to be fixated first on average, most likely because both agents as such and humans in particular are overall more salient (Cohn & Paczynski, 2013;Crouzet, Kirchner, & Thorpe, 2010;Gao, Baker, Tang, Xu, & Tenenbaum, 2019;Rösler, End, & Gamer, 2017;Webb, Knott, & MacAskill, 2010) and because the human agents in the current stimuli were larger than the inanimate patients.
In the recognition task, responses to the probe were slower and less accurate when either the agent or the patient/action mismatched as compared to when the briefly presented target and the probe pictures matched. Whether the cue sentence had an active or passive structure had no effect on recognition performance (cf. Appendix A.3).

Discussion and conclusions
We show that visual event apprehension can be guided by the syntactic structure of recently uttered sentences. Whilst the core event role configuration of cue sentences was kept constant, they differed in the expression of viewpoint on the event-one where either the agent or the patient was foregrounded conceptually. This viewpoint subsequently influenced the attentional prioritization of agents or patients during the planning and execution of the very first fixation onto the briefly presented target event pictures. We take these first fixations to be a direct reflex of the ongoing or possibly finished apprehension process. Participants did likely retrieve the core event structure information parafoveally (Dobel et al., 2007;Hafri et al., 2013), including information on agents and patients and their location (i.e., they extracted what is often called the event's gist, Henderson & Ferreira, 2004). On the basis of this information, they decided where to place their first fixation for further visual information uptake. While the process of event structure extraction itself thus may not have been affected, the subsequent first direction of gaze into the event pictures was informed by the viewpoint conveyed by the syntactic structure of the cue sentences. Event apprehension and saccade programming were executed rapidly: target pictures were fixated already after approximately 200 ms. This means that people could compute their first fixation already after only minimal exposure to the stimuli, and that the cue sentences' syntactic structure thus exerted influence on early perceptual processing stages.
Crucially, cue and target events were unrelated: Whilst cue events involved a human agent and a human patient, target events involved a human agent and an inanimate patient (Fig. 1). The discrepancy in event type and in agent and patient properties (such as animacy), however, still allowed for viewpoint cueing from speaking to seeing. This underlines that the effect took place at the level of the conceptual structure of the events, which includes viewpoint information. The abstract conceptual event structure foregrounding either the agent or the patient was part of the message (Levelt, 1989(Levelt, , 1999generated during production of the cue sentences (cf. also Bunger, Papafragou, & Trueswell, 2013). We propose that this event structure remained activated also after the sentence was uttered. It could therefore "warp" viewers' event apprehension by exerting a top-down influence on perceivers' decision on which part of event pictures appeared most attention-worthy and should be looked at first under the pressing demands of the task to recognize entire events with only brief exposure (cf. Lupyan, 2012). This process may be similar to the processes underlying syntactic priming during language production (Bock, 1986;Pickering & Ferreira, 2008), where a representation stays active after recent use and influences subsequent processing. The event representation activated during speaking retained activation and was used for subsequent seeing, i.e., when extracting the gist and deciding the starting point for detailed processing of the target event. 4 The effect of active and passive cue sentences extends conceptual guidance theories of scene apprehension and eye movements to the domain of events (Henderson, 2017;Henderson et al., 2007;Henderson, Hayes, Peacock, & Rehrig, 2019) and shows how language can provide such conceptual feedback to initial attention allocation. It expands the evidence for language-perception interactions to the realm of sentences and relational categories. To date, it could be shown that labels denoting object concepts facilitate perceptual categorization of these objects.
Here, we show that sentences that convey a viewpoint through a syntactic structure can transiently cue the conceptual salience of relational percepts and guide the direction of initial gaze into briefly presented event pictures, resulting in early attentional biases in visual information processing.
Cueing effects of linguistic labels on object perception in the literature were mainly behavioural (with the exception of, e.g., Lupyan, 2015 andSamaha et al., 2018, who report effects on early visual EEG responses), and assessed post-hoc, e.g., through button presses (cf. Firestone & Scholl, 2016). Here, by contrast, we report an effect on first fixation locations (Gerwien & Flecken, 2016), providing a direct window into event processing and demonstrating that syntactically conveyed event viewpoints play a role in mediating early visual scene processing.
Both active and passive sentences served as appropriate cues for the uptake of information relevant to the task, i.e., extracting agent-patient relations for later recognition, 5 but their differing viewpoints elicited differential prioritization in online attention allocation (to either the agent or the patient).
Could the cueing effect, at least in part, be driven by a reliance on verbal encoding of target events (due to the production task or the demands of the recognition task), inducing more early patient fixations for passives (Sauppe, 2017b)? Even though people may rely on verbal strategies to support memory (Trueswell & Papafragou, 2010), such strategies are unlikely to go beyond labelling of event type to include the planning of syntactic alternations, as event viewpoint was irrelevant to the task. Exposure to the pictures for only 300 ms is also likely not sufficient to plan the grammatical structure of entire sentences (Griffin & Bock, 2000), including such syntactic alternations as active and passives. In addition, verbal encoding strategies may not play a role in the encoding of complex scenes for recall (Rehrig, Hayes, Henderson, & Ferreira, 2020).

Table 1
Results of hierarchical Bayesian Bernoulli regression predicting the likelihood of fixations on the patient/action and agent regions in briefly exposed target pictures. All Pareto k values < 0.5 (Vehtari, Gelman, & Gabry, 2017 Note that in the current study both agents and patients were always overtly mentioned in cue sentences and only differed in foregrounding and backgrounding through syntactic structure. In conclusion, cueing effects of grammatical structure on event processing open up a new range of possibilities for exploring languageperception interactions, beyond features of single words (like gender, Sato & Athanasopoulos, 2018), and making use of linguistic diversity (Norcliffe, Harris, & Jaeger, 2015). Visual event apprehension could, for example, also be modulated by other grammatical phenomena that attune agent and patient salience such as differential subject and object marking (de Hoop & Malchukov, 2008), ergativity (Bickel, Witzlack-Makarevich, Choudhary, Schlesewsky, & Bornkessel-Schlesewsky, 2015;Dixon, 1994), or information-structurally driven word order variations (Downing & Noonan, 1995).