Speaking and gesturing guide event perception during message conceptualization: Evidence from eye movements

Speakers ’ visual attention to events is guided by linguistic conceptualization of information in spoken language production and in language-specific ways. Does production of language-specific co-speech gestures further guide speakers ’ visual attention during message preparation? Here, we examine the link between visual attention and multimodal event descriptions in Turkish. Turkish is a verb-framed language where speakers ’ speech and gesture show language specificity with path of motion mostly expressed within the main verb accompanied by path gestures. Turkish-speaking adults viewed motion events while their eye movements were recorded during non-linguistic (viewing-only) and linguistic (viewing-before-describing) tasks. The relative attention allocated to path over manner was higher in the linguistic task compared to the non-linguistic task. Furthermore, the relative attention allocated to path over manner within the linguistic task was higher when speakers (a) encoded path in the main verb versus outside the verb and (b) used additional path gestures accompanying speech versus not. Results strongly suggest that speakers ’ visual attention is guided by language-specific event encoding not only in speech but also in gesture. This provides evidence consistent with models that propose integration of speech and gesture at the conceptualization level of language production and suggests that the links between the eye and the mouth may be extended to the eye and the hand.


Introduction
The idea that linguistic conceptualization of events builds on and guides apprehension of events is prominent in influential theories of language production. According to the thinking for speaking hypothesis (Slobin, 1996), speakers attend to the aspects of the world that they plan to speak about and in ways compatible with the lexical and syntactic constraints of their specific language. Similarly, in Levelt's (1989) language production model, speaking begins with a preverbal apprehension of the broad details of an event, including information about people, objects, places, time and the relations among them. This preverbal event representation is mapped onto a linguistic message taking into account the constraints on how entities, relations and spatiotemporal information are packaged into different lexical and syntactic structures, which ultimately culminates into an utterance. This model is supported by eyetracking studies showing that while describing visual scenes, speakers allocate their attention to the features they plan to speak about (Gleitman, January, Nappa, & Trueswell, 2007;Griffin & Bock, 2000;Meyer, Sleiderink, & Levelt, 1998;van de Velde, Meyer, & Konopka, 2014) and in a way reflecting languagespecific semantic and grammatical patterns (Norcliffe, Konopka, Brown, & Levinson, 2015;Sauppe, 2017;Sauppe, Norcliffe, Konopka, Van Valin, & Levinson, 2013; for an overview see Norcliffe, Harris, & Jaeger, 2015). Nevertheless, language is a multimodal phenomenon. Speakers frequently use gestures along with speech to convey information about events (McNeill, 2005). Furthermore, they do so in ways linked to language-specific encoding of events in speech (Kita & Ö zyürek, 2003). The purpose of the present study is to draw on evidence from the eye-tracking paradigm to test whether producing languagespecific gestures along with speech further guides visual attention to events.
There are different views on how gesture production is linked to (spoken) language production. One class of models propose that gestures are pre-linguistically generated from the visual imagery in visuo-spatial working memory and thus function as a direct window into thought (Krauss, Chen, & Chawla, 1996;Krauss, Chen, & Gottesman, 2000). A second class of models propose that gestures are generated from the speaker's communicative intent about what information they want to convey (de Ruiter, 2000(de Ruiter, , 2007Melinger & Levelt, 2004). Part of this information is communicated via speech and part of it is conveyed via gesture. The information conveyed in speech and gesture may or may not overlap. Crucially, both classes of models propose that gesture production is not part of message preparation and therefore, the content and the form of gestures should not be influenced by language-specific constraints on how information is expressed in the accompanying speech. Therefore, these models cannot explain how gestures produced along with speech show language-specific patterns and how speech and gesture are linked systems. Unlike these two classes of models, the Interface Model of multimodal language production (Kita & Ö zyürek, 2003) proposes that gesture is also planned during message preparation for language production. In this view, linguistic conceptualization interacts with the spatio-motoric imagery underlying gesture generation during online language production. Through these interactions, co-speech gestures represent information following language-specific constraints on information packaging in the speech that they accompany. Each co-speech gesture expresses semantic information encoded within one processing unit (i.e., verbal clause) in speech. The Interface Model uniquely predicts gesture production to require conceptualization as it is for speech production. If so, one might expect similar links between co-speech gesture production and event apprehension as found for speech production. One of the novelties of the present study is to test this prediction.

Multimodal linguistic encoding and conceptualization of motion events
Motion events provide an ideal domain for investigating how event apprehension during message preparation might be linked to speech and gesture production, and how these links might be indexed through visual attention. There is considerable cross-linguistic diversity in how languages map motion event components onto lexical or syntactic structures (Talmy, 1985). This diversity provides the grounds for looking for potential differences and language specificity in event conceptualization. Satellite-framed languages (e.g., English, Dutch) typically encode manner of motion in the main verb (e.g., "ran" sentence 1) and path of motion in elements outside of the verb (e.g., in pre-positional phrases, which are formed by adding a preposition "into" before a noun phrase as in "into the phone booth" in sentence 1). Verb-framed languages (e.g., Turkish, Greek), however, typically encode path of motion in the main verb (e.g., "girdi" "entered" in sentence 2), and manner of motion in subordinate verbs (e.g., "kosarak" "while running" in sentence 2). In verb-framed languages manner is optional and can be omitted. (1) The woman ran into the phone booth Noun phrase Verb Preposition Noun phrase Figure  It should be noted that these patterns are not the only ways in which motion events are encoded in these languages but rather indicate the most frequent and typical ways of encoding motion. Speakers of verbframed languages occasionally encode path in elements outside of the verb, such as post-positional phrases either together with path verbs or as the sole expression of path information (Özçalışkan & Slobin, 2003). For example, in Turkish path of motion can be encoded occasionally without the main path verb but only through post-positional phrases, which are formed by adding a postposition (e.g., "içi-(n)e" "to-inside") after a noun phrase (e.g., "telefon kulübesi-nin içi-(n)e" "to inside the phone booth" in sentence 3). (3) Kadın telefon kulübesi-nin içi-(n)e koş-tu woman phone booth-GEN inside-DAT run-PST Noun phrase Noun phrase Postposition Verb Figure  Ground Path Manner 'The woman ran to inside the phone booth'. Furthermore, there are systematic cross-linguistic differences in cospeech gestures that depict path and manner of motion. Crucially, these cross-linguistic differences in gesture show striking parallels to the typological patterns in how path and manner information is encoded in speech in these languages. For example, English-speakers are likely to use one clause to express path and manner in speech (i.e., manner as main verb and path as pre-positional phrase; see sentence 1 above) and typically conflate path and manner components into a single co-speech gesture (Kita & Ö zyürek, 2003). In contrast, Turkish-and Japanesespeakers who are likely to use separate clauses to express path and manner in speech (i.e., path as main verb and manner as subordinate verb; see sentence 2 above), typically produce separate gestures for manner and path. They also tend to produce more path-only than manner-only gestures because path is encoded in the main verb (Kita & Ö zyürek, 2003, see also Kita et al., 2007;Ö zçalışkan, 2016;Ö zçalışkan, Lucero, & Goldin-Meadow, 2016a, 2016bÖ zyürek, Kita, Allen, Furman, & Brown, 2005, Ö zyürek et al., 2008and Gullberg, 2011 for converging evidence from the domain of placement events). Importantly, these cross-linguistic differences in speech and gesture surface in the descriptions of the very same events. Thus, despite the fact that the visual imagery for the events is the same, gesture patterns differ crosslinguistically, mirroring patterns found in speech for satellite-and verb-framed languages. These findings challenge the view that gestures are not part of message preparation (de Ruiter, 2000(de Ruiter, , 2007Krauss et al., 1996Krauss et al., , 2000. Instead, they can be taken as evidence for the Interface Model of speech and gesture production (Kita & Ö zyürek, 2003) according to which gesture production is constrained by the kind of semantic information that can be syntactically packaged in one processing unit (i.e., clause) in speech (Levelt, 1989).
Finally, there has been evidence for a tight link between event conceptualization and language-specific motion event encoding in speech from cross-linguistic eye-tracking studies. In one study, Englishand Greek-speaking adults watched videos of motion events (e.g., a man skating to a snowman) while their eye movements were recorded and then described the events (Papafragou, Hulbert, & Trueswell, 2008). While viewing the events prior to speaking, both groups allocated more attention to the component that they were planning to encode in the main verb. Greek-speakers attended more to path of motion than English-speakers, and English-speakers attended more to manner of motion than Greek-speakers. Crucially, these cross-linguistic differences in attention allocation that emerged prior to speaking disappeared when participants freely inspected the events without preparing for linguistic encoding (see also Bunger, Skordos, Trueswell, & Papafragou, 2016, 2021Bunger, Trueswell, & Papafragou, 2012;Flecken, von Stutterheim, & Carroll, 2014;Sakarias & Flecken, 2019;Trueswell & Papafragou, 2010). These findings strongly suggest tight links between visual attention and language-specific encoding of motion that emerge during speech planning.
Even though the current body of evidence on the tight link between speech and gesture production, and particularly language-specificity of gestures accompanying speech, is best explained by the Interface Model, there is one aspect of this model that remains to be tested empirically.
The Interface Model attributes language-specificity of co-speech gestures to interactions between event apprehension, linguistic conceptualization, and visual-spatial imagery during message preparation stage of multimodal language production. However, empirical evidence for this aspect of the model has been somewhat indirect because it comes from speech and gesture behavior and not from eye-gaze behavior during event apprehension. This is because all of the prior work on eyegaze patterns during message preparation has focused on (spoken) language production in the auditory modality (with the exception of one study on sign language production; Manhardt et al., 2020). In addition to the modulation in eye-gaze patterns driven by speech production, there might be further modulation of visual attention driven by the additional language-specific encoding of event information in the gestural modality. Gestures produced with speech can encode information overlapping with speech as well as additional information not necessarily found in speech. For example, speakers may express path of motion in speech by saying "the woman entered the phone booth" and convey additional information about direction of motion by producing a co-speech gesture that directly maps onto the visual scene (e.g., moving index finger across space from left to right). Whether such languagespecific encodings in co-speech gesture further guide visual attention remains to be seen.

Present study
In the present study, our primary goal is to seek empirical evidence from eye-gaze behavior for the integration of the speech and gesture during message conceptualization. As a secondary goal, we aim to replicate and extend prior evidence on the relation between visual attention and language-specific encoding of motion events in speech. To address these goals, we ask how speakers of Turkish (a verb-framed language) attend to path as opposed to manner of motion events while viewing the events in preparation for linguistic encoding in speech and gesture (i.e., linguistic task) versus freely inspecting events without preparing for any linguistic encoding (i.e., non-linguistic task). For linguistic encoding, we investigate how path and manner is encoded in speech and gesture and whether this is in line with previously shown typological patterns (Talmy, 1985, Slobin, 1996. For visual attention, we ask if attention allocated to path relative to manner varies (a) in relation to variations in linguistic encoding of path of motion in speech and (b) in relation to path gesture production accompanying path encodings in speech.
In language production, Turkish-speakers are expected to encode path in the main verb and manner outside of the main verb. In line with this typological encoding they might be expected to produce descriptions that encode only path of motion in speech because of optional encoding of manner (i.e., the element encoded outside of the main verb) in verb-framed languages. Alternatively, they may produce spoken descriptions that encode both path and manner of motion as seen in previous work with Turkish-speakers, when both path and manner are salient in an event (Özyürek et al., 2008). However, as more relevant for the aims of this study, the majority of path encodings in speech are expected to be within the main verb as opposed to outside of the verbalbeit with some variation.
In gesture, participants are expected to produce descriptions that encode only path of motion more frequently because semantic elements encoded in the main verb (i.e., path in this case) are more likely to determine the content of the gestures than elements encoded outside of the main verb - (Kita & Ö zyürek, 2003;Ö zyürek, 2018;Ö zyürek et al., 2005). Since Turkish has a verb-framed typology, speakers should be more likely to produce separate co-speech gestures for path and manner. Furthermore, as seen in previous work, Turkish-speakers should be more likely to leave out the gesture component corresponding to the element encoded outside of the main verb -in this case, manner of motion (Özçalışkan, 2016; Ö zçalışkan et al., 2016a, 2016b).
For eye movements, we expect the time course of visual attention to differ across linguistic and non-linguistic tasks, as found in speakers of other verb-framed languages that encode motion similarly to Turkish (e. g., Greek: Papafragou et al., 2008;Trueswell & Papafragou, 2010). Specifically, speakers should allocate more attention to path over manner of motion in the linguistic task than they do in the non-linguistic task. This is based on the fact that path is likely to be encoded in the main verb in such languages whereas manner is encoded outside of the main verb or omitted.
To be able to show more direct links between the specific choices in actual language use and the time course of visual attention within a single group of language users, we expect speakers to allocate more attention to path over manner of motion in the linguistic task when they encode path in speech compared to when they do not encode it in speech (i.e., mention only manner in their description). We also expect participants to allocate more attention to path over manner of motion when they encode path in the main verb as opposed to when they encode path outside of the verb. These predictions are based on the proposal that verbs are the main unit of sentence planning (Levelt, 1989, among others) as well as the thinking for speaking hypothesis (Slobin, 1996) which proposes that language-specific encodings in the verbs guide attention prior to speaking. Previous cross-linguistic eye-tracking studies have only compared speakers of satellite-and verb-framed languages at the group level without considering within-language variation in motion event encoding. Thus, the current study goes beyond previous studies in testing whether visual event apprehension varies in relation to different syntactic encodings of event components within a single language. Such evidence can illuminate what kind of linguistic encoding (i. e., path within the main verb versus outside of the verb) influences event apprehension in a more fine-grained way.
For eye movements regarding gesture production in the linguistic task, we expect even more attention allocated to path over manner of motion when path is encoded in both speech and gesture compared to when path is encoded only in speech. Unlike other models of gesture production (de Ruiter, 2000(de Ruiter, , 2007Krauss et al., 1996Krauss et al., , 2000, the Interface Model uniquely predicts gesture production to require similar conceptualization as speech production and to be constrained by the kind of semantic information can be packaged in one processing unit within the main verb (Kita & Ö zyürek, 2003). In the case of Turkish, as path is more likely to be encoded in the main verb, path gestures are expected to be produced with similar conceptualization of events as in speech production. Furthermore, path gestures might provide extra information about the direction of motion in the left-right axis not found in speech, which might then lead to more attention allocated to path of motion in the visual event.

Method
The methods reported in this study were approved by the Humanities Ethics Assessment Committee of the Radboud University.

Participants
Data were collected from adult native speakers of Turkish (n = 36, 10 males, mean age = 21.5 years, range = 19-24). All of the participants had learned Turkish from birth on and as their first language. Participants were students at Ozyegin University in Istanbul, Turkey and received course credit for their participation. Data from six additional participants were discarded due to trackloss (n = 1), the participant having amblyopia (n = 1), failing to follow the instructions in the linguistic task (n = 3) and equipment error (n = 1).

Stimuli
Stimuli consisted of short video clips depicting two types of events: motion events (target stimuli) featuring intransitive events of an agent moving in relation to a landmark, and transitive events of an agent performing actions on objects (fillers). One of two different female actors performed each event. All stimuli are available at https://osf.io/st 5gb/.

Motion events
Fifty video clips that depicted a female actor moving with respect to a landmark object along a particular path with a particular manner served as the stimuli for motion events. Each video clip was 2500 ms long. Motion lasted throughout the entire 2500 ms.
The stimuli included five different spontaneous manners of motion, corresponding to: walk, run, leap, skip, and hop (yürümek, koşmak, sıçramak, hoplamak, sekmek in Turkish) and five different paths of motion, corresponding to: to, past, into, from and out of (yaklaşmak, geçmek, girmek, uzaklaşmak, çıkmak in Turkish). The complete set of stimuli included an equal number of path and manner variations. Manners of motion were filmed in a studio at Radboud University for the purpose of this study. Each actor performed each manner of motion against a green background. The video clips were edited in Adobe Premiere Pro CC 2015. First, each clip was cut to last 2500 ms. Then, the background of the video was made transparent using the ultra key feature of Adobe Premiere Pro. In order to create a scene, each manner of motion was combined with one of two different backgrounds (a white brick wall, or a light pink wall) and a gray asphalt-textured floor. Finally, motion paths were created by combining the moving figure with a landmark object ( Fig. 1). For to and into paths, the landmark objects were placed near the final location of the actor's motion. For from and out of paths, the landmark objects were placed near the starting location. For past paths, the landmark objects were placed towards the final location of the motion, but such that the actor would pass the object during the video. Some landmark objects appeared twice, with a different token for each actor.
Previous eye-tracking work in the domain of motion events has revealed that speakers extract path of motion from similar events by predictively fixating on a goal object (Bunger et al., 2012(Bunger et al., , 2016(Bunger et al., , 2021Papafragou et al., 2008;Trueswell & Papafragou, 2010) rather than tracing the trajectory of motion with their eyes since people rarely fixate on empty space (see also Kamide, Lindsay, Scheepers, & Kukona, 2016). Thus, the events that involved a goal directed path (to, into, or past) served as the target motion events. In order to ensure that participants did not only view goal directed motion, events that involved source paths (from or out of) were included as non-target motion events.
Each path-manner combination had two versions, performed by a different actor, creating a total of 50 motion events (see Appendix A for the complete list of motion events). These 50 events were divided into two lists, such that each manner-path combination appeared once in each list. The assignment of lists to tasks (non-linguistic, linguistic) was counterbalanced across participants. For each event list, an additional version was created by reversing the order of items. Thus, for each task (non-linguistic, linguistic) there were four presentation lists. Furthermore, the two actors, the two backgrounds (white, pink) and the direction of motion in the video (left-right, right-left) were counterbalanced across lists, and across each manner and each path within the lists.

Transitive filler events
Fifty additional video clips that depicted the same female actors performing every-day actions on objects (e.g., peeling a banana) served as the stimuli for transitive filler events (25 videos per actor). Transitive filler events were 2500 ms long (see Appendix B for a complete list of transitive filler events). Transitive filler events were also filmed at the same studio at Radboud University for the purpose of this study. Actors performed the actions on a gray table against a green background. Video clips were edited in Adobe Premiere Pro CC 2015. First, each clip was cut to last 2500 ms. Then, the green background was removed and replaced with one of the two backgrounds that were used for the motion events. The 50 transitive filler events were divided into two lists: one for the non-linguistic eye-tracking task, and one for the linguistic eye-tracking task. All participants saw the first set during the non-linguistic task and the second set during the linguistic task.

Procedure
Each participant was tested in a quiet room in their university campus in Turkish by a native speaker. Participants first signed a consent form. Then, they were seated approximately 60 cm away from a SMI RED 250 eye-tracker (SensoMotoric Instruments) attached to a DELL Precision M4800 laptop. Eye gaze was sampled (binocular) at a rate of 250 Hz (every 4 ms). Screen resolution was 1920 × 1080. The size of the stimulus videos was 1280 (width) mm X 720 (height) mm. The experiments were run through Presentation® software (Version 16.5, Neurobehavioral Systems, Inc., Berkeley, CA, www.neurobs.com). All participants completed the three components of the experiment in the following order: (1) Non-linguistic eye-tracking task, (2) Filler task, (3) Linguistic eye-tracking task. At the end of the session, all participants completed a demographics and language background survey (Gullberg & Indefrey, 2003) and a post-experiment questionnaire. The nonlinguistic task was presented first to avoid carry-over effects from the linguistic eye-tracking task onto the non-linguistic eye-tracking task. The filler task was included to distract the participant's attention from the previous task and lasted approximately 5 min. Each session lasted approximately 45 min.

Non-linguistic eye-tracking task
Participants watched 50 video clips of events (25 motion events and 25 transitive filler events) presented on the computer screen while their Fig. 1. Sample motion event stimuli: (a) "A woman running into a phone booth" (b) "A woman skipping to rocks". eye movements were recorded. In each trial, participants first saw a fixation screen, containing a fixation cross in the center, for 1000 ms. Next, an event was shown for 2500 ms. Finally, a gray screen was presented until the participant clicked on a blue button on the mouse to advance to the next trial. In order to ensure attention to the screen, participants were given a secondary task (Flecken et al., 2014;Sakarias & Flecken, 2019). Participants were asked to press a button on the keyboard marked with a yellow sticker during the gray screen when a given event had been presented twice in a row. There were 5 repeating events in total. Crucially, all of these repeating events were transitive filler events.
Before the experimental trials started, participants completed 3 practice trials, followed by optional feedback and the opportunity to ask questions. After the practice trials, a 5-point calibration and validation procedure was completed. This part of the experiment lasted approximately 10 min.
In order to ensure that the setups used in the non-linguistic and linguistic tasks were similar, in both tasks the participants were seated across from a confederate whom they believed was another naïve participant. The confederate was included to make the linguistic task more communicative and to elicit more naturalistic descriptions from participants than, for example, speaking to a computer screen in an empty room. The confederate was instructed to just listen and not to direct any questions or comments to the participant. Crucially, in the non-linguistic task, the confederate was busy filling out a questionnaire on a laptop in front of her and did not engage with the participant.

Linguistic eye-tracking task
Participants watched 50 video clips of events (25 motion events and 25 transitive filler events) presented on the computer screen while their eye movements were recorded. In each trial, participants first saw a fixation screen, containing a fixation cross in the center, for 1000 ms. Next, an event was shown for 2500 ms. Finally, a gray screen was presented. Participants were asked to describe what had happened in the video to the confederate once the gray screen appeared. Participants were informed that their eye movements were not recorded during the gray screen, thus they were free to move, look at the other participant and use their hands while speaking. Participants' descriptions (speech and co-speech gestures) during the linguistic task were recorded with a Canon video camera. After the participant had finished the description, the confederate clicked a button on the mouse marked with a blue sticker to initiate the next trial.
Before the experimental trials started, participants completed 2 practice trials, followed by optional feedback and the opportunity to ask questions. After the practice trials, a 5-point calibration and validation procedure was completed. Then, after calibration, the experimental trials started. The calibration procedure was repeated once in the middle of the task.
The confederate was the same research assistant as in the nonlinguistic task whom the participant believed was another participant. The confederate was instructed to listen to the participant's descriptions and click the blue button when the other person finished describing. They were also told to listen the descriptions carefully because they would be asked to answer some questions afterwards. This part of the experiment lasted approximately 15 min.

Coding
Descriptions of target motion events were transcribed and coded for the presence of path and manner information in speech and gesture using ELAN software (Lausberg & Sloetjes, 2009) by a native speaker of Turkish.

Speech coding
First, event descriptions were segmented into clauses. Each clause consisted of a main verb and its subordinate verbs, if any. Clauses could be coordinated by conjunctions (ve/and, ama/but, sonra/then) or connective morphemes (− erek, − e… e…, − ıp …ıp). Each clause was coded for the presence of path and manner information in speech and gesture separately (following conventions in Allen et al., 2007). At the trial level participants' speech could either include one component of motion (path-only or manner-only) or both components (path + manner).
In speech, path information was coded as present if it was expressed with path verbs (e.g., gir/enter, yaklaş/approach, geç/pass, git/go) or outside of the verb in post-positional phrases (e.g., içi-(n)e/to-inside). All path mentions were further coded for how path information was encoded (i.e., within a path verb or outside of the verb). Manner information was coded as present if it was expressed as either a manner verb subordinated to a path verb with a connective (e.g., koş-arak/ run-CONN) or as main manner verbs (e.g., koş/run, yürü/walk, zıpla/jump).
In order to ensure reliability, 25% of the speech data were coded by a second coder who was also a native speaker of Turkish. The agreement between the coders for the presence of path and/or manner information in speech was 94.1% at the clause level (κ = 0.921). All disagreements were discussed to reach 100% agreement.

Gesture coding
First, we segmented gesture strokes (most meaningful part of the movement) that accompanied speech and represented path and or manner of motion. Each gesture was coded for the presence of path and manner information (following conventions in Ö zyürek et al., 2005, Ö zyürek et al., 2008). Path information was coded as present if speakers traced the figure's change of location across space. Speakers could trace the trajectory of motion either in the lateral axis (from left to right or from right to left) or in the sagittal axis (moving away from or towards the body). Pointing gestures to the location of the landmark object were not coded as path gestures because they do not trace the trajectory of motion and hence fail to convey path information. Manner information was coded as present if the speakers produced a gesture that depicted how the motion unfolds in a non-linear way with a body part chosen to represent the figure (e.g., inverted V-hand shape with wiggling fingers to indicate walking). Manner gestures could indicate the manner of motion from an observer's perspective (e.g., an index finger moving up and down to indicate jumping) or could be an enactment of the figure's posture during motion from an actor's perspective (e.g., moving the arms up and down to indicate running).
At the trial level, participants' gestures could either include one component of motion (path-only, Fig. 2a or manner-only, Fig. 2b) or both components (path + manner, Fig. 2c). When both components are gestured these gestures could either be a combination of separate path and manner gestures (e.g., a gesture like the one in Fig. 2a and another gesture like the one in Fig. 2b) or a single gesture that conflates path and manner, even though the latter pattern was quite rare (10%). See Fig. 2 for examples.
In order to ensure reliability, 25% of the gesture data were coded by the same second coder as for speech. The agreement between the coders for the presence of path and manner information in gesture were 87.9% at the clause level (κ = 0.748). All disagreements were discussed to reach 100% agreement.

Preprocessing of eye movement data
Two rectangular Areas of Interest (AoI) were defined for each of the target motion event stimuli using SMI BeGaze software. Path AoI was defined as the area surrounding the ground object based on previous eye-tracking work on spontaneous motion events (Bunger et al., 2012(Bunger et al., , 2016(Bunger et al., , 2021Papafragou et al., 2008;Trueswell & Papafragou, 2010). Size and position of the Path AoI remained the same throughout the trial since the ground object remained static. Manner AoI was defined as the area surrounding the legs, torso and arms of the figure. Because the figure moved across the screen as the motion unfolded, the coordinates of the Manner AoI had to be updated throughout the trial. To do so, anchor points for the position of Manner AoI were created by repositioning the AoI at every 100 ms for the entire 2500 ms. Manner AoI size and shape remained the same across anchor points. Based on these anchor points BeGaze created a dynamic Manner AoI that moved along a connected path and the coordinates of the AoI were updated in a way that always included the legs, arms and torso of the figure. Fixations to the AoIs were computed by SMI BeGaze software. The onset and offset of stimuli for each trial were marked by a message sent from Presentation software to the eye-tracker. Using an R script (version 3.4.3) (R Core Team, 2018), we determined whether a fixation fell into one of the AoIs in successive 100 ms time bins for 2500 ms. Participants with more than 25% trackloss across all trials in a given task were excluded from the analysis for both tasks (n = 1 due to trackloss in the linguistic task). Additionally, we excluded trials in which trackloss was higher than 50% (non-linguistic task: 2.59%, linguistic task: 0.86%).

Language production: Event descriptions in speech and gesture
Speech and gesture production data were analyzed using log-linear models with Poisson-distributed residuals. Models were fit using glm function with stats package in R (version 4.0.3; R Core Team, 2020). Significance levels for pairwise comparisons with corrections for multiple comparisons were obtained using emmeans (version 1.5.5-1; Lenth, 2021) and multcomp (version 1.4-16;Hothorn, Bretz, & Westfall, 2008) packages. Data and analysis code are available at https://osf.io/st5gb/. Table 1 shows the distribution of information about path and manner of motion across speech and gesture that we explore further below with statistical analyses for speech and gesture separately.

Speech
For speech analysis we tested to what extent event descriptions in speech conform to language-specific patterns of motion event encoding such that participants would be more likely to produce descriptions that encode path of motion only. A log-linear model tested the fixed effect of speech type (manner-only, path-only, path + manner) on counts of mention in speech (1 = mentioned, 0 = not mentioned) at the trial level as the dependent measure. Participants were more likely to use path + manner descriptions in speech compared to both manner-only (β = 1.755, SE = 0.127, z = 13.84, p < .001) and path-only (β = 2.407, SE = 0.169, z = 14.21, p < .001) descriptions (see Table 1). Participants were also more likely to use manner-only descriptions than path-only descriptions (β = 0.653, SE = 0.200, z = 3.26, p = .003). This indicated that, contrary to our initial expectation, most of the time participants produced descriptions that encoded both path and manner in speech. Next, we tested the prediction that path of motion would be more likely to be encoded in the main verb as opposed to outside of the verb. For this analysis, we focused on a subset of the data (85%) in which participants encoded path of motion in speech either using a path-only description or a path + manner description (see Table 1). Thus, we excluded data from 15% of trials in which participants did not encode path information and encoded manner information only. When participants encoded path of motion in speech, as expected, the majority of the path mentions were in path verbs (63% of path mentions) and path mentions outside of the verb (i.e., in post-positional phrases only) were less frequent (37% of path mentions). A log-linear model tested the fixed effect of type of path encoding (post-positional phrases, numerically contrast coded as − 1/2; verbs, numerically contrast coded as 1/2) on counts of mention in speech (1 = mentioned, 0 = not mentioned) at the trial level. There was a fixed effect of type of path encoding, indicating that, when participants encoded path of motion in speech, they were more likely to use path verbs than post-positional phrases only (β = 0.542, SE = 0.097, z = 5.59, p < .001).

Gesture
Turning to motion event encodings in gesture, we first examined to what extent the gestures that speakers produced conform to languagespecific patterns. Of interest was whether participants were more likely to produce gestures that only encode path of motion compared to path + manner or manner-only gestures due to language-specific encoding of path in the main verb in Turkish. To do so, we focused on the trials in which participants produced a gesture (51% of all trials) and assessed which of the three gestures types (path-only, manner-only, and path + manner) was most frequent. A log-linear model tested the fixed effect of gesture type (manner-only, path-only, path + manner) on counts of mention in gesture (1 = mentioned, 0 = not mentioned) at the  trial level as the dependent measure. Participants were more likely to produce path-only gestures compared to both path + manner (β = 0.530, SE = 0.138, z = 3.83, p < .001) and manner-only (β = 1.165, SE = 0.173, z = 6.75, p < .001) gestures (Table 1). Furthermore, participants were more likely to produce path + manner gestures compared to manneronly gestures (β = 0.635, SE = 0.187, z = 3.40, p = .002). This confirms that path-only gestures were indeed produced most frequently by our participants. Summarizing, language production data showed that in speech participants most frequently produced descriptions that encoded path and manner together. Furthermore, and most importantly for the purpose of our study, path was mostly encoded within the main verb in speech. In gesture, participants most frequently produced gestures that encoded only path of motion. These patterns largely conform to language-specific encoding of motion events in speech and gesture in Turkish. In the following section, we test the relation between visual attention and these language-specific encodings in speech and gesture.

Analysis of eye movements
We were interested in testing whether the time course of the relative attention allocated to path over manner during message preparation changed across tasks, types of path encoding in speech and types of path encoding in gesture. To test these hypotheses, we analyzed the time course of eye movements using Growth Curve Analysis (GCA; Mirman, 2014, Mirman, Dixon, & Magnuson, 2008. 1 GCA is a multilevel regression method designed for analyzing time course data. GCA uses polynomial functions to model time course and is able to capture changes in time course that follow any shape. Hence, this approach is suitable for modelling the change in eye movements over time in our data, which followed a non-linear shape (i.e., initial increase followed by a decrease, see Figs. 3-5). GCA is also able to quantify variation due to fixed effects (i.e., group-level effects; in our case: tasks, types of path encoding in speech and types of path encoding in gesture) as well as the random variation introduced by individual differences (i.e., participants or items). For our dependent variable, we followed prior eye tracking work in the motion event domain (Bunger et al., 2012(Bunger et al., , 2021Papafragou et al., 2008;Trueswell & Papafragou, 2010) and used difference scores as a measure for preference to fixate on one event component over the other. Thus, our dependent variable was the difference between the proportion of fixations to the Path AoI (out of all fixations) minus the proportion of fixations to the Manner AoI (out of all fixations). For the analyses, data were aggregated into 100 ms time bins. All analyses were conducted on log transformed odds ratio of proportions of Path minus Manner fixations. We excluded 8.1% of the data due to participants not fixating on anywhere on the stimuli (either path or manner AoIs, or elsewhere on the scene) within a bin.
Since we were interested in examining the differences in eye movements tied to linguistic planning, our analyses focused on a subset of the time course of eye movements. Specifically, we focused on the window spanning 200 ms to 1500 ms after stimulus onset. We excluded the eye movements in the last 1000 ms of the trial (between 1500 ms and 2500 ms) from the analyses for two reasons. First, previous work has shown that event apprehension for utterance planning is rapid (Griffin & Bock, 2000) and eye movements in this earlier time window can reflect changes in visual event apprehension due to linguistic planning more accurately. Second, since the target motion events in our stimuli were all goal directed, the figure reached the landmark object at the end of the clip and therefore the path and manner AoIs overlapped in the last second. We also excluded the eye movements in the first 200 ms from the analyses since it takes about 200 ms for participants to plan and land a saccade (Matin, Shao, & Boff, 1993).
All models were fit with the lme4 package (version 1.1.17; Bates, Mächler, Bolker, & Walker, 2015) in R (version 4.0.3; R Core Team, 2020). Polynomial growth functions were created using psy811 package (version, 1.0; Mirman, 2015). P values for the t-tests on the parameter  (Huang & Snedeker, 2020). Following the approach recommended by Huang and Snedeker (2020), we modelled our data using binomial logistic regressions as well and replicated the findings from the GCA reported in the current article. Results and analysis code from both approaches can be can be found at: https://osf. io/st5gb/. estimates were obtained with lmerTest package (version 3.1-1, Kuznetsova, Brockhoff, & Christensen, 2017). Figures were produced using ggplot2 package (version 3.2.1, Wickham, 2016). Details of model fitting are available in Supplementary Materials. Data and analysis code are available at https://osf.io/st5gb/.

Eye movements in linguistic vs. non-linguistic tasks
We first tested to what extent eye movements were guided by engaging in linguistic planning, such that participants would allocate more attention to path over manner of motion in the linguistic task compared to the non-linguistic task. Fig. 3 shows the proportion of  fixations to path minus manner over time across linguistic and nonlinguistic tasks.
Polynomial growth functions were added stepwise to the model and the overall time course of eye movements were modelled with fourthorder orthogonal time terms in addition to the fixed effect of task (non-linguistic contrast coded as − 1/2; linguistic contrast coded as 1/2). The model also included random intercepts for Subjects and Items (more complex models with random slopes did not converge). Parameter estimates from the model are presented in Table 2. Most importantly for present purposes, the model revealed a fixed effect of task: participants had higher preference to fixate on path over manner in the linguistic task compared to the non-linguistic task. Furthermore, there was an interaction between task and the linear time term, indicating that over time, the decrease in path preference was less steep in the linguistic task than the non-linguistic task. The model also revealed that the time course of the data was characterized by Quadratic, Cubic and Quartic terms for time; however, none of these time terms interacted with task, indicating that the curvature was similar across tasks. Overall, these findings indicate that the time course of eye movements varies across linguistic and non-linguistic task with more attention allocated to path over manner of motion in the linguistic task.

Eye movements across different types of path encoding in speech
Next, we tested whether and to what extent eye movements in the linguistic task varied across different types of path encoding in speech. Of interest was whether participants would allocate more attention to path over manner of motion when they encoded path in speech (with either a post-positional phrase only or a verb) compared to when they did not encode path in their speech. Also of interest was whether participants would allocate even more attention to path over manner of motion when they encoded in a path verb as opposed to when they encoded it outside of the verb in post-positional phrases only. Fig. 4 shows the proportion of fixations to path minus manner over time when participants did not encode path in speech at all (i.e., manner-only), when they encoded it as a post-positional phrase only and when they encoded it as a path verb.
Polynomial growth functions were added stepwise to the model and the overall time course of eye movements were modelled with fourthorder orthogonal time terms. The fixed effect of path encoding in speech (No Path, Post-positional Phrase, Path Verb) was tested with planned contrasts on only the linear and quadratic time terms. Adding the interactions between the fixed effect of path encoding in speech and the Cubic (χ 2 (2) = 0.178, p = .915), and Quartic (χ 2 (2) = 0.047, p = .977) time terms did not improve model fit (see Supplementary Materials for details). For the fixed effect of path encoding in speech, first we compared trials in which participants did not encode path in speech to any type of path encoding (no path encoding contrast coded as − 2/3, post-positional phrase contrast coded as 1/3, and path verb contrast coded as 1/3). Then, we compared trials in which participants encoded path in post-positional phrases only to when they used path verbs (no path encoding contrast coded as 0, post-positional phrase contrast coded as − 1/2, and path verb contrast coded as 1/2). The model also included random slopes for path encoding in speech by Subjects and Items (models with random slopes for time terms did not converge). Parameter estimates for the fixed effects from the best-fitting model are presented in Table 3.
As earlier, curvature was similar across different types of path encoding in speech: an initial increase in path preference was followed by a decrease and a second increase and decrease (quartic term). However, and most importantly for present purposes, both of the contrasts for path encoding in speech interacted with the linear time term. This indicated that the overall decrease in path preference was steeper when participants did not encode path in speech (i.e., encoded manner only) compared to when they encoded path in speech in any way (with either a post-positional phrase only or a path verb). Additionally, the overall decrease in path preference was steeper when participants encoded path in speech with a verb compared to when they encoded it with a post-positional phrase only. This reflects the fact that when participants encoded path in speech path preference was particularly high at the beginning of message preparation. However, by the end of the analysis window preference to fixate on path over manner was quite similar across path encodings in verbs and post-positional phrases only. Thus, time course of eye movements for path encodings in path verbs was characterized by a stronger negative slope. Overall, these findings indicate that the time course of eye movements in the linguistic task varies across different types of path encoding in speech with more attention allocated to path of motion compared to manner of motion when path was encoded in verbs.

Eye movements in relation to path encoding in gesture
Finally, we tested to what extent time course of eye movements in the linguistic task varied when they were accompanied by different types of  path encoding in gesture. For this analysis, we focused on a subset of the eye-gaze data based on the linguistic encoding of event components in speech and gesture. As seen in Table 1, the most frequent encoding pattern was path + manner descriptions in speech and path-only gestures. In order to keep the semantic elements encoded in speech constant, we focused on the trials in which participants encoded both path and manner in speech. Then, we examined the time course of eye movements when participants did not use a gesture as opposed to when they used a path-only gesture. Thus, we excluded trials with less frequent speech (path-only and manner-only) and gesture (manner-only or path + manner) patterns from this analysis. This allowed us to test our prediction that path encoding in gesture in addition to what was already encoded in speech would be related to more attention to path over manner of motion in visual event apprehension. Fig. 5 shows the proportion of fixations to path minus manner over time when participants produced a path-only gesture and when they did not produce any gesture.
Polynomial growth functions were added stepwise to the model and the overall time course of eye movements were modelled with thirdorder orthogonal time terms in addition to the fixed effect of gesture type (no gesture contrast coded as − 1/2; path-only gesture contrast coded as 1/2). The model also included random slopes for gesture type by Subjects and Items (models with random slopes for time terms did not converge). Parameter estimates from the model are presented in Table 4.
As Table 4 shows, there was no effect of gesture type, indicating no difference in overall path over manner preference when a path-only gesture was produced compared to when no gestures were produced. However, there was an interaction between gesture type and the cubic time term, indicating differences in curvature when participants produced a path-only gesture versus when they did not produce any gestures. This interaction is likely to be driven by two patterns in the data. First, the initial peak in path preference follows a deeper curve when participants produce a path-only gesture. Second, there is a second rise in path preference when participants produced a path-only gesture compared to when they did not produce any gestures. Overall, these data indicate that time course of relative attention allocated to path over manner of motion in the linguistic task was different across different types of path encoding in gesture, with more attention allocated to path of motion over manner of motion when path was additionally encoded in gesture.

Further exploration of the relation between eye movements and path encoding in gesture
In order to further evaluate how visual attention varies in relation to path encoding in gesture, we conducted two sets of exploratory analyses. In a first set of analyses, we checked whether the variation in the time course of fixations to path and manner of motion linked to gesture production was simply a byproduct of what was encoded in the accompanying speech. It is possible that the speech accompanying pathonly gestures had more path encodings in verbs than with postpositional phrases only and thus the results reported above could be due to encoding differences in speech instead of having produced pathonly gestures. To rule out this possibility, we examined if the encoding of path of motion in speech was similar (verbs versus post-positional phrases only) in cases when speech was accompanied with path-only gestures versus no gesture. 2 We found that, when participants produced a path-only gesture, 56% of these path-only gestures occurred with speech in which path was expressed with a verb and 44% of these path-only gestures occurred with speech in which path was encoded with a post-positional phrase only. When participants did not produce any gesture, in speech path was encoded with a verb in 61% of the time and with a post-positional phrase only in 39% of the time. In fact, a chisquare test revealed that the distribution of trials that path was encoded with a verb vs. a post-positional phrase only in speech was similar across trials with path-only gesture vs. no gesture (χ 2 (1) = 0.452, p = .502).
To ensure that the variation in the time course of eye movements in relation to path encoding in gesture remained significant after statistically controlling for type of path encoding in speech, we tested the bestfitting growth curve model on the time course of eye movements by adding type of path encoding in speech as a fixed factor (post-positional phrase contrast coded as − 1/2, and path verb contrast coded as 1/2). The model structure for the remaining fixed and random effects was the same (see Supplementary Materials for details of model fitting and the complete list of parameter estimates). This model replicated the previously reported interaction between path encoding in speech and the linear time term (β = − 0.290, SE = 0.066, t = − 4.368, p < .001). Crucially, the interaction between gesture type and cubic time term remained statistically significant (β = 0.183, SE = 0.066, t = 2.769, p = .006) but did not further interact with path encoding in speech (β = 0.208, SE = 0.132, t = 1.571, p = .116). This indicates that the differences in curvature when participants produced a path-only gesture versus when they did not produce any gestures was observed both when participants encoded path in speech with a verb and when they encoded path in speech with a post-positional phrase only. These data confirm that the differences in the time course of eye movements linked to additional encoding of path of motion in gesture were sustained even after controlling for how path of motion was encoded in the accompanying speech.
In a second set of analyses, we explored whether there were any other systematic differences in the speech that accompanied path only gestures versus no gesture in terms of the ease of planning of the descriptions. It is commonly assumed that speakers produce gestures to compensate for difficulties in speech production. To eliminate the possibility that participants gestured about (and attended to) path of motion merely because they had difficulty speaking about it, we inspected the same subset of the data that was included in the analyses of eye Table 4 Parameter estimates from the best-fitting model on the proportion of Path minus Manner fixations across Path-only gesture and No Gesture trials when both path and manner is mentioned in speech. Significant p-values that are critical to the analysis are in boldface. movements in relation to path encoding in gesture. That is, we focused on the trials in which participants encoded both path and manner of motion in speech and produced either a path-only gesture or did not produce a gesture at all. Next, we coded for instances of disfluencies in speech. Disfluencies were defined as filled or unfilled pauses in speech or producing word fragments and self-corrections (following Graziano & Gullberg, 2018). Overall, participants were disfluent about path of motion only 5.9% of the time. When they produced path-only gestures, they were disfluent about path of motion 7.8% of the time, and they were not disfluent about path of motion 92.2% of the time. Similarly, when participants did not produce any gestures together with speech, they were disfluent about path of motion 4.8% of the time, and they were not disfluent about path of motion 95.2% of the time. A chi-square test confirmed that the distribution of the trials in which participants were versus were not disfluent about path of motion did not differ across path-only gesture versus no gesture trials (χ 2 (1) = 0.683, p = .409). These findings confirm that there were no systematic differences across path-only gesture versus no gesture trials in terms of difficulty in speaking about path of motion.

Discussion
Spoken language production guides visual event apprehension during message preparation (Gleitman et al., 2007;Griffin & Bock, 2000;Meyer et al., 1998) and in language-specific ways Sauppe, 2017;Sauppe et al., 2013). Our primary goal in the present study was to test if producing language-specific gestures along with speech further guides visual attention to events. Secondarily, as a novel contribution to previous work on cross-linguistic differences in event encoding in speech and visual attention, we tested whether eye gaze patterns vary in relation to variations in linguistic encoding within the typological framework of a specific language. Overall, our findings strongly suggest that language-specific encodings of path in the main verb (as opposed to outside of the verb) as well as producing path gestures along with speech guide visual attention allocated to path over manner of motion during message preparation.

Motion events in speech and gesture
In order to motivate our investigation of potential differences in visual attention linked to language-specific encoding of motion in speech and gesture, we began by exploring linguistic encoding of motion in speech and its links to gesture production in Turkish. This allowed us to re-establish that Turkish-speakers adhered to the patterns reported in previous typological and empirical work. We found that Turkishspeakers produced spoken descriptions that encoded both path and manner of motion more frequently than descriptions that encoded either only path or only manner. Even though this pattern was somewhat less expected based on typological patterns reported in prior work on motion (Slobin, 1996;Talmy, 1985), it is in line with similar reports from speakers of verb-framed languages especially when the manner of motion is salient, contrastive, and cannot be inferred from the context (Özyürek et al., 2008;Papafragou, Massey, & Gleitman, 2006;ter Bekke, Ö zyürek, & Ünal, 2022). Nevertheless, as expected, the majority of encodings included path of motion and these were expressed through path verbs. This is consistent with previous typological work on the encoding of motion events in verb-framed languages (Talmy, 1985;Turkish: Ö zyürek et al., 2008;Greek: Papafragou et al., 2008;Papafragou & Selimis, 2010).
In gesture, Turkish-speakers produced path-only gestures more frequently than gestures that encoded only manner or both path and manner. This path-only bias found for gesture supports the Interface Model of multimodal production by showing that the semantic elements that were more likely to be encoded in the main verb were also more likely to be encoded in gesture (Kita & Ö zyürek, 2003). The speech and gesture patterns in the present study conform to typological gesture patterns in verb-framed languages and contrast with data from speakers of satellite-framed languages where speakers use more manner gestures and express manner and path in a single gesture Ö zçalışkan, 2016;Ö zçalışkan et al., 2016a, 2016b, Ö zyürek et al., 2005, Ö zyürek et al., 2008. The combination of the most frequent encoding patterns in speech and gesture observed in the present study also coheres with the findings of a recent study conducted on Farsia language that has a mixed verb-framed and satellite-framed typology (Akhavan, Nozari, & Göksun, 2017). In that study, speakers also encoded path and manner equally frequently in speech, using light verbs plus prepositions to encode path and adverbs to encode manner, and were more likely to produce gestures that encoded only path. Together, these data provide behavioral evidence for the idea that speech and gesture form a tightly integrated system where speech and gesture (at least those about motion) are integrated with what can be packaged in a verb. This idea is corroborated by an exploratory finding in our data: speech disfluencies about path of motion were equally likely to co-occur with or without a path gesture. This is in line with recent evidence that gesture production does not necessarily help speakers retrieve words spatial content (Kısa, Goldin-Meadow, & Casasanto, 2021 see also Graziano & Gullberg, 2018). Both sets of findings challenge the view that gestures are produced merely to compensate for difficulties in word retrieval. Finally, our speech and gesture production findings confirm that multimodal linguistic encoding of motion is a good test bed for investigating further links between visual attention and language-specific speech and gesture production.

Attention to motion events prior to speech and gesture production
Turning to eye movements, our eye-tracking data revealed that when Turkish-speakers linguistically encoded events, they allocated more attention to path over manner of motion compared to when they nonlinguistically encoded events. These data offer further support for the idea that engaging in linguistic planning guides visual attention (Levelt, 1989). Our findings replicate findings of previous cross-linguistic eyetracking studies on motion events (Bunger et al., 2012(Bunger et al., , 2016(Bunger et al., , 2021Flecken et al., 2014;Sakarias & Flecken, 2019) including work with speakers of other verb-framed languages (Greek; Papafragou et al., 2008, Trueswell & Papafragou, 2010 and extend these findings to Turkisha language that had not been studied in this respect before. This finding is also important in showing that path preference in visual attention observed in prior work with Greek-speakers is not merely a reflection of order of mention of event components. In Greek, path of motion is typically mentioned before manner of motion. On the other hand, Turkish is a verb-final language and path of motion is typically mentioned after manner of motion. Despite this variation in word order, speakers of both (verb-framed) languages allocate more attention to path compared to manner of motion during early event apprehension. This suggests that the semantic information encoded within the verb has an important role in guiding visual attention to events (Levelt, 1989).
Our findings also go beyond prior work by pinpointing which types of linguistic encoding in speech within the variations in a given typology are more likely to guide eye movements during language production. Specifically, we showed that Turkish-speakers allocated more attention to path over manner of motion when they encoded path in speech compared to when they did not. Furthermore, they allocated even more attention to path over manner when they encoded path within a verb compared to outside of the verb with post-positional phrases only (i.e., in line with the verb-framed typology). This is compatible with the thinking for speaking hypothesis (Slobin, 1996).
Only two prior studies thus far have examined between-and withinlanguage variation in eye movements based on whether or not some motion event components were mentioned in speech (Bunger et al., 2016(Bunger et al., , 2021. This work demonstrated that when English-and Greekspeakers produced content-wise similar descriptions of caused motion events (e.g., mentioned both causative and resultative subevents) their eye movements prior to speaking were similar. Our findings highlight the importance of looking beyond the content of the descriptions (i.e., whether or not an event component is mentioned) for showing subtle nuances in visual event apprehension tied to language-specific event encoding in speech. To our knowledge, our data offer the first empirical evidence that attention allocation prior to speaking not only varies cross-linguistically but also within speakers of a single language in ways linked to language-specific encoding of motion paths. Together, these data provide further evidence for the idea that verbs are the main processing units of speech planning (e.g., Bock, 1982;Griffin & Bock;Kita & Ö zyürek, 2003;Levelt, 1989;Norcliffe & Konopka, 2015, among others).
As a very novel contribution, we also showed that attention allocation prior to linguistic encoding was linked to language-specific encoding of motion event components in co-speech gestures. Turkishspeakers allocated even more attention path over manner of motion when their spoken descriptions that included both path and manner were accompanied by a path-only gesture compared to when they did not encode any motion event components in gesture. This pattern possibly emerged due to the fact that path gestures included additional information about the direction of the motion in the left-right axis, that was not necessarily conveyed in path speech. Crucially, the speech produced with path-only gestures was similar to the speech produced without any gestures in several respects, including the syntactic encoding of path of motion. Furthermore, the variation in visual attention linked to path gesture production persisted even after controlling for how path of motion was encoded in the accompanying speech. This indicates that differences in visual attention related to additional encoding of path/direction of motion in gesture emerged in addition to the differences found in relation to speech. In addition to complementing prior behavioral findings on speech and gesture, these findings suggest that prior evidence on the relation between visual event apprehension and spoken language production may be extended to multimodal language production. They also provide evidence consistent with the Interface Model of multimodal language production (Kita & Ö zyürek, 2003).
These patterns suggest that at the planning stage, there are interactions between visual event apprehension, linguistic constraints on how motion is encoded in a specific language and the spatio-motoric imagery underlying gesture production by showing that what kind of semantic information can be packaged in a processing unit within the main verb predicts not only gesture production but also attention allocation to event components. Even though the model by Kita and Ö zyürek (2003) has previously proposed this interface at the conceptualization stage of multimodal language production, this is the first empirical investigation that reveals eye-gaze patterns during message preparation that are compatible for this aspect of the model.

Conclusions
The present study offers novel evidence suggesting that visual event apprehension is guided by multimodal linguistic encoding of events and that the links between the eye and the mouth may be extended to the eye and the hand. These influences seem to be susceptible to languagespecific constraints on event encoding in both speech and gesture. Together, these findings advance our understanding of language and its processing as a multisensory multimodal phenomenon. Finally, the approach reported in this study offers new possibilities for future work investigating previously hypothesized tight links between event representation and language production (Knott & Takac, 2021;Ünal, Ji, & Papafragou, 2021) by taking the multimodal nature of language into account.

Data availability
The data and analysis code for the present study are available from https://osf.io/st5gb/ Note: Participants always received event set A in the non-linguistic task and event set B in the linguistic task. Within each task participants saw motion events and filler events in a mixed order.