The effect of input sensory modality on the multimodal encoding of motion events

ABSTRACT Each sensory modality has different affordances: vision has higher spatial acuity than audition, whereas audition has better temporal acuity. This may have consequences for the encoding of events and its subsequent multimodal language production—an issue that has received relatively little attention to date. In this study, we compared motion events presented as audio-only, visual-only, or multimodal (visual + audio) input and measured speech and co-speech gesture depicting path and manner of motion in Turkish. Input modality affected speech production. Speakers with audio-only input produced more path descriptions and fewer manner descriptions in speech compared to speakers who received visual input. In contrast, the type and frequency of gestures did not change across conditions. Path-only gestures dominated throughout. Our results suggest that while speech is more susceptible to auditory vs. visual input in encoding aspects of motion events, gesture is less sensitive to such differences.


Introduction
We usually receive spatial information via multiple channels. For example, while seeing someone walking away, we may also hear the fading sound of footsteps echoing in the corridor. Each sensory modality has different affordances that contribute to our overall experience of an event. At the same time, we can express events in language using different modalities, as in the verbal and manual modalities, each of which has its own channel restrictions. It is possible, therefore, that the expressibility of multisensory events into multimodal language may differ according to the constraints of both input and output channels. To test this, we investigate whether perceiving events through vision or audition influences the way we express spatial events in speech and gesture.
Vision has the unique advantage of providing simultaneous (i.e., holistic) information about features of objects and events in both close and distant space (e.g., Eimer, 2004;Thinus-Blanc & Gaunet, 1997). It is continuously accessible and thus allows perceivers to update information about motion, location, and spatial relations. Like vision, audition is a distant sense, however, it provides better temporal information than vision across locations. Audition is found to dominate in temporal processing, such as discriminating rhythmic changes (e.g., Recanzone, 2003;Repp & Penel, 2002;Shams et al., 2000;Spence & Squire, 2003), and in contrast to the holistic nature of visual information, auditory information is sequential. Even though audition provides information about objects and events, vision typically dominates over conflicting auditory information in spatial perception (e.g., Alais & Burr, 2004;Howard & Templeton, 1966). Therefore, vision is widely considered the primary source of spatial perception (e.g., Ernst & Bülthoff, 2004;Welch & Warren, 1980). It has been claimed that language reflects this asymmetry between vision and audition. Vision appears to have privileged status, especially in languages of Western societies (e.g., Levinson & Majid, 2014;Lynott et al., 2020;Majid et al., 2018;San Roque et al., 2015;Speed & Majid, 2017;Viberg, 1983). This is reflected in the fact that vision-related verbs (e.g., see, look) are more frequent and numerous than non-vision related verbs (e.g., smell, feel) in the perceptual lexicons of languages of the world (e.g., Floyd et al., 2018;Lynott et al., 2020;San Roque et al., 2015;Speed & Majid, 2017;Winter et al., 2018). Although we see differences in the number and frequency of words across the senses, no study has experimentally investigated the role of input modality on the language used to describe events. Moreover, there is little known about its multimodal expression, particularly co-speech gesture.
From first principles, one might speculate the sequential format of speech is best suited to express event information perceived through the auditory modality, while gesture might best express information from the visual modality. Gesture production theories do indeed share an assumption that gesture derives from visuospatial imagery (Sketch Model, de Ruiter, 2000;Postcard Model, de Ruiter, 2007; Gesture as Simulated Action Framework, Hostetter & Alibali, 2008Information Packaging Hypothesis, Kita, 2000; Interface Model, Kita & Özyürek, 2003;Lexical Retrieval Hypothesis, Krauss et al., 2000;Growth Point Theory, McNeill, 1992;McNeill & Duncan, 2000), with iconic gestures in particular considered an effective tool to convey visuospatial information (Alibali, 2005;Hostetter & Alibali, 2008. While there is nothing in these theories precluding the expression of auditory information in gesture, the emphasis on the "visual" has meant there are very few studies that have investigated the spatial affordances derived from non-visual information and expressed through gesture (although see, e.g., Holler et al., 2022).
To be able to address the question of whether input sensory modality affects multimodal language production, it is important to situate this work in the broader study of motion events and language typology. This is important as speakers of different languages package the same spatial experience in different ways focusing on, and conversely omitting, certain event components in speech and gesture. Slobin (1996) proposed that speakers encode aspects of events depending on distinctions in their language. For example, unlike a satellite-framed language such as English, Turkish is considered a verb-framed language, which primarily encodes PATH in the main verb and optionally encodes MANNER in a subordinated verb or adverbial phrases (Talmy, 1985). Turkish speakers use PATH and MANNER in separate clauses (e.g., koşarak eve girdi "she entered the house running", see Table 1), whereas English speakers conflate these in a single clause (e.g., she ran into the house) with MANNER as the main verb.
These language-specific patterns in speech are also reflected in co-speech gesture (Kita, 2000;Kita & Özyürek, 2003;Özçalışkan et al., 2016. Turkish speakers gesture PATH and MANNER separately, whereas English speakers are more likely to produce conflated gestures. In addition, given the focus on PATH in verbframed languages, Turkish speakers have a tendency to gesture only about PATH, even in cases where they mention both PATH and MANNER in speech (Özyürek et al., 2005; Ünal et al., 2022; for a similar tendency in Farsi, Mandarin Chinese, and French respectively, see also Akhavan et al., 2017;Chui, 2009;Gullberg et al., 2008). To account for this, Kita and Özyürek (2003) proposed that gesture derives partly from language typology and partly from visuospatial imagery in their interface model.
With respect to our main research question concerning the role of input modality on the expressibility of motion events, most previous studies have relied overwhelmingly on visual stimuli as input (e.g., video-clips, cartoons, line drawings, paintings;Gennari et al., 2002;Gullberg et al., 2008;Papafragou et al., 2002;Slobin et al., 2014;Ter Bekke et al., 2022). A notable exception is the work of Özçalışkan et al. (2016) who examined cross-linguistic differences in motion event descriptions in congenitally blind, sighted, and blindfolded speakers of Turkish and English. In order to elicit descriptions, blind and blindfolded participants explored scenes haptically while sighted speakers explored them visually. Scenes consisted of landmark objects (e.g., toy house, crib), where static dolls in different postures were posed to create the impression of motion (e.g., a girl running into a house). All participants were instructed to describe the scenes and were explicitly encouraged to gesture at the same time. Özçalışkan et al. (2016) found that both blind and sighted speakers (blindfolded or not) of Turkish and English expressed events in speech and co-speech gesture according to the typology of their language. In a follow-up study, Özçalışkan et al. (2018) showed that blind and sighted speakers of Turkish and English do not display typological differences in gesture when produced without speech (i.e., silent gesture), in line with the claim that only cospeech gesture reflects language-specific packaging (Goldin-Meadow et al., 2008).
These findings suggest sensory modality (in this case, visual vs. haptic) does not strongly influence the way speakers express events in speech or co-speech gesture, with language typology playing a more critical role. However, this conclusion may be premature. While Özçalışkan et al. (2016, 2018) developed a clever paradigm to compare people with and without visual access to stimuli, the conditions were not controlled in all respects. People could have spent longer exploring haptic scenes than visual ones, and this could have affected descriptions. Moreover, there was no direct comparison between descriptions of blindfolded and sighted speakers, so it is possible that within language there were differences between visual and haptic conditions. Finally, in both Özçalışkan et al. (2016) and Özçalışkan et al. (2018) speakers were explicitly asked to gesture while describing events. Encouraging gesturing might affect the encoding of events and possibly increased speakers' gesture frequency (e.g., Cravotta et al., 2019). Therefore, it remains unclear whether sensory modality of input affects the rate and type of spontaneous gesture production.
There is, in fact, evidence that sensory input could affect multimodal language production for spatial scenes (Iverson, 1999;Iverson & Goldin-Meadow, 1997), which in turn could have implications for motion event encoding. Iverson and Goldin-Meadow (1997), for example, compared blind and sighted English speakers during a route description task and found blind children described PATH in a more segmented fashion with more landmarks in their speech than sighted children. For example, a blind child described a route description as: "Turn left, walk north, then you'll see the office, then you'll see 106, then 108, then 110, 112, then there's a doorway. Then there's a hall … ", whereas a sighted child said: "when you get near the staircase you turn to the left" (p. 463). Interestingly, when children gave segmented verbal descriptions, regardless of their visual status, they produced fewer gestures. Iverson and Goldin-Meadow (1997) claimed that gesture frequency decreases with segmented speech due to the process of gesture generation. As gestures express an image as "a global whole" (McNeill, 1992), when speech is represented sequentially, it is not as wellsuited for gesture. So, while speech might be more suitable for expressing information from non-visual input, gesture might be less well suited to do so.
To summarise, previous studies provide contradictory evidence about whether sensory modality could influence the way information is expressed in speech and gesture (Iverson, 1999;Iverson & Goldin-Meadow, 1997;Özçalışkan et al., 2016. However, no study has directly varied the input sensory modality of motion events-while also controlling for duration and event type-to test whether it affects speech and gesture.

The present study
We explore the effect of sensory modality of input on multimodal language use by focusing on motion events. Motion events provide a good test bed as there is a large body of previous speech and gesture production studies to build upon (e.g., Akhavan et al., 2017;Brown & Chen, 2013;Chui, 2009;Gennari et al., 2002;Gullberg et al., 2008;Papafragou et al., 2002). Importantly, PATH and MANNER components of motion events can be perceived from both visual and auditory inputs (Geangu et al., 2021;Mamus et al., 2019) and each may be differentially mapped to speech and gesture. Focusing on Turkish in particular allows us to situate our results with respect to previous studies in this language (e.g., Aktan-Erciyes et al., 2022;Allen et al., 2007;Kita & Özyürek, 2003;Özçalışkan et al., 2016Ter Bekke et al., 2022) which together provide an important corrective to the dominance of English language studies in the literature (cf. Thalmayer et al., 2021).
We compared Turkish speakers' speech and gesture for PATH and MANNER of motion events that were presented as audio-only, visual-only, or multimodal (visual + audio) input. Our main goal was to compare audioonly to visual-only input. Including a multimodal condition allowed us to examine the additional boost, if any, multiple sources of information provide. In particular, it is interesting to compare the visual-only to the multimodal condition to see if auditory information provides additional spatial information to language production processes.
In speech, there are a number of specific predictions we can make. First, based on the observation that vision dominates in the perceptual lexicons of languages (e.g., San Roque et al., 2015;Winter et al., 2018), it is possible that vision also influences linguistic encoding for motion events. If so, we would predict that participants in the visual conditions (i.e., visual-only and multimodal conditions) would provide more motion event descriptions than participants in the audio-only condition.
In addition, we can make specific predictions about the encoding of PATH vs. MANNER in speech. With regard to PATH, if the previously attested differences in encoding of PATH information from non-visual input (i.e., segmented PATH descriptions in blind vs. non-blind, Iverson, 1999;Iverson & Goldin-Meadow, 1997) are caused by the sensory modality of input at encoding, we would predict that participants in the audio-only conditions would describe PATH of motion in a more segmented fashion than in other conditions because auditory input is more sequential. This would lead to more mentions of PATH within each description in speech in the audio-only condition than in the visual conditions.
As for MANNER, it is possible that vision is of advantage here too. For example, in order to differentiate particular MANNERs, such as walk vs. run, vision provides richer information than audition about biomechanical properties (e.g., Malt et al., 2014), as well as providing information about speed and direction of motion. So, participants in the visual conditions might describe MANNER more often than participants in the audio-only conditions. On the other hand, audition is also good at providing temporal information-such as rhythm of motion (e.g., Recanzone, 2003;Repp & Penel, 2002), so it is also possible that auditory information might be as rich as visual information and lead to a comparable MANNER encoding of motion.
Regarding co-speech gesture, there are two main possibilities that can be predicted from the previous literature, either visual input is also advantageous for gesture or there is no impact of modality on gesture production. There are three reasons to expect gesture frequency for MANNER and PATH gestures would be higher for visual conditions than the auditory condition. First, gestures-due to the affordances of the visual modality and the possibilities of more easily mapping visuospatial information from vision to gesture-might be more suited for expressing visual information than auditory information (Macuch Silva et al., 2020). For example, signing children use more MANNER and PATH expressions in Turkish sign language than their Turkish speaking peers because of the visually motivated linguistic forms available to sign languages (Sümer & Özyürek, 2022). Second, one might expect gesture to parallel speech patterns (e.g., Kita & Özyürek, 2003;Özyürek et al., 2005), thus leading to more MANNER gestures in the visual conditions than the auditory condition. Finally, PATH gestures might be more difficult to produce with the segmented speech predicted for the auditory condition because gestures are less suited for segmented expressions (Iverson, 1999;Iverson & Goldin-Meadow, 1997), leading to higher rates of PATH gestures in the visual conditions. For all these reasons, visual input may be particularly suited to elicit gestures.
On the other hand, it is possible that there is no difference in the frequency of gesture production between different input conditions. Gesture production theories focusing on the role of mental imagery in gesture, such as the GSA framework, have suggested that "any form of imagery [such as auditory or tactile imagery] that evokes action simulation is likely to be manifested in gesture" (Hostetter & Alibali, 2019, p. 726). This suggests the type of input does not matter for how much gesture is elicited, as long as spatial imagery can be generated. Thus, on this account, participants in all conditions could produce comparable gestures.

Participants
We recruited 90 native Turkish speakers with normal or corrected-to-normal vision from Boğaziçi University. We randomly assigned 30 participants to each of three conditions: audio-only (M = 21 years, SD = 2, 17 female), visual-only (M = 22 years, SD = 3, 16 female, 1 nonbinary), and multimodal (M = 21 years, SD = 2, 10 female, 2 nonbinary). We tested participants in a quiet room on Boğaziçi University campus. They all received extra credit in a psychology course for their participation and provided written informed consent in accordance with the guidelines approved by the IRB committees of Boğaziçi and Radboud Universities.

Stimuli
We made video-and audio-recordings of locomotion and non-locomotion events with an actress. We created 12 locomotion events by crossing 3 MANNERs (walk, run, and limp) with 4 PATHs (to, from, into, and out of) in relation to a landmark object (door or elevator)-such as "someone runs into an elevator". So, participants either only listened to the sound of someone running into an elevator or watched the event with or without the sound. A video and audio recorder were placed next to the landmark objects. For to and into events, the actress moved towards landmarks, with the PATH direction approaching the audio recorder. For from and out of events, the actress moved away from landmarks, with the PATH direction away from the audio recorder.
We created 12 non-locomotion events with the same actress performing two-participant "transitive" actions with different objects (e.g., cutting paper, eating an apple), and the video and audio were recorded across from her at a fixed distance. Locomotion events served as the critical items, whereas non-locomotion events were included as fillers. Thus, we did not investigate the non-locomotion events.
There were 24 trials per person, including a total of 12 locomotion (M duration = 11.3s, SD duration = 3.6) and 12 non-locomotion (M duration = 7.7s, SD duration = 2.3) events presented in different random orders across participants (see Appendix I for a list of all events and their durations). All stimuli are available at https://osf.io/qe7dz/? view_only=d202c274a186461381c09dc70db6ad39.
The experiment used a between-subjects design with three levels of input modality (audio-only vs. visual-only vs. multimodal).

Procedure
Using a laptop and Presentation Software (Version 20.0, Neurobehavioral Systems, Inc., Berkeley, CA, www. neurobs.com), events were presented as audio-clips to participants in the audio-only condition, as silent video-clips to participants in the visual-only condition, and as video + audio clips to participants in the multimodal condition. All participants regardless of the condition wore headphones during the task. The instructions were the same across the conditions except the opening sentence (i.e., in this task, you will "watch video clips" / "listen to sound clips"). Participants were then asked to describe each event at their own pace without any instructions about gesture use. They were told other participants would watch their descriptions and watch/listen to the same events in order to match descriptions with events.
At the beginning of the experiment, participants performed two practice trials with non-locomotion events. Further clarification was provided, if necessary, after the practice trials. Event descriptions were recorded with a video camera that was approximately 1.5 m across from participants. The experimenter sat across from participants and next to the camera. After each event description, participants proceeded with the next trial at their own pace by pressing a button on the laptop. Participants also filled out a demographic questionnaire on another laptop after the event description task. The experiment lasted around 15 min.

Speech
All descriptions of locomotion and non-locomotion events were annotated by two native Turkish speakers using ELAN (Wittenburg et al., 2006), but only descriptions for the locomotion events were transcribed and coded. These descriptions were then split into clauses. A clause was defined as a verb and its associated arguments or a verb with gerund phrases. Clauses including locomotion descriptions (e.g., someone is walking towards the door) were coded as relevant, whereas clauses including a transitive event-such as opening a door or ringing the bell-or other information-such as a person is wearing high heels-were coded as irrelevant to the target event. Each relevant clause was coded according to the type of information it contained: (a) PATH (trajectory of motion), and (b) MANNER (how the action is performed)-see Table 1 for an example. We calculated the Interclass Correlation Coefficient (ICC) between two coders to measure the strength of intercoder agreement for PATH and MANNER of motion in speech (Koo & Li, 2016). Agreement between coders was .97 for PATH and .94 for MANNER of motion.

Gesture
Participants' spontaneous iconic gestures were coded for each target motion event description. We coded gesture strokes (i.e., the meaningful phase of a gesture) that co-occurred with descriptions (Kita, 2000). Each continuous instance of hand movement was coded as a single gesture. Iconic gesture representing trajectory and/or MANNER of motion were further classified into the following categories (see Figure 1 for gesture examples): (a) PATH-only gestures depict trajectory of movement without representing MANNER (b) MANNER-only gestures show the style of movement without representing trajectory (c) PATH + MANNER gestures depict both trajectory and MANNER of movement simultaneously We calculated the ICC between two coders to measure the strength of inter-coder agreement for identifying a gesture and coding each type of gesture. Agreement between coders was .98 for identifying gestures and between .92-.95 for type of gesture (i.e., .95 for coding PATH only, .92 for MANNER only, and .95 for PATH + MANNER gestures).

Results
To analyse the data, we used linear mixed-effects regression models (Baayen et al., 2008) with random intercepts for participants, items, path type, and manner type, using the packages lme4 (Version 1.1-28; Bates et al., 2015) with the optimiser nloptwrap and lmerTest (Version 3.1-3; Kuznetsova et al., 2017) to retrieve p-values in R (Version 4.1.3; R Core Team, 2022). We conducted linear mixed effects models on distinct motion elements (PATH and MANNER) in speech and gesture. To assess statistical significance of the fixed factors and their interaction, we used likelihood-ratio tests with χ 2 , comparing models with and without the factors and interaction of interest. For post-hoc comparisons and to follow-up interactions, we used emmeans (Version 1.7.3; Lenth, 2022). Data and analysis code are available at https://osf.io/qe7dz/?view_only= d202c274a186461381c09dc70db6ad39.

Speech
Overall differences in the amount of speech produced for visual and auditory motion events First, we tested whether participants differed in the speech they produced for motion events based on audio-only, visual-only, or multimodal input. Table 2 provides the descriptive statistics for the average number of all clauses, motion event clauses, all gestures, and relevant gestures.
We ran a glmer model with the fixed effect of input modality (audio-only, visual-only, multimodal), the fixed effect of manner type (walk, run, limp), and their interaction term on binary values for mention of motion event clauses in speech (0 = no, 1 = yes) as a dependent variable. See Appendix II for the model summary table. It revealed an effect of input modality, χ 2 (2) = 42.43, p < .001, R 2 = .042. Participants in the audio-only condition had fewer motion event descriptions compared to participants both in the visual-only (β = −1.07, SE = .170, z = −6.32, p < .001, R 2 = .031) and multimodal (β = −1.07, SE = .170, z = −6.29, p < .001, R 2 = .031) conditions. There was no difference between participants in the visual-only and multimodal conditions, (β = .006, SE = .178, z = .032, p = .99). Figure 2 shows the ratio of motion event descriptions (i.e., clauses including locomotion descriptions) in all descriptions.
The model also revealed an effect of manner type, χ 2 (2) = 7.77, p = .021, R 2 = .002. Participants had more motion event descriptions for the run than limp events (β = 0.29, SE = .102, z = 2.83, p = .013, R 2 = .002). But, there was no difference between the walk and limp events (β = 0.09, SE = .102, z = .91, p = .63) and the run and walk events (β = 0.20, SE = .107, z = 1.82, p = .16) in terms of the motion event descriptions. The model did not reveal a significant interaction between input modality and manner type, χ 2 (2) = 9.46, p = .051.  Differences in reference to PATH and MANNER in speech Next, we examined whether participants differed in how much they expressed PATH and MANNER in speech. To account for baseline differences in the number of motion event descriptions produced, we calculated the ratio of mention of PATH and MANNER per motion event description for each participant and item. We ran a lmer model with the fixed factors of input modality (audio-only, visual-only, multimodal) and type of description (PATH vs. MANNER) and their interaction term using the ratio of mention of PATH and MANNER per motion event description as the dependent variable (see Figure 3). The model revealed no fixed effect of input modality, χ 2 (2) = 1.37, p = .50, but a fixed effect of type of description, χ 2 (1) = 15.95, p < .001, R 2 = .008, showing that MANNER was mentioned more than PATH in speech. However, the model also revealed an interaction between input modality and type of description, χ 2 (2) = 31.25, p < .001, R 2 = .023. To follow-up the interaction, we first used emmeans function to compare the use of PATH vs. MANNER within each group. There was more mention of MANNER than PATH in the visual-only (β = .141, SE = .028, t = 5.03, p < .001) and multimodal conditions (β = .115, SE = .028, t = 4.11, p < .001), but more reference to PATH than MANNER in the audio-only condition, β = .068, SE = .029, t = 2.33, p = .020. That is, MANNER and PATH were differentially salient in the visual versus auditory conditions. Second, to follow-up the interaction effect, we also compared reference to MANNER and PATH separately across input modalities. PATH was mentioned more often in the audio-only than visual-only (β = .090, SE = .029, t = 3.15, p = .005) and multimodal (β = .101, SE = .029, t = 3.51, p = .002) conditions. Conversely, MANNER was mentioned less often in the audio-only than visual-only (β = −.12, SE = .029, t = −4.15, p < .001) and multimodal (β = −.08, SE = .029, t = −2.89, p = .011) conditions. There was no difference between the visualonly and multimodal conditions for references to PATH (β = .010, SE = .028, t = 0.36, p = .93) or MANNER (β = .036, SE = .028, t = 1.29, p = .41). See Appendix III for the summary of post-hoc comparisons with emmeans.

Gesture
Overall differences in the amount of gesture produced for visual and auditory motion events We investigated whether participants differed in how much they gestured about different elements of motion events based on input modality (see Table 2 for the descriptive statistics). Because the amount of gesture changes as a function of the rate of motion event descriptions, we first calculated the gesture ratio per motion event description. We compared the groups in terms of their overall gesture ratio using a one-way between-participants ANOVA. There was no significant difference in the gesture ratio between participants in the audio-only

Differences in PATH and MANNER gestures
To investigate the type of iconic gestures participants produced, we again calculated the ratio of PATH only, MANNER only, and PATH + MANNER conflated gestures per motion event description for each participant and item. For these calculations, total counts of PATH only, MANNER only, and PATH + MANNER gestures were divided by the number of motion event descriptions for each trial. The data was analysed in the same way as for speech. We ran a lmer model with fixed factors of  input modality (audio-only, visual-only, and multimodal) and type of description (PATH-only, MANNER-only, and PATH + MANNER) using the ratio of PATH and MANNER gestures per motion event description as dependent variable (see Figure 5). The model revealed a fixed effect of type of description, χ 2 (2) = 531.82, p < .001, R 2 = .156. Regardless of input modality, speakers produced more PATH-only gestures than MANNER-only (β = .230, SE = .011, z = 20.59, p < .001, R 2 = .107) and PATH + MANNER gestures (β = .236, SE = .011, z = 21.14, p < .001, R 2 = .113). There was no difference between MANNER-only and PATH + MANNER gestures (β = .006, SE = .011, z = .55, p = .85). The model revealed no fixed effect of input modality, χ 2 (2) = 3.64, p = .16, and no significant interaction between input modality and type of description on PATH and MANNER gestures, χ 2 (4) = 9.29, p = .054. See Appendix IV for the model summary table.

Discussion
Our goal was to investigate whether sensory modality of input influences the multimodal linguistic encoding of spatial information in motion events in speech and cospeech gesture. To determine this, we first examined the quantity of motion event descriptions in speech to establish whether the dominance of vision shown in perception lexicons (e.g., San Roque et al., 2015;Winter et al., 2018) is reflected in the linguistic encoding of motion events under experimental conditions. We found speakers produced more motion event descriptions when they watched events-either multimodal or visual-only-in comparison to when they only listened to events, i.e., audio-only. So, speakers provide richer linguistic information about spatial components of motion events when visual information is available. There was no difference in the amount of motion event descriptions between the visual-only and multimodal conditions, which suggests having auditory input on top of visual input does not further enrich speakers' motion event descriptions. These findings support the proposal that vision dominates in language, extending it to the domain of motion events.
There was, however, a qualitative difference in the verbal expressions of different spatial aspects of motion drawn from visual vs. auditory input. Speakers within the visual conditions mentioned MANNER more than PATH of motion, whereas speakers within the auditory condition mentioned PATH more often than MANNER. In addition, in the audio-only condition speakers mentioned PATH more often than they did in the visual conditions. This finding is in line with earlier studies of space showing non-visual input at encoding might lead to segmented PATH descriptions when describing routes (e.g., Iverson, 1999;Iverson & Goldin-Meadow, 1997). This might arise from the fact that non-visual spatial information is represented sequentially in contrast to holistic visual information. It is also possible that auditory input foregrounded PATH more than MANNER because information about MANNER of motion is less accessible without visual information. Although audition can provide high temporal acuity to differentiate rhythmic changes of movements (e.g., Recanzone, 2003;Repp & Penel, 2002), it might not provide detailed information to differentiate MANNERs of motion to the same degree as vision (Malt et al., 2014). On the other hand, we used only three simple MANNERs-i.e., walk, run, and limp, which may have been difficult to discriminate between based on auditory input alone. Our findings showed that participants, regardless of the condition, had more difficulty describing the limp than run events. A study using a more diverse set of MANNERs could better test the affordances of audition vs. vision.
Interestingly, Turkish speakers in the visual conditions mentioned MANNER more often than PATH in their speech. Considering the typology of Turkish, this is interesting since Turkish speakers might be expected to omit MANNER more often in motion event descriptions (e.g., Kita & Özyürek, 2003;Özçalışkan et al., 2016Slobin, 1996;Talmy, 1985). Our findings suggest there may be universal processes at work, such that vision always provides more detailed information about MANNER of motion than audition, and therefore MANNER of motion might be more salient in visual input, even in a PATH language like Turkish. This suggests the sensory modality of input could influence speakers' encoding of spatial event components independently of the well-established tendencies of speaking a particular language (e.g., Slobin, 1996;Talmy, 1985). Future cross-linguistic studies could tease apart these possibilities systematically.
Although the finding that speakers in the visual conditions mentioned MANNER more than PATH seems discrepant with the usual typological patterns, we are not the first to report a reversed speech pattern in Turkish (e.g., Allen et al., 2007;Ter Bekke et al., 2022). Recently, Ter Bekke et al. (2022) also found that Turkish speakers used more MANNER than PATH when describing motion events presented as silent videos. To explain their findings, they highlighted the fact that they used salient MANNERs-such as tiptoe, twirl, and hop-that are not "default" ways of changing location. Yet, this explanation does not hold for our findings, as the MANNERs in our study were not particularly salient-i.e., walk, run, and limp. Alternatively, Allen et al. (2007) claimed that Turkish speakers are more likely to omit MANNER in larger discourse and when it does not simultaneously occur with PATH in motion events, as used in earlier studies. When MANNER and PATH are simultaneously present in motion events-as in the present study-Turkish speakers mention both elements in their event descriptions. Further studies should examine whether the saliency of MANNER or the ease of expression modulate linguistic encoding of MANNER, particularly in PATH-dominant languages (i.e., verb-framed languages; Talmy, 1985).
For gesture, we predicted that gesture frequency for both PATH and MANNER might decrease in the audioonly condition compared to the visual conditions because of the affordances of the visual modality. Due to the available mapping between gesture and vision (Macuch Silva et al., 2020), gesture production might be easier in the visual conditions than the audio-only condition. However, this was not the case in the present study. We found auditory input alone can elicit similar gesture frequency and gesture types-PATH and MANNER-as visual input. This suggests auditory input can lead to spatial imagery just as visual input does, as explicitly claimed by Hostetter and Alibali (2019). In line with this, Holler et al. (2022) found speakers produce spontaneous co-speech gesture depicting metaphorical spatial features of auditory pitch when describing sounds-e.g., producing a gesture higher in space to depict high pitch notes. Thus, our results support the argument that auditory information can also elicit gesture if it triggers spatial imagery.
Unexpectedly, the difference between PATH and MANNER expressions across input modalities found in speech was not reflected in co-speech gesture. Based on prior work (e.g., Iverson, 1999;Iverson & Goldin-Meadow, 1997), if speech for PATH is segmented, it may be ill-suited for PATH gesture, and consequently gesture frequency for PATH may decrease. Contrary to this, we found that although participants in the audio-only condition segmented PATH of motion more (i.e., made more reference to PATH in speech) than participants in the visual conditions, the frequency of their PATH gestures did not differ to those produced in the visual conditions. This discrepancy between our results and earlier findings could arise from the fact that these events only had single PATHs. So, although speech for PATH was segmented into smaller units, the amount of segmentation possible might be diminished since we are dealing with smaller-scale PATHs-as in our motion events-compared to larger-scale route description with multiple PATHs. Indeed, Iverson (1999) showed segmentation in PATH descriptions decreases with the diminishing size of a spatial layout.
We found the same discrepancy between speech and gesture for MANNER. Even though speakers in the visualonly and multimodal conditions mentioned MANNER more often in speech, there was no increase in the frequency of MANNER gestures. Regardless of the sensory modality of input, speakers produced more PATH only gestures than MANNER gestures, including PATH + MANNER, even in cases where they mentioned both PATH and MANNER in speech. One might hypothesise that expressing manner in speech was easier than in gesture, and participants might have chosen the modality strategically to avoid confusion for potential addressees who, according to our instructions, would go on to match descriptions to motion events. However, we think this is unlikely since earlier gesture studies of Turkish find that Turkish speakers typically gesture more about path than manner of motion   Aktan-Erciyes et al., 2022;Mamus et al., under review;Özyürek et al., 2005;Ünal et al., 2022;although see Ter Bekke et al., 2022). So, the few manner gestures observed in our study fit the broader language typology (e.g., Akhavan et al., 2017;Chui, 2009;Gullberg et al., 2008).
Taken together, our findings are more in line with predictions that language typology is the determining factor in gesture production (e.g., Özçalışkan et al., 2016, 2018) and that gestures are mostly shaped by language typology during speaking (e.g., Kita & Özyürek, 2003;Özyürek et al., 2005;Slobin, 1996) rather than sensory input. The discrepancy between our speech and gesture findings also suggests that even though speech affects gesture through language typology, gesture does not solely depend on speech contrary to the suggestions of some theories (e.g., Sketch Model, de Ruiter, 2000; Lexical Retrieval Hypothesis, Krauss et al., 2000;Growth Point Theory, McNeill, 1992), but consistent with the proposal that speech and gesture are independent, yet highly interactive systems (e.g., Gesture as Simulated Action Framework, Hostetter & Alibali, 2008; Gesture-for-Conceptualization Hypothesis, Kita et al., 2017).
Although our results imply the sensory modality of input does not affect the gesture of Turkish speakers, results may differ for a satellite-framed language that encodes MANNER in the main verb-such as English-or an equipollently-framed language-such as Mandarin Chinese (e.g., Brown & Chen, 2013). As MANNER is usually encoded in speech and co-speech gesture in such languages, the affordances of auditory vs. visual input might be more observable in gestural expressions of MANNER-e.g., auditory input may lead to fewer MANNER gestures than visual input. A cross-linguistic investigation is necessary to better understand whether and how co-speech gesture is influenced by the interaction of sensory modality of input and language typology.

Conclusion
The present study examined the role of sensory modality of input on the linguistic expression of motion event components in both speech and co-speech gesture and found they pattern in distinct ways. In comparison to the auditory modality, the visual modality appears to foreground MANNER more than PATH in speech, but gestures are generated similarly regardless of the sensory modality of input. These findings suggest the sensory modality of input influences speakers' encoding of PATH and MANNER of motion events in speech, but not in gesture.