Skip to content
BY 4.0 license Open Access Published by De Gruyter Open Access November 16, 2023

Reaching beneath the tip of the iceberg: A guide to the Freiburg Multimodal Interaction Corpus

  • Christoph Rühlemann EMAIL logo and Alexander Ptak
From the journal Open Linguistics

Abstract

Most corpora tacitly subscribe to a speech-only view filtering out anything that is not a ‘word’ and transcribing the spoken language merely orthographically despite the fact that the “speech-only view on language is fundamentally incomplete” (Kok 2017, 2) due to the deep intertwining of the verbal, vocal, and kinesic modalities (Levinson and Holler 2014). This article introduces the Freiburg Multimodal Interaction Corpus (FreMIC), a multimodal and interactional corpus of unscripted conversation in English currently under construction. At the time of writing, FreMIC comprises (i) c. 29 h of video-recordings transcribed and annotated in detail and (ii) automatically (and manually) generated multimodal data. All conversations are transcribed in ELAN both orthographically and using Jeffersonian conventions to render verbal content and interactionally relevant details of sequencing (e.g. overlap, latching), temporal aspects (pauses, acceleration/deceleration), phonological aspects (e.g. intensity, pitch, stretching, truncation, voice quality), and laughter. Moreover, the orthographic transcriptions are exhaustively PoS-tagged using the CLAWS web tagger (Garside and Smith 1997). ELAN-based transcriptions also provide exhaustive annotations of re-enactments (also referred to as (free) direct speech, constructed dialogue, etc.) as well as silent gestures (meaningful gestures that occur without accompanying speech). The multimodal data are derived from psychophysiological measurements and eye tracking. The psychophysiological measurements include, inter alia, electrodermal activity or GSR, which is indicative of emotional arousal (e.g. Peräkylä et al. 2015). Eye tracking produces data of two kinds: gaze direction and pupil size. In FreMIC, gazes are automatically recorded using the area-of-interest technology. Gaze direction is interactionally key, for example, in turn-taking (e.g. Auer 2021) and re-enactments (e.g. Pfeiffer and Weiss 2022), while changes in pupil size provide a window onto cognitive intensity (e.g. Barthel and Sauppe 2019). To demonstrate what opportunities FreMIC’s (combination of) transcriptions, annotations, and multimodal data open up for research in Interactional (Corpus) Linguistics, this article reports on interim results derived from work-in-progress.

1 Introduction

While researchers in Interactional Linguistics and related fields commonly refer to collections of related transcripts as ‘corpora’, the concept underlying the use of the word ‘corpus’ in Corpus Linguistics requires that the collection be (i) computer searchable, (ii) large, (iii) annotated, and (iv) representative – that is, including “the full range of variability in a population” (Biber 1993, 243). The corpus we are going to describe in this article aims at least to approximate such a ‘corpus’: it is fully computer-searchable, large when it comes to its word count but indeed large as far as multimodal data are concerned, and annotated on multiple levels. As far as representativeness is concerned, ‘the full range of variability’ can never be established with complete confidence. Pursuing representativeness thus resembles pursuing the holy grail – an ideal that will be never reached (Leech 2007). Our aims regarding representativeness are further humbled by the fact that the recruits were overwhelmingly from the student population of Freiburg University (Section 2). The Freiburg Multimodal Interaction Corpus (FreMIC), then, at best reflects (to avoid the term ‘represent’) talk-in-interaction, not in general conversation in English, but in English conversation among university-educated 20- to 30-year olds.

FreMIC thus constitutes a corpus-linguistic corpus – but one which essentially diverges from most available corpora. Most corpora tacitly subscribe to a speech-only view filtering out anything that is not a ‘word’ and transcribing the spoken language merely orthographically. This monomodal view on language is fundamentally incomplete (Kok 2017, 2, Kelly 2001, 326–7) as speech is just “the tip of an iceberg riding on a deep infrastructure” (Levinson and Holler 2014, 2, Kelly 2001, 326–7) of non-verbal modalities.

Three main modalities are distinguished: verbal (speech), vocal (prosody), and kinesic (gesture, facial expressions, proxemics, posture, gaze) (Arndt and Janney 1987). The three modalities work as an integrated multi-modal communication system (Levinson and Holler 2014, 5). Indeed, evolution has equipped the human body optimally for multimodal communication, where bipedal locomotion has freed the hands for gesturing, and the white sclera renders gaze direction easily detectable (Holler and Levinson 2019, 9). Humans have a rich nonverbal toolbox at their disposal: there are 43 facial muscles and 34 muscles per hand; at the same time, the relative lack of bodily hair ensures that even small muscular movements, e.g. squints and blinks, are visible and communicatively effective (Holler and Levinson 2019, 9).

The three modalities form an ‘evolutionarily stratified system’ (Levinson and Holler 2014, 1) that has evolved over two and a half million years (Levinson and Holler 2014, 6). Language gradually co-evolved with the pre-existing gestural and vocal modes of communication over almost a million years, leading to an intricate intertwining of the modalities (Levinson and Holler 2014, 5). Due to the intertwining, the non-verbal modalities are never fully repressed: deaf signers mouth and vocalize, we gesture on the telephone although our gestures are not visible to the addressee (Beattie 2016). Intertwining facilitates crossmodality (Arndt and Janney 1987), i.e. we can easily shift the burden of information from one modality to another (cf. Levinson and Holler 2014, 1). Far from merely adorning speech, the non-verbal modalities form a constitutive part of the overall “psychology of speaking, along with, and not fundamentally different from, speech itself” (McNeill 1985, 351). Multimodal utterances are thus holistically planned (by the speaker) and their meanings are holistically derived (by the recipient) based on “gestalt-like principles – that is, fast integration of stimuli that ‘make sense’ together and are recognized as holistic percepts” (Holler and Levinson 2019, 641). In the field of conversation analysis (CA), while early analyses were restricted to audio tapes, pioneers such as Goodwin introduced video recordings in the 1980s, noting that conversationalists’ “access to each other’s bodies provides a resource for the display of meaning” (Goodwin 1986a, 29). Since then CA analyses of video recordings have become mainstream, so much that Mondada speaks of a “‘visual’, ‘embodied’, or ‘multimodal’ turn” (Mondada 2018, 86). Multimodality is now recognized as constitutive of face-to-face interaction operating on a wealth of semiotic resources that are combined in various configurations and form multimodal holistic articulations (‘gestalts’).

Obviously, dealing with multimodal communication poses serious methodological challenges – e.g. the number of variables to take into account is awe-inspiring (cf. Rühlemann 2022) and connected multimodal signals are often temporally disaligned (Holler and Levinson 2019). Therefore, it is not surprising that only a number of multimodal corpus compilation projects have taken up the challenge to date.

To start with, a small group of corpora allows research into prosodic patterning. These corpora include, for example, the London-Lund Corpus of Spoken English (Svartvik 1990), the Santa Barbara Corpus of Spoken American English (Du Bois et al. 2000–2005), the ‘Systems of Pragmatic annotation for the spoken component of ICE-Ireland’ (SPICE-Ireland) (Kallen and Kirk 2012), and the Switchboard Corpus (Calhoun et al. 2010), which is perhaps the most widely used in Interactional Linguistics (e.g. Roberts et al. 2015). Another group of corpora allows research into the kinesics-speech interface. For example, the Bielefeld Speech and Gesture Alignment Corpus (SaGA; Lücking et al. 2013) contains just 40,000 words but roughly ‘six thousand gesture units’ (Kok 2017, 4) (which is, considering the complexity of gesture annotation, very large). The Nottingham Multimodal Corpus contains 250,000 words sourced from lectures and supervising sessions at the University of Nottingham (Adolphs and Carter 2013), of which a subset was coded for the co-occurrence of backchannel and head nod. A small number of corpora have been created that investigate online communication multimodally. For example, the Interactional Variation Online (http://ivohub.com) project (Knight et al., Forthcoming, 2023) facilitates multimodal corpus analysis of virtual workplace communication. The VAPVISIO project (Holt et al. 2021, Cappellini et al. 2023) comprises a multimodal and multilingual (French, English, and Mandarin Chinese) corpus of second language learning/teaching through desktop videoconferencing tools based on 64 h of recordings with audio, video (recording torso, hands, and face), and eye-tracking data. The MIDI group at KU-Leuven collected several corpora of which two are highly relevant: the ALILEGRA corpus and the Insight Interaction corpus. The ALILEGRA-corpus (Alignment Leuven Graz) contains transcribed videos of spontaneous dyadic conversations plus physiological data, such as participants’ heart rate and blood pressure (cf. Jehoul et al. 2017). The Insight Interaction corpus also contains transcribed videos with dyadic interactions (experimentally controlled and spontaneous conversations) (Brône and Oben 2015). It differs from the latter in that it comes without physiological data but with eye-tracking and gesture data. Corpora within the Nordic NOMCO project (Paggio et al. 2010) can also be used to study multimodal communication (feedback, turn management, and sequencing) in several social activities in Swedish, Danish, Finnish, and Estonian. The annotation scheme opens up avenues for the analysis of head movements, facial expressions, body posture, and hand gestures. Another multimodal corpus with a diverse dataset is the MULTISIMO (project) corpus (Koutsombogera and Vogel 2018). It targets the investigation of task-based multi-party collaboration by analysing verbal and non-verbal signals. These serve as speaker cues and involve speech, acoustic, visual, lexical, perceptual, as well as survey and demographic data. Finally, the Corpus of Academic Spoken English corpus of Skype conversations between speakers of English as a Lingua Franca contains about 2 million words (Diemer et al. 2016).

In the following sections, a new addition to this small group of multimodal corpora will be presented – the Freiburg Multimodal Interaction Corpus (FreMIC). To demonstrate the utility of FreMIC’s (combination of) transcriptions, annotations, and multimodal data opens up for research in Interactional (Corpus) Linguistics, this article also reports on interim results derived from a number of case studies.

2 FreMIC

FreMIC is an emerging corpus constructed at the University of Freiburg based on a grant from the Deutsche Forschungsgemeinschaft.[1] It is a multimodal corpus of face-to-face interaction in unscripted dyadic and triadic conversation. As of now, it comprises c. 29 h of video-recordings; the final aim is for at least 40 h. The participants (n = 41 at the time of writing) signed consent forms determining their individual choices as to what of their data can be used and in what contexts it can be used. For the recordings, they were seated in an F-formation (Kendon 1973) enabling them to establish eye contact, hear each other clearly, and engage in nonverbal cues: participants in dyads were seated vis-à-vis each other (with the room camera capturing both participants from the side), whereas the seating arrangement in triads was an equilateral triangle (with the room camera frontally capturing the participant sitting at the triangle apex). The participants were instructed to talk freely about whatever they liked. As shown in Table 1, the participants are drawn from the same demographic layer that most convenience samples are drawn from (Robinson 2007), namely, students as well as their friends and relatives of the local university.

Table 1

Descriptive statistics of the participants’ demographics in FreMIC

N %
Gender
Male 17 41
Female 21 51
Diverse 2 0.05
NA 1 0.02
Age range
20–25 22 54
26–31 16 39
>32 3 7
NA 0 0
L1 variety
BrE 6 15
AmE 24 59
Others 11 27
NA 0 0
Educational background
BA 25 61
MA 3 7
Others 11 27
NA 2 5

The data assembled in FreMIC consist of three major layers:

  1. Transcribed and annotated data in ELAN,

  2. Automatically generated multimodal and psychophysiological data,

  3. Turns component.

The three layers will be described in what follows.

2.1 Transcribed and annotated data in ELAN

2.1.1 Inter-pausal units (IPUs)

The conversational data are transcribed and annotated in ELAN (Wittenburg et al. 2006). Transcriptions are organized around IPUs; i.e. we separate annotations when a speaker pauses for more than 180 ms. This threshold reflects the human threshold for the detection of acoustic silences, which lies between 120 and 200 ms (Heldner 2011, Walker and Trimboli 1982) and it facilitates comparability with Levinson and Torreira (2015) and Roberts et al. (2015). The onsets and offsets of the IPUs were determined through visual and auditory inspection of waveforms during transcription in ELAN’s annotation mode. So far, we have been using ELAN mainly in annotation mode, after trying out segmentation mode additionally in the beginning phase of the corpus construction. We found that the annotation mode provides the most flexibility to create annotations, add their content, and align them with the displayed waveform of the audio signal. ELAN’s audio resolution goes down to 1 ms, and the video time resolution is 3,300 ms in our case (30 fps of splitscreen video). We do make use of the different playback options such as slow motion, frame by frame, loop mode, etc. – especially for gesture annotation (Section 2.1.4). Furthermore, automatic pause measurement between annotations (IPUs) can be done in ELAN with a few clicks.

2.1.2 Types of transcription and error checks

Two types of transcriptions are used: conversations are transcribed orthographically (by native speakers of English) and using Jeffersonian conventions (e.g. Jefferson 2004). This latter transcription is referred to as CA transcription and records not only verbal content but also interactionally relevant details of sequencing (e.g. overlap, latching), temporal aspects (pauses, acceleration/deceleration), phonological aspects (e.g. intensity, pitch, stretching, truncation, voice quality), and laughter and free-standing (silent) gestures (see below). Departing from CA conventions, we express emphasis using the !…! convention (as italicization is not supported in ELAN) and also annotate what we refer to as the ‘scale’, or ‘Tonleiter’, phenomenon, i.e. a continuous upward or downward glissando in pitch. The conventions used are listed in Figure 1.

Figure 1 
                     Key to CA-style transcription conventions.
Figure 1

Key to CA-style transcription conventions.

The amount of detail in the CA transcriptions is considerable and, as a result, presents a serious source of potential error. There are two specific problem areas: (i) illegal characters and (ii) missing legal (special) characters. While illegal characters are easy to detect and remove/replace, missing legal characters require more work. Two types can be distinguished:

  1. Type A error – missing mirror delimiters, i.e. delimiters that are unambiguously either opening or closing, e.g. [word] for overlapped speech, (word) for candidate hearing, (1.1) or (.) for pause, ((word)) for comment, etc.

  2. Type B error – missing twin delimiters, i.e. delimiters that can be used for both opening and closing annotations, e.g. °word° for quiet/hushed speech, !word! for emphasized speech, ≈word≈ for tremulous voice, etc.

To weed out Type A and B errors, automatic error checks were devised in R for each potential error type and each transcription convention, largely based on Regular Expression, to detect utterances with missing opening or closing character; once detected, the errors were manually corrected in the ELAN files.

2.1.3 Transcription of quotes

Quotation, also referred to as (free) direct speech (e.g. Labov 1972), constructed dialog (e.g. Tannen 1986), or re-enactement (e.g. Holt 2007), is a recurrent feature of conversational language. It is particularly frequent in conversational storytelling (e.g. Rühlemann 2013), where it often occurs at or around the story climax (e.g. Li 1986, Mayes 1990, Mathis and Yule 1994, Rühlemann 2013). In FreMIC, re-enactments are exhaustively transcribed as well as annotated on a separate tier to obtain exact temporal data for it.

At the time of writing, FreMIC contains 1,135 quotes. Consider extract (1): Speaker A’s utterance is transcribed in full, whereas the two quotes she uses are separated out on quote tiers (signified by the ‘-Q’ suffix to the ‘Speaker’ Id):

(1)

Speaker Utterance Timestamp
ID08.A and then ↑re↑cently I’ve been thinking about studying a!gain! because my life is this vacuum [like ∼I nee:d] £S(H)OMEthing£∼ = [(.) so] I was looking at like courses in Strasbourg and the whole time I was thinking ∼°↑yeah my French isn’t good↑° enough∼ and at some point someone was like 00:21:28.566 – 00:21:39.922
ID08.B [((v: laughs))] 00:21:31.871 – 00:21:32.661
ID08.A-Q ∼I nee:d] £S(H)OMEthing£∼ 00:21:32.281 – 00:21:33.216
ID08.B [yeah] 00:21:33.371 – 00:21:33.781
ID08.A-Q ∼°↑yeah my French isn’t good↑° enough∼ 00:21:37.216 – 00:21:39.008

The attention paid to re-enactments suits a multimodal corpus well in that such re-enactments are known to attract heightened activation in the vocal and bodily channels by the speaker (e.g. Holt 2007, Blackwell et al. 2015; Stec et al. 2016; Soulaimani 2018) and a prime means by which speakers express stance and emotion (e.g. Stivers 2008, Rühlemann 2022).

2.1.4 Transcription of silent gestures

A first multimodal element in the ELAN transcription concerns what we call ‘silent gestures’, that is, gestures that are iconic, embedded in speech, but not co-speech and not depictions of non-depictive speech (cf. Hsu et al. 2021). Silent gestures are exhaustively recorded in both the orthographic and the CA transcriptions. While in the former, the description is merely cursory, the description in the latter is comprehensive, indicating the articulating organ(s) (m: hand, f: face, t: torso, etc.) and providing a concise description of the gesture. At present, FreMIC contains 300 such silent gestures (virtually always by the current speaker). To illustrate, consider Speaker B’s silent gesture in extract (2). She is telling about an incident where she has been followed by a stranger in the night in a small French city (Metz):

(2)

Speaker Utterance Timestamp
ID01.B you know the dark 00:43:16.823 – 00:43:17.893
NA (0.250) 00:43:17.893 – 00:43:18.143
ID01.C I me[an] 00:43:18.143 – 00:43:18.633
ID01.B ((silent m: b hands raised to chest height palms facing each other, zigzags with b hands before returning to centre of chest and tilting b hands forward slightly)) [narrow] streets [of Metz] 00:43:18.400 – 00:43:20.013

Speaker B verbally describes the streets she was followed through as ‘dark’ and ‘narrow’. In the context of her story, dark and narrow streets are threatening because eyesight is poor, and others can hardly see (and possibly help) you and you cannot easily escape an attacker. The silent gesture – a hand movement whose key component is a zigzagging of both hands shown in Figure 2 – adds another quality to the streets not referred to on the verbal plane, a quality that could be glossed as ‘labyrinthine’. The silent gesture thus underscores how dangerous the situation was.

Figure 2 
                     Three stills from Speaker B’s silent zigzagging gesture; frame grabs extracted from participant A’s eye tracking video for illustration (i.e. not automatically retrievable from the corpus).
Figure 2

Three stills from Speaker B’s silent zigzagging gesture; frame grabs extracted from participant A’s eye tracking video for illustration (i.e. not automatically retrievable from the corpus).

2.1.5 Transcription of gestures in storytelling interaction

One layer of transcription and annotation in ELAN that is at present not yet available at a large scale but currently being extended concerns gestures in storytelling interaction. So far only a subset of storytellings[2] have been fully examined for gestures performed by the storyteller.

Gestures are manually annotated in ELAN on three tiers: gestures as such, gesture phases based on Kendon’s (2004) gesture phase model, and the gesture expressivity index, which captures gesture dynamics. Both the phase model and (a prior version of) the Gesture Expressivity Index are explained in detail in Rühlemann (2022). Here, a mere sketch is attempted.

The gesture phase model distinguishes five phases: preparation, i.e. the departure of the hand(s) from a rest, or home, position; the pre-stroke-hold; the stroke, i.e. the “phase of the excursion in which the movement dynamics of ‘effort’ and ‘shape’ are manifested with greatest clarity” (Kendon 2004, 112); the (optional) hold, i.e. the “phase in which the articulator is sustained in the position at which it arrived at the end of the stroke” (Kendon 2004, 112); and the recovery, i.e. the movement from the stroke (and optional hold) back to the home position. Stroke and hold together form the nucleus, which is that part of the gesture “that carries the expression or the meaning of the gesture” (Kendon 2004, 112). The only phase that must be present in a gesture annotation is the stroke phase; all other phases are optional (cf. the Coding Manual for Gestures and Gestures Phases in Appendix 2).

The Gesture Expressivity Index comprises altogether seven variables; parameters 1–4 are manually annotated on what we call the index tier; parameters 5–7 are extracted from the gesture and gesture phase tiers (cf. Coding Manual for Gesture Expressivity Index in Appendix 3, which also gives a rationale for the coding parameters used):

  1. Gesture viewpoint (gesture viewpoint is the character’s viewpoint rather than the observer’s viewpoint)

  2. Gesture silence (gesture is a silent gesture without co-occurring speech)

  3. Gesture size (gesture radius is sizable)

  4. Gesture force (gesture execution requires muscular effort)

  5. Gesture hold (gesture contains a hold phase)

  6. Gesture articulation (gesture is articulated in concert with other bodily activation)

  7. Gesture duration (gesture nucleus is longer than story average)

Gestures are annotated in FreMIC with a specific research focus in mind, namely, the contribution of gestures (and gesture expressivity) by the storyteller to emotional resonance in the story recipients. As will be explained later (in Section 2.2.2.2), in a subset of the FreMIC recordings, the participants wore wrist watches measuring, inter alia, electrodermal activity (EDA), a psychophysiological proxy for emotion arousal. Given this research focus (and the inherent complexity of gesture annotation as well as the inevitable necessity of manual annotation), the aim in FreMIC is not to achieve full-scale annotation for all storytellings in the whole corpus. Rather, the annotation is restricted to those files where EDA data are available.

To illustrate, in a pilot study (Rühlemann 2022), the Gesture Expressivity Index was implemented for two triadic storytellings that were comparable on a number of counts but starkly different on others, the main selection criteria being the fact that both stories were stories from the sadness/distress spectrum of emotions and the fact that the storytellers came from the opposite ends of the gesticulation spectrum: the storyteller in story ‘Toilet woman’ using their hands a lot and the storyteller in story ‘Sad story’ using them scarcely (for a complete description of the selection criteria, see Rühlemann (2022, 3–4)). The metric used to estimate how gesture expressivity develops across a storytelling was the slope of a linear regression computed for the index values in the storytelling, with positive slopes indexing intensifying expressivity and negative slopes indexing weakening expressivity.[3]

It was shown that the unequal use of gesture dynamics by the two storytellers was differentially correlated with the storytellings’ progression from background to climax: while the gestures by the storyteller in ‘Toilet woman’ gained in expressivity, the gestures used by the storyteller in ‘Sad story’ increasingly lost in expressivity (Figure 3).

Figure 3 
                     Gesture Expressivity in ‘Toilet woman’ vs ‘Sad story’. Dashed line: mean gesture expressivity index. Solid pink line: regression (‘trend’) line computed from story beginning to climax onset. Red rectangles: duration of climax. X-axis: timestamps of gestures in the storytellings.
Figure 3

Gesture Expressivity in ‘Toilet woman’ vs ‘Sad story’. Dashed line: mean gesture expressivity index. Solid pink line: regression (‘trend’) line computed from story beginning to climax onset. Red rectangles: duration of climax. X-axis: timestamps of gestures in the storytellings.

Gesture expressivity may impact on whether, or to what degree, the ultimate goal in storytelling – emotional resonance between storyteller and story recipient(s) (e.g. Labov 1972, Stivers 2008, Peräkylä 2015) – is achieved (see Section 2.2.2.1, and Rühlemann 2022 for more detail).

2.1.6 Annotation of Question–Answer (Q&A) sequences

FreMIC is comprehensively annotated for Q&A sequences. Q&A sequences are of particular interest to research on turn-taking. This is because questions are “among the most frequent high-level action type in all languages” (Levinson 2013, 112) and, unlike other turn types (such as, for example, assessments), an (information-seeking) question puts maximal pressure on the interlocutor(s) to produce an answer that provides the sought information (Stivers and Rossano 2010, 29). If the information is not provided, a negative sanction may be issued (Stivers 2013, 204). Questions thus count among the most strongly next-action and turn-transition mobilizing actions; in Stivers (2010), for example, 93% of all questions were indeed followed by turn transition.

Questions are a functionally highly diverse class of turns/social action. They can be used not only for information requests but also, inter alia, repair initiations, confirmations, and assessments (Stivers 2010), and not all question types exert the same kind or amount of response relevance. For example, most questions that become questions only by virtue of an appended question tag serve not to fill a knowledge gap on the part of the questioner but to elicit agreement with the questioner. Other question types are per se restricted in terms of who is selected as the answerer (e.g. a recipient in multi-party interaction asking a repair question is highly likely to select as answerer the current speaker whose talk contained the repairable).

As the focus of interest in the FreMIC annotation of Q&A sequences is on information-seeking questions with high response relevance, we target four types of questions: wh-questions, polar questions, declarative questions, and (multi-clausal) or-questions, as illustrated in (3):

(3)

Q type Speaker Utterance
wh ID07.C yeah how’s everybody doing¿ (.) from the band
polar ID01.B I mean is she at work?=
declarative ID01.C [but you] use leggings or?
or ID07.C is it!mult!iple singers for the band or or is [she like the main one] °then°=

Identifying such Q&A sequences is by no means a trivial task. There are questions that remain unanswered, questions that may involve a wh-pronoun but do not seek information but rather affirmation of the stance displayed in the question (as in rhetorical questions), questions that get responded to but not in a type-fitted manner by the selected next-speaker but by a third, non-selected party inserting some (intrusive) talk that does not provide the sought information, to name only a few types of unorthodox question – ‘response’ sequences. Another complicating factor is the occurrence of the question in turbulent turn-taking, for example due to multiple overlap, which makes identification of question and particularly answer difficult. Q&A sequences were therefore annotated by both authors and rigorously filtered. The sequences were also annotated for a large number of additional factors, including the following:

  1. Q_by: Which participant asks the question?

  2. Answ_by: Which participant gives the answer?

  3. Q_dur: How long is the question (and, respectively, the answer) in total?

  4. Last_G_to: Which participant does the questioner last gaze at?

  5. Last_G_to_Answ: Is the last gaze to the answerer or not?

  6. Last_G_dur: How long is that last gaze?

  7. Preselected: Is the answerer preselected sequentially or epistemically?

While variables 1–6 are straightforward, variable 7, Preselected, requires a rigorous case-by-case analysis, often extending deep into the sequential contexts prior to the question. Recipients can be preselected sequentially or, for example, due to the ‘last-as-next’ bias observed by Sacks et al. (1974) or due to ‘tacit addressing’ (Lerner 2003), where, for example, an epistemic asymmetry holds between recipients that effectively limits eligible recipients to the single recipient who is ‘in the know’. Detecting such constraints on next-speaker selection is a key for analyses of particular next-speaker selection methods, such as gaze (cf., for example Lerner 2003): if the recipient that is last looked at during the current speaker’s turn and speaks next is, for example, the only one eligible to respond due to some epistemic advantage, it is impossible to determine which method effected the selection – it could be gaze or tacit addressing or both. We return to this issue in Section 4.2, where we report on a FreMIC-based study involving gaze and other selection meachnisms.

2.1.7 Descriptive statistics for transcription and annotation layers in FreMIC-so-far

To summarize, Table 2 presents the basic statistics of the transcriptions and annotations at the time of writing.

Table 2

Descriptive statistics of the transcription and annotation layers of FreMIC-so-far

Recordings Number of recordings Duration (h) of recordings Recordings with complete integration of multimodal data IPU count Word count Silent gestures Gestures in storytellings Enactments Q&A sequences
Dyadic interaction 21 15 7 11,473 77,708 150 356
Triadic interaction 18 14 11 19,486 131,373 143 900 622 383
Total 39 29 18 30,959 209,081 293 900 978 383

The data in Table 2 seem to suggest that FreMIC is a very small corpus; its word count, for example, is barely above 200,000 words. It should not be forgotten, however, that FreMIC is still under construction, i.e. new recordings will be added while not even half of all existing recordings have yet been transcribed and integrated into the corpus. Moreover, the CA transcriptions are of a granularity not found in any other corpus we are aware of. Finally, the transcriptions and manual annotations are just half the story; the other half are the automatically generated multimodal and psychophysiological observations, which already at this early stage count in the (hundreds of) thousands (Section 2.2).

2.1.8 From ELAN to a corpus architecture

Given our aim to construct a corpus, we cannot yet be satisfied with the large and detailed ELAN files elaborated so far. The files need to be exported and combined into a framework that is capable of holding the information contained in them in ways that are conducive to corpus linguistic methods.

While ELAN offers a large number of export options, we found only one of them to suit our needs, albeit in a form that requires substantial post-processing. We export transcriptions into R using ELAN’s “Traditional Transcript Text …” option and use Regular Expression to extract what are distinct variables into columns of a data frame. Distinct variables at this point comprise line number, speaker Ids, utterances (which need not be only verbal but can be of any modality, cf. Kendon 2004), timestamp, and file Id. Further, based on timestamps, we compute for each utterance its starttime, endtime, and duration, as shown in Figure 4.

Figure 4 
                     Screenshot of FreMIC data frame with temporal data.
Figure 4

Screenshot of FreMIC data frame with temporal data.

Considering that interaction plays out in time, these detailed temporal data are essential for most interactional-linguistic/multimodal analyses; see, for example Stivers et al. (2009) on turn transition times, within-turn speech rate (Roberts et al. 2015), timing of eye blinks used as addressee feedback (Hömke et al. 2017), gesture-speech dissociation (Ter Bekke et al. 2020), pauses, gaps, and overlap (Heldner and Edlund 2010), delay of dispreferreds (Kendrick and Torreira 2014), to name only a few.

2.1.9 PoS-tagging

A defining feature of many traditional corpora is that speech is part-of-speech tagged. That is, based on an automatic analysis that considers each word’s usage in its immediate co-text, the words are assigned their (most likely) part-of-speech tag, thus making possible, for example, distinctions between ‘like’ used as a verb, noun, preposition, etc. For FreMIC, we submitted our orthographic transcriptions (after the replacement of unclear speech, silent gestures, comments, and the like with placeholder ‘NA’) to the CLAWS web tagger (Garside and Smith 1997) and its c7 tag set (http://ucrel-api.lancaster.ac.uk/claws/free.html), which has an accuracy rate of 98.5% (Leech et al. 1994). The resulting PoS transcriptions were then added to the data frame in R. One of the advantages of PoS-tagging speech is that it allows the computation of turn size based on the number of grammatical words; i.e. even in contracted forms, the underlying grammatical words are recognized, tagged, and counted separately, for example the phrase ‘I’m gonna’ is tagged ‘I_PPIS1 ‘m_VBM gon_VVGK na_TO’, resulting in four words rather than two words. The word count for each Utterance is added as the new column ‘N_c7’ in the data frame (cf. Figure 5).

Figure 5 
                     Screenshot of FreMIC data frame with PoS-tagging (in column ‘c7’) and count of grammatical words (in column ‘N_c7’).
Figure 5

Screenshot of FreMIC data frame with PoS-tagging (in column ‘c7’) and count of grammatical words (in column ‘N_c7’).

2.2 Automatically generated multimodal and psychophysiological data

The fine-grained Jeffersonian transcription of many hours of dyadic and triadic conversational interaction already set FreMIC apart from most speech corpora available today. But FreMIC offers much more data not normally found in a linguistic corpus, namely, a wealth of automatically generated multimodal and psychophysiological data. The term ‘multimodal’ is here used to cover gaze data, whereas psychophysiological data include pupillometric data and measurements of participants’ electrodermal activity (EDA).

2.2.1 Multimodal data – Gaze direction

All participants to the video recordings wear Ergoneers eyetrackers. Eye tracking traces a participant’s foveal vision, which is indicative of gaze direction (e.g. Auer passim), which is interactionally key, for example in turn-taking (e.g. Auer 2021) and re-enactments (e.g. Pfeiffer and Weiss 2022).

While the eye tracking technology has been around for quite some time, eye tracking has to date predominantly been used in qualitative studies examining sequences of interaction manually.

Eye tracking data come in two separate streams: as continuous measurements of the X/Y coordinates of a participant’s foveal vision, produced on average every 12 ms, and as what is called area-of-interest (AOI) data, produced whenever a gaze falls into that particular area. While the X/Y coordinate data are at present not (yet) accessible in the FreMIC data frame, the AOI data are comprehensively integrated into the corpus architecture.

The AOI technology relies not only on the participants’ eyetrackers and the foveal vision detected by them but also on two more components: on polygons the researchers inscribe into the eye tracking videos around the AOIs – in the case of FreMIC, on the co-participants’ faces – and on QR code markers applied to the participants’ clothing (as shown in Figure 6). Each polygon is assigned to one or more QR markers. The role of the QR markers is to prevent the AOI polygons from shifting when the eyetracked person shifts their head – as a result, the markers fixate the polygon on the face. Moreover, the markers form their own coordinate system so that any gaze into an AOI polygon is recorded in the marker’s coordinate system. The relation between QR code marker and AOI thus remains stable and independent of head movements of the eyetracked person.

Figure 6 
                     Still from participant A’s eye tracking video (File F01) showing the interplay of QR code markers, AOI polygons, and foveal vision detection (shown in the crosshair).
Figure 6

Still from participant A’s eye tracking video (File F01) showing the interplay of QR code markers, AOI polygons, and foveal vision detection (shown in the crosshair).

In Figure 6, for example Speaker B’s and Speaker C’s faces were defined as AOIs by polygons delineating their heads (plus some surrounding space). Whenever the gaze of the eyetracked participant (Speaker A in this case) falls into the polygon for a predefined time threshold (as shown by the crosshair, see also Appendix), that gaze and its duration are recorded. It is important to acknowledge that the AOI technology is not bullet proof: it fails if the QR code is illegible because, for example subjects have removed their clothing with the QR code or the codes are covered by hair strains, scarfs, the participant’s own gesturing or extreme torso movements, etc.[4]

AOI gazes are comprehensively integrated into the FreMIC corpus architecture: for each participant, there are two columns recording AOI data: one indicating the sequence of AOI targets looked at and another indicating the durations of these gazes. Consider Figure 7, which, for illustrative purposes, shows just the Speaker and Utterance columns as well as the four AOI columns from a dyadic interaction.

Figure 7 
                     Screenshot of FreMIC data frame with AOI-related columns.
Figure 7

Screenshot of FreMIC data frame with AOI-related columns.

In Figure 7, for example Speaker B requests Speaker A to =tell me everything about your <quaranti:ne>=. The ‘*’ symbol in column ‘B_aoi’ indicates that during her utterance, she does not fixate A. Speaker A, by contrast, does look at B during the utterance: the string ‘*B*B*B’ in column ‘A_aoi’ contains six gaze data points: the ‘B’ values indicate fixed gaze on B, whereas the ‘*’ values indicate averted gaze. In column ‘B_aoi_dur’, we see that B’s averted gaze (‘*’) during the request had a duration of 2,716 ms (which is the duration of the whole utterance); in column ‘A_aoi_dur’, the durations for the six gaze values are given; they sum up to 2,716 ms, the duration of the utterance.

The AOI technology for the detection of areas of special interest is fully automatic. That is, notwithstanding its present shortcomings, the promise to gaze, and indeed multimodal, researchers is substantial: it opens up the possibility to study gaze behaviour and its interaction with other relevant multimodal semiotics at a large scale, thus facilitating serious quantification and statistical analysis.

2.2.2 Psychophysiological data

The psychophysiological data fall into two categories: pupillometric and EDA data.

2.2.2.1 Pupillometric data

A participant’s foveal vision is computed based on the detection of that participant’s pupils through small infra-red cameras located in the eyetrackers and ‘looking’ into the participant’s eyes. A side effect of the tracking of the participant’s foveal vision is hence the constant detection and analysis of their pupils. In other words, eye tracking not only tells us where people are looking but also produces large-scale pupillometric data in microscopic detail.

Of particular interest is the size of the pupils. Pupil size depends on a number of factors such as, obviously, luminescence, and also drug consumption, pathological states, emotional arousal, and cognitive load. Indeed, under normal circumstances and stable light conditions, pupil size is a reliable indicator of how intensely the processing system is operating (Just and Carpenter 1993, Barthel and Sauppe 2019, 3, Beatty 1982, Beatty and Lucero-Wagoner 2000, Sirois and Brisson 2014; cf. references in Laeng et al. 2012, 18). While a pupil’s normal diameter is 3 mm, it can extend to a maximal diameter of up to 7 mm (an increase by 120%) (Laeng et al. 2012). Cognitively induced changes, however, rarely exceed 0.5 mm (Beatty and Lucero-Wagoner 2000).

In linguistics, pupillometry has made inroads in laboratory studies. For example, Papesh and Goldinger (2012) study the cognitive effort in speech planning, Sevilla et al. (2014) compare cognitive costs in canonical v. non-canonical word order, Lõo et al. (2016) measure cognitive effort in word naming tasks, Sauppe (2017) examines processing load in the voice systems (active/passive) in German v. Tagalog, and finally, Barthel and Sauppe (2019) look into speech planning at turn transitions.

Given the availability of pupillometric data in FreMIC, a corpus of unconstrained conversation, using these data has the potential of breaking new ground. We will illustrate the use and utility of pupillometric data in the brief report on the study by Rühlemann and Barthel (under review) in Section 4.2.

2.2.2.2 EDA data

At the time of writing, participants in 11 video recordings (totalling c. 10 h) wear Empatica wrist watches. These devices record a large number of psychophysiological observations including, most notably, EDA[5] (or GSR). EDA reflects changes in sweat gland activity and skin conductance. Sweating on and near the palms is involved in “emotion-evoked sweating” (Dawson et al. 2000, 202) rather than “physical activity or temperature” (Bailey 2017, 3). EDA is therefore a reliable indicator of emotional arousal (Peräkylä et al. 2015, Scherer 2005), that is, of intensifying excitation of the sympathetic nervous system associated with emotion (Dael et al. 2013, 644, Peräkylä et al. 2015, 302): emotional excitement leads to increases in EDA amplitude.

In the aforementioned pilot study (Rühlemann 2022), EDA responses have been examined in the context of two triadic storytellings with a focus on the storyteller’s deployment of multimodal resources, including re-enactments, prosody (pitch and intensity), gaze (gaze direction and gaze movement), and, as noted, gesture expressivity. It was found that the storytellers differ substantially in their multimodal investment in the storytelling, with the storyteller in the story called ‘Toilet woman’ using re-enactments, prosody, gaze, and gestures in more expressive ways than the storyteller in the story called ‘Sad story’. Figure 8 shows the EDA responses for storytellers and story recipients during the two storytellings. While the results are mixed to a degree (e.g. although the storytelling in ‘Toilet woman’ is more expressive on all counts, recipient C’s EDA amplitude shows no signs of arousal), EDA responses on the whole are stronger for ‘Toilet woman’ than for ‘Sad story’. This is congruent with the Multimodal Crescendo Hypothesis (Rühlemann 2022), according to which greater multimodal effort by the storyteller predicts greater emotional resonance in story recipients – a hypothesis to be tested against larger and more diverse data in future research.

Figure 8 
                        Scatter plots of EDA responses by participants in storytelling ‘Toilet woman’ (left panel) and storytelling ‘Sad story’ (right panel).
Figure 8

Scatter plots of EDA responses by participants in storytelling ‘Toilet woman’ (left panel) and storytelling ‘Sad story’ (right panel).

2.2.3 Synchronizing ELAN transcription, multimodal, and psychophysiological data

At the point where all the different data streams are available, the corpus construction faces a major challenge: the transcriptions made in ELAN, the EDA data provided by the Empatica wrist watches, the eyetracking data on gaze direction, and pupil size provided by the Ergoneers eyetrackers are independent and heterogeneous data streams of different origin. Moreover, the ELAN data on the one hand, and the EDA and pupil observations on the other hand, play out on very different time scales: while IPU ‘observations’ transcribed in ELAN have a mean duration of 1,700 ms, EDA and pupil observations are produced every 16 ms! The challenge then is to synchronize the various data in a single data frame structure that will hold all the information in a way that benefits corpus research.

Although the data stream are heterogeneous in kind and play out on different time scales, they have one element in common: timestamps. Based on this information, ELAN, EDA, gaze direction, and pupil data can be joined and synchronized in large data frames in R (needless to say that due to the heterogeneity of data and time scales, that synchronization involves large amounts of coding).

A critical decision in the synchronization task is which data stream should be prioritized over the other; that is, which ‘variable’ is promoted to the reference variable to which the others are aligned. The decision in FreMIC is that the ELAN transcriptions (stored in column ‘Utterance’) serve as the reference category to which all multimodal data are synchronized.

As mentioned earlier, IPUs, which represent the bulk of ELAN transcriptions, are slow-paced observations, with typical rhythms between 1 and 2 s, whereas the rhythm of most multimodal observations is fast-paced, running up large numbers of observations during one and the same IPU. Moreover, there is the ‘binding problem’ noted by Holler and Levinson (2019) for multimodal data: i.e. the speech (or any other interactionally relevant behaviour) captured in the IPU and the multimodal data are often offset in time rather than neatly aligned.[6] For example, we may address a question to a co-participant and select that participant as the next speaker by gazing at them but will keep our gaze on them during the ensuing turn transition and at least during the initial part of the answer. What this means is that multiple multimodal data points need to be funnelled and tailored into the IPU’s single time slot.

For illustration, consider the IPU like I mean she’s not like in the (.) <!risk! age group> recorded in column ‘Utterance’ in the FreMIC data frame and shown in Figure 10.

The utterance like I mean she’s not like in the (.) <!risk! age group> starts at 13,100 ms and ends at 16,898 ms, as shown by the columns ‘Startttime_ms’ and ‘Endttime_ms’ at the top of Figure 10. The AOI data for this time span show that the speaker, Speaker A, first gazes at Speaker C. That gaze starts long before the utterance starts, namely, at 11,365 ms, and it ends at 14,287 ms, which is within the time span of the utterance. The following non-AOI gaze, shown as ‘NA’ in the figure, the next gaze to Speaker B, and the non-AOI gaze thereafter fall within the utterance’s time slot. The last gaze, however, which is directed to Speaker C, starts within the utterance’s time span, at 15,421 ms, but outlasts it, ending at 20,054 ms. In other words, while the two non-AOI gazes and the one gaze to Speaker B can be seamlessly funnelled into the time span of the utterance, the two gazes to Speaker C cannot – they need to be tailored to fit the utterance’s time span. Note how the result of that funnel-and-tailor operation is given in new columns in the FreMIC data frame (shown at the bottom of Figure 9): in column ‘A_aoi’, we find the string ‘C*B*C’, which represents the speaker’s shift from AOI C, to non-AOI (signified by *), to AOI B, to non-AOI, and to AOI C. In column A_aoi_dur, we find the string “1187,151,132,851,1477”, which contains the exact durations of the speaker’s gazes during the time span of the utterance.

Figure 9 
                     Data streams feeding into FreMIC – ELAN: ELAN transcription tool; EDA: electrodermal activity observations provided by Empatica wrist watches; ET: eyetracking data provided by Ergoneers eyetrackers; PP: pupillometric observations provided by Ergoneers eyetrackers.
Figure 9

Data streams feeding into FreMIC – ELAN: ELAN transcription tool; EDA: electrodermal activity observations provided by Empatica wrist watches; ET: eyetracking data provided by Ergoneers eyetrackers; PP: pupillometric observations provided by Ergoneers eyetrackers.

The funnel-and-tailor operation shown in Figure 10 for one utterance and that speaker’s gazes are done for all utterances and all speakers in each recording, and it is implemented not only for AOI data but also for the psychophysiological data. The end result is a data frame that is far too large to be shown in a print medium, containing almost 50 columns. It is also a data format that, to some observers, may appear ‘untidy’, as long sequences of character or numeric data are converted to strings and squeezed into single cells. However, that untidiness, if anything, is the price we are willing to pay for the orderliness we gain with respect to the reference variable ‘Utterance’: we know exactly what happens multimodally and psychophysiologically during any one utterance.

Figure 10 
                     Illustration of the ‘binding problem’ in synchronizing ELAN transcription data with multimodal (in this case, AOI) data.
Figure 10

Illustration of the ‘binding problem’ in synchronizing ELAN transcription data with multimodal (in this case, AOI) data.

Large strings of data require special treatment when it comes to analysing the stringed data. Modern packages such as the ‘tidyverse’ super package in R allow for quick and easy unpacking of data strings across multiple columns. For example, using the functions ‘pivot_longer’ and ‘separate_rows’ we can easily fan out the AOI data for all three speakers during the aforementioned utterance like I mean she’s not like in the (.) <!risk! age group> to transform the densely packed cells with strings of multiple data points, which are spread out across multiple columns to individual data points neatly ordered in few columns, as shown in Figure 11.

Figure 11 
                     Result of transformation of the AOI data for all three speakers during the utterance “like I mean she’s not like in the (.) <!risk! age group>” from wide to long format.
Figure 11

Result of transformation of the AOI data for all three speakers during the utterance “like I mean she’s not like in the (.) <!risk! age group>” from wide to long format.

Now that the data are unpacked so that each gaze observation has its own row, and each variable has its own column – for example, all the AOI gazes are assembled in column ‘aoi’, all the durations of these gazes are collected in column ‘aoi_dur’ – we can address a large number of analytic questions and the transformation from an ‘untidy’ spread-out format to a ‘tidy’ gathered format of the multimodal data opens up novel ways of visualization. In Section 4, these visualizations are discussed, and one ongoing research project is briefly described. In Section 3, we describe yet another corpus component, the FreMIC Turns component.

3 Turns component

As noted in Section 2.1.1, the basic unit of observation in FreMIC is the IPU. IPUs can be safely assumed to be a good approximation to turn-constructional units (TCUs). This is suggested by the fact that IPUs in FreMIC and the TCUs measured in the study by Hömke et al. (2017) are virtually equally long, namely, 1,700 ms and, respectively, 1,754 ms. We also know that there is a preference for short turns consisting of a single TCU – the so-called single-TCU bias, which indeed governs 67% of all turns in conversation (Robinson et al. 2022). But this leaves us with almost a third of all turns that will not consist of a single TCU or IPU. Exceptions to the single-TCU bias include a large number of cases. First, according to Sacks et al.’s (1974) turn-taking rule 1c, “If the turn-so-far is so constructed as not to involve the use of a ‘current speaker selects next’ technique, then current speaker may … continue” (Sacks et al. 1974, 704); next, the turn structure may play a role: speakers bowing to the ‘begin-with-a-beginning’ constraint (Sacks et al. 1974, 719) may start their turn with a prosodically (and often syntactically) disintegrated turn-initial particle (‘insert’), which can form its own TCU (well, oh, etc.; cf. Heritage and Sorjonen 2018). Further, preference may lead to multi-TCU turns: for example, dispreferred responses make relevant additional, same-turn accounts (e.g. justifying, excusing) (cf. Heritage 1984a, Pomerantz and Heritage 2013). Finally, and perhaps most importantly, action/activity type may be critical: (story-)tellings, for example, can be ‘built from many TCUs’ (Goodwin and Heritage 1990, 299; cf. also Sacks 1992, 222).

To approximate turns, specifically (story-)tellings, more comprehensively, a large number of IPUs were contracted into composite units – turns, together forming the FreMIC Turns component.

The contraction was done rules-based: if there were successive IPUs by the same speaker interspersed only by (i) pauses, (ii) within-turn overlap, and/or (iii) (free-standing) continuers, the transcriptions stored in the CA-like ‘Utterance’ and the ‘Orthographic’ columns were contracted into a single larger unit, a FreMIC Turn. Importantly, any associated columns – including temporal data as well as multimodal data – were contracted as well.

Using Regular Expression, continuers were identified based on three criteria drawing on information stored mainly in the ‘Orthographic‘ and CA-like ‘Utterance‘ transcription:

  1. They had to be from the small set of backchannels shared across varieties of English, which includes five items: ‘mm’, ‘yeah’, ‘mhm’, and ‘uh-huh’ (cf. White 1989, Wong and Peters 2007, Rühlemann 2017); the (fifth) item ‘oh’ was not included, given its common function as a sequence-closing third (cf. Schegloff 2007).

  2. These backchannels were, however, not treated as continuers if, as shown in the CA-style transcription in ‘Utterance‘, they were spoken with greater-than-usual paralinguistic activity; e.g. a ‘yeah’ involving ↑ (high pitch), ↓ (low pitch), ¿ (half rise), ? (rise), £ (smiley voice), # (creaky voice), ≈ (tremulous voice), or capitalization (intensity) is likely not to serve as a mere continuer as it displays Goodwin’s (1986b, 211) ‘elaborated participation display’.

  3. Continuers are brief: durations for high-frequent continuers in Peters and Wong (2015) are <500 ms, the median duration in the study by Young and Lee (2004) is 390 ms in a range of 240–3,190 ms. So the third criterion was a duration of <1,000 ms.

To illustrate the procedure, consider the IPU transcriptions in (4), an excerpt from a triadic interaction. To highlight the issue of multimodal data, specifically gaze data, the excerpt also features co-participant C’s AOI data in column ‘C_aoi’ (‘A’ denotes that C gazes to Speaker A, whereas ‘*’ denotes that her gaze wandered off A’s face; column ‘C_Aoi_dur’ gives the durations of each gaze).

Speaker A works for a cruise ship company. In the extract, she is relating her company’s problems in attracting customers. From line 1 through 16, Sacks et al.’s (1974) turn-taking rule 1c is in operation, with the current speaker continuing through multiple pauses and without verbal actions from the recipients. Speaker C’s [mhm] in line 12 too is not intended to take the turn, being a minimal response token in overlap; similarly, Speaker B’s mhm in line 14, although free-standing, serves a continuer function. Only in line 17 does Speaker C self-select to take the turn, initiating a Q&A sequence by asking Speaker A [what] type of: tours is it is it [(like a long] ti:me [or¿], to which A replies [it’s cruise ship]:

(4)

Line Speaker Utterance C_aoi C_aoi_dur
1 ID01.A [AND] on top of it we were already running into a problem with our British companies, because the Brits don’t wanna come to Europe *A 1,081, 3,845
2 NA (0.396) A 396
3 ID01.A °because of Brexit° A 648
4 NA (1.484) A 1,484
5 ID01.A and now we’re having trouble with, I mean and our others are Am(.)ericans and Australians A*A 2,401, 660, 1,318
6 NA (0.273) A 273
7 ID01.A !they!‘re all A 544
8 NA (0.869) A 869
9 ID01.A they’re!a:ll! at least fifty and older (.) it’s!very! rare to >have someone< under fifty do one of these tours A 5,487
10 NA (0.360) A 360
11 ID01.A [um but]!most! people are (.)!well! retired cos you have to have the money and the time to do these things= A 4,456
12 ID01.C [mhm] A 556
13 NA (0.025) A 25
14 ID01.B =mhm A 541
15 NA (0.582) A 582
16 ID01.A u[m] A 300
17 ID01.C [what] type of: tours is it is it [(like a long] ti:me [or¿] A 3,500
18 ID01.A [it’s cruise ship] A 782

That Speaker A in lines 1–16 is engaging in a telling also transpires from Speaker C’s sustained gaze to A. Note, however, that the ‘C_aoi’ and ‘C_aoi_dur’ columns render that sustained gaze as if it were multiple separate gazes – whereas, clearly, these separate observations are in actual fact one single observation. The contraction of the IPU transcriptions into turns as well as the contraction of the multimodal data take care of this. Consider excerpt (5), which gives Speaker A’s telling as a single turn consisting of 10 IPUs: the values in the ‘C_aoi’ and ‘C_aoi_dur’ columns have now shrunk to merely four: an away-gaze (‘*’), as before, followed by a gaze to A, whose duration now has increased from 3,845 to 8,774 ms, followed by an away-gaze, and, finally, another gaze to A, with a duration as long as 13,607 ms!

(5)

Line Speaker Utterance C_aoi C_aoi_dur N_ipu
1–16 ID01.A [AND] on top of it we were already running into a problem with our British companies, because the Brits don’t wanna come to Europe } (0.396) } °because of Brexit° } (1.484) } and now we ‘re having trouble with, I mean and our others are Am(.)ericans and Australians } (0.273) }!they!‘re all } (0.869) } they ‘re!a:ll! at least fifty and older (.) it ‘s!very! rare to >have someone< under fifty do one of these tours } (0.360) } [um but]!most! people are (.)!well! retired cos you have to have the money and the time to do these things= } u[m] *A*A 1,081, 8,774, 660, 13,607 10
17 ID01.C [what] type of: tours is it is it [(like a long] ti:me [or¿] A 3,500 1
18 ID01.A [it’s cruise ship] A 782 1

The sequential pattern targeted by the algorithm (same-speaker IPUs interspersed only by pauses, within-turn overlap, and continuers) seems to be highly recurrent: the algorithm combines 34,861 (verbal) IPUs occurring in 20 files into 19,954 turns. That is, 43% of all IPUs are part of multi-IPU turns.

The data in the Turns component allow us to approach the gaze data with confidence: the data points respect sustained gazing. The total number of gazes to co-participants’ faces is 69,166 (including speakers’ and recipients’ gazes). To illustrate the research potential of the gaze data in the Turns component, we can ask a straightforward question: Are speakers’ gazes to co-participants equally long in dyads as in triads? In asking this question, we follow the lead of Holler et al.’s (2022) recent comparison of turn transition times in dyads v. triads, which discovered significant differences.

The answer is shown in Figure 12.

Figure 12 
               Durations of speakers’ gazes to recipients in dyads v. triads (after removal of outliers). Broken horizontal line indicates mean. Formula: Gaze duration ∼ Group size + (1 | Speaker) + (1 | File).
Figure 12

Durations of speakers’ gazes to recipients in dyads v. triads (after removal of outliers). Broken horizontal line indicates mean. Formula: Gaze duration ∼ Group size + (1 | Speaker) + (1 | File).

As can be seen in Figure 12, speakers’ gazes to co-participants are substantially longer in dyads than in triadic interaction; in a mixed-effects model, the difference in length is highly significant and it holds on all counts including the mean, the median, and even the mode (Figure 12). What causes this difference? Focusing on triadic interaction, Rühlemann and Pfeiffer (under review) examine gaze alternation (the speaker fixating first one recipient and then another recipient), an affordance categorically unavailable in dyadic interaction. Gaze alternation is found to occur in three sequential contexts: (i) overlap by not-gaze-selected recipient: upon completing their turn, a current speaker has been gaze-selecting one recipient but shifts their gaze momentarily to another recipient, who comes in early thus overlapping the speaker’s turn; (ii) dialogic enactment: in animating events that involve turn-taking by story characters, the enacting speaker can gaze alternatingly at different recipients “in order to assign different roles [in the enacted situation] to them” (Pfeiffer and Weiss 2022, 34), and (iii) inclusive gaze alternation, the practice of “select[ing] all co-participants as addressees by looking at them alternatingly” (Auer 2018, 207). This practice is specifically adapted to multi-party conversation, with its ever present danger of schism and marginalization of non-focal participants. There, gaze alternation precisely serves to “avoid schisms or marginalizations of speakers” (Auer 2018, 207). Rühlemann and Pfeiffer’s analysis shows that what all three contexts have in common is that they play out in longer-than-average turns: for example, the average length of turns with inclusive gaze alternation is almost 11 s (median = 6.5 s), whereas the average length of turns is between 1.7 and 3.5 s (Hömke et al. 2017, Robinson et al. 2022). The second commonality is that the speaker’s gaze shift from one recipient to another takes time (on average, 400 ms) – time which cannot be spent looking at recipients. What distinguishes the three contexts are their unequal weights: while contexts (i) and (ii) occur occasionally, context (iii), inclusive gaze alternation, is much more frequent. This finding suggests that for speakers in triadic interaction, a much greater concern than dealing with overlap by non-selected recipients or doing dialogic enactments is ‘keeping everybody involved’ by gaze-selecting them as addressees to mitigate or avoid exclusion. The study as a whole strongly suggests that gaze alternation, in whatever function, explains the differential durations of speakers’ gazes in dyads and triads.

4 FreMIC-based visualizations and work-in-progress

4.1 Visualizations

In Sections 4.1.1–4.1.3, we introduce some novel visualization techniques for gaze data. All visualizations are computed in R. For researchers interested to reproduce or adapt them to their own needs, the relevant scripts will be shared upon request.

4.1.1 Speech-Gaze transcripts

A novel visualization concerns multimodal transcription, specifically representations of gaze in transcriptions of talk-in-interaction. Various conventions have been offered to integrate gaze movements into transcripts of talk-in-interaction (e.g. Goodwin 1984, Rossano 2012, Mondada 2016, Rühlemann et al. 2019, Auer 2021, Laner 2022). While all prior methods rely on manually annotating gaze in transcripts, which require much extra work by the researcher and often produce hard-to-read results, the method presented here is both visually intuitive and computationally automated; that is, with the R code for it in place, the researcher just selects the turn or sequence of turns for which a multimodal transcript is needed and runs the code. For illustration, let’s stick with the turn discussed earlier and visualize not only the speaker’s gazes during that turn but also everybody’s gazes during the sequence of which the turn is an element. The multimodal speech-gaze transcript of the sequence is shown in Figure 13.

Figure 13 
                     Speech-gaze transcript of file F01 [Lines 2:9].
Figure 13

Speech-gaze transcript of file F01 [Lines 2:9].

The turn discussed earlier represents an extension of a Q&A sequence, with A inquiring about C’s mother’s age: “how old’s your mom¿”, after which a long gap (0.855 s) ensues before Speaker C starts the answer somewhat hesitantly with ‘eh’ before she gives her age with sixty:::-one. Speaker A’s turn “like I mean she’s not like in the (.) <!risk! age group>” extends the sequence in that it makes the implications of C’s mother’s age explicit: at age 61, she’s not in the age group at (highest) risk to attract the Corona virus.

As regards gaze, the transcript shows that Speaker B, gazes to whom are highlighted in green, is not gazed at by either Speaker A or Speaker C during the Q–A sequence; only during Speaker A’s sequence-extending turn is she briefly gaze-addressed by Speaker A. The exclusion of Speaker B from the gazes by A and C during the Q-A sequence is arguably due to the fact that the question is addressed by A only to C: it is her mother’s age she wants to know. So Speaker A’s gaze during the question serves to select C, rather than B, to give the answer (on the issue of selection, see the work-in-progress report in Section 4.2). Interestingly, note that Speaker B retracts her gaze from A during the gap following the question and during C’s answer, as if to signal her not being involved in the completion of the Q–A sequence. Her gaze returns to A during A’s acknowledgment token “yeah … yeah yeah” and is also steady on her during the sequence-extending turn.

All of these (and possibly more) observations are facilitated by the multimodal Speech-Gaze transcript, which shows exactly when gazes start and end and who gazes at whom when. The Speech-Gaze transcript thus represents a viable and cost-effective alternative to traditional multimodal transcripts.

4.1.2 Gaze activity plots

Another novel visualization technique afforded by the unpacking of the multimodal data strings is the Gaze Activity plot. This plot type shows the participants’ gaze behaviour for long stretches of transcription. In Figure 14, for example, we see all AOI gazes for all three participants in the first 10 min of recording F01. Note that the extended Q–A sequence discussed in Section 4.1.1 and shown in the Speech-Gaze transcript in Figure 13 is highlighted in the black rectangle in the left upper corner of Figure 14.

Figure 14 
                     Gaze activity plot F01 (minutes 0:10).
Figure 14

Gaze activity plot F01 (minutes 0:10).

Such gaze activity plots give the researcher valuable insights into gaze patterns and can help single out stretches of particular interest. Notice, for example how in minutes 2–7 (indicated in the left margin) both Speakers B and C keep their gaze fixated on A with few and short interruptions only. The pattern of sustained gaze by Speakers B and C is particularly strong in minute 3. What happens in the interaction during that time?

The talk-in-interaction during most of 3 min of file F01 is shown in extract (6) from the Turns component:

(6)

Speaker Utterance Timestamp
ID01.C [when was] that¿= 00:03:00.054 – 00:03:00.668
ID01.A [yes] 00:03:00.135 – 00:03:00.555
NA (0.163) 00:03:00.668 – 00:03:00.831
ID01.A =!this! was on °Wednesday° 00:03:00.831 – 00:03:01.720
NA (0.273) 00:03:01.720 – 00:03:01.993
ID01.C oka[y] 00:03:01.993 – 00:03:02.398
ID01.A ↑[IN] THE PAST↑ } (0.486) } every time this has ever happened here’s how it plays out } (0.364) } I’m walking towards a womens’ bathroom someone says (.) ∼↑wait stop this is a womens’↑ bathroom∼ (.) and I turn and look at them and they go ∼((v: gasps)) Oh I’m sorry I’m sorry I’m sorry I’m sorry∼ like ∼go ahead go ahead go ahead∼ (.) because like!this! } (0.345) } is!not! a man’s face= 00:03:02.330 – 00:03:18.858
NA (0.177) 00:03:18.858 – 00:03:19.035
ID01.B =°ri:ght° 00:03:19.035 – 00:03:19.331
NA (0.424) 00:03:19.331 – 00:03:19.755
ID01.A !yes! I’m six foot (.)!yes! I’ve got short hair (.)!yes! I wear men’s clothes I get it in like a split second that you’re like (.) ↑that’s a dude↑ wait no that’s a lady } (0.697) } ↑she gra:bs my elbow (.) I turn to!look! at her↑ and she’s like ∼this is a (.) womens’ bathroom you can’t go in there∼ } (0.534) } and I’m like ∼((silent f: blank stare))∼ (.) and she didn’t, she was just like ∼you can’t go in∼ (.) I’m like ∼I’m a!woman!∼ she said ∼no you’re not you can’t go in∼ } (0.487) } I’m like ∼[((silent f: annoyed stare))]∼ 00:03:19.755 – 00:03:43.000
ID01.C [((v: gasps)) she said] ∼no you’re [not?]∼ 00:03:42.005 – 00:03:43.600
ID01.B [((v: gasps)) she said] ∼no you’re [not?]∼ 00:03:42.010 – 00:03:43.600
ID01.A ∼<[NO YOU ‘RE] NOT (.) you can’t go!in!>∼ 00:03:43.400 – 00:03:45.900
NA (0.260) 00:03:45.900 – 00:03:46.160
ID01.C show her [your boobs] next time 00:03:46.160 – 00:03:47.395
ID01.A [tt] 00:03:46.504 – 00:03:46.812

Clearly, Speaker A is engaged in storytelling, specifically a trouble telling about her being mistaken as a man and consequently being disallowed from using a public bathroom for women. Speaker B’s and C’s produce empathic responses ranging from incredulous repetitions of the acts of denial by the toilet ladies (“she said ∼no you’re not?∼”) said in unison by both speakers, to sarcastic advice to the storyteller to next time “show them your boobs” as well as extended laughter. Clearly, given the engaging stories and the story recipients’ empathizing with them, it is not surprising that the recipient’s gaze should be largely fixated on the storyteller. This line of inquiry can be pursued further with Gaze Path plots, described in the next section.

4.1.3 Gaze path plots

As noted, eye tracking does not only produce AOI data (whether the gazes hit the AOIs or not) but also continuously records the X/Y coordinates of the foveal vision. This type of eye tracking data is quite unlike the AOI data: while the latter produces discrete data (either AOI or not AOI), X/Y coordinate gaze data are continuous, produced on average every 12 ms. While the AOI data then allow us to see when and how long a participant looked at particular areas, the X/Y data allow us to comprehensively track the gaze movements over time.

For example, while it is well known that non-current speakers gaze at the current speaker more than the other way around (e.g. Kendon 1967, Goodwin 1980), little seems to be known about how this asymmetry plays out in storytelling. To illustrate, consider the Gaze Path plots in Figure 15; they depict not only the storyteller’s but also the recipients’ gaze trajectories.

Figure 15 
                     Gaze path plots for storyteller and recipients in the story ‘Sad story’ in F16; X/Y scales aligned for all panels per row; lower row: close-ups on narrow X/Y range; curved lines depict gaze movement from one X/Y coordinate datum to the next; curves are colour coded for temporal progression ranging from green for story-early, to red for story-late gaze points.
Figure 15

Gaze path plots for storyteller and recipients in the story ‘Sad story’ in F16; X/Y scales aligned for all panels per row; lower row: close-ups on narrow X/Y range; curved lines depict gaze movement from one X/Y coordinate datum to the next; curves are colour coded for temporal progression ranging from green for story-early, to red for story-late gaze points.

First consider the upper row in Figure 15; it shows all X/Y values recorded for all three participants. Note that for comparison, the X/Y coordinate scales of the three panels have been aligned. The result is striking: while the storyteller’s gaze fans out in all directions of their visual space, the recipients’ gazes can barely be seen as they are steadfastly directed to a very small section of their visual field. It almost appears as if they did not move their gaze at all.

The lower row in Figure 15 shows the gaze paths in close-up resolution. Here, then we see that recipients’ gazes are far from being arrested but, compared to the storyteller’s gaze movements that reach far beyond the plotting margins, they are restricted to a very small radius.

Which ‘target’ are the two recipients fixating their gaze on? We will hypothesize that, in storytelling, they will most likely look at the storyteller. But at this point and based merely on the X/Y data, this is merely a hypothesis. To test it, we can synchronize the X/Y data with the AOI data. This synchronization allows us to not only see all gaze observations, see how they travel across the participant’s field of vision and obtain a sense of how the gaze paths relate to the storytelling’s progression through time; it also allows us to discover which X/Y gazes are at the same time gazes to AOIs, that is, to co-participants.

As shown in Figure 16, all X/Y gazes by Recipient A are at the same time AOI gazes – that is, the recipient’s gaze is completely fixated on the storyteller. Recipient B, by contrast, does allow his gaze to wander off the storyteller’s face, particularly towards the end of the storytelling. During the main bulk of the telling, however, his gaze is fixated on the storyteller too.[7]

Figure 16 
                     Integrated X/Y gaze path and AOI plots of recipients’ gazes in ‘Sad story’; curved lines depict gaze movement from one X/Y coordinate datum to the next; curves are colour-coded for temporal progression (green for story-early gazes and red for story-late gazes); blue crosshairs depict AOI gazes (Note: to reduce the effect of over-plotting, the blue crosshairs are plotted with large colour transparency, which may make the detection of isolated crosshairs difficult.).
Figure 16

Integrated X/Y gaze path and AOI plots of recipients’ gazes in ‘Sad story’; curved lines depict gaze movement from one X/Y coordinate datum to the next; curves are colour-coded for temporal progression (green for story-early gazes and red for story-late gazes); blue crosshairs depict AOI gazes (Note: to reduce the effect of over-plotting, the blue crosshairs are plotted with large colour transparency, which may make the detection of isolated crosshairs difficult.).

To judge from the example visualized in Figures 15 and 16 (and many other examples), it appears that the recipients’ gaze behaviour in storytellings is firmly synchronized and aligned: their gazes perform the same gesture – namely, focusing on the storyteller. Storytelling then may provide a means for the storyteller of galvanizing gaze and, hence, attention and a site for story recipients to align and synchronize their behaviour towards the storyteller. In storytelling as a shared activity, the ‘eye-to-eye ecological huddle’ (Goffman 1964, 95) is pulled together very closely.

4.2 A pilot study on turn-taking, gaze, and cognitive intensity

In this section we report on a pilot study carried out by Mathias Barthel and the present first author (Rühlemann & Barthel, under review). The study triangulates turn-taking, gaze and cognition. The brief report here may serve as an illustration of how different FreMIC annotation layers and corpus components can be combined to ask hitherto unasked questions. The pilot study draws on the following FreMIC resources:

(i) CA transcriptions of talk-in-interaction (cf. Section 2.1)

(ii) Q&A annotation (cf. Section 2.1.6)

(ii) Turns component (cf. Section 3)

(iii) AOI gaze data (cf. Section 2.2.1)

(ii) Pupillometric data (cf. Section 2.2.2.1)

The starting point of the analysis relates to the concept of next-speaker selection, which refers to the practices and mechanisms that participants in a conversation use and rely on to determine who should speak next. While this concept is central in Conversation Analysis, there is little empirical evidence to suggest that next-speaker selection has a psychological base in the sense that selected speakers register their being selected, orient to it, and, as a result, take the next turn. One of the few quantitative analyses of how speakers’ use of next-speaker selection methods correlates with next speakers actually taking the turn relates to gaze as a next-speaker selection method in German ‘ihr’ Q&A sequences. There it turns out that “in a vast majority (74%) of cases”, the last-gazed-at participant was at the same time the answerer (Auer 2021: 25; cf. also Lerner 2003 and the follow-up study by Auer, Rühlemann, Gries & Ptak, in preparation). This finding has implications for theories of turn taking, as it suggests that self-selection – sub-rule (IIb) in Sacks et al.’s (1974) model – is overstated. However, the finding is not as conclusive as it could be: the fact that the question-final gaze to one recipient (vs. to another recipient) co-occurs with the answer by that recipient (vs. by another recipient) describes a correlation, not a causation. If we maintain that gaze is an other-selection technique, we tacitly make assumptions about participants’ intentions and states of mind. That is a stretch, as, in actual fact, we do not know whether the questioner intends to select the recipient that does answer and we do not know whether the answerer answers because he or she feels selected because he/she was last gazed at. We at least start filling this gap by drawing not only on gaze data but also on pupillometric data, which, as noted, open up a window on processes in the mind, specifically planning activity.

The study focuses on questions in triadic interaction (cf. Section 2.1.6). Questions are an ideal testing ground for next-speaker selection, as asking a question puts pressure on the addressee of the question to respond thus leading to turn transition. The study examines the Steeper Intensification Hypothesis: it predicts a steeper increase in planning effort in the selected recipient than in the not-selected recipient. The rationale for this hypothesis is that if participants to conversation orient to being selected, the selected recipient should invest more cognitive effort in planning a response than the recipient that is not selected.

The Steeper Increase Hypothesis was tested on 194 questions where the answerer was multiply selected in the sense that the answerer was already pre-selected sequentially prior to the occurrence and was additionally gaze-selected during the question. A case in point is the question in line 6 in extract (7):

(7)

Speaker Utterance
1 ID12.A so I I like <hydraulics,> and I did it in my bachelor's more, than I've done at my: } (1.197) } uh master's program, [but]
2 ID12.C [mhm]
3 ID12.A °it's not too bad I guess° (.) [°°it's°°]
4 ID12.B [mhm]
5 ID12.A the =the basic stuff is pretty (.) I have a pretty good understanding of, } um } (0.838) } but I've never written anything remote(h)ly like a(hh) !the!sis, so I don't know how I'm gonna write like } (0.281) } fifty pages to a hundred pages or whatever but } (0.379) } I'll figure it out [°like i-°]
6 ID12.B [so you] didn't have to write one for [your bachelor either?]
7 ID12.A [ no no we ] did like a group project it was like a hundred pages most of i(h)t was l(h)ike pictures and

Prior to the extract, Speaker A has told his interlocutors that he is doing a master’s in Renewable Energy. His specific topic involves hydraulics, a field he admits he is not very experienced in. In the extract, he elaborates on this lack of experience. In line 1, he says he hasn’t been uh } (0.769) } putting enough time into it. In line 3, he qualifies his expertise as not too bad I guess. In line 5 he deplores that in hydraulics he has never written anything remote(h)ly like a(hh) !the!sis. B’s question in line 6 [so you] didn't have to write one for [your bachelor either?] relates back to the just-stated information that he has never written a thesis-like text about hydraulics. The question thus is a follow-up question that squarely depends on the prior sequence, as is evidenced by the pragmatic marker ‘so’ indicating that the forthcoming utterance ties back to the preceding utterance(s) as its starting point and also the substitute form ‘one’, whose meaning is only decipherable with recourse to the prior context. In addition to being selected during the question by B’s gaze fixation, recipient A is, then, also sequentially selected to respond to the question. Does this multiple selection impact the planning processes of the two recipients?

As shown in Figure 17, there is a far steeper increase in pupil size for the selected recipient (A) than for the not-selected recipient (C). To what extent does this pattern generalize?

Figure 17 
                  Pupillary responses by recipients A (selected recipient) and C (not selected recipient) of question so you] didn’t have to write one for [your bachelor either?] Solid lines indicate overall trend of pupil sizes during the question; dotted lines indicate observed pupil sizes.
Figure 17

Pupillary responses by recipients A (selected recipient) and C (not selected recipient) of question so you] didn’t have to write one for [your bachelor either?] Solid lines indicate overall trend of pupil sizes during the question; dotted lines indicate observed pupil sizes.

As shown in Figure 18, which plots the modelled pupil sizes of multiply-selected vs not-selected recipients during 195 questions, the Steeper Increase Hypothesis is fully supported.

Figure 18 
                  Modelled pupil sizes by multiply-selected recipient vs not-selected recipient during question (n = 194). Pupil size values were binned in intervals of 10 values each; means were calculated from these intervals. Formula: Pupil size interval means ∼ 1 + poly(position_rel, degree = 2) * Role + (1 |  Speaker) + (1 | File).
Figure 18

Modelled pupil sizes by multiply-selected recipient vs not-selected recipient during question (n = 194). Pupil size values were binned in intervals of 10 values each; means were calculated from these intervals. Formula: Pupil size interval means ∼ 1 + poly(position_rel, degree = 2) * Role + (1 | Speaker) + (1 | File).

The figure shows that pupils in selected recipients dilate whereas pupils in not-selected recipients contract during the current turn. We take this finding as evidence that speech planning intensifies in the recipient that is selected to respond and that speech planning de-intensifies in the recipient that is not selected.

Overall, the study suggests that next-speaker selection in conversational interaction has a cognitive base: participants orient to selection, synchronizing their speech planning in line with it, either stepping it up when selected or reducing it when seeing another recipient being selected for the next turn.

CA analyses are based on orientation as displayed in conduct. This analysis suggests that orientation has a private mental correlate. The displayed orientation – in this case, the fact that the selected recipient takes the turn whereas the not selected recipient does not take the turn – is inseparably connected to and, indeed, due to orientation that remains undisplayed and that precedes displayed orientation.

The study breaks new ground in that it demonstrates the usability of pupillometric data not in lab settings but in unconstrained conversation and it demonstrates that pupillometry can successfully be triangulated with automatically generated multimodal (AOI gaze) data and talk-in-interaction (transcription) data. We are confident that FreMIC will allow more such examinations of speech planning in multi-party talk-in-interaction.

5 Concluding remarks

FreMIC is still in its early days. New recordings are still being made and transcribed at the time of writing to further enlarge the corpus; also, ways to improve data accuracy, especially as regards gaze detection, are still experimented with, and more layers of multimodal data will be added, especially gesture annotations.

The basic architecture though stands. FreMIC provides conversational data transcribed in great detail, far greater than in most corpora, and it offers a wealth of additional multimodal data not found in any other corpus. The corpus is structured in such a way that one type of data – transcribed talk-in-interaction – can be seamlessly related to the other – multimodal data – allowing the corpus users to explore the intricate ways that participants in casual conversation intertwine the different modalities to create complex meanings and complex interaction. FreMIC thus represents a first major corpus attempt at reaching beneath the verbal level, beneath, that is the tip of the iceberg.

As such FreMIC may be useful for a wide range of researchers, including hard-core corpus linguists, researchers with an orientation to corpus pragmatics, interactional linguists, and even conversation analysts.

FreMIC will not be made public in the near future. However, we are planning on applying for funding that would allow us to create – following the (mono-modal) BNCweb; cf. Hoffmann et al. (2008) – FreMICweb, a web interface for FreMIC where users could freely work with the data (in as much as the participants to the recordings have given their consent) and which would support typical corpus-linguistics methods on a multimodal level. For example, while mono-modal corpus-linguists work with collocations defined as word–word combinations, users could search for word-gesture or gesture-gaze or word-gesture-gaze combinations. Before the necessary funding is obtained and the interface is created, the data frames, all necessary R scripts, as well as the raw data underlying FreMIC will be happily shared among researchers upon request. It is also hoped that others pursuing to construct multimodal corpora will take their bearings on what novel and promising ways FreMIC may offer to push the corpus-linguistic study and understanding of multimodality in interaction as well as, of course, on the ways it may fail to do so.

Acknowledgments

The construction of FreMIC was facilitated by a grant from the Deutsche Forschungsgemeinschaft (DFG; grant number: 414153266).

  1. Conflict of interest: Authors state no conflict of interest.

Appendix 1

FreMIC settings for AOI detection

1. Blinks:

When a person blinks, the eye is closed for several milliseconds, meaning that no pupil detection is possible for this period. If the person blinks while glancing at an AOI, this will lead to a split in the recorded glance: AOIblinkAOI. It can however be argued that the person is still ‘looking’ at the AOI even during the blink. To reflect that sustained gaze direction across the blink, the following procedure is in place:

If:

a person gazes at an AOI called ‘X’ (v., say, ‘Y’),

then follows a less-than-410 ms-period without any pupil detection,

before the person’s gaze is detected again to the same AOI ‘X’ (v. ‘Y’)

then:

the period between the two AOI gazes is interpreted as a blink and

the AOI-gazes prior to and following that period are summarized into a single AOI gaze to ‘X’

The maximum duration/default setting for this procedure in D-LAB is 300 ms (ISO 15007) (cf. https://my.hidrive.com/lnk/1gnnoksg#file). In FreMIC, we have adopted Hömke et al.’s (2017, 4) threshold of 410 ms.

2. Fly Throughs:

A fly-through is a very short glance at an AOI, which is not (to be considered) a gaze fixation. The person’s gaze simply ‘flies’ across the AOI.

The default setting (in D-LAB) for this is 120 ms (ISO 15007), in FreMIC it is 100 ms as the lowest threshold of fixations.

→ A gaze to an AOI that is shorter than 100 ms is not counted as an AOI-gaze

Franchak, J. M., & Adolph, K. E. (2010). Visually guided navigation: Head-mounted eye-tracking of natural locomotion in children and adults. Vision research, 50(24), 2766–2774.

Rigby, S. N., Stoesz, B. M., & Jakobson, L. S. (2016). Gaze patterns during scene processing in typical adults and adults with autism spectrum disorders. Research in Autism Spectrum Disorders, 25, 24–36.

→ An * within an AOI coding (e.g. C gazes at: A*B) can therefore mean:

Blink >410 ms

Gaze aversion/no gaze at an AOI

Gaze at an AOI for <100 ms

Failed AOI detection (marker covered by gesture, hair etc.; no pupil detection)

3. Tremors:

Even when fixated on an object, there can be involuntary micro-motion of eye gaze (so-called fixation tremor). Similarly, AOI polygons have been observed to flicker at times (due to unstable QR-code marker detection). Either way, tremors of eye gaze or AOI polygons can result in splitting up what is actually a sustained fixation, thus creating multiple observations where there is really just one (e.g. a C*C*C sequence instead of a prolonged C sequence). To keep tremor-related interference to a minimum, gazes to the same co-participant interrupted by away-gaze (*) ≤ 150 ms were melted into a single gaze.

Appendix 2

Coding Manual for Gesture and Gesture Phase Annotation in ELAN

1. What gestures are annotated?

The annotation focuses on communicative gestures, i.e. gestures that have semantic, pragmatic, or interactive meaning(s). We do not annotate adaptors, i.e. manual movements that serve a self-soothing function.

2. How are the gestures described?

Gesture descriptions all contain the following three components (in that order) (Figure A1):

  1. a description of the articulators used (head, face, hands, shoulders, torso),

  2. a description of the articulator’s shape and orientation at the end of the stroke, and

  3. a description of the location where the stroke takes place.

  • Articulators: the possible articulators are abbreviated as follows:

    1. m = hand, e = eye, f = face, h = head, s = shoulder, t = torso.

    2. In addition, the abbreviations l (left) and r (right) are used for specifying which hand is being used, and the abbreviation b (both) is used when both hands are used.

  • Shape and orientation: In the case of a static gesture, the hand’s shape and orientation is described considering which direction the palm is facing, as well as if the palm is open, and the possible combinations between fingers curled and extended (e.g. such as in a pointing gesture). If the gesture is dynamic, then a specific quality of the movement is described as well in addition to shape and orientation (e.g. a cyclical gesture)

  • Location: The location where the gesture occurs is described based on McNeill’s (1992) gesture space schema, which is presented below. Static gestures are annotated with the notation @ (for ‘at’), followed by an abbreviation of the different gesture spaces (e.g. ‘ctct’ for centre-centre). For dynamic gestures, the different gesture spaces are also used for describing the range of motion of each gesture, such as ‘from ctct to rperi’ (from centre-centre to right periphery)

Figure A1 
                  McNeill’s (1992) Gesture Space Schema adapted from Ladewig (2011).
Figure A1

McNeill’s (1992) Gesture Space Schema adapted from Ladewig (2011).

To illustrate how all these parameters are combined in a single notation, consider the following example:

((m: r h snapping of fingers @rperi))

The complete notations would read ‘manual gesture with right hand, snapping of fingers at right periphery’.

3. How are gesture phases annotated?

Gestures are annotated according to different phases of their production, including an obligatory stroke phase, and optional phases such as pre-hold, prep, hold, and relax. We consider the stroke as the only obligatory phase to account for instances of consecutive gestures, since a new gesture may begin directly after the stroke from a previous gesture without a hold or relax phase. The pre-hold and prep phases can occur before the stroke, and the hold and relax after the stroke. The definition and distinctions between the phases are as follows:

  • Stroke: The only obligatory phase, since it contains the most semantically important part of the gesture and is its core. The end of the stroke can be performed in:

    1. static form (the hand assumes a particular shape and orientation, then stops), or

    2. dynamic form (the hand assumes a particular shape and orientation, then performs a dynamic movement)

  • Pre-hold: This is defined as a movement of the hand away from a resting position to a stationary position that precedes the stroke, but does not seem to be directly connected to the stroke in regards to the hand orientation and shape.

  • Prep: The preparation phase is defined as an optional phase, in which the hand is moved from the rest position to an intermediary position in the gesture space before the stroke phase begins. This phase takes the assumption that a gesturer will prefer to produce gestures in the most economical form, moving from a resting position directly to the stroke. Thus, the prep phase considers that the gesturer has found it necessary, in regards to gesture expressivity, to include this intermediary step. The following illustrations exemplify the differences between a pointing gesture with and without a prep phase.

As is shown by the ‘Without preparation’ illustration in Figure A2, the hand moves from the resting position 1 towards the end of the stroke at 5, and the hand also adopts the pointing shape progressively as it moves forward, as can be seen in 3.

Figure A2 
                  Illustration of gesture without (upper panel) and with (lower panel) preparation phase; colour coding of hand skeleton is time-aligned: from green signifying gesture onset to red signifying gesture end.
Figure A2

Illustration of gesture without (upper panel) and with (lower panel) preparation phase; colour coding of hand skeleton is time-aligned: from green signifying gesture onset to red signifying gesture end.

In the ‘With preparation’ illustration, the hand first moves from the 1 position towards 3 in a preparatory movement, without assuming the pointing shape until 4. In this scenario, the movement between 1 and 3 is considered as part of the prep, while the movement between 3 and 5 is the stroke.

  • Hold: In this phase, the hand is held at the position of the end of the stroke, including the hand orientation and shape. This phase can be relatively long, and a slight degree of movement may be observed, especially when beats are present, which can be defined as simple up and down flicks, which usually correspond with the ‘rhythm of the speech’.This phase is also optional, since the hand can move into a relax phase after the stroke, or begin a new gesture.

  • Relax: In this phase, the hands are brought to a relaxed or resting position. While the most common resting positions are usually armchairs or lap, a resting position can be identified in different gesture spaces, such as resting a hand over the participant’s own neck for instance. Such instances may be related to self-soothing hand activities, such as playing with one’s own hair.

Appendix 3

Coding Manual for Gesture Expressivity Index

1. To be coded in ELAN:

Viewpoint:

CV: Gesture is carried out as if the gesturer slipped into the role of the character (character viewpoint; coding value = 1); alternatively, the gesture is performed from the gesturer’s own perspective, as observed by them (observer viewpoint; coding value = 0) (cf. McNeill 1992).

Applies only to representational gestures.

A good diagnostic is transitivity: character viewpoint gestures occur significantly more often with transitive constructions, where the verb takes an object (“She hit the ball hard”) than with intransitive constructions, where the verb does not take an object (“He’s jumping up an down”) (McNeill 1992, Beattie 2016).

Ask yourself: are the Gesturer’s hands his/her own hands or are the Gesturer’s hands (meant to represent) the character’s hands?

Size:

In judging gesture size, we distinguish lateral and forward movements.

In judging size in lateral movements, we will base our judgements on McNeill’s (1992) gesture space schema:

We consider a movement sizable if it crosses at least two major lines (i.e. solid lines in Figure A3); e.g. from CENTRE-CENTRE to PERIPHERY, or from EXTREME PERIPHERY to CENTRE.

Figure A3 
                  McNeill’s (1992) Gesture Space Schema adapted from Ladewig (2011).
Figure A3

McNeill’s (1992) Gesture Space Schema adapted from Ladewig (2011).

NB: If the onset of a gesture is not at the ‘normal’ rest position (i.e. in the speaker’s lap or on the arm rest) but at some other point in the gesture space, we will consider that onset as the starting point of the gesture’s trajectory and count from there how many major boundaries it crosses. For example, if the gesture’s onset is in the right EXTREME PERIPHERY and moves back to PERIPHERY, one single major boundary is crossed and the movement will be considered not sizable.

Gesture size can also become expansive if the hands and arms orientation is from the gesturer’s body into the space in front of them, i.e. if they extend their hands towards the interlocutor, as shown in Figure A4.

Figure A4 
                  45° angle forward movement.
Figure A4

45° angle forward movement.

Speaker extends her arms forward in a 45° angle. We will consider this extension as the demarcation point between sizable and not sizable : the forward movement will be considered sizable if the speaker extends her arms beyond the 45° degree angle; the forward movement will be considered not sizable if the speaker does not extend her arms beyond the 45° degree angle.

SZ: Gesture is expansive (coding value = 1, if expansive; coding value = 0, if not expansive; NA, if undecidable).

Force:

FO: Gesture requires muscular effort (coding value = 1, if muscular effort is required; coding value = 0, if muscular effort is not required). To obtain a sense of whether muscular effort is involved, annotators physically re-enact the gesture.

Silent gesture:

Silent gestures, alternatively referred to as ‘speech- embedded non-verbal depictions’ (Hsu et al. 2021), are gestures that communicate meaning “iconically, non-verbally, and without simultaneously co-occurring speech” (Hsu et al. 2021, 1). With the (default) verbal channel muted, the burden of information is completely shifted to bodily conduct (cf. Levinson and Holler, 2014, 1). This shift makes silent gestures particularly expressive: they are ‘foregrounded’ and ‘exhibited’ (Kendon 2004, 147). Actively attending to them is prerequisite for the recipient’s understanding. Moreover, given that the occurrence of speech is expected, its absence will not only be noticeable but also emotionally relevant as the omission of an expected stimulus has been shown to increase EDA response (Siddle 1991, 247).

Coding procedure in ELAN:

  1. Start by creating a new tier, using the suffix -I, e.g. ID12.A-I.

  2. On the I-tier, make a copy of the gesture as annotated on the -G tier (make sure the I-tier annotation has exactly the same length as the G-tier annotation).

  3. Insert this exact default coding onto each gesture’s I-tier:

    ((i: CV = 0, SZ = 1, FO = 0, SL = 0))

    That is, we always keep the same order of labels from left to right and all coding values are initially set to 0. Code each parameter separately, asking yourself “Is this gesture (a CV gesture/expansive/forceful/silent)?” If the answer is “(rather) yes”, set the value to 1; if the answer is “(rather) no”, leave the value as is; if you cannot make a decision, change 0 to NA.

  4. For each coding category separately, play each gesture as often as is necessary to determine whether the coding value needs to be changed to 1 or NA or whether 0 should be kept.

2. To be coded in R:

MA: Gesture is performed by multiple articulators

This variable is readily available from the ELAN gesture annotation, where, as noted earlier, the description of the gesture is preceded by the initial of the articulating organ(s). If there is a single initial, the hand movement is the only visible bodily conduct; if there is more than one initial, the gesture is performed in synchrony with other bodily articulators, e.g. the head, the face, the torso, and, in some cases, the eyes (see below). The rationale for inclusion of this variable in the index is again Dael et al.’s (2013) found that arousal is positively correlated with the amount of bodily movement. This variable also allows us to take facial expression into account.

ND: Nucleus duration is greater than story-average.

As noted, the nucleus, as the combination of stroke and optional hold, “carries the expression or the meaning of the gesture” (Kendon 2004, 112). We assume that if that expression or meaning is actively displayed to the interlocutor in a prolonged manner, the interlocutor’s perception of the expression or meaning will be facilitated. As a reference measure, the gesture’s nucleus duration is compared to the mean nucleus duration in the storytelling. The value 1 is obtained if the duration is greater than the mean nucleus duration.

HO: Gesture contains a hold phase

Subsuming the presence of a hold phase under expressive gesture dynamics draws on the absence of movement, which is assumed to gain saliency considering the lack of progressivity manifested in the hold. Given the preference for progressivity (Stivers and Robinson 2006, Schegloff 2007), which we assume extends to a preference for progressivity in bodily conduct, the uninterrupted execution of a gesture can be seen as preferred, aligned with the default expectation of progressive movement, whereas the interrupted execution as occurring during a hold phase will be seen as disaligned with the default expectation of progressive movement and hence dispreferred. As a dispreferred, the gesture hold, just as a ‘hold’ during speech, “will be examined for its import, for what understanding should be accorded it” (Schegloff 2007, 15) and is therefore likely to raise attention and add to the saliency of the gesture.[8] Also, while the overwhelming majority of gestures are not gaze-fixated, i.e. not taken into the foveal vision, but processed based on information drawn from the parafoveal or peripheral vision, those gestures that contain a hold phase, i.e. a momentary cessation in the movement of the gesture, reliably attract higher levels of fixation (Gullberg and Holmquist 2002). Beattie argues that “[d]uring ‘holds’, the movement of a gesture comes to a stop and thus the peripheral vision is no longer sufficient for obtaining information from that gesture, thus necessitating a degree of fixture” (Beattie 2016, 131).

References

Auer, P. 2018. “Gaze, addressee selection and turn-taking in three-party interaction.” In Eye-tracking in interaction. Studies on the role of eye gaze in dialogue, edited by Geert Brône and Bert Oben, p. 197–231. Amsterdam: Benjamins.Search in Google Scholar

Auer, P. 2021. “Turn-allocation S. Th. Gries, and gaze: A multimodal revision of the “current-speaker-selects-next” rule of the turn-taking system of conversation analysis.” Discourse Studies 23(2), 117–40.Search in Google Scholar

Auer, P., C. Rühlemann, S. Th. Gries, and A. Ptak. In preparation. The last-gazed-at participant answers the question in triads: Evidence from the Freiburg Multimodal Interaction Corpus [working title].Search in Google Scholar

Adolphs, S. and R. Carter. 2013. Spoken corpus linguistics. From monomodal to multimodal. London/New York: Routledge.Search in Google Scholar

Arndt, H. and R. W. Janney. 1987. InterGrammar. Towards an integrative model of verbal, prosodic and kinesic choices in speech. Berlin/New York: Mouton de Gruyter.Search in Google Scholar

Bailey, Rachel L. (2017). “Electrodermal activity (EDA),” In The International Encyclopedia of Communication Research Methods, edited by Jörg Matthes, Christine Davis, and Robert Potter, p. 1–15. Blackwell: Wiley. 10.1002/9781118901731.iecrm0079.Search in Google Scholar

Barthel, M. and S. Sauppe. 2019. “Speech planning at turn transitions in dialog is associated with increased processing load.” Cognitive Science 43(7), e12768. 10.1111/cogs.12768.Search in Google Scholar

Beattie, G. 2016. Rethinking body language: How hand movements reveal hidden thoughts. London/New York: Routledge.Search in Google Scholar

Beatty, J. 1982. “Task-evoked pupillary responses, processing load, and the structure of processing resources.” Psychological Bulletin, 91(2), 276–92. 10.1037/0033-2909.91.2.276.Search in Google Scholar

Beatty, J. and B. Lucero-Wagoner. 2000. “The pupillary system.” In Handbook of psychophysiology, edited by J. T. Cacioppo, L. G. Tassinary, & G. G. Berntson, 2nd ed., p. 142–62. Cambridge, UK: Cambridge University Press.Search in Google Scholar

Biber, D. 1993. “Representativeness in corpus design.” Literary and Linguistic Computing 8(4), 243–57.Search in Google Scholar

Blackwell, N. L., M. Perlman, and J. E. Fox Tree. 2015. “Quotation as multi- modal construction.” J. Pragmat 81, 1–17.Search in Google Scholar

Brône, G. and B. Oben. 2015. “Insight interaction: A multi- modal and multifocal dialogue corpus.” Language Resources and Evaluation 49, 195–214.Search in Google Scholar

Calhoun, Sasha, Jean Carletta, Jason M. Brenier, Neil Mayo, Dan Jurafsky, Mark Steedman, and David Beaver. 2010. “The NXT-format switchboard corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue.” Language Resources and Evaluation 44, 387–419. 10.1007/s10579-010-9120-1.Search in Google Scholar

Cappellini, Marco, Benjamin Holt, Brigitte Bigi, Marion Tellier, and Christelle Zielinski. 2023. “A multimodal corpus to study videoconference interactions for techno-pedagogical competence in second language acquisition and teacher education.” Corpus 24.Search in Google Scholar

Dael, Nele, Martijn Goudbeek, and Klaus R. Scherer. 2013. “Perceived gesture dynamics in nonverbal expression of emotion.” Perception 42, 642–57. 10.1068/p7364.Search in Google Scholar

Dawson, Michael E., Anne M. Schell, and Diane L. Filion. (2000). “The electrodermal system,” In Handbook of physiology, edited by John T. Ciacoppo, Louis G. Tassinary, and Gary G. Berntson, p. 200–223. New York, NY: Cambridge University Press.Search in Google Scholar

Diemer, Stefan, Marie-Louise Brunner, and Selina Schmidt. 2016. “Compiling computer-mediated spoken language corpora: Key issues and recommendations.” International Journal of Corpus Linguistics 21(3), 349–71 [Special issue: Compilation, transcription, markup and annotation of spoken corpora, ed. by John M. Kirk and Gisle Andersen].Search in Google Scholar

Du Bois, John W., Wallace L. Chafe, Charles Meyer, Sandra A. Thompson, Robert Englebretson, and Nii Martey. 2000–2005. Santa Barbara corpus of spoken American English, Parts 1-4. Philadelphia: Linguistic Data Consortium.Search in Google Scholar

Garside, Roger and Nicholas Smith (1997). CLAWS part-of-speech tagger for English. UCREL.Search in Google Scholar

Goffman, Erving (1964). “The neglected situation.” American Anthropologist 66(6), 133–136.Search in Google Scholar

Goodwin, Charles (1980). “Restarts, pauses, and the achievement of a state of mutual gaze at turn-beginning.” Sociological Inquiry 50(3–4), 272–302.Search in Google Scholar

Goodwin, C. 1984. “Notes on story structure and the organization of participation.” In Structures of social action: Studies in conversation analysis, edited by J. M. Atkinson and J. Heritage, p. 225–46. Cambridge: Cambridge University Press.Search in Google Scholar

Goodwin, C. 1986a. “Gestures as a resource for the organization of mutual orientation.” Semiotica 62(1–2), 29–49. 10.1515/semi.1986.62.1-2.29.Search in Google Scholar

Goodwin, C. 1986b. “Between and within alternative sequential treatments of continuers and assessments.” Human Studies 9, 205–17.Search in Google Scholar

Goodwin, C. and J. Heritage. 1990. “Conversation analysis.” Annual Review of Anthropology 19, 283–307.Search in Google Scholar

Hayano, K. 2013. “Question design in conversation.” In The handbook of conversation analysis, edited by J. Sidnell and T. Stivers, p. 395–414.Search in Google Scholar

Heldner, Mattias. 2011. “Detection thresholds for gaps, overlaps and no-gap-no-overlaps.” The Journal of the Acoustical Society of America 130, 508–13. 10.1121/1.3598457.Search in Google Scholar

Heldner, Mattias and Jens Edlund. 2010. “Pauses, gaps and overlaps in conversations.” Journal of Phonetics 38, 555–68. 10.1016/j.wocn.2010.08.002.Search in Google Scholar

Heritage, J. 1984a. “A change-of-state token and aspects of its sequential placement.” In Structures of social action: Studies in conversation analysis, edited by M. Atkinson and J. Heritage, p. 299–345. New York, NY, USA: Cambridge University Press.Search in Google Scholar

Heritage, J., & M. L. Sorjonen. 2018. “Analyzing turn-initial particles.” In Between turn and sequence: Turn-initial particles across languages, edited by J. Heritage and M.-L. Sorjonen, p. 1–22. Amsterdam: John Benjamins. 10.1075/slsi.31.01her.Search in Google Scholar

Hömke, Paul, Judith Holler, and Stephen C. Levinson. 2017. “Eye blinking as addressee feedback in face-to-face conversation.” Research on Language and Social Interaction 50(1), 54–70. 10.1080/08351813.2017.1262143.Search in Google Scholar

Holler, Judith, Janet Bavelas, Jonathan Woods, Mareike Geiger, and Lauren Simons. (2022). “Given-new effects on the duration of gestures and of words in face-to-face dialogue.” Discourse Processes 59(8), 619–645.Search in Google Scholar

Hoffmann, Sebastian, Stefan Evert, Nicholas Smith, David Lee, and Ylva Berglund-Prytz. 2008. Corpus linguistics with BNCweb-a practical guide (Vol. 6). Lausanne: Peter Lang.Search in Google Scholar

Holler, Judith and Stephen C. Levinson. 2019. “Multimodal language processing in human communication.” Trends in Cognitive Sciences 23(8), 639–52. 10.1016/j.tics.2019.05.006.Search in Google Scholar

Holt, E. 2007. “I’m eying your chop up mind’: Reporting and enacting.” In Reporting talk. Reported speech in interaction, edited by E. Holt and R. Clift, p. 47–80. Cambridge, MA: Cambridge University Press.Search in Google Scholar

Holt, B., M. Cappellini, B. Bigi, and C. Zielinski. 2021. Building a multimodal corpus to study the development of techno-semio-pedagogical competence across different videoconferencing settings and languages. The open archive HAL SHS. halshs-03476577: https://shs.hal.science/halshs-03476577/file/Article_corpus_v3.pdf.Search in Google Scholar

Hsu, H.-C., G. Brône, and K. Feyaerts. 2021. “When gesture “takes over”: Speech-embedded nonverbal depictions in multimodal interaction.” Frontiers in Psychology 11, 552533. 10.3389/fpsyg.2020.552533.Search in Google Scholar

Just, M. A. and P. A. Carpenter. 1993. “The intensity dimension of thought: Pupillometric indices of sentence processing.” Canadian Journal of Experimental Psychology 47(2), 310–39. 10.1037/h0078820.Search in Google Scholar

Jefferson, G. 2004. “Glossary of transcript symbols with an introduction.” In Conversation analysis: Studies from the first generation, edited by G. H. Lerner, p. 13–31. Amsterdam: John Benjamins.Search in Google Scholar

Jehoul, A., G. Brône, and K. Feyaerts. 2017. “The shrug as marker of obviousness.” Linguistics Vanguard 3(s1).Search in Google Scholar

Kallen, J. and J. Kirk. 2012. SPICE-Ireland: A user’s guide. Belfast: Cló Ollscoil na Banríona.Search in Google Scholar

Kelly, Spencer D. (2001). “Broadening the units of analysis in communication: Speech and nonverbal behaviours in pragmatic comprehension.” Journal of Child Language 28(2), 325–349.Search in Google Scholar

Kendon, A. 1967. “Some functions of gaze-direction in social interaction.” Acta Psychol 26, 22–63. 10.1016/0001-6918(67)90005-4.Search in Google Scholar

Kendon, Adam. 1973. “The role of visible behavior in the organization of social interaction.” In Social Communication and Movement: Studies of Interaction and Expression in Man and Chimpanzee, edited by M. Cranach and I. Vine, p. 29–74. New York: Academic Press.Search in Google Scholar

Kendon, A. 2004. Gesture: Visible action as utterance. Cambridge, MA: Cambridge University Press.Search in Google Scholar

Kendrick, K. H. and F. Torreira. 2014. “The timing and construction of preference: A quantitative study.” Discourse Processes 52(4), 1–35. 10.1080/0163853X.2014.955997.Search in Google Scholar

Knight, D., A. O’Keeffe, C. Fitzgerald, G. Mark, J. McNamara, and F. Farr. Forthcoming, 2023. “Indicating engagement in online workplace meetings: The role of backchannelling head nods.” International Journal of Corpus Linguistics.Search in Google Scholar

Kok, K. I. 2017. “Functional and temporal relations between spoken and gestured components of language. A corpus-based inquiry.” International Journal of Corpus Linguistics 22(1), 1–26. 10.1075/ijcl.22.1.01kok.Search in Google Scholar

Koutsombogera, M. and C. Vogel. 2018. Modeling collaborative multimodal behavior in group dialogues: The MULTISIMO corpus.Search in Google Scholar

Labov, W. 1972. Language in the inner city. Oxford: Basil Blackwell.Search in Google Scholar

Ladewig, Silva H. 2011. Putting the cyclic gesture on a cognitive basis, CogniTextes [online], Volume 6. http://journals.openedition.org/cognitextes/406; 10.4000/cognitextes.406.Search in Google Scholar

Laner, Barbara. 2022. ““Guck mal der Baum” – Zur Verwendung von Wahrnehmungsimperativen mit und ohne mal.” Gesprächsforschung – Online-Zeitschrift zur verbalen Interaktion 23, 1–35. http://www.gespraechsforschung-online.de/fileadmin/dateien/heft2022/ga-laner.pdf.Search in Google Scholar

Laeng, B., S. Sirois, and G. Gredebäck. 2012. “Pupillometry: A Window to the Preconscious?” Perspectives on Psychological Science 7(1), 18–27.Search in Google Scholar

Levinson, S. C. and J. Holler. 2014. “The origin of human multi-modal communication.” Philosophical Transactions of the Royal Society B: Biological Sciences 369, 20130302. 10.1098/rstb.2013.0302.Search in Google Scholar

Levinson, S. C. and F. Torreira. 2015. “Timing in turn-taking and its implications for processing models of language.” Frontiers in Psychology 6, 731. 10.3389/fpsyg.2015.00731.Search in Google Scholar

Leech, G., R. Garside, and M. Bryant. 1994. “CLAWS4: the tagging of the British national corpus.” COLING ‘94: Proceedings of the 15th conference on computational linguistics – Volume 1 August, p. 622–8. 10.3115/991886.991996.Search in Google Scholar

Leech, G. 2007. “New resources, or just better ones? The holy grail of representativeness.” In Corpus linguistics and the web, edited by M. Hundt, N. Nesselhauf, and C. Biewer, p. 133–50. Amsterdam/New York: Rodopi.Search in Google Scholar

Lerner, Gene H. (2003). “Selecting next speaker: The context-sensitive operation of a context-free organization.” Language in Society 32(2), 177–201.Search in Google Scholar

Levinson, Stephen C. (2013). “Action formation and ascription.” In The handbook of conversation analysis, edited by Jack Sidnell and Tanya stivers, p. 101–130. Malden/MA: Wiley-Blackwell.Search in Google Scholar

Li, C. L. 1986. “Direct and indirect speech: A functional study.” In Direct and indirect speech, edited by F. Coulmas, p. 29–45. Berlin: Mouton de Gruyter.Search in Google Scholar

Lõo, Kaidi, Jacolien van Rij, Juhani Järvikivi, and Harald Baayen (2016). “Individual Differences in Pupil Dilation during Naming Task.” In CogSci.Search in Google Scholar

Loehr. D. 2012. “Temporal, structural, and pragmatic synchrony between intonation and gesture.” Laboratory Phonology 3(1), 71–89. 10.1515/lp-2012-0006.Search in Google Scholar

Lücking, A., K. Bergmann, F. Hahn, S. Kopp, and H. Rieser. 2013. “Data-based analysis of speech and gesture: The Bielefeld Speech and Gesture Alignment Corpus (SaGA) and its applications.” Journal on Multimodal User Interfaces 7(1–2), 5–18. 10.1007/s12193-012-0106-8.Search in Google Scholar

Mathis, T. and G. Yule. 1994. “Zero quotatives.” Discourse Processes 18, 63–76. 10.1080/01638539409544884 Search in Google Scholar

Mayes, P. 1990. “Quotation in spoken English.” Studies in Language. International Journal sponsored by the Foundation “Foundations of Language” 14, 325–63. 10.1075/sl.14.2.04 Search in Google Scholar

McNeill, D. 1985. “So you think gestures are nonverbal?” Psychological Review 92(3), 350–71.Search in Google Scholar

McNeill, D. 1992. Hand and mind. What gestures reveal about thought. Chicago: University of Chicago Press.Search in Google Scholar

Mondada, L. 2016. “Conventions for multimodal transcription.” https://franz.unibas.ch/fileadmin/franz/user_upload/redaktion/Mondada_conv_multimodality.pdf.Search in Google Scholar

Mondada, Lorenza (2018). “Multiple temporalities of language and body in interaction: Challenges for transcribing multimodality.” Research on Language and Social Interaction 51(1), 85–106.Search in Google Scholar

Papesh, M. H. and S. D. Goldinger. 2012. “Pupil-BLAH-metry: Cognitive effort in speech planning reflected by pupil dilation.” Attention, Perception, & Psychophysics 74(4), 754–65. 10.3758/s13414-011-0263-y.Search in Google Scholar

Paggio, P., J. Allwood, E. Ahlsén, and K. Jokinen. 2010. “The NOMCO multimodal Nordic resource: goals and characteristics.” In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10), edited by Calzolari, N., K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, and D. Tapias, p. 19–21. Valletta, Malta: European Language Resources Association (ELRA).Search in Google Scholar

Peters, P. and D. Wong. 2015. “Turn management and backchannels.” In Corpus pragmatics. A handbook, edited by K. Aijmer and C. Rühlemann, p. 408–29. Cambridge: Cambridge University Press.Search in Google Scholar

Peräkylä, Anssi, Pentti Henttonen, Liisa Voutilainen, Mikko Kahri, Melisa Stevanovic, Mikko Sams, and Niklas Ravaja. 2015. “Sharing the emotional load: Recipient affiliation calms down the storyteller.” Social Psychology Quarterly 78(4), 301–23.Search in Google Scholar

Peräkylä, Anssi, Pentti Henttonen, Liisa Voutilainen, Mikko Kahri, Melisa Stevanovic, Mikko Sams, and Niklas Ravaja (2015). “Sharing the emotional load: Recipient affiliation calms down the storyteller.” Social Psychology Quarterly 78, 301–323.Search in Google Scholar

Pfeiffer, Martin and Clarissa Weiss. (2022). “Reenactments during tellings: Using gaze for initiating reenactments, switching roles and representing events.” Journal of Pragmatics, 189, 92–113.Search in Google Scholar

Pomerantz, A., and J. H. Heritage. 2013. “Preference.” In Handbook of conversation analysis, edited by J. Sidnell and T. Stivers, p. 210–28. Hoboken: Wiley-Blackwell.Search in Google Scholar

Rayson, Paul and Roger Garside (1998). “The claws web tagger.” ICAME Journal, 22, 121–123.Search in Google Scholar

Roberts, S. G., F. Torreira, and S. C. Levinson. 2015. “The effects of processing and sequence organization on the timing of turn taking: a corpus study.” Frontiers in Psychology 6(509), 1–16.Search in Google Scholar

Robinson, J. D. 2007. “The role of numbers and statistics within conversation analysis.” Communication Methods and Measures 1(1), 65–75. 10.1080/19312450709336663.Search in Google Scholar

Robinson, J. D., C. Rühlemann, and D. T. Rodriguez. 2022. “The bias toward single-unit turns in conversation.” Research on Language and Social Interaction 55(2), 165–83. 10.1080/08351813.2022.2067436.Search in Google Scholar

Rossano, Federico. 2012. Gaze behaviour in face-to-face interaction. Unpublished PhD dissertation. Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands.Search in Google Scholar

Rühlemann, C. 2013. Narrative in English conversation: A corpus analysis of storytelling. Cambridge: Cambridge University Press.Search in Google Scholar

Rühlemann, Christoph. (2017). “Integrating corpus-linguistic and conversation-analytic transcription in XML: The case of backchannels and overlap in storytelling interaction.” Corpus Pragmatics, 1, 201–232.Search in Google Scholar

Rühlemann, C. 2022. “How is emotional resonance achieved in storytellings of sadness/distress?.” Frontiers in Psychology (Sec. Language Sciences) 13, 952119. 10.3389/fpsyg.2022.952119.Search in Google Scholar

Rühlemann, C. and M. Pfeiffer. Under review. Timing of gaze alternation [working title].Search in Google Scholar

Rühlemann, C, M. Gee, and A. Ptak. 2019. “Alternating gaze in multi-party storytelling.” Journal of Pragmatics 149, 91–113.Search in Google Scholar

Roberts, S. G., F. Torreira and S. C. Levinson. 2015. “The effects of processing and sequence organization on the timing of turn taking: a corpus study.” Frontiers of Psychology 6, 509.Search in Google Scholar

Sacks, H. 1992. Lectures on conversation. Vols. I & II. Cambridge: Blackwell.Search in Google Scholar

Sacks, Harvey, Emanuel A. Schegloff, and Gail Jefferson. (1978). “A simplest systematics for the organization of turn taking for conversation.” In Studies in the organization of conversational interaction (pp. 7–55). Academic Press.Search in Google Scholar

Sauppe, S. 2017. “Symmetrical and asymmetrical voice systems and processing load: Pupillometric evidence from sentence production in Tagalog and German.” Language 93(2), 288–313. 10.1353/lan.2017.0015.Search in Google Scholar

Schegloff, E. A. 2007. Sequence organisation in interaction: A primer in Conversation-Analysis. Cambridge: Cambridge University Press.Search in Google Scholar

Scherer, K. R. 2005. “What are emotions? And how can they be measured?.” Social Science Information 44, 693–727.Search in Google Scholar

Sevilla, Y., M. Maldonado, and D. E. Shalóm. 2014. “Pupillary dynamics reveal computational cost in sentence planning.” The Quarterly Journal of Experimental Psychology 67, 1041–1052. 10.1080/17470218.2014.911925.Search in Google Scholar

Siddle, David A. T. (1991). “Orienting, habituation, and resource allocation: An associative analysis.” Psychophysiology, 28(3), 245–259.Search in Google Scholar

Sirois, S., and J. Brisson. 2014. “Pupillometry.” Wiley Interdisciplinary Reviews: Cognitive Science 5(6), 679–92. 10.1002/wcs.1323.Search in Google Scholar

Soulaimani, D. 2018. “Talk, voice and gestures in reported speech: Toward an integrated approach.” Discourse Studies 20, 361–76. 10.1177/1461445618754419.Search in Google Scholar

Stec, K., M. Huiskes, and G. Redeker. 2016. “Multimodal quotation: Role shift practices.” Journal of Pragmatics 104, 1–17.Search in Google Scholar

Stivers, T. 2008. “Stance, alignment, and affiliation during storytelling: When nodding is a token of affiliation.” Research on Language and Social Interaction 41, 31–57.Search in Google Scholar

Stivers, Tanya and Jeffrey D. Robinson. (2006). “A preference for progressivity in interaction.” Language in Society, 35(3), 367–392.Search in Google Scholar

Stivers, Tanya and Federico Rossano. (2010). “Mobilizing response.” Research on Language and Social Interaction, 43(1), 3–31.Search in Google Scholar

Stivers, T., N. J. Enfield, P. Brown, C. Englert, M. Hayashi, T. Heinemann, G. Hoymann, F. Rossano, J. P. de Ruiter, K.-E. Yoon, and C. Levinson. 2009. “Universals and cultural variation in turn-taking in conversation.” Proceedings of the National Academy of the Sciences. U.S.A. 106(26), 10587–92. 10.1073/pnas.0903616106.Search in Google Scholar

Svartvik, J., ed. 1990. The London corpus of spoken english: Description and research. Lund Studies in English, Vol. 82. Lund: Lund University Press.Search in Google Scholar

Tannen, D. 1986. “Introducing constructed dialog in Greek and American conversational and literary narrative.” In Direct and indirect speech, edited by F. Coulmas, p. 311–32. Berlin: Mouton de Gruyter. 10.1515/9783110871968.311.Search in Google Scholar

Ter Bekke, M., L. Drijvers, and J. Holler. 2020. “The predictive potential of hand gestures during conversation: An investigation of the timing of gestures in relation to speech.” 10.31234/osf.io/b5zq.Search in Google Scholar

Walker, M. B. and C. Trimboli. 1982. “Smooth transitions in conversational interactions.” The Journal of Social Psychology 117, 305–6.Search in Google Scholar

White, Sheida (1989). “Backchannels across cultures: A study of Americans and Japanese1.” Language in Society, 18(1), 59–76.Search in Google Scholar

Wittenburg, P., H. Brugman, A. Russel, A. Klassmann, and H. Sloetjes. 2006. “Elan: a professional framework for multimodality research,” In Proceedings of LREC, 2006 (Genoa).Search in Google Scholar

Wong, Deanna and Pam Peters. (2007). “A study of backchannels in regional varieties of English, using corpus mark-up as the means of identification.” International Journal of Corpus Linguistics, 12(4), 479–510.Search in Google Scholar

Young, R. and J. Lee. 2004. “Identifying units in interaction: Reactive tokens in Korean and English conversation.” Journal of Sociolinguistics 8(3), 380–407.Search in Google Scholar

Received: 2022-10-27
Revised: 2023-08-09
Accepted: 2023-08-15
Published Online: 2023-11-16

© 2023 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 27.4.2024 from https://www.degruyter.com/document/doi/10.1515/opli-2022-0245/html
Scroll to top button