Prosodic features of stances in conversation

Stance—attitudes and opinions about the topic of discussion—has been investigated textually in conversation analysis, discourse analysis, and computational models, but little attention has focused on its acoustic-phonetic properties. It is a challenge, given the complexity of stance and the many other types of meaning that share the same acoustic channels, all overlaid on the lexical and syntactic material of the message. With the goal of identifying automatically-extractable, acoustically-measurable correlates of stance-taking, this work identified signals of stance in prosodic measures of fundamental frequency, intensity, and duration in an audio corpus of dyads engaged in collaborative conversational tasks designed to elicit frequent changes in stance at varying levels of involvement. The study examined over 32,000 stressed vowels in content words spoken by 40 American English-speakers and found that f0 and intensity increased with stance strength, longer vowel duration signaled positive polarity, and a combination of measures distinguished several stance-act types, including: general agreement, weak-positive agreement, rapport-building agreement, reluctance to accept a stance, stance-softening, and backchannels. These results contribute to the understanding of the acoustic-phonetic properties of the social and attitudinal messages conveyed in natural speech.


Introduction
Many layers of meaning are conveyed in natural speech, beyond lexical and sentential denotations. One layer is stance, or the expression of an attitude toward an object, claim, or person relevant within the discussion context (Biber, Johansson, Leech, Conrad, & Finegan, 1999;Du Bois, 2007). Stance can be conveyed in many ways, but with only a fraction of the message sent through textual components, much of the information must be present in the delivery, the acoustics of the speech signal itself. Just as changes in pronunciation and prosody can transform a sentence from statement to question, similar changes can affect the intended meaning and reception of social and attitudinal information. Phonetic correlates of information structure, discourse structure, and such social-indexical aspects as the region, gender, ethnicity, or identity of speakersand perceptions and interpretations of these features by listeners-have been studied in various sociolinguistic and computational fields. However, the phonetic properties of stance-taking have received less attention. This leads to questions of how stance is signaled acoustically. For example, we can express strong or weak opinions, contrast positive and negative attitudes, convey enthusiastic or reluctant agreement, take confident or uncertain positions, engage in persuasion or show deference, all without changing the words we use. How is this accomplished? In addressing this question, this study presents some of the first work to find automatically-extractable acoustically-measurable correlates of stancetaking in natural speech. It employed a large audio corpus of stance-dense collaborative different types of stance. Measures of intensity, pitch height and range, speaking rate, and hyperarticulation were especially useful in locating stances involving assessment (good/bad, praiseworthy/deplorable), newness (new/background, surprising/typical), subjectivity (fact/opinion, controversy), and personal relevance to the audience. Their models considered over 80 different acoustic measures of prosodic features, which has the advantage of testing many complex combinations of features but may have a disadvantage in the human interpretability of the results. The current work took a complementary approach, employing separate acoustic-prosodic measurements (f0, intensity, duration) over all utterances in a conversational corpus. The methods were first employed on a subset of utterances in the corpus, 2266 instances of the word "yeah" . In that study, greater stance strength carried higher f0 and intensity; positive polarity was signaled by higher f0, lower intensity, and longer vowel duration; and certain stance types were differentiated by vowel duration and intensity.
The work presented here investigated acoustic correlates of stance-taking in American English conversation with a detailed treatment of stance features, a broad range of stanceexpression types, and local phonetic measurements. It took up the argument that since stance presence is signaled acoustically (Freeman, 2014), components or features of stance (strength, polarity, type) are likely to differ acoustically as well . The approach leveraged advantages of qualitative content analysis with quantitative phonetic measurement over a sizeable audio corpus of dyadic conversations.

Methods
The central prediction of this study was that stance type, strength, and polarity are signaled by changes in the acoustic signal. This prediction was tested using measures of fundamental frequency (f0), intensity, and vowel duration extracted from an eight-hour audio corpus of 40 speakers engaged in collaborative tasks annotated for stance features.

Data set
The data set for this study was drawn from the ATAROS corpus, a high-quality audio collection of dyads completing collaborative tasks designed to elicit frequent changes in stance (for a full description of the corpus, see Freeman, 2015; for access to the corpus, contact the author). The sample consisted of 20 dyads engaged in two of the tasks, for a total of nearly eight hours of conversation containing over 71,300 words. The acoustic analyses presented here were conducted on lexically stressed vowels within content words (hereafter called 'stressed-content vowels' or SCVs). This was intended to minimize interactions with phonetic reduction typically found in function words and unstressed vowels. SCVs comprised 37% of all vowels in the sample and provided more than 32,000 vowel tokens for analysis.

Speakers
In order to minimize potential stance-related dialect differences, all speakers in the corpus were adult native English-speakers aged 18-75 who grew up in one dialect region, the Pacific Northwest (Washington, Oregon, and Idaho). Ethnicity was not controlled, but the proportions of self-identified ethnicity were consistent with the general ethnic makeup of the Seattle area, where recordings were made (U.S. Census Bureau, 2010). Speakers reported no history of hearing problems, and any speakers with apparent speech impediments were excluded from the current analysis. Dyads were made up of strangers matched roughly by age (within 10 years) and either crossed or matched by sex. Table 1 shows the distribution of dyads in the sample by age and sex. There were more female speakers than male (24 and 16 total), and half were under age 35. Speakers varied in the amount of speech they contributed, but contributions were proportional by sex and age group, with 57% of vowels uttered by females, almost half by the younger group, and about a quarter each for the middle and older groups.

Tasks
Recordings were made in a sound-attenuated booth in a university lab using head-mounted microphones and a separate recording channel for each speaker, resulting in 16-bit stereo WAV-file recordings with a 44.1 kHz sampling rate.
Dyads completed a brief demographic questionnaire and five collaborative problemsolving tasks designed to elicit frequent changes in stance and differing levels of involvement or engagement. The tasks involved two sets of about 50 target items chosen to represent the main vowel categories of Western American English in fairly neutral consonantal contexts (i.e., avoiding liquids and following nasals, which commonly neutralize vowel contrasts; Labov, Ash, & Boberg, 2006). This study analyzed the Inventory and Budget tasks, the two tasks intended to elicit the weakest and strongest stances and levels of involvement, respectively. Both tasks averaged about 13 minutes in duration and about 150 utterances per speaker (for details, see Freeman et al., 2014;Freeman, 2015).

The Inventory task
This collaborative decision-making task was designed to elicit low levels of involvement and weak stances. Speakers stood facing a felt-covered wall and were given a box of about 50 Velcro-backed cards that could be stuck to the felt. The cards were printed with the names of household items, and about 15 additional cards were already placed on the wall, which represented a store inventory map. Speakers were told to imagine that they were co-managers of a superstore in charge of arranging new inventory. They discussed each item and decided where to place it on the map. This task generally involved polite solicitation and acceptance of suggestions, as in this example exchange: A: Books could go near toys I think. Maybe. B: Yeah or travel guide -Yeah, between toys and travel guides? A: Yeah, sure.

The Budget task
This collaborative decision-making task was designed to elicit high levels of involvement and strong stances. Speakers were seated at a computer screen and told to imagine that they were on a county budget committee in charge of making cuts to about 50 services and expenses. They discussed each item and decided whether to fund or cut it. This task involved more elaborate negotiation, which might include citing personal knowledge or experience as support for stances, as in this excerpt: A: Well job training programs is pretty crucial. [...] And so is ... chicken pox vaccinations, right? B: I -well, I didn't get a chicken pox vaccination. I think a lot of kids just naturally get chicken pox and then they're fine.

Annotation
Three levels of manual annotation were conducted: orthographic transcription, stance strength and polarity annotation, and stance type annotation. Annotators were three advanced or recently graduated bachelor's students in linguistics and speech science who were trained and supervised by the author to ensure transcription accuracy and annotation consistency.

Transcription
Tasks were manually transcribed in Praat (Boersma & Weenink, 2013) following a simplification of the ICSI Meeting Corpus guidelines (Morgan et al., 2001). Stretches of speech were demarked when surrounded by at least 500 ms of silence, and the resulting 'spurt' was transcribed orthographically using conventional American spelling, with the addition of common shortenings, discourse markers, filled pauses, disfluencies, and vocalizations with clear meanings (Freeman, 2015). Completed manual transcriptions were automatically time-aligned to the audio using the Penn Phonetics Lab Forced Aligner (P2FA; Yuan & Liberman, 2008), which demarked word and phone boundaries for each speaker.

Stance strength and polarity
Tasks were manually annotated at a coarse level between pauses for two broad features of stance: strength and polarity. Each spurt (stretch of speech said by one speaker between at least 500 ms of silence) was marked with one of the stance strength labels shown in Table 2. Spurts with a discernible stance strength (label 1, 2, or 3) were also labeled for polarity, as shown in Table 3. As a result, each spurt was marked with one of 14 possible strength-polarity label combinations. Both textual content and prosody were taken into account when determining labels, as prosody can be used to enhance or even reverse the meaning of text alone. One purpose of this study was to identify acoustic cues that people use to convey (and therefore interpret) stance, making it necessary to include the audio signal in the annotation process; however, annotators considered prosody holistically without specific reference to components to be x Unclear: cannot be determined, excited pronunciations of minimal-stance content (e.g., "Ooh, buckets!" "I don't know what that means.") measured acoustically (pitch, loudness, duration). Because strength is relative, the scheme was applied on a per-speaker, per-task basis. Before labeling a task, annotators listened to a portion of the task or a prior task to get a general sense of each speaker's styles and strategies. For example, for speakers with small f0 and intensity ranges, small deviations are more meaningful than for more energetic speakers, whose modulations must be more extreme to indicate differences in stance. Annotators listened to both channels of the task audio while labeling one speaker's transcription, and then listened to the task again while labeling the other's. The scheme was verified for its usability with independent annotation. The first two dyads recorded were used for training and reliability testing. Three annotators independently annotated all four task files with moderately high agreement. Fleiss' kappa was 0.69 for polarity labels, 0.57 for stance strength labels, and 0.55 for combined (strength + polarity) labels. Given the complexity of the annotation task, this level of agreement was deemed sufficient to allow less overlap in annotation in favor of an overall faster procedure. After a task was labeled by one annotator, a second reviewed and verified or corrected each label while listening to the audio and reading the transcript. Asterisks were used to indicate uncertainty, with the second annotator providing a second opinion as needed. If the second annotator remained uncertain about a label, a third annotator served as a tiebreaker. In the 20-dyad sample analyzed here, 5.4% of spurts were marked with uncertainty by a first annotator, and only 1.8% by a second, with a fairly even distribution across strength and polarity levels. This method yielded very high interrater agreement between the two annotators. Weighted Cohen's kappas with equidistant penalties were 0.87 for stance strength labels and 0.93 for polarity labels (p < .001), with the unweighted kappa for combined labels at 0.88 (p < .001).
With the given annotation protocols, uneven distributions across levels were expected, with strong stances particularly rare. Table 4 shows the distribution of analyzed Table 3: Stance polarity levels.

Label
Description and examples (applicable only to strength labels 1, 2, 3) + Positive: agreement, approval, willing acceptance, encouragement, positive evaluation (e.g., "Sure. Good idea." "Yes! Perfect.") -Negative: disagreement, disapproval, rejection, grudging acceptance, hedging, negative evaluation (e.g., "No, I don't think so." "Well, I guess. If you want to.") (none) Neutral: none of the above, non-evaluative offering or solicitation of opinions or solutions (e.g., "What should we cut next?" "Let's do this one.") X Unclear: cannot be determined. stressed-content vowels (SCVs) by stance strength and polarity, as inherited from the spurts that contained them. Weak and moderate-strength SCVs were similar in proportion, but over half of SCVs were labeled with neutral polarity, a fifth with positive, and very few with negative. Note that vowels with unclear polarity were included in stance strength analysis but removed for polarity analysis; vowels with unclear stance strength were excluded from both analyses.

Stance type
Stance type was annotated at a more fine-grained level than stance strength and polarity: Words and phrases were only marked when they performed 'stance acts,' or dialog acts involving stance-taking (Carletta et al., 1997;Fairclough, 2003). Stance act boundaries were determined by the annotators, and acts might divide or span multiple spurts. Both lexical and auditory information was considered when marking a stance act, based on whether the utterance performed the functions shown in Table 5 within the discourse context. As with stance strength and polarity annotation, annotators listened to both audio channels of a task while annotating one speaker's transcript, and then listened again to annotate the other's. The stance-act type annotation scheme drew on a range of content-and discourse-analytic literature with a variety of stance-related concepts and classifications (Jaffe, 2009), as described below. This resulted in a combination of dimensions that are often examined separately, including elements of persuasion, discourse management, and interpersonal relations, which were combined into one scheme here in order to capture the range of behaviors typical to the collaborative tasks at hand. Some of the most overt types of stance-taking were included under the opinion-offering label (o): evaluation and evaluative description, appraisal, judgment, appreciation, affect/affective stance, assessment, subjectivity, intersubjectivity, positioning, alignment, attitude/attitudinal stance, recommendation, persuasion, modality, modulation, and prediction (Conrad & Biber, 2000;du Bois, 2007;Fairclough, 2003;Hunston & Thompson, 2000;Ogden, 2006).
In soliciting another's stance (s), speakers engaged in both knowledge exchange (Fairclough, 2003) and interpersonal stance-taking, which involved negotiating their positions and power relationships, showing deference and politeness, and/or controlling the flow of conversation and the weights or attention given to each person's stances (du Bois, 2007;Hunston & Thompson, 2000). Both teamwork/rapport-building and encouragement/praise (t, e) were interpersonal in nature (du Bois, 2007), with speakers working to bolster their cohesiveness as a team by expressing positive sentiments about their jointly-constructed stances, each other, and themselves as team members.
Agreement and disagreement (a, d) can be called second order stances (Kockelman, 2004) in that they take stances in relation to previous stances of any type (Conrad & Biber, 2000;du Bois, 2007;Fairclough, 2003;Ogden, 2006). As a polite form of disagreement, reluctance to accept a stance (r) adds a layer of positive interpersonal stance to the rejection of a proposition (du Bois, 2007;Fairclough, 2003;Hunston & Thompson, 2000;Ogden, 2006).
The remaining labels allowed for types of stance that were difficult to name (strongly expressive intonation, unclear [i, x]) and those which normally carry little or no stance (backchannels, minimal-stance [b, 0]). Although backchannels were considered to have no/minimal stance ( Table 2), they were labeled separately for stance type due to their recognizable discourse function and previously-studied acoustic properties (e.g., Beňuš, Gravano & Hirschberg, 2007), which may serve as a useful basis of comparison against stance-carrying types.
Some of the labels served similar functions which were often more difficult to differentiate during annotation. A distinguishing feature between agreement and opinion-offering (a, o) was whether the utterance took a new stance (o) or merely showed acceptance/approval of an existing one (a). Similarly, lexically positive backchannels (b) like 'yeah, right, okay' could be difficult to distinguish from agreement/acceptance (a); here the rule of thumb was whether the speaker took (or attempted to take) the floor (a). (The new turn may continue after the agreement, or if the agreement was the entire turn, the other speaker often began a new turn in response, whereas backchannels generally occurred during another speaker's turn.) While reluctance to accept and hedging (r, f) could sound similar, reluctance usually occurred in response to another's stance to soften or avoid rejection, while hedging attempted to soften the force of one's own offer, allowing more room for the other to reject it. Rapport-building and encouragement (t, e) are very similar concepts, as encouragement could be considered a subtype of rapport. However, they were separated here to allow for potentially strong prosodic differences between the more extreme examples, such as individual esteem-boosting verbal 'pats on the back' (e) versus sarcasm or commiseration (t), which on the surface may appear negative but which served to build solidarity (i.e., "At least we're in the same boat"). Finally, labels for general and intonationally-carried 'stanciness' (x, i) were left underspecified to allow for additional classifications that might emerge in future analyses.
Multiple labels were applied to phrases performing more than one stance act type; e.g., offering a suggestion (o) with questioning intonation to solicit another's opinion about it (s) would be labeled (os). Because stance type annotation is more subjective than stance strength and polarity procedures, all annotations were reviewed and corrected by a second annotator. Any areas of uncertainty or disagreement between the first two annotators were settled by a third. In the 20-dyad sample used here, 5% of acts were marked with uncertainty by a first annotator, and only 1% by a second. Labels receiving greater than 5% initial uncertainty included: reluctance to accept, disagreement, opinion with reasons, softened opinion, strongly-expressive intonation, and unclear (r, d, co, fo, i, x). Finally, stance acts with automatic transcript alignments which deviated substantially from the audio were marked during annotation. These poor alignments made up a small portion of the recordings (4.3% of acts in the 20-dyad sample), and so they were removed from the current acoustic analysis.
Because stance acts were delimited independent of spurt boundaries, they differed in structure from spurts. On average, stance acts in the sample were shorter than spurts, with a mean length of 3.9 words over 1.3 seconds, compared to 6.4 words in 2.2 seconds for spurts. (The speaking rate was unaffected, at about 3 words per second.) As with spurts, stance acts were longer on average in the Budget task (mean 4.4 words, compared to 3.5 in the Inventory task). These patterns held for both sexes.
The 24 stance type labels and label combinations with at least 100 stressed-content vowel tokens were included in the analyses of stance type presented here ( Table 6). This helped ensure there were enough tokens with each label for reliable comparisons between types. With over 32,000 total vowels, all types in the annotation scheme ( Table 5) were represented except encouragement (e). Table 6 shows the total number of stance acts with each label, the mean and standard deviation of the number of words and the number of stressed-content vowels (SCVs) per act type, and the total number of SCVs with each label. The most frequent stance act types were opinion-offering, convincing/reasoning, and agreement (labels o, c, a); together, these comprised 54% of the measured stressed-content vowels. Also frequent were vowels in stretches of speech labeled here as minimal-stance (labeled 0, 24% of SCVs); these were not considered parts of stance acts, but they were included in acoustic analyses for comparison. Opinions with solicitation or supporting reasons (os, co) together contributed just under 9% of all SCVs, and the remaining stance types contributed less than 2% each. Stance act types varied substantially in length, with acts involving convincing (c, co, cd, ct, cs, cr) being some of the longest, at about 9 words with nearly 4 SCVs on average, those involving opinion-offers (o, os, co, ot, fo, do, ao) next with about 6.5 words and 3 SCVs, other types ranging from 2 to 5 words with about 2 SCVs, and backchannels tending to be one-word acts.

Measurements
After transcription, alignment, and annotation were complete, a Praat script automatically measured the f0 and intensity (Hz, dB) of all vowels at every decile of their duration using Praat's autocorrelation and mean energy functions with a window length of 25 ms, f0 range of 50-300 Hz, 1 and dynamic range of 30 dB. Forced-alignments and automatic measurements were not manually corrected, as the very large size of the data set minimized the effects of alignment and measurement errors. However, spurts with very poor alignments were marked during annotation and excluded from analysis; in the current sample, this resulted in excluding about 3.5% of vowels.
Measurements were normalized within-speaker to allow for cross-speaker comparisons. Vowel f0 and intensity were each z-score normalized using the means and standard deviations of all a speaker's measurements taken over all words in both tasks combined. Similarly, vowel duration was z-score normalized within speaker but also within vowel quality to account for intrinsic vowel duration differences (Peterson & Lehiste, 1960;Tauberer & Evanini, 2009). Each vowel's stance strength, polarity, and type labels were inherited from the spurt or stance act to which the vowel belonged. For example, if an utterance of "I agree absolutely" were a spurt marked with moderate strength and positive polarity, and also marked as an act of agreement, the acoustic measurements of each vowel in the utterance would be tagged with 2+ and agreement.

Results
Signals of stance strength, polarity, and type were found in the duration, fundamental frequency, and intensity of lexically-stressed vowels within content words (stressedcontent vowels, SCVs). Because initial analyses showed f0 and intensity patterns holding across vowel duration, the statistics reported below are for these measures at vowel midpoint. A principal components analysis of the z-score normalized measures revealed that f0 and intensity aligned with one component which accounted for about half the variance in stance labels, and vowel duration aligned with a second component which accounted for another third of the variance (see Freeman, 2015 for the full analysis).
The primary results reported below come from linear mixed-effects models for each dependent measure (midpoint f0, midpoint intensity, vowel duration) with stance strength, polarity, and type as fixed effects, speaker as a random effect (random intercept), and a random slope for stance strength within speaker. Models with a random slope for stance type failed to converge, as did models with random slopes for both stance strength and polarity together. Results from models with a random slope for polarity are noted for each measure below. Models were computed in R (R Core Team, 2017) using the lme4 package's lmer function (Bates et al., 2015) and the afex package's Satterthwaite estimations to compute p-values (Singmann et al., 2019). The smoothing-spline ANOVA plots for each measure were created using the ggplot2 package (Wickham, 2009).

Fundamental frequency
Fundamental frequency (f0) at vowel midpoint was systematically related to stance strength and type. Table 7 shows the results for a linear mixed-effects regression model (LMER) with a random slope for stance strength. Mean midpoint SCV f0 was significantly affected by stance strength, with stronger stances successively higher in f0 but no significant difference between minimal-stance and low-strength vowels (labels 0, 1). Several stance types differed from minimal-stance (label 0), as indicated by the stars in Table 7. Results from an LMER with a random slope for polarity were nearly identical, and likelihood ratio tests showed that both models provided better fits than one without random slopes (χ 2 (9) = 85.28 against LMER with random slope for strength; χ 2 (5) = 101.70 against LMER with random slope for polarity, both p < .001).
With high overlap between stance types, it was difficult to identify clusters of stance types based on f0 at vowel midpoint. However, Welch's t tests identified a few types that were distinct from the others: Reluctance to accept a stance (r) and strongly-expressive intonation (i) were indistinguishable with the highest f0, backchannels (b) had the lowest, and agreement (a) dipped from moderate to low (p < .05). These relationships can be seen in the smoothing-spline ANOVA plot in Figure 1, which shows a contour connecting mean f0 for each stance type cluster at each decile of vowel duration (Gu, 2002;Wassink & Koops, 2013). While f0 generally declined over vowel duration, agreement and backchannels (a, b) showed sharper slopes. These patterns held in words at all utterance locations, with f0 generally declining over utterance duration.

Intensity
Stance strength and type were also reliably signaled by intensity at vowel midpoint. Table 8 shows the results for a linear mixed-effects regression model (LMER) with a random slope for stance strength. Similar to f0, mean midpoint intensity was significantly affected by stance strength, with stronger stances successively higher in intensity but little difference between minimal-stance and low-strength vowels (labels 0, 1). This was influenced by the large number of vowels in weak positive utterances (label 1+), which had lower intensity than minimal-stance and other weak-stance vowels (labels 0, 1, 1-). Polarity levels did not differ substantially in intensity. Several stance types differed from minimal-stance (label 0), as indicated by the stars in the table. Estimates from an LMER with a random slope for polarity were very similar, but five fixed effects differed in significance between the two (indicated with exclamation points in Table 8): Neutral polarity (label 0) and unclear stance (type x) did not have significant effects, but lowstrength (label 1), backchannels, and disagreement with reasons (types b, cd) reached significance (p < .05). Likelihood ratio tests showed that both models provided better fits than one without random slopes (χ 2 (9) = 369.41 against LMER with random slope for strength; χ 2 (5) = 88.04 against LMER with random slope for polarity, both p < .001). As with f0, there was high overlap between stance types, but Welch's t tests identified a few distinct types. Agreement with rapport (at) had the highest intensity and differed significantly from all other types except strongly-expressive intonation (i) (p < .01), and its intensity dropped less at the ends of utterances than in other types. Stance-softening or hesitation (f) had the lowest intensity and overlapped only with backchannels (b), the next highest, which in turn overlapped with the next highest, agreement (a) (p < .05). Both agreement and backchannels (a, b) dropped more sharply over vowel duration than other types. All other types overlapped heavily and were not clearly distinguishable based on intensity at vowel midpoint. These patterns can be seen in the smoothing-spline ANOVA plot in Figure 2, which shows a contour connecting mean intensity at each decile of vowel duration for each stance type cluster. While intensity generally declined over vowel duration (with drops at the edges, as expected near flanking consonants or silence), agreement and backchannels (a, b) showed sharper slopes, similar to their pattern for f0. The patterns held in words at all utterance locations, with intensity generally declining over utterance duration.

Vowel duration
Finally, distinctions in stance were also associated with systematic differences in vowel duration. Table 9 shows the results for a linear mixed-effects regression model (LMER) with a random slope for stance strength, which provided a better fit than a model without a random slope (χ 2 (9) = 68.17, p < .001 by likelihood ratio test). (An LMER with a random slope for polarity failed to converge.) Strength levels did not differ reliably, but most stance types differed from minimal-stance (label 0), as indicated by the stars in Table 9. The results for polarity were less clear. The LMER indicated minimal differences between polarity labels but a distinction between neutral and negative. However, mean SCV duration for positive utterances was longest, 121 ms, compared to 96 ms for negative and 94 ms for neutral stances, and post-hoc Welch's t tests showed that positive stances duration. "Reluctance + intonation" = combined types: reluctance to accept a stance and strongly-expressive intonation, which were indistinguishable, as were all other types not listed separately, combined here as "else." had longer stressed vowel durations than negative and neutral (both p < .001), which did not differ. Thus, there may be strong individual differences in the use of vowel duration to signal polarity. For stance type, there was again high overlap between types, but Welch's t tests identified a few types that differed from most others: backchannels, agreement with rapport, and strongly-expressive intonation (b, at, i) had some of the longest vowel durations and were only indistinguishable from each other and unclear stance (x), which also overlapped agreement (a) and five other types. Agreement (a) also had longer vowel durations and was only indistinguishable from unclear (x) and two other types (fo, r). Other types overlapped heavily and were not clearly distinguishable based on vowel duration.

Combined prosodic patterns
Following the patterns of each measure above, a few of the stance types were differentiated with a combination of prosodic features. Agreement (a), one of the most frequent types, showed longer vowel duration and moderately low f0 and intensity which both dipped over the durations of stressed-content vowels. Backchannels (b), one of the least frequent types in the corpus, also showed long vowel duration and low-dropping intensity, but their f0 remained low throughout vowel duration. Reluctance to accept a stance (r) and stronglyexpressive intonation (i), also infrequent, showed high f0, the latter also with long vowel duration. Agreement with rapport (at) stood out with the highest intensity and longest vowel duration, and stance-softening/hesitation (f) showed the lowest intensity.
The same prosodic measures also combined to help differentiate levels of stance strength and polarity. Successively increasing levels of strength were best distinguished by increases in both f0 and intensity, while positive polarity was signaled by longer vowel duration. In combining all three measures, weak-positive utterances (1+) stood out as having the longest vowels with the lowest f0 and intensity; this group showed the same patterns as the agreement type mentioned above (a), as the majority (66%) of agreeing stance acts (a) occurred in weak-positive utterances (1+), and nearly half (47%) of vowels in weakpositive utterances (1+) contributed to agreement (a), with another 5% involved in a combination of types which included agreement (ac, ae, aet, af, afo, ai, ao, ar, as, at).

Summary of results
In this study of a large sample of over 32,000 stressed vowels in content words said by 40 speakers, prosodic measures were found to signal stance strength, polarity, and type. F0 and intensity were most associated with differences in stance strength and type: Both increased with stance strength, and they helped distinguish several stance-act types. Reluctance to accept and strongly-expressive intonation (r, i) had very high f0, backchannels (b) very low, and agreement (a) low-dipping; the latter two also showed sharply-dropping intensity, with backchannels lower overall. Stance-softening/hesitation (f) showed the lowest intensity and rapport-building agreement (at) the highest. While most of these types also had longer vowels, vowel duration did not reliably differentiate them. While positive polarity showed longer vowel duration, individual differences between speakers may cloud the use of duration in signaling polarity. Finally, weak-positive agreement (a,1+) stood out with the longest vowels and lowest f0 and intensity. Table 10 summarizes these results. These findings support the prediction that information about stance is carried in prosodic features of the acoustic speech signal. It stands to reason that variations in prosody play a strong role in conveying the many complex and subtle meanings of opinions and attitudes. At a phrasal level, many well-known intonational contours can be overlaid on identical lexical/syntactic material to change the meaning from statement to question, scolding to incredulous, genuine to sarcastic, and so on, but in naturally-occurring speech, such well-defined tunes are affected by a host of other contextual factors, making it more difficult to tease apart the acoustic components that contribute to each aspect. This study identified some components of stance meanings as they were carried on stressed vowels in content words, and while phrasallevel analysis is certainly called for in future work, the very large sample size used here allows pieces of the broader pattern to emerge. Again, it stands to reason that stronger stances had higher f0 and intensity, with increased effort during delivery indicating greater investment; that backchannels and weak agreement were quiet and low-pitched; that rapport-building agreement was delivered energetically; that downplaying a stance was done quietly; and that complex stances (e.g., reluctance to accept an idea without outright rejection) carried complex intonation patterns. Such findings form a solid foundation for expansion into both broader and more detailed acoustic investigations.

Limitations
As some of the first work to report acoustic signals of stance-taking, this study had several limitations, including a 'flattening' of the prosodic information in a spurt caused by collapsing vowel measurements across all spurt positions. Local speaking rate, lexical frequency, and predictability in context were also not considered in detail. Finally, other types of spoken interaction are likely to involve stance types or prosodic contours that are not well represented in the collaborative tasks used here, which encouraged cooperation with low stakes and no consequences attached to any decision the participants made. More competitive tasks or controversial topics are likely to elicit more disagreement, persuasion, and stronger opinions, which may be expressed with distinct prosodic cues.

Conclusion
This study provides an initial sketch of the prosodic cues to stance, the ways in which components like f0, intensity, and duration can be manipulated and combined to send complex messages about our attitudes, opinions, and interpersonal relationships. Such information not only deepens our understanding of human communication but also contributes to the growing body of computational work on sentiment analysis (see e.g., Mäntylä, Graziotin, & Kuutila, 2018), for use in both automatic detection and humaninteractive production. Given that many other types of information-social/indexical, discursive, structural, etc.-are sent in the same acoustic stream, stance should be considered as a potential influencing factor when designing and analyzing studies of variation in pronunciation and prosody in natural speech.