On the nature of the perception-production link: Individual variability in English sibilant-vowel coarticulation

This study aims to elucidate the nature of the perception–production link with respect to coarticulation by examining the production and perception of English sibilants before different vowels. A group of native speakers of American English were recorded reciting a set of /s/and /ʃ/-initial words in different vocalic contexts and took part in an identification experiment designed to test their ability to adjust their perceptual expectation in light of the vocalic influence on the preceding sibilant. Significant correlations between the production and perception results were observed when by-subject estimates for context-relevant predictors (and their interactions) in the perception regression models were examined in relation to the by-subject estimates of the production regression models. These results suggest a positive correlation between how much an individual attends to context-specific variation in perception and how the sibilant contrast is realized in specific vocalic contexts. Ramifications of these findings are discussed for the nature of speech perception and production and the understanding of sound change.


Introduction
Understanding the nature of individual variation in speech perception and production is increasingly important, particularly for research on sound change and propagation (Stevens & Harrington, 2014). Various scholars have argued in recent years that sound change actuation might come about as a result of interactions between individuals with different perceptual and/or articulatory targets for the 'same' sound category (Baker, Archangeli, & Mielke, 2011;Yu, 2013Yu, , 2016 or different tendencies to attach social meaning to linguistic differences (Garrett & Johnson, 2013). Beddor (2009), for example, argued that listeners can be accurate perceivers who attend to coarticulatory information available to them in the input signal but nonetheless have different perceptual weightings (or phonological grammars) in terms of how they use coarticulation to signal the presence (or not) of the coarticulation trigger in the signal. Various scholars, most prominently Ohala (1993b), have argued that listeners who fail to compensate for coarticulatory effects properly would lead to sound change. For example, oral vowels in nasal contexts (e.g., VN sequences) might be mistakenly reconstructed as nasal (e.g., ṼN) if listeners fail to take into account the effects of coarticulatory nasalization. Likewise, Beddor and Krakow (1998) suggested that if vowels in nasal contexts are perceived as partially nasalized, listeners might fail to disambiguate fully the spectral contribution of nasalization and tongue/jaw position in both contextual and distinct nasal vowels, fostering vowel height shifts in both. Extending the logic that hypo-or hyper-corrective sound changes are due to listeners failing to properly take contextual information into account, some scholars have argued that listeners who compensate less, or not at all, for coarticulation could be more likely to initiate context-dependent sound changes (Yu, 2010(Yu, , 2016Yu & Lee, 2014). Identifying the underlying nature of individual variability in the perception and production of coarticulated speech might help explain why certain sound changes happen at all and, conversely, why sound changes are so rarely actuated even though the phonetic pre-conditions are always present in speech.
One common avenue to explore the nature of individual variation in speech perception and production is to examine how the two might be related. Studies using paradigms such as altered auditory feedback (e.g., Houde & Jordon, 2002;Katseff, 2011;Shiller, Sato, Gracco, & Baum, 2009) and phonetic imitation (e.g., Babel, 2012;Nielsen, 2011;Yu, Abrego-Collier, & Sonderegger, 2013) have generally found speakers to be quite adept at adjusting their production patterns in face of adjusted perceptual feedback or altered perceptual experiences, suggesting a close link between speech perception and production. Within the context of the perception and production of coarticulated speech, some studies have reported positive mapping between perception and production. Beddor and Krakow (1999), for example, compared the patterns of perceptual compensation for vowel-nasal coarticulation among English and Thai listeners and found that Thai listeners compensated for vowel-nasal coarticulation less than the English listeners (i.e., Thai listeners were more accurate in detecting vowel nasality in contexts where perceptual compensation is expected to reduce sensitivity to the presence of vocalic nasalization). They explained this difference by appealing to the fact that nasal coarticulation in Thai is less extensive than in English. Thai listeners who experience smaller degrees of contextual nasalization on a regular basis might come to expect less nasalization (and conversely be more sensitive to an unexpectedly high degree of nasalization) in the appropriate contexts. A similar cross-linguistic correlation was observed in the perception and production of vowel-to-vowel coarticulation (Beddor, Harnsberger, & Lindemann, 2002). Zellou and colleagues have recently demonstrated that American English speakers are able to imitate different degrees of vocalic nasalization from their own, suggesting further a positive correlation between perception and production (Zellou, Dahan, & Embick, 2017;Zellou, Scarborough, & Nielsen, 2016).
However, the correlation between the production and perception of coarticulation is not always consistently observed. Harrington, Kleber, and Reubold (2008), for example, found an age-based correlation between /u/-fronting and listeners' perceptual compensation for this fronting effect. Younger speakers of Southern British English compensated less for the /u/-fronting effect compared to the older speakers who exhibited stronger context-specific /u/-fronting. However, it is unclear how these findings inform the perception-production link concerning coarticulation since listeners who compensate less (i.e., the younger group) also have more fronted /u/ regardless of context. Thus, it is not clear if /u/ is less coarticulated in the fronting context or it is simply produced as more front in general. It also remains unknown as to whether speakers of the same age range exhibit similar perception-production link concerning /u/-fronting. Kataoka (2011), who examined /u/-fronting in alveolar contexts in speakers of California English between the age of 19 to 45, found no significant correlation between the production and perception of /u/-fronting. Others focused on long-distance vowel-to-vowel coarticulation and found that the magnitude of long-distance vowel-to-vowel coarticulation does not correlate with individuals' ability to discriminate coarticulated vowels in isolated contexts (Grosvald, 2009;Grosvald & Corina, 2012). However, since their perceptual measures did not specifically test for the listeners' ability to utilize knowledge of coarticulation, the findings of Grosvald (2009) and Grosvald and Corina (2012) remain inconclusive. That is, there are no theoretical reasons to think that there should be a relationship between the ability to discriminate vowels (coarticulated or otherwise) in isolation (i.e., contextual information is not given) and to produce vowels in context. Thus the lack of a relationship between the results of the production and perception tasks reported in Grosvald (2009) and Grosvald and Corina (2012) does not constitute negative evidence against the relationship between the perception and production of coarticulation per se since they did not provide any independent measures of the listener's ability to perform perceptual readjustment to coarticulated speech. In a recent study, Zellou (2017) investigated the correlations between individual differences in the production of nasal coarticulation and patterns of perceptual compensation in American English based on the results of a production task, a paired discrimination task, and a nasality rating task. Individuals who produce less extensive anticipatory nasal coarticulation exhibit more veridical acoustic perception (indicating less attention to potential sources of coarticulation in the signal) than individuals who produce greater coarticulatory nasality in the paired discrimination task. However, listeners' nasal coarticulation in production did not predict results from the rating task. This inconsistency suggests potential task sensitivity in reifying the perception-production link. Clearly, further investigations into the perception-production link with regards to coarticulated speech are needed.
The perception-production link might be further complicated by other mediating factors. For example, studies concerning the processing of coarticulated information have found that coarticulatory information not only affects the classification and discrimination of speech, but also the temporal dynamics of speech processing (e.g., Beddor, Coetzee, Styler, McGowan, & Boland, 2018;Beddor, McGowan, Boland, Coetzee, & Brasher, 2013;Dahan, Magnuson, & Tanenhaus, 2001;Mahr, McMillan, Saffran, Weismer, & Edwards, 2015). The perception-production correspondence might be affected by individual variability in their general processing skills and styles. Beddor et al. (2018), for example, found that participants who produced an earlier onset of coarticulatory nasalization on vowels were more efficient users of nasality as listeners as that information unfolds over time. Furthermore, previous literature on the perception-production link outside of the coarticulated speech domain also suggests that the link might be more nuanced. For example, the distinctness of an individual's production of a contrast has been found to correlate with how well the individual discriminates that contrast (Newman, Clouse, & Burnham, 2001;. However, it remains unclear how contrast distinctness might play a role in the perception and production of coarticulated speech. That is, how does coarticulation influence the contextspecific distinctness of segmental contrasts in production and perception? To this end, the present study provides an opportunity to address this gap in the literature. Current theories of the perception and production of coarticulated speech differ in their predictions on how the two may be related. Theories of perceptual compensation that see phonetic perception as a matter of simple auditory function (e.g., Lotto & Holt, 2006) predict a relationship whereby speech production is concerned with uncovering strategies for producing optimally perceivable acoustic signals. As such, the mapping between the two is imprecise due to the non-uniqueness of the mapping between the acoustic targets and vocal tract configurations. However, given that speech articulation is hypothesized to serve perceptual goals, one might expect the listeners who perceive the presence of coarticulation information in the speech signal to also produce speech that achieves a similar degree of coarticulation. From this perspective then, auditory theories would predict a positive correlation between the perception and production of coarticulated speech. However, the nature of the production rendered could vary as long as the acoustic outputs achieve the desired percepts that resemble coarticulation.
Gestural theories of speech perception, such as the Motor Theory (Liberman & Mattingly, 1985) and Direct Realism (Fowler, 2006), see speech perception as being guided by the recovery of gestures in the underlying signal. Such gestural knowledge might stem from an 'innate vocal tract synthesizer' (Liberman & Mattingly, 1985) or some presumed universal function of perceiving in the world, as in the case of Direct Realism. Due to the universal nature of the presumed gestural knowledge, which helps explain why perceptual compensation is not language specific or unique to humans (Viswanathana, Magnusona, & Fowler, 2010), such theories generally provide little insights into the nature of individual variability in perceptual compensation. To the extent that variation is acknowledged, they are attributed to differences in listening modes (e.g., Fowler & Brown, 2000;Repp, 1981, and more discussion below). If the universalist assumption is relaxed and an individual's knowledge of coarticulation derives from his/her coarticulatory habits, gestural theories would predict that the magnitude of coarticulation in production should mirror the magnitude of perceptual compensation, as the objects referenced in perception and production are one and the same (i.e., phonetic gestures of the vocal tract).
A similarly direct link between the production and perception of coarticulation can be found in exemplar-based approaches to speech perception and production. Pierrehumbert (2002), for example, posits an explicit perception-production loop, where stored perceptual experiences are weighted by social and attentional factors and such perceptual exemplars are drawn upon to generate production targets. Context-specific perceptual experiences would lead to context-specific realizations of production targets. Sonderegger and Yu (2010) laid out an explicit model along this line, showing in particular that the listeners' perceptual compensatory responses to vowel-to-vowel coarticulation can be modeled effectively using a rational (in this case, Bayesian) model of speech perception. Crucially, the model assumes as inputs context-specific acoustic cues, such as means of F1 and F2 of the target vowels in different vocalic contexts in modeling the perception of vowel-tovowel coarticulation. Thus, the acoustic measures, which were derived from a production study and presumably reflected the production targets of the speakers, helped explain the perceptual behaviors of the listeners. However, since the acoustic measures were not drawn from the same individuals who participated in the perceptual task, it is unclear if the perception-production link is evident at the individual level.
The approaches reviewed thus far assume that speech perception is veridical and that perceptual compensatory behaviors can be fruitfully modeled given proper articulatory/acoustic information and vice versa. There are, however, models that are more 'non-veridicality-centric,' that is, they assume that the input signal could be recoded before perception is registered. The so-called 'C-CuRE' approach to perceptual compensation for coarticulation (Cole, Lindebaugh, Munson, & McMurray, 2010;McMurray & Jongman, 2011), as well as models of sound change that rely on listener misperception as a driving force behind certain sound changes predict a negative correlation between the perception and production of coarticulated speech. C-Cure, for example, assumes that the incoming acoustic cues are initially encoded veridically, but cues are recoded (hence non-veridically stored) in terms of their differences from expected values, which can be specific to particular individuals, as different sources of variance are categorized. This approach predicts that individuals who engaged in expectation adjustment robustly in perception (i.e., those who compensate more robustly) would have production targets of a sound category that have low variance and are relatively context-free (less coarticulated). The same sound category might conversely have a more diffused distribution and the production targets are more context-dependent (more coarticulated) for individuals who do not adjust for contextual information robustly in perception (i.e., they are more veridical in perception, perhaps similar to Repp's 1981 auditory listeners; see discussion below).
Finally, as noted in Whalen (1999), some see no necessary connection between perception and production. Perception can proceed with no knowledge of production, as is the typical position in the automatic speech recognition (ASR) literature (see Livescu, Jyothi, & Fosler-Lussier, 2016, for an alternative view).
The importance for understanding the link between speech perception and production is all the more relevant to language and speech researchers in light of an increasing number of reported cases of variability in the perceptual compensation for coarticulation across individuals. Beddor (2009), for example, found a great deal of individual variability in the perception of nasalization in VNC sequences in American English. She suggested that the variability might stem from differences in perceptual grammar across individuals. Mann and Repp (1980) were first to report individual variability in perceptual identification of English sibilants in different vocalic contexts. Repp (1981) explained the individual variation in terms of two different strategies for listening to sibilant-vowel sequences.
Some listeners are what he referred to as auditory listeners, who segregate the noise portion from the vocalic portion, while the others are phonetic listeners, where sibilant noise information is more integrated with the vocalic portion. Based on their findings, Yu and Lee (2014), who also observed individual variation in perceptual compensation for sibilant-vowel coarticulation, argued that the observed individual variability is more continuous than suggested by Repp's (1981) two-listening-mode model. Yu (2010) argued that the magnitude of perceptual compensation for the vocalic effect on sibilant perception might be modulated by the listener's gender and autistic-like traits, as measured by the Autism Spectrum Quotient (AQ; Baron-Cohen, Wheelwright, Skinner, Martin, & Clubley, 2001); the Autism Spectrum Quotient is a short, self-administered scale for identifying the degree to which any individual adult of normal IQ may have traits associated with Autism-Spectrum Disorder (ASD), where clinical diagnosis involves difficulties in social development and communication, alongside the presence of unusually strong repetitive behavior or 'obsessive' interests (American Psychiatric Association, 2013;I.C.D-10, 1994). In particular, he found females with a lower AQ score (i.e., less autistic-like) to be less likely to perceptually compensate for coarticulation. More recently, Yu (2016) found that speakers of Hong Kong Cantonese with less autistic-like traits exhibit more vocalic influence on /s/ production than those with more autistic-like traits. That study did not examine the participants' perceptual responses to coarticulatory information, however. The present study contributes to this line of inquiry by examining whether individual variation in perceptual responses to coarticulatory information might correlate with individual variation in coarticulation in production.
To examine the link between the perception and production of coarticulated speech, we report the results of two experiments examining (i) the identification of sibilants in different vocalic contexts and (ii) the acoustic realization of these sibilants in the corresponding contexts. Crucially, each participant took part in both experiments in order to allow for a direct examination of the correlation between the results of the two tasks. The investigation suggests that individual variability in the acoustic realization of /s/ and /∫/ in different vocalic contexts is indeed correlated with variability in how individuals respond to sibilants in different vocalic contexts perceptually.

Participants
Forty-two adult native speakers of American English (twenty-eight females), age ranged from 18 to 53 (median = 20, SD = 7) with no reported history of speech, language, or hearing problems were recruited from the University of Chicago community and received a nominal fee or course credit for participating in the study.

Stimuli
For the perceptual task, the stimuli were those employed in Yu and Lee (2014) and can be found at https://bit.ly/2CGcJRK. Readers are referred to Yu and Lee (2014) for a detailed explanation on how the stimuli were created. Briefly, they created two 7-step /V i sV i -V i ∫V i / continua (V = /a/ or /u/) by mixing, using weighted average of waveforms, /s/ and /∫/ taken from original /∫a/ and /∫a/ syllables produced by a male native speaker of American English. The natural /s/ and /∫/ were included as endpoints of the 7-step series. The seven fricatives (synthesized and natural) were then cross-spliced with /a/ and /u/ taken from original /da/ and /du/ syllables produced by the same male speaker. The resulting tokens were then normalized for intensity and pitch.
The target stimuli for the production task were English words where the initial stressed syllable contained the onset consonant /s/ or /∫/, and one of the following vowels, /i/, /u/, or /ae/. As noted above, we had chosen to employ the stimuli used in Yu and Lee (2014) to ensure maximal compatibility between earlier perceptual experiments and the current one. However, this methodological choice created several complications for the production task. To begin with, as the number of word-initial /s/ and /∫/ minimal pairs in the /a/ context is rather limited, we had to remedy this situation by using another low vowel, /ae/, as the environment for the production stimuli in order to preserve the contrast between low vs. high vowel contexts in the perceptual stimuli. In order to expand the empirical coverage of the production study, we also included the /i/ context to examine the effect of lip rounding apart from vowel height. These methodological choices are admittedly not ideal since the vowel contexts examined in the perception and production tasks are not isomorphic. Future extensions of this experiment should aim for a more direct mapping between the perceptual and production stimuli. Each CV combination is instantiated by four distinct words. A total of 48 tokens (2 sibilants × 3 vowels × 4 words × 2 repetitions) were elicited from each participant. The list of target words is listed in Table 1.

Procedure
Participants performed a two-alternative forced choice (2AFC) identification task where the participants listened to a series of VCV sequences where C is one of the 7-step of the synthesized /s/ -/∫/ continuum and V is /a/ or /u/ and had to decide whether the fricative was /s/ or /∫/. Participants responded to six repetitions of each stimulus for a total of 84, randomly ordered, trials (=7 steps × 2 vowel contexts × 6 repetitions). The order of response options was counter-balanced across participants. Participants were given three seconds to respond before the presentation of the next stimulus. Approximately 2% of the trials had no responses.
For the production task, each participant was digitally recorded in a quiet room individually at a sampling rate of 44,100 Hz with a Marantz PMD 670 solid-state recorder and a Shure SM10A head-mounted microphone, reading the target stimuli in a random order in a carrier phrase, "say ___ again." The perception task always precedes the production task. Approximately 5% of the trials were lost due to mispronunciations or noise from participants touching the microphone.
Fricative segmentation involved the simultaneous consultation of the waveform and wideband spectrogram. Fricative onset was defined as the point at which high-frequency energy (roughly in the region above the second formant of the following vowel) first appeared on the spectrogram and/or the point at which the number of zero crossings rapidly increased. Frication offset was defined as the intensity minimum immediately preceding the onset of vowel periodicity.
The spectral properties of English sibilants have been extensively studied in the past (e.g., Blacklock, 2004;Iskarous, Shadle, & Proctor, 2011;Jongman, Wayland, & Wong, 2000;Whalen, 1981Whalen, , 1991. We follow earlier reports, especially Shadle and Mair (1996) and Jongman et al. (2000), and analyzed the spectral properties of sibilant noise measured in terms of the spectral peak frequency, the first four spectral moments, and the total fricative duration. English /∫/ typically exhibits a midfrequency peak at around 2.5-3kHz, while /s/ displays a primarily spectral peak at around 4-5 kHz, although the location of the spectral peak is partly dependent on the speaker (Hughes & Halle, 1956) and vowel (Soli, 1981). Likewise, the first spectral moment (i.e., the spectral mean) also distinguishes well between /s/ and /∫/ in English (Jongman et al., 2000;Shadle & Mair, 1996) and across different vowel contexts (Nittrouer, 1995), gender (Nittrouer, 1995), and socio-economic classes (Stuart-Smith, 2007). Some report /s/ to be distinct from /∫/ in terms of having lower standard deviation (Jongman et al., 2000;Tomiak, 1990), although Li, Edwards, and Beckman (2009) found /s/ to have a more diffused shape (higher standard deviation) than /∫/ and /ɕ/. /∫/ is found to have a positive skewness, i.e., a concentration of energy in the lower frequencies. The sibilant /s/ also has a higher kurtosis (a more peaked distribution) than /∫/ in English (Jongman et al., 2000;Li et al., 2009). Shadle and Mair (1996) reported that the particularly high kurtosis value for /s/ around /u/ compared to other vowels is likely due to a whistly /s/ in rounded contexts.
A custom-made PRAAT script automatically extracted the spectral measurements. Similar to the measurement procedure described in Jongman et al. (2000), DFTs (frequency range: 500-12000 Hz) were calculated using a 40 ms Hamming window with preemphasis at 80 Hz, centered at eleven points (at 10% increments of the fricative's duration from 0% to 100%) during the fricative. That is, each DFT is based on a window that spanned the preceding and following 20 ms of each measurement point. Measurements at 0%, 10%, 90%, and 100% were not included in the analysis. Spectral peak is defined in the script as the highest amplitude peak of the DFT. The same script also measured the duration of the sibilant.
The results of the experiments were analyzed in two different ways, at the group level and at the individual level. The perception-production link will be examined in the individual-level analysis section.

Group-level perceptual results
To examine the perceptual responses at the group level, participants' perceptual responses (/∫/ response = 1, /s/ = 0) were modeled using logistic mixed-effects regression fitted in R, using the lmer() function from the lme4 package (version 0.999999-2; Bates, Maechler, & Bolker, 2011). The Wald's Z test, which describes how distant a coefficient estimate is from zero in terms of its standard error, was used to test the significance of estimates of the model.
A series of regression models were tested with three within-subject predictors (Trial indexed the order in which the stimuli were presented, Vowel indexed the vocalic contexts [/u/ vs. /a/] and Step indexed the seven steps along the /s/-/∫/ continuum) as well as a between-subject predictor of participant's biological Sex (female vs. male). Continuous variables (Trial and Step) were centered and z-scored. Sex was sum-coded, while Vowel was treatment-coded with /a/ as the baseline.
The final model included only four main predictors (Trial, Vowel, Step, and Sex), the two-way interactions between Vowel and Step, and their interactions with Sex. To account for the non-linearity of Step, the quadratic term of Step (i.e., Step 2 ) was included in the model. By-subject random slopes were included for Trial, Vowel, Step, and Step 2 , as well as the interactions between Vowel and Step and between Vowel and Step 2 , to allow for by-subject variability in the effect of each of the variables on /∫/-identification. The model formula in lme4 style was: /∫/-response ∼ Trial + Vowel * (Step + I (Step 2 )) * Sex + (1 + Trial + Vowel * (Step + I (Step 2 ))|subject).
In light of findings from previous literature, listeners exhibiting perceptual compensation for coarticulation are expected to respond /∫/ less often before /u/ than before /a/. This is the pattern observed. Figure 1 illustrates the probability of a /∫/-response in different vocalic contexts. The regression model shows a main effect of Vowel (β = -1.018, z = -4.604, p < 0.001), suggesting that the cohort, as a whole, exhibits more /∫/-responses before /a/ than before /u/. There are also significant main effects of Step (β = 4.516, z = 13.374, p < 0.001) and Step 2 (β = -0.981, z = -4.396, p < 0.001), suggesting that, in the /a/ context, the pattern of change in the log-odds for /∫/ identification across the sibilant continuum concaves mildly downward. Crucially, Vowel significantly interacts with Step (β = 1.093, z = 2.958, p < 0.01) and Step 2 (β = -1.318, z = -3.385, p < 0.001), suggesting that the log-odds of /∫/-identification across the continuum concaves downward more severely in the /u/ context than in the /a/ context. As illustrated in Figure 1a, the probability of /∫/-identification in the /a/ context was low at the /s/-end of the sibilant continuum and rose gradually toward the middle of the continuum. In the /u/ context, this rise in the /∫/-identification rose more sharply in the middle of the sibilant continuum than toward the two ends. No main nor interaction effect involving Sex was significant.

Group-level Production Results
A summary of six spectral measures for /s/ and /∫/ in different vocalic contexts is presented in Table 2. As expected, /s/ shows qualitatively lower spectral mean/peak frequency, higher standard deviation, less peakiness, and less negatively skewed in the /u/ context than in the other unrounded vowel contexts. There is less, if any, voweldependent variation in the spectral properties of /∫/. Sibilants are also found to be shorter before the low vowel /ae/ than before the high vowels (cf. Yu, 2016). While the spectral information measured is commonly analyzed separately for the purpose of elucidating the acoustic properties of fricatives, the mapping between individual spectral measures and their perceptual correlates is not clear, especially since some of the measures are highly correlated with each other (e.g., spectral mean, peak frequency, and skewness). In an effort to reduce the dimensionality of the mapping between spectral cues and perceptual responses and the complexity of the correlation analysis in the context of the examination of the perception-production link in the next section, rather than analyzing the spectral measures individually, an integrated cue-combination approach was taken such that the five spectral measures (centroid frequency, standard deviation, kurtosis, skewness, and peak frequency; see Table 2 for a summary) were first submitted to a principal component analysis (PCA) to obtain linear combinations of these spectral variables that would capture the maximum variation. This analytic procedure has been successfully applied to the analysis of sibilants elsewhere (e.g., Yu, 2016). The specifics of the PCA are as follows. Spectral peak frequency, spectral mean, and spectral standard deviation, which are all in Hertz, and kurtosis, which is unitless but negatively skewed, were log-transformed (natural log). Skewness was not transformed in any way since it is already unitless and normally distributed. Since the kurtosis values can be negative, all kurtosis values were increased by 3 to ensure that they were positive prior to log-transformation. The spectral measures (measurements at the first and last two measurement points were not included in the analysis to avoid measurement problems at the edges of the sibilant), not including sibilant duration, were analyzed using the prcomp() function in R, which performs a principal components analysis on the given data matrix. All acoustic parameters were centered and z-scored for the PCA.
The relative weightings and proportion of variance for each component are summarized in Table 3. The optimal linear combination (PC1), which accounts for about 59% of the variance, and the 2nd component (PC2), which accounts for approximately 32% of the variance, were selected as independent variables for the analysis below as the first two components collectively account for more than 90% of the variance. PC1 has strong loadings for skewness and log-transformed peak frequency and spectral mean, which are all spectral measures that characterize the concentrations of spectral energy. Higher spectral mean and peak values and left-skewness generally correspond to a more /s/-like percept (Jongman et al., 2000). PC2, on the other hand, is dominated by standard deviation and kurtosis, which pertain to the energy levels across different frequencies of the spectrum. Lower standard deviation and higher kurtosis generally correspond to a more /s/-like percept (Jongman et al., 2000).
The results of the two linear-transformed components, obtained using the predict() function in R, were modeled separately using linear mixed-effects regression fitted in R, using the lmer() function from the lme4 package (version 0.999999-2; Bates et al., 2011). Both components were tested for the following task-level effects: Trial order (1-45), Syllable count (1-3), sibilant Duration, sibilant Type (/s/ vs. /∫/), Vowel (/i/, /ae/, /u/), measurement Position (1-7), the participant's biological Sex (female vs. male), and the two-and three-way interactions between Type, Vowel, Position, and Sex. In order to take into account the temporal dynamics of the vocalic influence, the quadratic term for Position (i.e., Position 2 ) was included in the analysis. The number of syllables of each stimulus, which might affect the realization of the target sibilant, was taken into account since syllable count was not controlled for in stimulus selection in an effort to maintain the prosodic position of the target sibilants. Continuous variables, including Position, were centered and z-scored. The Vowel variable was treatment-coded such that the /ae/ context was taken to be the baseline. Thus the first contrast, Vowel i , compares the influence of /ae/ to the influence of the high vowels /i/, while the second contrast, Vowel u , encodes the rounding contrast between /ae/ and /u/. The models also included by-subject random intercepts to allow for subject-specific variation in the specific spectral component as well as by-subject random slopes for each of the main predictors. The model formula in lme4 style for the first two principal components of the spectral measures (PC) was PC ∼ Trial + Duration + Syllable + (Sibilant Type + Vowel + Position + Sex) 3 + (Sibilant Type + Vowel + I (Position 2 ) + Sex) 3 + (1 + Trial + Duration + Syllable + Sibilant Type + Vowel + Position + I (Position 2 )|Subject).
The residuals of the initial fit of each model were examined and were found to deviate strongly from normality. As a result, residuals which were more than 2.5 standard deviations from the mean were trimmed, which amounted to no more than 2.6% of the data for each principal component modeled, and the models were refitted to the trimmed data set. The new models had residual distributions much closer to normality, and it is the refitted models that are reported below. The estimates for all predictors in the analysis of the first two principal components of the spectral measures can be found in Table 4.
Recall that PC1 has the strongest loadings for peak frequency, spectral mean, and skewness; an increase in PC1 corresponds to an increase in peak frequency and spectral mean and a decrease in skewness (thus less tilt toward the left). Previous reports suggest that, relative to the spectrum of /∫/, the spectrum of /s/ has higher spectral mean and spectral peak, and is more negatively skewed. Taken together, we expect a higher PC1 values for /s/ than for /∫/. Given that lip rounding and protrusion results can lead to a more /∫/-like /s/ acoustically, PC1 values are expected to be lower before /u/ than before /ae/ and /i/. These predictions are indeed borne out. Figure 2 illustrates the distribution of PC1 for /s/ and /∫/ in different vocalic contexts. The PC1 values for /s/ and /∫/ differ significantly, as indicated by a main effect of Sibilant Type (β = 1.58, t = 27.17, p < 0.001). The difference in PC1 between /s/ and /∫/ varies across measurement points both linearly (Type × Position: β = 0.04, t = 8.16, p < 0.001) and non-linearly (Type × Position 2 : β = -0.01, t = -2.08, p < 0.05). As illustrated in Figure 2, this difference is primarily driven by the fact that the trajectory of PC1 changes across measurement points for /s/ has a more downwardly concave shape than that of /∫/.
There is a main effect of Vowel u (β = -0.14, t = -3.88, p < 0.001), showing that /ae/ and /u/ exert significantly different effects on PC1 such that PC1 is lower before /u/ than before /ae/. Vowel u interacts significantly with Type (β = -0.24, t = -20.01, p < 0.001); the difference in effects between /ae/ and /u/ is larger in /s/ than in /∫/. The vocalic effect also varies temporally. The difference between the effects of /ae/ and /u/ on PC1 varies across measurement points both linearly (Vowel u × Position: β = -0.03, t = -4.06, p < 0.001) and nonlinearly (Vowel u × Position 2 : β = -0.06, t = -6.15, p < 0.001). Moreover, the sibilant-specific effects of vowel also differ across measurement positions. Figure 2 shows that the difference in vocalic influence between the onset and offset of /s/ is larger than the difference between those of /∫/. The significant three-way interactions between Type, Vowel u , and Position (β = -0.05, t = -6.72, p < 0.001) and between Type, Vowel u , and Position 2 (β = -0.03, t = -3.13, p < 0.01) suggest that there are sibilant-specific differences in terms of the effects of /ae/ and /u/ on PC1 across measurement points; the temporally-dependent difference in vocalic influence on PC1 is more pronounced on /s/ than on /∫/. Note: '***' = p < 0.001; '**' = p < 0.01; '*' = p < 0.05. p-values were obtained using normal approximation which has the assumption that the t distribution converges to the z distribution as degrees of freedom increase (see Mirman, 2014, for details).
In terms of sex-specific effects, there is a main effect of Sex on PC1 (β = 0.35, t = 5.01, p < 0.001); in general, females have higher PC1 than males. Sex also interacts significantly with other factors. In particular, the interaction between Sex and Type (β = 0.14, t = 2.53, p < 0.05) indicates that there is a larger difference between /s/ and /∫/ in females than in males. The sibilant-specific effect of vocalic rounding on PC1 is smaller in females than in males (Vowel u × Type × Sex: β = 0.02, t = 2.18, p < 0.05). The significant threeway interaction between Vowel u , Position 2 , and Sex indicates that the rounding effect on PC1 across measurement points has a more downward concave trajectory in males than in females.
While there is not a significant main effect of Vowel i , suggesting that the influences of /ae/ and /i/ on PC1 do not differ markedly, there is a significant two-way interaction between Vowel i and Position, indicating that the rise in PC1 across measurement positions is steeper before /ae/ than before /i/. This difference is driven by /s/, however, as indicated by the significant three-way interaction with Type (β = -0.04, t = -5.27, p < 0.001). Furthermore, this vocalic difference in PC1 steepness across measurement positions is stronger in males (Vowel i × Position × Sex: β = 0.02, t = 2.35, p < 0.05). Finally, while the influences of /ae/ and /i/ do not differ across sibilant type, PC1 is higher before /i/ than /ae/ for male /s/, but not for female ones (β = -0.02, t = -2.59, p < 0.01).
With respect to the second principal component (PC2), recall that PC2 has strong loadings for standard deviation and kurtosis; an increase in PC2 corresponds to a decrease in standard deviation and an increase in kurtosis (i.e., more peaky distribution). Given that /s/ has been shown to have lower standard deviation and greater kurtosis than /∫/, /s/ is expected to have higher PC2 than /∫/. Likewise, as /s/ is more /∫/-like before /u/, /s/ is expected to have lower PC2 before /u/ than before the other vowels. These predictions are only partially borne out. Figure 3 illustrates the distribution of PC2 for /s/ and /∫/ in different vocalic contexts. While there is not a significant main effect of Type, Type interacts significantly with Position (β = -0.06, t = -5.29, p < 0.001) and Position 2 (β = -0.05, t = -4.14, p < 0.001), suggesting that PC2 for /s/ and /∫/ have different slopes across measurement Error bars present the 95% confidence intervals.
points and PC2 has a more downward concave trajectory for /s/ than for /∫/. Type also interacts with Sex significantly (β = 0.3, t = 3.25, p < 0.01), suggesting that there is a large PC2 difference between /s/ and /∫/, but this difference is mostly observed in males, not females. In particular, male speakers have higher PC2 for /∫/ than /s/, while female speakers show the opposite trend. Type × Sex also significantly interacts with the linear term of Position (β = 0.02, t = 3.03, p < 0.01), indicating that, while PC2 for /s/ and /∫/ diverge linearly across measurement positions for the male speakers, they converge for the females. There is also a significant effect of Duration (β = 0.12, t = 3.68, p < 0.001); an increase in sibilant duration leads to an increase in PC2 (i.e., a decrease in standard deviation and an increase in kurtosis). In terms of vocalic effects, it is as predicted that there is a main effect of Vowel u (β = -0.76, t = -7.74, p < 0.001) such that PC2 is lower before /u/ than before /ae/. Vowel u interacts with Position (β = -0.13, t = -7.56, p < 0.001), indicating that the PC2 trajectories diverge before /ae/ and /u/. In particular, PC2 has a downward trend before /u/ while an upward trend before /ae/. Vowel u also interacts with Type (β = -0.13, t = -5.17, p < 0.001), showing that the PC2 for /s/ is lower before /u/ than for /∫/. Moreover, there is also a three-way interaction between Vowel, Type, and Sex (β = 0.06, t = 3.25, p < 0.01), indicating that the vocalic difference in PC2 between /s/ and /∫/ is smaller for females than males.
No main effect of Vowel i is observed. There is a significant three-way interaction between Vowel i , Type, and Sex (β = 0.05, t = 2.7, p < 0.01), suggesting that there is a larger /ae/-/i/ difference between /s/ and /∫/ in females than in males. A significant three-way interaction between Vowel i , Position 2 , and Sex (β = 0.05, t = 2.54, p < 0.05) indicates that the PC2 before /i/, relative to PC2 before /ae/, has a more downward concave trajectory in males than in females.
Overall, the production results suggest that, while males show less distinctness between /s/ and /∫/ in PC1 (i.e., in terms of spectral mean, peak frequency, and skewness) relative to females, they exhibit more distinctness in PC2 (in terms of standard deviation and kurtosis) for the same contrast. In terms of vocalic influence, males appear to exhibit more vocalic influences than females. Error bars present the 95% confidence intervals.
Thus far, we have reviewed the significant effects of vocalic coarticulation on the acoustic realization of /s/ and /∫/ in English. The vocalic effects are generally stronger on /s/ than on /∫/ and they are stronger toward the sibilant offset than the onset. The perceptual results also indicate that the participants, as a group, exhibit significant effects of perceptual compensation. That is, consistent with previous literature, participants recorded fewer /∫/ responses in the /u/ context than in the /a/ context. However, the fact that the inclusion of by-subject random slopes for contextual factors (i.e., the main factors of Vowel in both the perceptual and production models and the interaction of Vowel with Step in the perceptual model) significantly improves model likelihood suggest there exist a high degree of individual variability in the perception and production of sibilants in different vocalic contexts. The next section explores the nature of this variation in detail.

The perception-production link: A closer look at individual variation
This section considers the nature of individual variability in the perception and production of coarticulated speech by examining the link between them. In this study, we explore the connection between the perception and production of coarticulated speech by fitting individual regression models for each subject's perceptual and production responses and examining the correlation between the by-subject estimates (i.e., the coefficients) for the coarticulatory context-related predictors in the perceptual and production regression models. The regression models were fitted using the ddply() function in the plyr package (Wickham, 2012). The perceptual responses were modeled in terms of Firth's Bias-Reduced logistic regressions using the logistf() function in the logistf package (Heinze, Ploner, Dunkler, & Southworth, 2016) to avoid problems of separation. The production measures (PC1 and PC2) were modeled using linear regressions. The models included the fixed factors already mentioned above. Model formulas are /∫/-response ∼ Trial + Vowel * Step and PC ∼ Trial + Duration + Sibilant Type * Vowel * (Position + Position 2 )). Table 5 summarizes the predictors whose model estimates were used in the correlation study. With 68 possible correlations (4 perception estimates × 17 production estimates), the alpha level with Bonferroni correction is 0.0007.
Theoretical predictions: Before diving into the results of the correlation analysis, it is worth noting that, despite the admittedly exploratory nature of the correlation analysis, it is important to consider what correlations might be expected a priori. As reviewed in Table 5: Summary of predictors whose model estimates were used in the correlation study. The variable Vowel has two contrasts, Vowel i and Vowel u .

Vowel Vowel
Step Position the Introduction, the existing literature reported conflicting claims about the perceptionproduction link. Some reported no relationship, while others found indirect or inconsistent evidence. The inconsistency in previous research findings might stem from the fact that the mapping between perception and production is not direct. As shown above, many factors (and the interactions between them) are responsible for explaining the variances in the perception and production results. From a purely theoretical perspective, perception models that do not assume non-verdicial encoding of percepts predict a positive relation between perception and production. In its most basic form, one might expect the magnitude of perception compensation to find the analog in the degree of coarticulation reflected in production. That is, for example, the coefficients for Vowel from both the perception and production models would be expected to correlate positively. More nuanced models, such as Pierrehumbert (2002) and Sonderegger and Yu (2010), which see perceptual responses as reflective of the frequency distribution of the acoustic cues that are indexed to different contextual information, including coarticulated contexts, would predict that the context-dependent realization of /s/ and /∫/ (as indexed by the Type × Vowel interaction in the production model) would positively correlate with the context-dependent classification of the /s/-/∫/ continuum (as indexed by the Step × Vowel interaction in the perceptual model). That is, the greater the degree of context-dependent variation is observed in production, the more context-sensitivity is expected in the listeners' classification of the sibilant continua. For example, given the distinctness between /s/ and /∫/ is reduced in the /u/ context, one might expect listeners to exhibit less certainty (e.g., a shallower classification function) in classifying the sibilant continuum in the /u/ context. Given that the vocalic influence, particularly the effect of /u/, is stronger toward the sibilant offset than at the onset, to the extent that listeners are sensitive to the temporal dynamics of vocalic influence on the spectral quality of the sibilant, a correlation is expected between the coefficients for the Vowel × Position( 2 ) interactions from the production models and the coefficients for the Vowel or Vowel × Step( 2 ) term of the perception models. Models that assume nonveridical encodings, such as C-Cure and perceptual models implicitly assumed in listenermisperception models of sound change (see more discussion below), would predict the opposite correlations. That is, individuals with strong coarticulation in production should exhibit the least compensatory response or less context-sensitivity.
Results of the correlation analysis: Table 6 summarizes the correlation results. While several correlations show p values below 0.05, only one correlation is significant at the alphaadjusted level, namely, the correlation between the estimates for the Type × Vowel u interaction in the production models and the estimates for the Step × Vowel u in the perceptual models (r = 0.51, p = 0.0005; Bonferroni-corrected alpha = 0.0007). Figure 4 shows a scatterplot illustrating the negative relationship between the Step × Vowel estimates (the x-axis) on the one hand, and the estimates for the Type × Vowel u interaction (the y-axis) on the other.
As the variables being correlated are estimates for interactions between predictors, it would not be feasible to interpret the nature of the correlations without first examining the nature of individual variability concerning a given interaction between predictors within each regression model. To this end, we first focus on the Step × Vowel estimates. The left column of Figure 5 shows the mean percentage of /∫/ response in /a/ and /u/ contexts by individuals in the 1st (top panel) and 4th (bottom panel) quartiles of the Step × Vowel estimates. Individuals in the 4th quartile of the Step × Vowel estimates (i.e., data points toward the right end of the x-axis in Figure 4) show the expected patterns of perceptual compensation for coarticulation, where the identification function for /∫/ responses in the /u/ context appears to the right of the corresponding identification function in the /a/ context. The pattern differs for the individuals in the 1st quartile (top panel). While individuals in the 1st quartiles (data points toward the left end of Figure 4) exhibited some degree of perceptual compensation, as evidenced by the differences in identification functions across vowel contexts at the crossover point (i.e., when identification rate is at 50%), we see that the /∫/-identification function in the /u/ context shows a shallower rise than the corresponding identification function in the /a/ context, suggesting that the identification of /s/ and /∫/ is more gradient in the /u/ context than in the /a/ context. Listeners in this quartile are also less certain in their identification of sibilant in the /u/ context in general. That is, listeners are less likely to identify a sibilant as /∫/ at the /∫/end of the sibilant continuum and they are more likely to identify a sibilant as /∫/ even at the /s/-end of the continuum. Crucially, this uncertainty is most acute in the /u/ context, not generally, suggesting that this is unlikely to be a general difference in classification strategies across individuals (cf. Kong & Edwards, 2016).
The production patterns mirror the perceptual patterns. The Type × Vowel u interaction captures the way the vocalic environment influences the production of the contrast between /s/ and /∫/. The larger the Step × Vowel estimate in perception (i.e., more uncertainty between /s/ and /∫/ in the /u/ context), the less distinct the contrast in production is between /s/ and /∫/ in the /u/ context relative to the /ae/ context. As shown in the right column of Figure 5, which shows the corresponding mean PC1 values for /s/ and /∫/ in the contexts of /i/, /ae/, and /u/, individuals in the 1st quartile of the Step × Vowel estimates (top panel) show a weaker distinction (i.e., a smaller PC1 difference) between /s/ and /∫/ in the /u/ context, compared to the individuals in the 4th quartile (bottom panel). Table 6: Correlations between coefficients from the perception (in columns) and production (in rows) regression models.

Discussion
In the preceding section, we observed significant correlations between individual-level estimates for predictors in the perceptual and production regression models. More specifically, our results point to a relationship between the distinctness in the acoustic realization of /s/ and /∫/ across vocalic contexts and the categorical perception of /s/ and /∫/ in the corresponding vocalic environments. In particular, the less distinct the acoustic differences are between /s/ and /∫/ in a particular vowel context (in this case, primarily in the /u/ context), the less categorical the perceptual responses are in the corresponding vocalic context. Simply put, the participants are less certain about the identity of the sibilant in the /u/ context if their own productions of /s/ and /∫/ are not so distinct in that vocalic context. This uncertainty in the /u/ context extends even to the endpoints of the continuum where the sibilants should be most distinct. These findings are most consistent with perceptual models that assume veridical encoding, such as the auditorist, the gesturalist, and the exemplar-based approaches to speech perception since there is a positive relationship between the perception and production of coarticulation.
Our results resonate most strongly with the perception-production loop models such as Pierrehumbert (2002) and Sonderegger and Yu (2010). Specifically, the context-specific distinctness in the acoustic realization of the sibilants is reflected in shifts in contextspecific category boundaries as well as the categoricity of the perceptual responses. This finding echoes previous findings that show a correlation between how distinct an individual produces a contrast and how well that individual discriminates the contrast (Newman et al., 2001;. Our findings extend these earlier findings, demonstrating the context-specificity of such a contrast-based correlation. Our findings are not consistent with predictions of perceptual models that assume nonveridical encoding, such as C-Cure (Cole et al., 2010, cf. Yu, 2016Yu & Lee, 2014) and models of sound change that rely on listener misperception as a driving force behind certain sound changes. Such models predict a negative relationship between the perception of coarticulated speech. That is, assuming that all listeners engage in expectation adjustment somehow, albeit to varying degrees, listeners engaging in weak expectation adjustment (i.e., the so-called 'misperceivers') would register a heavily coarticulated /su/ as /∫u/ while listeners who engage in more robust compensation would recode the same co-articulated /su/ as a non-coarticulated /su/. Since the perceptual exemplar space for the so-called 'misperceivers' would have more /∫u/-like exemplars than individuals who engage in more robust expectation adjustment, the production targets of the 'misperceivers' should be more [∫u]-like than those of the individuals who engage in more robust expectation adjustment. Thus the more robust one engages in expectation adjustment, the less vocalic influence is expected in one's production.
A noteworthy aspect of our findings is the fact that the correlation between perception and production of coarticulation is much more nuanced than previous studies have generally assumed. That is, the correlations between the general effects of vocalic contexts on sibilant perception (i.e., the magnitude of the Vowel effect in perception) and the coarticulatory effects of vowels on sibilants (i.e., the Vowel u and Vowel i ) did not turn out to be significant, echoing, for example, Kataoka's (2011) findings concerning the lack of a correspondence between the extent of coarticulatory fronting of /u/ and the extent of perceptual compensation for /u/-fronting. This fact is particularly striking since the vocalic context is a significant predictor in both group-level models for the perception and production data. Our findings offer potential insights into why previous studies that focused primarily on the extent of the coarticulatory effects on speech perception and production failed to observe a perception-production link. The present findings suggest that what is important is not the extent of coarticulation per se, but rather the effects that coarticulation has on the realization of the segments targeted. Such findings are consistent with recent studies demonstrating that listeners have fine-grained sensitivity to acousticphonetic cues needed to track their distributions (Clayards, Michael K. Tanenhaus, & Jacobs, 2008), as evidenced by listeners' sensitivity to within-category differences in, for example, reaction time (Pisoni & Tash, 1974), patterns of eye movements (McMurray, Tanenhaus, & Aslin, 2002), neural patterns of activities (Blumstein, Myers, & Rissman, 2005), as well as in category formation (Maye, Werker, & Gerken, 2002). Thus if one's articulation produces acoustic-phonetic cues for sibilants that result in more overlapping variances in certain coarticulated contexts (e.g., in the /u/ context) and not others (e.g., in the /ae/ context), that person's perceptual responses will exhibit more uncertainty in contexts where there is more overlapping variances. Further research is needed to ascertain the underlying mechanisms responsible for these individual differences in the perception-production linkage. Individuals might differ in the type of coarticulated speech input they encounter, which could result in different distributions of context-specific perceptual exemplars (cf. Pierrehumbert, 2002) or different understanding of how articulatory gestures are coordinated in different vocalic contexts, to the extent the universalist understanding of gestural knowledge can be relaxed. Variability in coarticulatory experiences need not stem from source variability per se, however. Individual differences in cognitive processing style (Yu, 2010) might also affect how individuals analyze coarticulated speech even if the input source is the same. As noted in the introduction, one type of cognitive processing style difference that has been linked to variation in the perception and production of coarticulation is autistic-like traits. While this study does not investigate this factor explicitly, this variable is unlikely to be the type of cognitive processing style difference that mediates the link between the perception and production of coarticulated speech since the effect of AQ is found only within females and not among males (at least according to the findings of Yu, 2010). Individuals might also vary in terms of oral-motor skills. From an auditorist perspective, for example, individuals might differ in how well they are at uncovering strategies for producing optimally perceivable acoustic signals. However, existing evidence for individual variation in oral-motor skills remain scant (cf. Diehl, Preston, & Bennetto, 2011;Iverson & Thelen, 1999;Mahler, 2012). Future research might also elucidate potential lurking variables not tied to knowledge of coarticulation (e.g., individual variation in cognitive sensitivity) that might account for, if only partially, the observed link between perception and production.
The fact that the correlation between perception and production is modest deserves some attention as well. The strength of the correlation hovers around 0.5, suggesting that there remains a sizable portion of the variance not explained by the correlation. Various factors might have contributed to this state of affairs. Given the perception task was administered prior to the production one, participants might have become hypersensitive to the /s/-/∫/ distinction and hyperarticulated in the production task, potentially limiting the degree of coarticulation. In addition, the lack of filler items might have further heightened the participants' hyperarticulation tendencies. The reliance of original /da/ and /du/ stimuli to create the target stimuli in the perceptual study might bias listeners toward a more alveolar percept (i.e., /s/). Also, the fact that the production data but not the perceptual judgments were obtained in lexical contexts might have constrained the strength of the correlation between perception and production. Finally, as noted earlier, the perception and production tasks targeted different low vowels (i.e., /a/ in the perceptual task and /ae/ in the production one), which might have further constrained the nature of the perception-production correlation. To this end, it is worth noting that Jongman et al. (2000), who examined English fricatives in different vocalic contexts, including /a/ and /ae/, reported no significant difference between the effects of /a/ and /ae/ on fricative noise duration, spectral peak location, noise amplitude, and the various spectral moment measures they examined. The only acoustic difference observed concerns F2 onset frequency, at 1820 Hz before /ae/ and 1512 Hz before /a/. Given that F2 onset frequency was not a spectral measure included in the acoustic analysis conducted, this difference is unlikely to have affected the results of the perception-production correlation analysis in any significant way. Finally, as summarized succinctly in Beddor et al. (2018), successful communication depends on the listeners being malleable and able to perceptually adapt to a diverse set of variances, including phonetic context, speaker, speaking rate, novel experiences, and others. To the extent that listeners are able to adapt efficiently, the production system may not need to be similarly malleable. As such, the perception-production link is unlikely to be a perfect relation. In any event, despite these potential limitations, a significant correlation between the perceptual and production results is nonetheless found, which can be taken as further evidence of the robustness of the correlation observed.
Given that some of the participants fall outside the 95% confidence intervals (CIs) of the correlation trend lines, as illustrated by the scatterplots in Figure 4, further attention to the perception and production patterns of these individuals might yield fruitful insights. To this end, consider the relationship between the perceptual and production results of the three participants highlighted in Figure 6. All three participants have similar Step × Vowel estimates ((P)articipant 6 = 0.651 (top), P14 = 0.676 (mid), P36 = 0.880 (bottom)), but they have wildly different Type × Vowel u estimates. P36's Type × Vowel u estimate (β = -0.507) is below the trend line, showing a strong convergence between /s/ and /∫/ in PC1 values in the /u/ context, relative to the /ae/ context. P6's Type × Vowel u estimate (β = 0.294) is above the trend line, showing almost no vocalic influence on PC1 at all. P14 falls within the 95% CIs (in fact, almost exactly on the trend line; β = -0.079), and shows a modicum of vocalic influence on PC1.
The type of individual variation exemplified by participants 6, 14, and 36 is unlikely to be explainable by general mapping principles between perception and production. Further research is needed to ascertain the mechanism underlying this type of individual variability. To be sure, it would be useful to ascertain first just how stable is this type of variation across individuals. While there is evidence for stable individual differences across perceptual tasks tapping into listeners' knowledge of coarticulation (Yu & Lee, 2014), the stability of perceptual compensation, or more aptly, context-dependent perceptual response, across time remains to be shown. To the extent that the variability observed is stable, it points to a need to further explore other factors that might influence perception and production independently and together. To this end, it should be noted here that, while Sex interacts significantly with the acoustic realization of /s/ and /∫/ and modulates the influence of vocalic contexts, the two sexes do not appear to differ in the way perception and production are correlated with each other. To be sure, the sex-based differences observed in the production study are particularly noteworthy from this perspective. While previous studies have identified sex-based differences in the acoustic realization of sibilants, until recently, little is known about the sex-specific nature of the vocalic influence on sibilant production. Such sex-based differences in sibilant realization might stem from potential sociolinguistic differences in articulatory strategies beyond basic variation in male and female vocal physiology (Strand, 1999;Stuart-Smith, 2007), including differences in terms of sexual orientation (Munson & Babel, 2007) or style construction (Podesva, Roberts, & Campbell-Kibler, 2001). In this study, males appear to exhibit a greater degree of vocalic influence in /s/ realization. However, no sex-based difference is observed in the perceptual responses.
Another source of potential variances might stem from contrast-related differences in articulation. , for example, show that, while there is generally a substantial contact of the underside of the tongue tip with the lower alveolar ridge during the production of /s/ but not /∫/, the degree of acoustic contrast between /s/ and /∫/ among a gender-balanced cohort of 20 native speakers of American English is related to their use of contact contrastively and in their discriminative performance. The most distinct sibilant productions were from participants who used contact in producing /s/ but not /∫/ and who had high discrimination scores, while the participants who did not use the contact differentially to produce the sibilants would produce the least distinct sibilant and would also discriminate synthetic sibilants less well. Individual variation in vocalic influence on the realization of the sibilant contrast might come about as a result of how individuals vary in whether the vocalic context influences the use of contact in producing the /s/ and /∫/ contrast.
Finally, as noted earlier, an increasing number of studies has argued for the importance of understanding individual variation in perception and production as a means to understand sound change actuation (Baker et al., 2011;Beddor, 2009;Dimov, Katseff, & Johnson, 2012;Garrett & Johnson, 2013;Mielke, Baker, & Archangeli, 2016;Stevens & Harrington, 2014;Yu, 2010Yu, , 2013Yu, , 2016Zellou, 2017). As coarticulation-induced variation in speech is often assumed to be a major source of phonetic precursors to sound change and sound patterns (Ohala, 1993a(Ohala, , 1993b, our findings suggest that some individuals within the same speech community are more advanced in reifying contextspecific variation in speech production than others and this progression is mirrored in the individuals' perceptual behavior as well. Specifically, some individuals exhibit a greater reduction in contrast between /s/ and /∫/ in certain vowel contexts than others. This individual variability in context-dependent contrast reduction is reflected in how individuals perceive sibilants in the relevant contexts. Individuals whose sibilants are less distinct in the /u/ context are also less certain in their classification of sibilants in that context. Such findings are reminiscent of recent findings concerning the progression of sound change and categoricity in perception. In particular, Pinget (2015) investigated labiodental devoicing and labial stop devoicing in word onset position across various dialect regions in Dutch-speaking regions where these two instances of sound change in progress are at different stages of completion; fricative devoicing is more advanced than stop devoicing. She found that regions where devoicing was most advanced in production turned out to also be regions where perception was the least categorical. Taken together, Pinget's findings and the findings of the current study suggest that a reduction in contrast could lead to eventual innovation of a new sound pattern. In the present case, the contrast between /s/ and /∫/ in American English might eventually be partially or completely neutralized before a rounded vowel. While there remain major gaps in our understanding of the relationship between individuals who exhibit substantial contextdependent variation in speech perception and production and their social profiles within a speech community, to the extent that such individuals (the proverbial 'innovators' in change) become leaders within their community of practice, or have strong influence on such leaders themselves, their patterned variation might propagate throughout their respective communities.

Conclusion
This study establishes significant individual variability in the perception and production of /s/ and /∫/ in English across vocalic contexts. The variability is not random, however. There is a significant correlation between how individuals categorize sibilants in contextspecific ways and how they realize their sibilants in the corresponding contexts. The present findings not only further the understanding of coarticulation in speech perception and production, they also have significant implications for research on sound change and language variation and change in general. Further research is needed to identify the causal mechanism behind the perception-production link identified in this study.