The role of gestural timing in non-coronal fricative mergers in 1 Southwestern Mandarin : acoustic evidence from a dialect island 2 3

9 Merger between a voiceless labiodental fricative, [f], and a voiceless velar fricative, [x], is 10 common across languages, including many varieties of Chinese, particularly those spoken in 11 Southwestern China. The sound changes that lead to merger in Southwestern Mandarin 12 varieties are bidirectional: in some, [f] becomes [x]; in others [x] becomes [f]. We conducted 13 a study of phonetic variation in one such variety, Zhongjiang (中江) Chinese, which has been 14 reported to merge [x] to [f] in the environment of [w] and [u]. Our results confirm this basic 15 pattern while revealing additional nuances, including a new environment, [_oŋ], which 16 conditions merger in the opposite direction, [x] becomes [f], and new phonetic details. In 17 particular, [x] exhibits a particularly low spectral Center of Gravity (CoG) and [f] exhibits a 18 wide range of spectral variation, including tokens with low CoG, characteristic of a velar 19 constriction. We interpret these patterns in the context of areal variation, proposing a pathway 20 to change that relates spectral variation attributable to gestural overlap to diachronic 21 observations of labio-velar merger. 22


27
Merger between a voiceless labiodental fricative, [f], and a voiceless velar fricative, [x], 28 henceforth labial-velar merger, is a common sound change, attested in Germanic, Romance,29 Celtic, Slavic, and Uralic (Hickey 1984) as well as many varieties of Southwest Chinese. In a 30 typological survey of 374 varieties spoken in Hunan, Hubei, Sichuan, and Yunnan provinces, 31 He (2004) reports that 212 have the [f]-[x] merger. The merger can also be found in Southern 32 Chinese, e.g., Min (Chen & Li 1991), Cantonese (Zhan 2002), Gan (Sun 2007), Hakka in 33 west Guangdong (Li 1999) and the vernacular of north Guangdong (Zhuang 2004 (1988), to characterize fricatives within and across languages. The mean energy of the 51 spectrum, or Center of Gravity (CoG), is often used even in the absence of other spectral 52 moments (e.g., Gordon, Barthmaier, & Sands, 2002). In their study of English fricatives, 53 Shadle & Mair (1996) also included two additional spectral measures, dynamic amplitude 54 and the slope of a line fit from the maximum frequency to 16.97 kHz. Dynamic amplitude 55 picked up some consistent differences between English fricatives while the slope of the line 56 captured variation in speaker effort. In a study of eight English fricatives, Jongman et al. 57 (2000) investigated the acoustic separability of fricatives, and reported that the variance of 58 the spectrum was a particularly robust acoustic cue to place of articulation. Similarly, Shadle 59 & Mair (1996) found that the related measure of spectrum standard deviation to be useful in 60 differentiating English fricatives. Similar methods have been used to analyze Mandarin 61 Chinese fricatives. For example, Svantesson (1986)'s 2D 'fricative space' consists of 62 measures of Center of Gravity (CoG) and a measure of spectral dispersion that is closely 63 related to spectrum standard deviation. Other studies of Mandarin have followed this method, 64 finding that all five fricatives can be clearly differentiated using these two dimensions (Ran 65 2008) or some normalized version of these measures (Ran & Shi 2012). Following this work, 66 our main analysis in this paper focuses on spectrum Center of Gravity (CoG), which we also 67 plot alongside Spectrum Standard Deviation (SD). To encourage additional analyses, the 68 entire data set, including sound files and textgrids (see methods), has been submitted as a 69 Data in Brief article. 70 71 Although spectral moments are rather coarse descriptions of the spectrum, for the specific 72 case of labial-velar fricatives, the interpretation of CoG and Spectrum Standard Deviation are 73 relatively straight-forward. The posterior constriction for velar fricatives, [x], usually ensures 74 a low CoG, due to resonance of the long cavity in front of the constriction, and low SD, due 75 to the relatively sharp spectral peaks (Stevens, 1998: 370-372). For [f], the anterior 76 constriction at the lips typically results in a diffuse (flat) spectrum (Stevens, 1998: 389-398), 77 indexed by high spectrum SD, and high CoG, due to resonance of either a very short cavity in 78 front of the constriction or no detectable front cavity resonance at all. Phonetic variation in 79 these measures, conditioned in part by coarticulatory influences, provides some clues to 80 understanding the diachronic patterns across the languages of China more generally. 81 82 One important characteristic of the labial velar merger is that, like many sound changes, it 83 tends to proceed in specific phonological environments. The remainder of the paper is structured as follows. Section 2 provides background on the 99 labial-velar merger in Southwest China. Section 3 describes the methods of four studies on 100 the language variety of Zhongjiang. Study 1 provides a descriptive baseline, reporting 101 measurements of [x] and [f] fricatives in minimal pair environments. Against this baseline, 102 studies 2-4 report the same phonetic measurements in environments related to conditional 103 mergers. Section 4 reports the results of our phonetic analysis. Section 5 discusses synchronic 104 and diachronic issues related to the merger in light of the phonetic results. Section 6 briefly 105 concludes. 106 1 Glides and corresponding vowels in Mandarin have typically been analyzed as single sounds; relatedly a consonant followed by a glide is analyzed as a single sound (Chao 1934:42;Duanmu 2007:25). The reason is that it is assumed that there is only one slot in the onset, which C and G must share. Glides do not contrast with corresponding high vowels, [i,u,y], and the two sets can be treated as variants of each other. Duanmu    In the Mandarin of Yunnan, Sichuan and Hubei provinces, as well as the area around 175 Chongqing, the majority pattern is Type I, while multiple types of merger exist in Hunan and 176 areas of Hubei that are adjacent to Hunan. In addition, there are a few different types of 177 labial-velar merger in hilly areas in the middle Sichuan, which are also distributed in Hunan 178 and Hubei. Zhongjiang is one such language. It is a "dialect island" in the sense that it shows 179 a Type II pattern in an otherwise predominantly Type I area.  a Type II  220 language variety, is provided in the appendix in the same pseudo-randomized order as it was 221 presented to participants (see Methods). 222 223 The list of 90 items results from balancing a number of design constraints. We attempted to 224 elicit each fricative in a range of different environments but also to balance the rime context, 225 including the tone, across fricatives and to focus on relatively high frequency words. This last 226 concern is because low frequency words could potentially elicit more literary, archaic or 227 Standard Mandarin pronunciations, which might not accurately reflect the contemporary 228 speech patterns of the regional Zhongjiang variety. There are systematic phonotactic 229 constraints prohibiting certain fricative vowel combinations, e.g.
[f] cannot be followed by [w] 230 or [aɯ], as well as some accidental gaps which limit the number of perfect minimal pairs. 231 Focusing on well-known words also limited the number of minimal pairs. Of the 90 mono-232 syllabic words, 44 words enter into a minimal pair (a total 22 minimal pairs). We made sure 233 that there are at least some minimal pairs for each of the vowel environments relevant to 234 characterizing conditional mergers (see Table 1), which we describe in further detail below. 235 236 We deal with the lack of balance in the corpus in two ways. First, we use mixed effects 237 modelling to assess the effect of fricative while accounting for contextual variation with a 238 combination of random and fixed effects. Second, we run follow-up analyses on subsets of 239 the corpus consisting of minimal pairs controlled to address specific aspects of the 240 Zhongjiang pattern, including context. We describe these minimal pair subsets below as 241 separate studies along with the specific objective of each. 242 243 Study 1: minimal pairs from *x and *x w 244 As a Type II variety, we expect *x and *x w in Zhongjiang to be contrastive:   Participants were recorded in a sound-attenuated room at Wucheng hotel in Zhongjiang City. 288 The data reported in this paper were part of a longer recording session, including spontaneous 289 speech and other elicitation materials. The list of monosyllables reported here was recorded 290 immediately after the spontaneous speech portion of the session, in which participants were 291 asked to talk about their life in Zhongjiang or to introduce some aspect of Zhongjiang life: 292 food, popular local attractions, etc.. Before recording the monosyllables, all participants were 293 given the complete list on paper to look over. The order of the items in the list, also the order 294 of items listed in the appendix, was pseudo-random. Each participant confirmed that they 295 knew all of the words on the list. After this familiarization stage, the target items were 296 displayed one at a time on a computer screen, in the same psuedo-randomized order as they 297 appeared in the paper list. Participants were asked to read each item when it appeared on the 298 screen. The items were presented continuously until 10 repetitions of the list were elicited. 299 300 The hotel where we did the recording has carpet on the floors and wallpaper on the walls. We 301 chose a comparatively small room without a window to avoid noise from outside, and hanged 302 curtains on the walls to minimize echo. Before we started the recording, we closed the air 303 conditioner, electric fan, fluorescent lamp and phone, and used Audacity software to test that 304 the noise was under -48 dB, and the volume of the speakers was in the range of -18 to -6 dB. 305 We used BYLY software to record the sound, which can automatically save each 306 monosyllable recording as a single sound file, and display the waveform automatically and 307 synchronously, which is helpful to test the quality of each token. All the tokens were saved 308 as .wav format. All tokens were recorded in mono channel at 44,100 Hz directly to a 309 Thinkpad T440 laptop using an external Samson C03U microphone, which was stabilized 310 with a microphone stand and set to unidirectionality mode. 311 312 Segment boundaries were determined by forced alignment, using the Montreal Forced 313 Aligner. We created two sets of segment boundaries, one based on the pre-trained Standard 314 Mandarin aligner and another based on a Zhongjiang-specific aligner, trained on our 315 recordings. To evaluate the forced alignment, we hand-segmented 100 tokens from one 316 speaker and assessed correlations between the hand-segmented and force-aligned tokens both 317 for segment duration and also for the dependent measures of interest for the study (see below). 318 The Zhongjiang-specific aligner showed higher correlations with the hand-measured set than 319 the Standard Mandarin aligner, so we proceeded by using the Zhongjiang-specific aligner 320 throughout. A total of 9 tokens (0.1%) were excluded due to alignment failure. 321 322 Spectral measurements were extracted using Praat (Boersma & Weenick, 2016), with 323 reference to the segment boundaries from forced alignment. We extracted measurements at 324 five different timestamps in the target fricatives, the first 20 ms of the fricative, the second 20 325 ms of the fricative, the middle 20 ms of the fricative, the penultimate 20 ms time window and 326 the final 20 ms of the fricative. Our main analysis focuses on spectrum Centre of Gravity 327 (CoG). This measurement is known to be sensitive to the frequency range of the analysis (e.g., 328 Shadle & Mair, 1996). Since our recordings are studio-quality, we opted to use the maximal 329 frequency range at our disposal, basing our analyses on the Nyquist frequency: 22,500 Hz. 330 We discarded extreme outliers, defined as tokens that were greater than three standard 331 deviations from the mean CoG and spectrum SD calculated across the entire data set. This 332 resulted in the loss of 46 tokens or 0.5% of the data. 333 peaks of distributions show subtle differences: *f is slightly shorter than *x w which is slightly 341

Results
shorter than *x. To investigate whether this difference is statistically significant, we fit two 342 nested linear mixed effects models to the duration data using the lme4 package (Bates et al 343 2019) in R (version 3.9.2). The baseline model contained only random effects: a random 344 intercept for speaker and a random intercept for item. To this baseline, we added fricative 345 (proto-category) as a fixed factor. A likelihood ratio test showed that the model including 346 fricative did not significantly improve over the baseline model (χ 2 = 1.19, p = 0.55) indicating 347 that the difference in duration across fricatives is not significant. This includes both the 348 difference between *f and *x w , hypothesized to have merged in this variety, and between *f 349 and *x, which are expected to maintain contrast.  We begin with the velar fricative [x]. Figure 3 shows three examples from the corpus. The 363 distribution of energy in these tokens has a long right tail with a peak at low frequency. Since 364 most of the energy is concentrated in the lower frequencies, these tokens are characterized by 365 a low CoG and a low standard deviation. This is expected for fricatives with a posterior 366 constriction in the vocal tract. Aperiodic energy generated at the posterior constriction will 367 resonate in the portion of the oral cavity in front of the turbulent energy source.   by the interaction with rime, a topic which we take up later in the paper, including through an 442 analysis of minimal pair subsets in Section 4.4. The final three rows of Table 7 show the χ 2 443 statistic from model comparison via anova for the effects of rime (anova comparison of m1 444 and m2), fricative (anova comparison of m2 and m3), and the interaction between them 445 (anova comparison of m3 and m4).    To summarize the results so far, we investigated fricatives corresponding to three proto-499 categories, *f, *x w , *x. The proto-categories correspond to three distinct fricatives in some 500 other Chinese language varieties, including Standard Mandarin. In terms of duration, the 501 three proto-categories are indistinguishable in Zhongjiang. Our analysis of CoG showed that 502 *x is spectrally distinct from *f at all time points, the beginning, middle and end of the 503 fricative, while differences between *f and *x w were smaller and significant only at the 504 midpoint and t4. These results are largely consistent with past reports, based on 505 impressionistic listening, that Zhongjiang exhibits a Type II merger. Our analysis is based on 506 8,945 tokens, repetitions of 90 monosyllabic items, which vary in the composition of their 507 rimes. The analysis above also revealed that the CoG of fricatives is significantly impacted 508 by the identity of the following rime. The effect of rime on CoG is significant throughout the 509 fricative, i.e., at early time points as well as later timepoints, but it is strongest at the end of 510 the fricatives (later time windows) with corpus imbalance factored into the analysis through 511 a combination of fixed and random effects. In particular, we found that rime, one of the 512 factors that was not perfectly controlled for in the materials, had a significant effect in the 513 model, across time points (although it was strongest towards the end of the fricative). To 514 provide a complementary perspective, we now turn our attention to subsets of the larger 515 corpus that constitute minimal pairs, i.e., are controlled for rime context. 516 4.4 Minimal pair comparisons 517 We now turn to a spectral analysis for each subset of the data, comprising four studies. 518 519 Study 1: minimal pairs from *x and *x w 520 The purpose of study 1 was to establish phonetic differences between [x] and [f]. For this 521 purpose, we selected four sets of words that we expect to form minimal pairs in contemporary 522 Zhongjiang. A key assumption underlying our selection of these words as minimal pairs is 523 that Zhongjiang is a Type II language (see Table 1), as has been claimed in the literature.  Figure 7 shows the average CoG over time for each fricative proto-category. Differences in 529 CoG between *x and *x w are already present at the earliest time point, t1. The CoG for both 530 fricatives increases over the course of the fricative, with a maximum CoG at the midpoint of 531 the fricative. This is also the timepoint with the greatest difference between categories. The 532 difference in CoG decreases dramatically toward the end of the fricative (t4) and is 533 completely neutralized by the end of the fricative (t5). 534 535 Figure 7. Average CoG for *x and *x w proto-categories at five timepoints.

537
Since we will be discussing cases of fricative merger, we focus our subsequent discussion on 538 inter-speaker variation on the midpoint of the fricatives, as this is the time window that shows 539 the largest average difference between fricatives. 540 Figure 8   To evaluate the statistical significance of the trends in Figure 8, we fit linear mixed effects 553 models of the same structure shown in (1), m0, m1, m2, m3, m4, to the study1 words. The 554 results for study1 were similar to the results for the larger corpus, except that the effect of 555 rime was not significant in the study one words--m2 did not show significant improvement 556 over m1 (χ 2 = 0.996) and m4 did not show significant improvement over m3 (χ 2 = 2.08, p = 557 0.14911). This indicates that the subset of rimes included in study1 words do not impact CoG 558 at the midpoint. As in the larger model, m1, the model adding duration, showed significant 559 improvement over m0 (χ 2 = 4.36*); and m3, adding fricative proto-category, showed 560 significant improvement over m2 (χ 2 = 27.45*). The estimate for *x, based on m3 was 1063 561 Hz. This is higher than the estimate of 776 Hz for the main model, which can be obtained by 562 adding the effect of fricative (-2863.72) to the intercept (3640.16) reported in Table 7. The  563 model estimate for *x w in study1 words is 3929 Hz, which is similar, though somewhat 564 higher, to the larger model estimate of 3238 Hz for *x w . The study 1 words indicate a 565 significant differences between *x and *x w ; however, as Figure 8 shows, even in minimal 566 pair contexts, the majority of the speakers in this sample show some phonetic overlap 567 between Zhongjiang as a Type II variety, to be fully merged. These are pairs of words that historically 572 derive from a contrast between *f and *x w . In Zhongjiang, *x w is reported to have changed to 573 [f]. This change produced homophones from minimal pairs. For example, 肥 'fat' (Pinyin 574 fei 35 ) and 回 'return' (hui 35 ) were phonetically distinct in medieval Chinese and are 575 synchronically distinct in other Chinese languages (including Standard Mandarin), but are 576 expected to be homophonous in Zhongjiang. 577 578 Figure 9 plots the average CoG at each time point of analysis. The CoG is similar across 579 proto-categories at all time points but maximally distinct at the midpoint. At the first time 580 window, *x w , is already similar to the values reported in study1, which are slightly lower than 581 *f. 582 583 584 Figure 9. Average CoG for *f and *x w proto-categories at five timepoints. 585 586 Figure 9 plots the token-by-token variation by speaker. As expected, most speakers show a 587 merger for these words--the CoG and SD values tend to overlap. Moreover, the range of 588 phonetic values for these words corresponds on a speaker-by-speaker basis to the range of 589 values observed for [f] in minimal pairs (Study 1, Figure 8) Figure 10. Center of Gravity and Spectrum Standard Deviation of non-contrastive pairs (*f_-*x w _) 598 599 To assess the statistical significance of the differences in Figure 10, we fit the same set of 600 linear mixed effects models, shown in (1), to the Study2 words. The addition of duration (χ 2 = 601 30.69***) and rime (χ 2 = 27.81***) led to significant improvements. Beyond that, the 602 addition of fricative proto-category also led to significant model improvement (χ 2 = 603 24.48***). This indicates that the merger between *f and *x w is not complete. We also tested 604 interaction between rime and fricative, which was not significant (χ 2 = 19.53, p > 0.05). The 605 estimate for *x w was 3815Hz, which is similar to the larger study (see Table 7). The estimate 606 for *f was 4351Hz, an increase of 536Hz. This significant effect is driven entirely by speaker 607 2. If we re-run the models without speaker 2, then the effect of fricative proto-category is not 608 significant (χ 2 = 2.24, p >0 .05) and the estimates for *f and *x w are nearly identical: 3537Hz 609 for *f and 3572 for *x w . 610 To summarize, the results of study 2 show that most speakers (9/10) produce *x w as [ Figure 11. Average CoG for *f and *x w proto-categories at five timepoints. 632 633 Figure 12 shows the range of variation by subjects. The final study investigated the environment of [oŋ], as this environment has been known to 647 condition mergers and splits in other Chinese language varieties (see Table 1). As a Type II 648 language, Zhongjiang is expected to maintain contrast in this environment. Figure 13  At the beginning of the fricatives, t1, the first 20 ms window, and t2, the second 20 ms of the 670 fricative, there was no significant difference between *x w and *f in the larger corpus. Later in 671 the fricative, including the middle 20 ms, there was a small but significant difference, with 672 *x w having a lower CoG than *f (β= -402.5 Hz, p < 0.01). This decrease in CoG leaves *x w 673 still much closer to *f than to *x, suggesting perhaps incomplete neutralization of *x w and *f. 674 When we controlled for rime context--the minimal pairs in study 2--we found that 9 out of 10 675 speakers showed a complete merger, i.e, no significant differences between *x w and *f. This 676 result is consistent with Zhongjiang as a Type II variety, although there was some inter-677 speaker variation. The most common pattern was shared by four subjects: S3, S4, S8, and 678 S10. This group includes two male participants (S3 & S8) and two female participants (S4 & 679 S10 available, and the number of tokens (N) that entered into each analysis. For the current study, 700 we also provide the CoG estimates from our mixed models, which take into account control 701 variables and random effects. As far as we know, there are no other quantitative phonetic 702 analyses of Zhongjiang non-coronal fricatives. For reference, the table includes a seminal 703 study on Standard Mandarin fricatives (Svantesson 1986) as well as a more recent update 704 (Ran 2008), a recent study on another variety of Mandarin, Jianghuai Mandarin, with a larger 705 number of tokens than the other reference points (Wu 2020), and a study on Cairene Arabic 706 (Norlin 1983), which also has a contrast between labiodental and velar fricatives. Most of the 707 studies present CoG values in Hz, but some present results in critical band units (CBU). 708 Svantesson (1986)  Estimates from other studies in Table 8

741
Our reporting of the data focuses primarily on spectral CoG, a measure which enables broad 742 comparison to past results, as summarized above. We also explored other ways to summarize 743 fricative spectra including additional spectral moments, such as skew and kurtosis, formant 744 values, and duration. In exploring these additional phonetic measures, we found that many of 745 them were correlated. For example, skew and kurtosis, the third and fourth spectral moments, 746 were closely correlated with the mean energy (CoG) and variance of the spectrum, the first 747 and second spectral moments. From the plots in Figures 8, 10, 12, 14, it is also possible to 748 observe a positive correlation between spectrum CoG and spectrum SD. This is not a general 749 finding for fricatives but follows from the typical properties of velar and labiodental 750 fricatives (see also Section 1). To facilitate additional exploration of the data, including cross-751 linguistic comparison, we have submitted our entire data set, including additional 752 measurements as well as sound files and text grids as a Data in Brief article. 753

754
In the remainder of the paper, we discuss how some of the observations above might be 755 related. words come from *f or *x w . Additionally, we observed that the CoG for *x is substantially 795 lower than other varieties, another fact which is not integrated into this account of the merger. 796 In the remainder of this sub-section, we sketch a pathway to change that links the pattern of 797 phonetic variation that we have observed to the Type II merger. 798 799 Our proposed pathway to change decomposes the diachronic process into five steps, which 800 are summarized in Figure 15. The dimensions of each box represent a fricative space defined 801 by CoG (x-axis) and spectrum dispersion (y-axis), as per Figures 8, 10, 12, and 14 of our 802 results. In the first step (top, left panel) we assume that *f, *x, *x w , the three relevant proto-803 categories, are all produced with similar ranges of variation. We also assume that the CoG for 804 *x w is slightly higher than for *x, owing to coarticulation with the labial glide. This part of 805 the account is consistent with Zhuang (2004), as summarized above. The next step is 806 characterized by increased variability for *f, which causes this category to encroach on *x. 807 Since *x w has higher CoG than *x, increased variability from *f first threatens *x w . In the 808 Type II scenario, *x w mergers with *f. This is shown as a partial merger in the top right panel. 809 In the next proposed step, *x differentiates from *x w /*f by lowering CoG (and spectrum 810 dispersion). We surmise that this lowering helps to stabilize a new system in which *x w and 811 *f are completely merged. Our proposal, summarized in Figure 15, Figure 14). However, when we controlled for rime, zeroing in on a single non-posterior rime 854 environment, [əɯ], in Study3, we still observed a wide range of CoG variation for *f ( Figure  855 12). In particular, speakers 4, 6, 8, produced a similar range of CoG variation for *f in Study 856 3 ( Figure 12) as they did in Study 1 and 2 The Zhongjiang variety is a "dialect island", a Type II language in a sea of Type I languages 888 (see Figure 1). representative of a Type II language in our sample. 907 908 Another speaker whose production patterns fell outside of the main pattern was S2. This 909 speaker's productions are also interesting from the standpoint of the sociophonetics of 910 language contact. This speaker had the range of variation for *f in Study 3 that motivated us 911 to posit a secondary velar articulation, [