Acoustic Analysis of EkeGusii Vowel System

: This study describes the vowel system of EkeGusii (“Bantu E.42”)(Guthrie, 1948) in an acoustic phonetics perspective using oral data got from purposively sampled subjects: four adult males, four adult females and four children (two boys and two girls all 8 years old) equally from the two dialects of EkeGusii (EkeMaate and EkeRogoro Dialects). In order to capture the distribution characteristics of the vowel acoustic concentration, the group frequency means are normalized using Lobanov’s (1971) algorithm. Two view-points are the subjects of analysis in EkeGusii vowels: (a) acoustic vowel space as projected by the intersection of F2 vs. F1 or quadrilateral, and (b) spatial features of high, low, front and back. These qualities are mainly influenced by the physiology of speakers and social variability as occasioned by gender, age and dialect. The results indicate that children have no gender difference in formants, and have the highest frequencies for all formants, followed by adult females and then adult males. Furthermore, acoustic vowel space and spatial features are affected by gender, age, and dialect. A vowel pattern, replicated by all informants, is realized in the dispersion of the vowels within the chart influenced by gender and age. This study found out that EkeGusii seems to adopt a seven-vowel system of /i e ɛ a ɔ o u/ with a length contrast.


Introduction
Among languages of the world, it is a universal tendency to have vowels with maximum dispersal of vowel quality in order to, as much as possible, contrast parts of the vowel space (Hayward, 2013:328). Articulatory vowel quality is also considered since it correlates acoustic signals with perception of vowels in speech. The affinity of the acoustic to the articulatory configurations of the speakers can therefore be recovered from the renditions of vowel quality. The most critical features of traditional affinity structure of phonetic descriptions and F1-F2 plots are presupposed.
The main focus of this study is the vowel quality based on traditional features such as height. It is assumed that acoustic information is mapped in a one-to-one manner into relevant area functions such as height, frontness, backness, which can be represented on a vowel plot. The study kept to the "triangle tradition" because it is considered to be a reliable means for characterizing a given vowel system (cf. Nearey, 1978).
The EkeGusii vowels are analyzed according to their acoustic features in the context of the source-filter model (Fant, 1960). The Frequencies of the first two formants, F1 and F2, give the primary acoustic correlations of vowel quality (cf. Peterson, 1961;Fant, 1959). In a descriptive linguistic study such as this, F2-F1 plots are relied on as they are sufficient as an input of human speech that reflects the vowel quality of a given language (Hayward ibid.).
In order to determine the mapping that exists between physical parameters and phonetic features it is normal practice to have an explicit feature extractor that affects the separation of variability realized by the physical parameters. The first and second formants (F1 and F2 henceforth) extracted from target vowels in EkeGusii are plotted on a graph to simulate the IPA vowel chart. However, the vowels in EkeGusii are seen to be spread in close proximity though still maintaining an expected triangular formation. Some informants had their vowels skewed from the group norm, yet hearers could clearly perceive outliers. This proves that F1 and F2 are not the only qualities required to describe vowels, but F3 and other higher formants, together with pitch and intensity, may be also included. The outliers in the formant results suggest that the two formants, F1 and F2, are sufficient in specifying height, frontness and rounding characteristics of vowels in EkeGusii, but not sufficient in bringing out other secondary features like tone which was not considered in this study since it does not determine the phonemic status of vowels in the language.
1.1 Purpose Theoretically, the analytical vowel system of EkeGusii presented here will help better understand how EkeGusii vowels are realized acoustically, thus it may be possible to estimate articulatory configurations and phonetically grounded phonological rules. This will contribute to the debate on whether we can rely solely on acoustics to fully describe vowels that are usable in language learning or/and lay a phonetic basis for the phonology of human languages.

The language
EkeGusii is an under-described Bantu language (Nash, 2011:1) spoken by the Gusii people who reside in Kisii and Nyamira counties, Kenya (cf. Basweti et. al., 2015). According to the 2009 National Census, EkeGusii has an estimated 2.2 million speakers. The AbaGusii are believed to have migrated from the Congo forest in the 1400s through Uganda entering Kenya through the Western part of the country. They are believed to be descendants of Mogusii (Ochieng', 1974) and his sons form the seven main clans of the community, that is, Getutu, Mogirango-Rogoro, Mogirango-Maate, Monchari, Mobasi, Machoge and Nyaribari. They are bordered by Nilotic speakers; to the East by the Kipsigis, to the West by the Luo, and to the South by the Maasai. In his zonal classification of languages, Guthrie (1948) classifies EkeGusii as a central Bantu language part of the sub-family of the Kuria language labeled JE. 42. EkeGusii is a tonal language with two major distinct tones: the high and the low (Otieno, 2013:98). Guthrie (1948) relates EkeGusii to other languages including: Lulogooli, Ameru (Kenya) Kuria (Kenya and Tanzania) Ware, Ikizu, Ikoma, and Sanjo (Tanzania). Like most Bantu languages, EkeGusii has a seven-vowel system (Nurse & Gerald, 2003). Cammenga (2002:36) points out that Whitely (1956) did not define the phonetic or phonological significance of these symbols exactly. Cammenga goes ahead to give an EkeGusii vowel inventory where the vowel segments are differentiated in terms of minimally distinctive phonological features. For Cammenga (2002) Premier researchers of EkeGusii starting with Guthrie (1948) and Whiteley (1965) attempted to identify and describe EkeGusii vowels /i e ɛ a ɔ o u/ using impressionistic data. Currently, it has become increasingly necessary to describe the sounds of language using acoustic means to give such descriptions a firm scientific footing which is replicable.

Selection of language
This study selected EkeGusii on the basis that it is a language which is still under-studied in various aspects, and especially, to the best of the researcher's knowledge, there is no acoustic phonetics study on the phones of the language so far. The choice of the language is also based on easy accessibility to the researchers to study subjects, and also for being a native speaker of EkeGusii language there are aspects that can be critiqued using native intuitions. The language is also facing a threat from more dominant Kiswahili and English with the younger generation shunning their native language. All these reasons call for a proper description and documentation of the language.
This study selected EkeGusii in order to give this phonetically under-studied language an evidence based description of its vowel system. As such, this study includes quantitative analyses since it quantifies the features of vowels along dimensions such as vowel space and spatial features of high vs. low and front vs. back.

Sample
This study purposively sampled twelve subjects from both Rogoro (six subjects) and Maate dialects (six subjects) of EkeGusii language. The subjects selected are native speakers of the language who grew up speaking the language in the home setting at least until they started formal education where other languages (Kiswahili and English) are introduced.
For this study, four females (MW1 aged 40, MW2 aged 25, RW3 aged 20 and RW4 aged 38), four adult males (RM1 aged 42, RM2 aged 21 and MM3 aged 23 and MM4 aged 36) and four children (two boys and two girls all aged 8) were selected which is well above the bare minimums of three females and three males according to Ladefoged (2003) on the number of subjects for a study like the present one.
For dialectal differences, this study purposively picked two subjects from each group (men, women and children) to represent each of the two dialects of EkeGusii as mentioned above.
Each subject was presented with a list of words and sentences in Ekegusii orthography. The list of words for eliciting vowel sounds was made in the /tVt/ context. This context was used for both the word lists and target words in carrier sentences. In very few instances, when a word in a given context was not available, a nonsense word was coined and used.

Data collection
The oral data comprised of tokens elicited from the list of words and sentences generated by the researcher in the /tVt/ context bearing the target sounds for the vowel sounds. For the stop consonants, a list of words having the target sounds was generated by researcher to elicit the oral data. To improve the quality of the recorded data, the exercise was done in a controlled room, in the language laboratory at Kisii University where background noise was reduced as much as possible. It took three months to gather these data. If any mistake was detected, the informant would be asked to redo the recording.

Stimulus
In this research, artificial words were created to fill the phonemic gaps in some cases where there are no minimal pairs. The words in citation form were then used to get the acoustic data that could cover the entire possible vowel space of the language. The entire list of the minimal pairs is given in table (1) below. The words in Table (1) above were used to generate the carrier sentences given in Table  (2) below that were read three times by the respondents. The words selected in word lists, both real and nonsense words, were those that form minimal pairs in order to control, as much as possible, any intervening variable like place of articulation and manner of articulation for lip rounding or height of vowels. Only the vowels in the first syllable and the middle vowel were acoustically measured. The final vowel could not be measured because the resonances at the end would make the vowel appear longer than it really was.

Data recording
For each vowel, there was assigned to it a specific lexical item of which each of the subjects was required to phonate. Three repetitions for each word were extracted into separate sound files and TextGrids. On the edit window, the non-empty categories were extracted and listed. These were the ones analyzed and their averages were calculated. In the greater sciences, averages alone are not reliable since a couple of outliers can push the figures this way or the other. Here, individual average and group average were subjected to a statistical test of significance to eliminate any bias in the data. Apart from t-tests, the data will be subjected to SPSS ANOVA report to test the variance within and between groups.
All the recordings were made at the language laboratory at Kisii University. Each subject had to sit on a chair facing a computer screen with the headphone attached to the microphone adjusted to be about 15cm from the mouth inclined at about 45 degrees. The recording was made with a Weile WL-906 and a computer hard drive with a sampling rate of 44100 Hz.
The word list and sentences were prepared in EkeGusii orthography so that the subjects had no difficulty reading them. Before making the recording, each subject was given time to practice. Whenever a problem arose, the recording was repeated. This occurred mainly with [+_ATR] vowels since EkeGusii orthography does not make distinction between them except that the subject understands appropriate pronunciation from the context by employing the native speaker's instinct.
2.6 Data analysis After data were collected, analysis of objectives was done using descriptive statistics to find mean frequencies. Data was interpreted using mean range and measures of significance. To locate center formants, this research relied on Linear Predictive Coding (LPC), a tool available on Praat, for accuracy. This was also guaranteed by the visual display which helped the researcher to confirm what Praat has automatically generated through the LPC.
The monophthongs analyzed in this present work can be tracked easily since each of them has a single vowel quality all along its duration. The researcher selected words that are minimal pairs to ensure that the quality of the vowel was kept as uniform as possible in the /tvt/ context. Note also that since this work deals with three different groups of informants (children, women and men), who characteristically have different lengths and thickness of vocal tract; it was appropriate to adjust the maximum frequency settings in Praat to suit each group. Men have lower frequencies and their maximum was set at 5000 Hz; women's were higher and their maximum was set at 5500 Hz; and finally children's voices were with the highest frequencies and their maximum was set at 6000 Hz. With these kinds of settings, the natural formant ranges for each group were accommodated for correct measurements to be made.
The data recorded for analysis of vowels was down-sampled to 11025 Hz with the CSL 4400 software and analyzed with Praat version 5.1.23 by Boersma & Weenink (2010). For females, 0.01 seconds of window length and 0.01 time steps, and maximum frequency 5500 Hz with the Burg method. F0 was extracted from the middle of the vowel with the auto-correction method recommended for intonation (Boersma & Weenink, 2010). F1 and F2 for each vowel and each token were automatically generated through a log script that generated a report in a text file which was exported to the Excel work-book and SPSS analyzer for statistical analysis.
To be able to do this, down-sampled speech signals were analyzed with the Burg method with the stated parameters and the result extracted in a text. The formant values of the signal closest to the middle point were recorded on a spreadsheet for further statistical analysis. In this study Lobanov's algorithm (Lobanov, 1971)  To check whether there are significant differences between the means of different variables and groups, a paired sample T-test was conducted. In the test, the two contrasting arrays of data were selected for one tailed distribution that was paired to get the level of significance where p<0.05 to be significant.

EkeGusii vowel chart
The most useful representation for the vowels is a plot showing the average values for F1 and F2 for each vowel collected from a group of speakers (Ladefoged & Disner, 2012). This means that the most important descriptions for vowels, that is, height, backness and rounding features can be identified by the F1 and F2 frequencies of vowels, and, of course, to some extent F3 for lip rounding. F1-F2 plots can be very useful means for describing EkeGusii vowels. It is also a way of revealing the universal perceptual space since the formant frequencies are a direct result of acoustic signals and an input of human speech perception. Producing a vowel (formant) chart for EkeGusii is equal to producing a map of the language's vowels in the universal vowel space (Hayward, 2013:288). F1 is geared towards identifying the height feature while F2 identifies backness and rounding features. The heights have an inverse proportional relation with their F1 frequencies. The lower the F1 frequency, the higher the vowel is located. Table 3 below gives the mean values for F2-F1 for the adult males for target words in citation form. The SDs for each of the vowels was relatively high which pointed to the way the various speakers had their values dispersed much more from the mean. The resulting vowel chart for men is on Fig. 1 below. Men's vocal tract by nature is longer and thicker than that of women and children. This makes air passages from the source out to be expended of energy hence lower formant frequencies. These values are within the norm of an F1 of 250-800 Hz according to Stevens (2000).

Results for adult males
Tab. 4 below contains the results for adult males as extracted from carrier sentences in running speech. The results for connected speech follow the same trajectory as those for the vowels in citation form. However, there is a consistent difference between them where the values for both F1 and F2 are slightly higher than the two values from the carrier sentences. The difference was further tested by using the Student's T-test. The difference between the F2 for word-lists and the F2 for carrier sentences was not significant with a p-value of 0.2. This means that the difference was just a random chance. The difference between the F1 of word-lists and the F1 of carrier sentences had a very significant value of 0.0005. This implies that the difference between word-lists and carrier sentence vowels was only realizable on F1. Fig. 2 below shows the averaged chart for men after normalization procedure (Lobanov, 1971). Just like the values are before normalization, the expected triangular formation of vowels is discernable. The short front high vowel /i/ is most fronted and highest placed on the chart. It is followed by /e/ then /ɛ/ for the front vowels in the descending order. For back vowels, /u/ is the highest but not as high as /i/. It is followed by/o/ which is also the backmost then /ɔ/. /a/ still remains the lowest placed at the centre of the chart.
3.2 Results for adult females Tab. 5 below shows the mean F2-F1 values and their SD for all the women informants for this study. Tab. 5 above shows high SD values for F2 and equally high SD for F1. The points for the vowels on the chart were much dispersed from the mean.
The results for females showed higher frequencies for all vowels for both F1 and F2 as compared to males, which is expected considering the different anatomical structure of the adult male and adult female speech organs. The difference was significant with a p<0.05 in both the vowels extracted from words in citation form and also for those from carrier sentences.
Tab. 6 below gives the females average values and their SD for extracted vowels from carrier sentences. The vowel chart for females that results from the data in table (6)   There was no significant difference for F2 of for females in the values for vowels extracted from carrier sentences and those extracted from wordlists. Significant difference occurred in F1 with p= 0.01. The same trajectory for female informants is maintained after normalization as seen in Fig. 4   Vowels for the males and females follow each other in the same order for both the normalized chart and the non-standardized plot. For instance, vowel/i/ (for titi) is the highest and most fronted for both the men and the women and vowel /a/ (for tata) is central and the lowest placed of all the seven for both sexes. The differences in frequencies were highly significant with p<0.05. This means that female vowels were located higher and more fronted for the high vowels and lower for the low vowels than those for the males.  Considering the two plots side by side, women's vowels are pushed more to the front and lower on the chart than for men. The statistical difference between the two sets of data was 0.0039 for F2 and 0.0048 for F1. The figures indicate that the two groups have distinct values that are actual patterns found in the language and not just a result of random chance.

Results for children
Average frequencies for the first (F1) and second formant (F2) for children are on Tab. 7 below. Results on this table are better appreciated when displayed on the following scatter plot that shows collective vowel chart for children. Statistical difference between values for children F2-F1 for vowels extracted from word lists and those got from carrier sentences were significant with p= 0.02. Again, SD values are high for the children indicating that their values were staggered below and above the averages. Significance test for the difference between children scores and those of men yielded highly significant p-values at 0.006 while with the women it was not significant at 0.3. This means that children and adult females had the same range of F1 but it differed statistically from the men, that is, the frequency range for the vowels for children and women was very close. At 0.3 on the T-test, the statistical meaning is that there is 70% chance that the two groups are different and 30% chance that they are just random differences. Be it as it may, in pure sciences the threshold is p<0.05. Therefore, the women and children vowels for F2-F1 are not significantly different. The vowel chart for children maintained the expected triangular formation even after normalization as seen on Fig. 7 above, though a bit different from that of the men and of the women as seen on Fig. 8 where all the three groups means are plotted on the same axes. To better portray the vowels of EkeGusii on the chart so as to capture distribution characteristics of vowel acoustic concentration, group means were normalized using the Lobanov (1971) algorithm which is considered as one of the best vowel normalization procedures (Adank, 2003). Fig. 9 below gives the Lobanov normalized plot for all 12 speakers.
Key: ⦁ = /i/; ⧠=/e/; ▲=/ɛ/; * =/a/;⟠=/ɔ/; ▰=/o/;+=/u/ Subjects: Men 1-4; women 5-8; children 9-12 Figure 9. F2-F1 average for men, women and children normalized values after Lobanov (1971) algorithm This brings out spatial differences in the following contexts: genders and ages. The normalized values eliminate initial Hertz values of vowel formant frequency which is not suitable for direct comparison. All vowels as produced by informants are generally scattered across the chart with a little overlap for back vowels. The front vowels are well distributed on the vowel space.

Discriminant analysis
The results above were subjected to a discriminant analysis (cf. Klecka, 1980) first according to gender. Tab. 8 gives the discriminant results for adult male and female subjects. For adult informants represented on Tab. 8, 75% of all male cases were correctly classified and 100% of female cases selected were correctly classified. For adult informants, this translates to 87.5% of selected original grouped cases which were correctly classified. Of these, 52.1% of unselected grouped cases were correctly classified. The results indicate that values for F1 and F2 can be used to discriminate between the genders by 98 %. F3 was the poorest classifier. These results are confirmed by Wilk's Lambda test as in Tab. 9 below. Tab. 9 shows F1, F2 and F3 test results with F1 and F2 having highly significant values of p<0.0001 while F3 was not significant with p=0.393. This means that the values for F1 and F2 are highly discriminative as to be used to discriminate between the genders unlike F3 values.
Same trend is seen on Tab 10 for all 12 speakers, that is, adding children. The significance value for F1 and F2 remain at p<0.0001 while that of F3 is still not significant at p=0.283. When we have all the speakers' results, that is, men, women and children, the discriminant function reduce considerably by gender. Tab. 11 gives the classification results for all informants.
As seen on Tab. 10, all speakers reduce the selected original grouped cases correct classification to 69%. This was down from 87.5% for adult alone. The results imply that for children alone, F1, F2 and F3 values cannot correctly discriminate between the genders. That notwithstanding, 72.2% of unselected original grouped cases were correctly classified. Only 57.1% of the original selected cases for children informants were correctly classified as either male or female as opposed to 87.5% for the adult informants. Cross validation is done only for those cases in the analysis where each case is classified by functions derived from all cases other than that one. For the children, 53.6% of cross-validated grouped cases were correctly classified. What these results mean is that it is just very difficult to discriminate between the sexes by looking at the formant results for children while for adults, the difference between the sex shows very high classification rates. For children, the best explanation is that by age 8, male vs. female speech features have not been realized due to the growth of secondary sex characteristics, the more reason for difficulty in discrimination.
4.2 Other statistical tests ANOVA reports revealed that there was insufficient evidence to reject the null hypothesis, F (1,83)= 4.098, p>.05. These results indicate that when the means are compared for all the subjects, there is no significance to show any differences. This is contrasted when we compare the group means and the ANOVA tests which indicate high levels of significance. Gender tests between groups yields the greatest significance rates.

Conclusion
Only the first two formants are required to pin down a vowel on the vowel chart but they are not enough to fully capture all the qualities of a vowel required for its recognition sufficiently. There could be more that goes on in the perception of vowels by the hearer.
The evidence from the experiments here showed some overlapping of vowel sounds in their measures of F1 and F2 and yet listeners can perceptually distinguish them by using tonal frequencies that were not registered by acoustical measurements.
When the F1-F2 dimensions for each vowel are plotted, it was observed that the norm and the outlier frequencies of the formants were discrete. The spacing of the vowels on the plots must be seen from what Hayward (2013:295) calls the 'psychologically real'. The vowels which fall in the norm and the outliers need to be explained since the listener can perceive them as alike. This gives credence to the idea that the formant peaks are the ones utilized by the speaker, or what the listener perceives, as the primary phonetic parameter to make a phonological decision on where to locate a given vowel sound given the options available in the language.