On the Acquisition of English Voiceless Stop VOT by Indonesian-English Bilinguals : Evidence of Input Frequency

The paper attempted to investigate the acquisition of Voice Onset Time (VOT) of voiceless stop consonants of English /p/, /t/, and /k/ by Indonesian-English bilingual children in its close relation to how second language (L2) input shapes the L2 VOT production. It looked at two types of bilingual participants; (1) one 6-year-old participant receiving extensive input of English natives from YouTube in about 8 hours per day since she was two in addition to having an interactive communication in English with her family members (2) four students (aged 7-8 years old) of International Class Program with non-native environment of English. Both groups were residing in Malang, East Java, Indonesia at the time of data collection. The comparative analysis concluded that the VOT valued differ significantly across different inputs. The participants with non native input acquired much shorter VOTs falling within the average of 28 – 36 ms, while the one with native input could achieve native-like VOTs in the average of 69 ms for /p/ and /t/ and even longer for stop consonant /k/. Contributing factors of individual differences might arrive from input frequency levels, types of inputs, and complexities of phonological properties of Indonesian and English.


INTRODUCTION
The development of two language systems in a bilingual self has always been thought-provoking in language acquisition as the two systems are repeatedly found to influence each other during the process of acquisition and development.This cross-linguistic phenomenon has brought together multi-variables of research attempting to find evidences on how it varies across bilinguals.Looking at developmental variations in bilingual speakers, I follow Unsworth (2013) in maintaining that the source of variations may come from the amount and type of L1 and L2 inputs.Examining the role of language inputs is therefore crucial not only for bilingual acquisition enthusiasts to look at how significant it is in assisting bilingual development, but also for parents and educators to account for best practices in developing successive bilingual children.
Using this underlying point of view, I investigate the extent of how L2 input affects L2 sound production and acquisition.More specifically, I look closely at the production of Voice Onset Time (VOT) of voiceless stop consonants of English by two sets of Indonesian-English bilingual children who have been exposed to two different types of L2 inputs.Furthermore, it is in a particular purpose of proofing which of these inputs work best in the acquisition of English voiceless stop consonants VOTs.VOT, according to Ladefoged and Johnson (2011, 151) is "the interval between the release of a closure and the start of the voicing" which is characterized by the presence of a period of silence during and after the release of the following articulation in aspirated sounds.As also outlined, the VOT values may arguably be different across languages.Sindhi"s aspirated stop VOTs, for example, is only 50 ms, Navajo is 150 ms, whereas English"s initial [p] in particular would be around 50 -60 ms (Ibid).I borrow Carrol"s (2015) hypothesis that language exposures have lent a remarkable influence to the learning outcomes.Her claims, however, need to be re-examined in other different bilingual pairings as well as research contexts.It is in regard to the complexities of two language systems and environments that bilingual children in my data may experience.Referring to the importance of language environment, Abutalebi and Clahsen (2017) discuss two canonical findings to explain how much language that children can learn by modelling the exposure patterns, namely Skinner"s (1957) Behaviorist and Chomsky"s (1959) Usage-Based that the later model was rooted from the idea that children process linguistic rules from the language they hear in their surroundings.
The nature of input has strikingly attracted researchers" attention.According to De Houwer (2011), language input environments including parental language, age of first regular exposure, input frequency, and interaction strategy are seen to determine individual differences of bilinguals" two languages.She constructs this argument by conducting a largescale survey toward 3,390 bilingual children with a more in-depth study to 31 bilingual families.Similarly, Hauser-Grüdl, Arencibia Guerra, Witzmann, Leray, and Müller (2010) find that parental contactvariety input plays a role in cross-linguistic influence which means that the outcomes of cross-linguistic interaction in bilinguals" repertoires are governed in considerable ways by the language pattern spoken in the children" closest circles.
Furthermore, Place and Hoff (2015) sought an additional evidence in regard to three quality indicators that shape bilingual language including the amount of input by native speakers, the number of different speakers providing input, and the frequency of language mixing.Upon examining the role of these three, they observe 90 thirty-month-old Spanish-English bilinguals using Language Diary method and suggest a finding that the positive quality indicator might come from the amount and number of native speakers" input whereas the negative indicator was from the language mixing frequency.I use Place and Hoff's (2015) critical role of native input as the point of departure.The focal point of native input in my study is, however, unique that it does not refer to the native speakers of English in socalled a primary environment where children in my data can interactively speak with, but from YouTube videos or so-called a secondary native environment which I will elaborate further in methodology section.
In regard to the abstraction of children"s interactions with the environment, scholars have come up with different terms.Some use exposure and experience in pretty much different context, while some others utilize input and exposure interchangeably (Carrol, 2015).In a very limited way, I use the term input to refer to a wider concept of language input environment in De Houwer"s (2011) proposal.She conceptualizes it to refer to a number of different aspects that children hear in a language including the number of utterances, the length of time, and the way languages are used among parents that she believes to be the most essential environmental factor in children bilingual acquisition (Ibid).
Extensive works have also been devoted to investigate the VOT acquisition by bilingual children.
Most attention has been given to groups of typologically related languages, such as German-Spanish, Spanish-English, Dutch-English, etc. (See Kehoe, Lle, & Rakow, 2004;Fabiano-Smith & Bunta, 2012;Balukas & Koops, 2015;Liman, 2013;Schmid, Gilbers, & Nota, 2014).Kehoe, Lle, and Rakow (2004) compare the production of word-initial stop VOT of four German-Spanish early bilingual children and three early German monolingual children.Their findings suggest three patterns of VOT development; (1) delay in the phonetic realization of voicing, (2) transfer of voicing features, and (3) no cross-language influence in the phonetic realization of voicing.In addition, Fabiano-Smith and Bunta (2012) examine the VOT of /p/ and /k/ in syllable initial position produced by eight Spanish monolinguals, eight English monolinguals, and eight Spanish-English bilingual children.Using non-parametric statistical analyses, their findings suggest that (1) monolingual and bilingual children acquire different VOT value of English but similar in Spanish, (2) bilingual children produce no different VOT of English and Spanish, and (3) English and Spanish monolingual produce significantly different VOT values in each of the two languages.These two studies have principally illustrated that both probability and improbability of crosslinguistic influence may occur which in fact demands for further exploration to figure out certain conditions by which this cross-linguistic influence is predictably to occur or not to occur.
Less attention has unfortunately been paid to the investigation of typologically unrelated bilingual pairings.Lee and Iverson (2012) investigate Korean-English bilingual sound production to study whether these children establish distinct categories of speech sounds across languages.Measuring the VOT of word-initial stops produced by thirty Korean-English, thirty Korean monolinguals, and thirty English monolinguals aged 5 and 10, the researchers suggest several findings; (1) bilingual children produced longer VOTs in Korean and shorter VOTs in English compared to their monolingual peers, (2) the ten-year-old bilinguals distinguished all stop categories using both VOT and vowel-onset f0, whereas the five-year-olds tended to make stop distinctions based on VOT but not vowel-onset f0, (3) bilingual children at around five years of age do not have fully separate stop systems, and that the systems continue to evolve during the developmental period.Responding to the lack of study toward the unrelated-language pairings in addition to an assumption that bilingual complexities are most likely embodied within a pair of unrelated languages, I come closer to look at the L2 phonological production of Indonesian-English bilingual children by instigating a stand point on how these two languages are interacting.Narrowing down from the whole aspects of sound structures of a language, I put my most attention to the VOT of English voiceless stop consonants /p/, /t/, and /k/.
I put forward a pivotal concern on the high probability for Indonesian-English bilinguals to undergo such cross-linguistic influence due to the fact that the two languages do not share the phonological properties.The acquisition process of L2 English phonological systems is therefore potentially determined by types of L2 input that is primarily framed within the scope of native and non native input.In this way, I follow De Houwer (2011), Hauser-Grüdl et al ( 2010), and Place and Hoff (2015) in assuming that input takes a major part in either strengthening or lessening the effect of cross-linguistic influence.
By conducting a small-scale observation to the production of English /p/, /t/, /k/ of two different groups of bilingual children nurtured in different types of inputs, I work to carefully examine L2 input frequency including its quantity and quality that can assist bilingual children in the process of acquiring L2 VOT systems.My specific objectives are; (1) how do the children" VOT values of voiceless stop consonants of English differ in a native and non native input environment?and ( 2) what are the probable contributing factors in the acquisition of these phonological features?This current study serve as a pilot project for further large-scale analysis taking larger samples of bilingual children with more various input environments.

METHOD Participants
The measurement of VOT value of English voiceless stop consonants /p/, /t/, and /k/ was conducted to two different groups of participants.They were categorized according to types and frequency of L2 inputs they were exposed to.The first group comprises 4 students (aged 7-8 years old) sitting in the 2 nd grade of the Primary Laboratory School of State University of Malang, Indonesia.At the time of data collection, they were enrolling in an International Class Program (ICP) class.It is a typical of English Partial Immersion Program where English was used as a medium of instruction in all school subjects, except Religion and Civic Education.Taking an advantage of such school program, students of ICP class were immersed with the use of English from various learning sources 14 hours per week.Upon receiving such intensive and extensive use of English, however, I considered this group of children to belong to those receiving non native inputs by looking at the sociolinguistic environment of the school, such that the teachers, the schoolmates, and the staffs with which they were interacting are all non native speakers of English.As I focused on a very specific feature of English sound systems, this linguistic environment could be significantly challenging for students to achieve the target-like VOT values, even though the result of teacher"s interview explained that students get more spoken inputs-mainly videos from British Council-rather than the written one which was about 70% and 30% respectively.
The second type of young bilingual speaker is a 6years-old girl who was nurtured in Javanese-Indonesian-English speaking family in Malang East Java and raised with extensive English exposures from YouTube since she was two years old.She had been watching a variety of kid videos, such as English nursery rhymes, TuTiTu (animated TV shows), Pocoyo Arts and Crafts, Princess Sofia and Disney videos, Play-Doh Arts and Crafts, Minecraft, etc. for about 8 hours in total per day, in addition to having interactive communication in English with her aunts who stayed together with the girl.The two aunts are multilinguals speaking Javanese and Indonesian as the L1 and English as the L2.One of these aunts had spent two years in Australia for her master degree.Putting this linguistic background in mind, I consider such input environment as unique in the way that how this young speaker of English perceptually absorbed English sounds mainly from having extensive engagement to YouTube videos and developing her productive skills with her aunties.That being said that this young girl received these secondary native inputs in home context while the other participants were from schools.
In terms of methodology, an issue of proportionality may arise as I only had one participant in one group and four participants in the other group.However, to include more samples of speakers obtaining inputs from YouTube in home context may stimulate other methodological problems, such as individual variations and differences during the acquisition process as the result of different pattern of bilingual nurturing.

Data Collection and Analysis
I collected the speech data by recording the participants" production of word-initial voiceless stops /p/-/t/-/k/ trough object-naming activity.Having 14 tokens in hand, I measured the VOT value of each word using Praat, quantified the mean value of each, and compared the two groups.To support my analysis, I interviewed the teachers of the first group and the aunties of the second group in specific regard of input frequency as experienced by these participants.

FINDINGS AND DISCUSSION
As aforementioned, I aim to (1) measure the VOT values of English voiceless stop consonants produced by Indonesian-English bilingual children across different inputs, and (2) estimate the probable contributing factors in the acquisition of VOTs where I discuss the empirical findings as follows.

Bilingual Children's VOT Values across Different Inputs
From the speech production of four participants belonging to Group 1 (children with non native input), the result of VOT measurement is presented in Table 1 below.Using the abovementioned mean VOT value of the average native speakers of English (50 -60 ms), the value shown in Table 1 is said to be a half way shorter than the natives.Participants" VOT values of /t/ are the longest compared to /p/ and /k/.To visualize the VOT, Picture 1 and 2 below illustrate the waveform of put and princess produced by the participants of the first group.The child can produce the VOT value of /p/ and /t/ as long as the native speakers can and even longer value in /k/.The perfect acquisition is somewhat surprising due to the lack of primary environment where the daily communication is conducted with non native speaker of English.Her major native spoken inputs are a variety of English conversation in YouTube that has eventually made it a way more interesting to study.The essential point I aim to propose is that having an extensive input from videos is yet immature in the context of language acquisition because it is not an interactive kind of input the child can interact with, however the fact that she can produce astonishingly perfect VOT of /p/ and /t/ and somehow longer VOT of /k/ has attracted my attention.Picture 3 and 4 in the followings are the waveform samples.One specific claim is: the quantity and quality of exposure to a given language matters regardless of the particular linguistic phenomenon under investigation.A second specific claim is: the quantity and quality of input matter regardless of the age of the learner.A third specific claim is: It is possible to define a threshold for "adequate input" such that when the threshold is not met, children will automatically develop a weak language.
To explain the later findings-participant with native inputs-, I refer to Carrol"s first claim because what I believe to be significantly matter in the accurate production of English VOTs is not only how many times (quantity) the children are investing to interact with the language, but also how close the kinds of input (quality) to the native users of the language are.
In regard to bilingual input, Paradis and Genesee (1996) deem to believe that when children are exposed to two languages simultaneously, they assume to get less exposure to each language compared to monolingual peers.Pearson (2007) argues that the adequate amount of exposures will make children become comfortable using the language which consequently brings more inputs that in turn will bring children into more practice.However, the quantity of input alone cannot determine the complete acquisition of two languages that it becomes the scientific reason of putting this current study into a place.My study puts forward the evidence on how input quantity and quality-by elucidating the nature of native and non native inputcan work best in language acquisition and development.
To have a holistic picture of how these two groups differ, Table 2 and Figure 3 summarize the findings.Bilingual children in two different groups show significant differences on the acquisition of VOT with the voiceless stop consonant /k/ as the shortest VOT value produced by Group 1 (Non native input), in contrast to Group 2 (Native input) with /k/ as the longest VOT value.Examining the quantity and quality of input or socalled input frequency in De Houwer"s terminology ( 2011) is very challenging, especially when it comes to empirical measurements.To assess the frequency means that we must firstly assume that the two languages are used very neatly in a separate domain (Paradis & Genesee, 1996).In the context of my study, participants with non native input use English in the classroom within the range of 14 hours per week; meanwhile a participant with native input gets used to listen to English conversation from YouTube 8 hours per day in addition to the occasional interactive communication with aunts at home.From an interview, the aunt reveals that watching YouTube video has a truly significant impact mainly toward the phonological development of her nephew.It is in addition to a narrow viewing activity where the girl herself who not only will choose the YouTube videos she wants to watch, but also will only choose types of videos she usually watches.She rarely picks different topics.These facts bring further evidence on how "personalized" inputs help improve L2 learning where "watching for enjoyment" become its underlying principle.Paradis and Genesee (1996) argue that even though it is feasible to interview parents on the language choice pattern their children are using, the remaining problem is that parents can possibly understand the abstraction of "language" differently.They may admit that their children speak English at home.Yet, the question is what kind of English it is or how much English their children are able to speak.Furthermore, what phonetically and grammatically uttered from the language can be either native or non native version of it as the result of cross-linguistic influence.This is a hint I have spotted from Group 1 that having an interactive communication with non native teachers and schoolmates in school will certainly improve their speaking fluency.However, it seems to still be very demanding for them to acquire target-like voiceless stop consonants of English, even though as revealed from an interview, the teacher indicates that the students receive more spoken inputs (70%) primarily from British Council"s videos and less written ones (30%) due to the lack of written resources.The written inputs or readings are given particularly to prepare students for the Cambridge Check-point and Progression test in which 60% of the test materials are reading texts.That being said that in the context of input quantity and quality, these students may have received enough for them to develop their English competence.However the lack of "personalized" and narrow-viewing input hypothetically become the reason of their inability of producing native-like VOTs of English.On the contrary, immersed in 8hours of more "personalized" YouTube video watching, the participant with native input are likely to develop a better L2 VOT acquisition.
It is indeed compelling to suggest that input quantity and quality is a major contributing factor in early L2 acquisition.In regards to this assertion, De Houwer (2011) reports the real-life story of an American father of a girl (Lauren) nurtured bilingually in English and Dutch from birth.With a very limited amount of English exposure (three hours per week), Lauren could only produce "yes" and "no" when she was three years old that her father took it as a rejection toward him.He did not consider that the amount of time he spent speaking English to her had a significant impact toward her English acquisition and development.Taking this case as an analogy, I argue that the ability to approximate the English natives" VOT value as performed by the participant of Group 2 is the ultimate outcome of input frequency that she has experienced.
Another important consideration is by looking at the phonological properties of the language itself.I refer to Ladafoged and Johnson (2011) in defining the properties of English /p/, /t/, /k/ that in the articulator domain, these sounds are made using different paired primary articulators.The sound /p/ is made with the two lips coming close together, /t/ is produced by the tongue tip or blade coming close to reach the alveolar ridge, and /k/ is manipulated by the back of the tongue that is raised to touch the soft palate or velum.I assume that bilingual children in both groups would naturally make use of these articulators when producing the targeted sounds because these sounds exist in their L1.
However, we may want to also look at the manner of articulation or how these sounds are processed which seems to behave differently in the two languages.The English /p/, /t/, and /k/ belong to the stop consonant group formulated by creating the complete closure of the articulators involved which are the upper and lower lips in /p/, the blade of the tongue and alveolar ridge in /t/, and the back of the tongue and velum in /k/, so that the airstream cannot pass through the mouth, and when the two primary articulators come apart, the airstream will be released in a small burst of sound that so called plosive (Ladefoged and Johnson, 2011).The Indonesian /p/, /t/, /k/ works in similar manner to those of English, except that the moment of aspiration-a period of silence after the closure released and before the start of the voicing for the following vowel-is shorter than in English.Thus, children acquiring two languages cannot avoid what so-called cross-linguistic influence during the developmental stages.By reflecting the different average VOT values of Indonesian and English, the imperfect acquisition of L2 VOT is subsequently anticipated in the production of English VOT by participants with non native input (Group 1).This inability to approach the closest VOT of English presumably results from the L1 influence.
Beyond the individual differences in acquiring English voiceless stop consonant VOTs, I concur De Houwer"s (2016) argument that children"s bilingual proficiency level are continuously changing along with the changing of their input frequencies, linguistic maturity and practice levels.In the context of my study, the ability of participant with native input to produce the target-like English VOTs and the inability of participants with non native input to do so cannot be treated as something permanent.It is indeed moving, shifting, and changing as a response to multiple factors of both linguistic and non linguistic aspects in bilingual selves.

CONCLUSION
My analysis concludes that the VOT values of English voiceless bilabial stop consonants differ significantly across different inputs.The participants with non native input acquire much shorter VOT of English /p/, /t/, /k/ falling within the average of 28 -36 ms, while the participant with native input can achieve the native-like VOTs in the average of 69 ms for /p/ and /t/ and even longer than the native for stop consonant /k/.My further analysis predicts some contributing factors underlying the individual differences in acquiring native-like VOT values; mainly (1) L2 input frequencies-the amount and the quality of L2 input-with a specific involvement of "personalized" and narrow viewing activity, and (2) phonological properties of two languages -Indonesian and English average VOT values-.
This kind of analysis imparts practical implication mainly for pedagogical area where firstly, teachers can highlight different phonological features of L1 Indonesian and L2 English and secondly, teachers can provide more various spoken resources for students to choose depending on their personal preferences.

APicture 1 .
similar tendency was indicated inNetelenbos, Li, and Rosen's (2015) study on the acquisition of French stop consonants by English-speaking children enrolling in an early French Immersion Program in Canada.The researchers attempted to see whether English-French linguistic interactions occur during the phonological acquisition.56 bilingual children and 45 English monolingual peers were examined on the basis of word-initial /p/, /t/, /k/, /b/, /d/, and /g/ production.The VOT measurement demonstrates that the English-French bilinguals displayed non-nativelike VOTs in the intermediate range between monolingual English voiced and voiceless stops, their English voiceless stops demonstrate higher VOT values than the monolinguals" and their English and French voiced stops are indistinguishable.Compared to this French-English bilingual data, my Indonesian-English bilingual datasets exhibit a slightly com-parable finding where it stands on the range of 20 ms to 54 ms (See Figure1) making it 28 -36 ms on the average.This intermediate range of VOT value is acquired within the context of non native input as teachers and schoolmates are all non natives of English, even though these students are situated in an English speaking environment during the school hours.The waveform of put Picture 2. The waveform of princess On the other hand, the observation and measurement to the participant receiving native input show surprisingly significant differences on the VOT production as presented in Figure2below.

Figure 1 .
Figure 1.Mean VOT Value of Participants with Non Native Inputs

Figure 2 .
Figure 2. Mean VOT Value of Participant with Native Inputs

Picture 3 .
The waveform of pony Picture 4. The waveform of colour I borrow Carrol (2015) three hypotheses that she has constructed from a growing literature on bilingual development, as follows.

Figure 3 .
Figure 3. Mean VOT Values in Both Input Environment

Table 1 .
Mean VOT value in non native input environment

Table 2 .
The mean VOT values in both input environments