Successful non-native speech perception is linked to frequency following response phase consistency

Some people who attempt to learn a second language in adulthood meet with greater success than others. The causes driving these individual differences in second language learning skill continue to be debated. In particular, it remains controversial whether robust auditory perception can provide an advantage for non-native speech perception. Here, we tested English speech perception in native Japanese speakers through the use of frequency following responses, the evoked gamma band response, and behavioral measurements. Participants whose neural responses featured less timing jitter from trial to trial performed better on perception of English consonants than participants with more variable neural timing. Moreover, this neural metric predicted consonant perception to a greater extent than did age of arrival and length of residence in the UK, and neural jitter predicted independent variance in consonant perception after these demographic variables were accounted for. Thus, difficulties with auditory perception may be one source of problems learning second languages in adulthood.

Here we examine the link between non-native speech sound perception and auditory processing in Japanese adults learning English as a second language using frequencyfollowing responses (FFRs), an electrophysiological response which reproduces the frequencies present in the evoking sound and reflects early auditory processing in the brainstem and cortex (Coffey, Herholz, Chepesiuk, Baillet, & Zatorre, 2016). The FFR features high test-retest reliability (Hornickel, Knowles, & Kraus, 2012) and reflects neural origins in the brainstem and cortex (Coffey et al., 2016), making it an excellent measure of the robustness of early auditory processing. The precision of FFRs has been linked to individual differences in the development of language skills in children (Hornickel & Kraus, 2013;White-Schwoch et al., 2015), but it remains unknown how FFR precision relates to second language acquisition. Recently, Krizman, Marian, Shook, Skoe, and Kraus (2012) reported that bilingual FFRs more robustly encoded the fundamental frequency (F0) of synthesized speech. Here, therefore, we predicted that non-native speech perception ability would relate to F0 phase-locking. Given that impaired gamma-rate phase-locking has also been shown to characterize children with language impairment (Heim, Friedman, Keil, & Benasich, 2011), we additionally investigated relationships between gamma phase-locking and nonnative speech perception.

Participants
Participants were 25 native Japanese speakers [13 female, aged 19 to 35 (M ¼ 29.3, SD ¼ 4.5)] with English learning experience at secondary school level or above in Japan. Participants were required to have arrived in the UK after the age of 18 and to have been resident there for at least 1 month at the time of testing. Secondary inclusion criteria included normal audiometric thresholds (25 dB HL for octaves from 250 to 8000 Hz) and lack of diagnosis of a language impairment. Participants received a mean (SD) score of 7.6 (4.1) on the Musical Experience portion of the Goldsmiths Musical Sophistication Index (Mü llensiefen, Gingras, Stewart, & Musil, 2014), indicating low levels of musical training. Mean age of arrival in the UK was 27.8 (4.9) years, and mean duration of residence in the UK was 2.6 (3.1) years. The Ethics Committee in the Department of Psychological Sciences at Birkbeck, University of London approved all experimental procedures. Informed consent was obtained from all participants. Participants were compensated £14 for their participation.

Behavioral measures
English speech perception was tested using the Receptive Phonology Test (Slevc & Miyake, 2006). Each question in this test is designed to assess a phonological contrast in English with which Japanese subjects have difficulty. The test contains three main sections. In the word sub-test, participants see a list of 26 word pairs which differ in a single speech sound (e.g., "late/rate"). Participants then hear a list of words and are asked to indicate which of the two words they heard. In the sentence sub-test, participants see a list of 25 sentences, with one of the words replaced with a word differing in a single speech sound (e.g., "My sister loves to play with crowns/ clowns.") Participants then hear a list of sentences and are asked to circle the word that they heard. Finally, participants listen to a short story and are given a written version of the story that includes 42 underlined words. Participants are asked to circle any of the underlined words that are mispronounced.
Because the original version of the Receptive Phonology Test featured a speaker of American English, test materials were re-recorded by a native speaker of British English (Received Pronunciation) in soundproof room with a RODE NT1-A Condenser Microphone. Three of the items from the original test were removed, as they feature speech sound contrasts which do not exist in British Received Pronunciation. Audio recordings were presented to participants using Sennheiser HD 25-1-II headphones. See Table 1

Recording parameters
During electrophysiological testing participants sat in a comfortable chair in a soundproof booth with negligible ambient noise and read a book of their choice. Stimuli were presented through Etymotic earphones in alternating polarity at 80 ± 1 dB SPL to both ears with an inter-onset interval of 251 msec. 6300 trials were collected for each stimulus, and stimuli were presented in blocks (i.e., all [ra] trials were collected in a single block). Electrophysiological data were recorded in LabView 2.0 (National Instruments, Austin, TX) using a BioSEMI Active2 system via the ActiABR module with a sample rate of 16,384 Hz and an online bandpass filter (100e3000 Hz, 20 dB/decade). The active electrode was placed at Cz, the grounding electrodes CMS and DRL were placed on the forehead at FP1 and FP2, and the reference electrodes were placed on the earlobes. Earlobe references were not electrically linked during data collection. Offset voltage for all electrodes was kept below 50 mV.

Data reduction
Electrophysiological data reduction was conducted in Matlab R2016a. Offline amplification was applied in the frequency domain for 3 decades below 100 Hz with a 20 dB rolloff per decade. The data was organized into epochs 40 msec before through 210 msec after the onset of the stimulus and baseline corrected. To ensure against contamination by electrical noise a second-order IIR notch filter with a Q-factor of 100 was used with center frequencies of 50, 150, 250, 350, 450, and 550 Hz. A bandpass filter (.1e2000 Hz, 12 dB/oct) was then applied to the continuous EEG recording, and epochs exceeding ± 100 mV were rejected as artifacts. The first 2,500 artifact-free responses to each stimulus polarity then were selected for further analysis.

Data analysis (>70 Hz)
To investigate the precision of neural sound encoding we calculated inter-trial phase-locking. This measure involves calculating the phase consistency at a particular frequency across trials and, therefore, no averaging is necessary. This procedure provides information similar to spectral analysis of average waveforms, but with a higher signal-to-noise ratio and less susceptibility to artifact (Zhu, Bharadwaj, Xia, & Shinn-Cunningham, 2013). All electrophysiological data analysis was conducted in Matlab 2016a. Parameters for FFR analysis were used for frequencies >70 Hz, in accordance with the standards of previous research on speech FFRs (Bidelman & Krishnan, 2009;Parbery-Clark, Skoe, & Kraus, 2009). For FFR analysis (>70 Hz), phaselocking was calculated within 40-ms windows that were applied repeatedly across the epoch with a 1 msec step size. First, for each trial, a Hanning windowed fast Fourier transform was calculated. Second, for each frequency, the resulting vector was transformed into a unit vector. Third, all of the unit vectors were averaged. The length of the resulting  c o r t e x 9 3 ( 2 0 1 7 ) 1 4 6 e1 5 4 vectordranging from 0 (no phase consistency) to 1 (perfect phase consistency)dwas then calculated as a measure of cross-trial phase consistency. Phase locking factors for [la] and [ra] were averaged together to form a global estimate of an individual's inter-trial phase locking. This time-frequency data was then averaged in the following manner. First, data were collapsed across the entire response (10e170 msec). Phase-locking at the fundamental frequency (100 Hz) and the second through sixth harmonics was measured by extracting the maximum phase-locking value in a 40-Hz bin centered on each frequency. (Harmonics above 600 Hz were not consistently represented in every single participant and were therefore excluded.) Phase-locking at the harmonics was averaged together to form a general measurement of harmonic encoding. In addition, phaselocking was measured separately in the response to the consonant (10e80 msec) and the response to the vowel (80e170 msec).

Data analysis (<70 Hz)
For lower-frequency analysis (<70 Hz), phase-locking was calculated within 80-ms windows with a 1 msec step size. Visual inspection of the cross-subject average (see Fig. 2) revealed an increase in phase-locking over baseline between 0 and 60 msec. Gamma phase-locking was quantified, therefore, as the average phase-locking within a window reaching from 0 to 60 msec and between 30 and 70 Hz.

Statistical analyses
Linear models of the behavioral and neural data were constructed using the lm() function with the software package 'R', and model comparisons were performed with the anova() function. For comparisons of correlations that shared one variable in common (Steiger, 1980), the r.test() function in the 'psych' package from 'R' was used.

Results
First we tested whether the ability to discriminate English consonants was related to our neural measures. Better performance (greater proportion correct items) on the consonant discrimination items of the The correlation between phase-locking at F0 and consonant perception was significantly greater than the correlation with vowel perception (T ¼ 2.76, p ¼ .011); similarly, the correlation between gamma phase-locking and consonant performance was significantly greater than the correlation with vowel perception (T ¼ 2.95, p ¼ .007). The correlation between consonant perception and phase-locking at F0 was significantly greater than the correlation with phase-locking at the higher harmonics (T ¼ 2.81, p ¼ .01). Fig. 2 displays phase-locking for the cortical evoked response and FFR across all subjects. Fig. 3 displays cortical and FFR phase-locking for good and poor perceivers of English consonants (top-bottom split). Fig. 4 is a scatterplot displaying FFR phase-locking and cortical phaselocking versus consonant perception performance. One possible explanation for this relationship between English speech perception and F0 phase-locking is that greater familiarity with English speech leads to enhanced encoding of neural responses to English speech sounds. If so, one would expect the relationship between English consonant perception and F0 phase-locking to be limited to the response to the consonant, which did not overlap with any Japanese speech sound. On the other hand, if our results reflect a more general relationship between precise auditory encoding and nonnative speech perception, then English consonant perception should also relate to F0-phase-locking in the response to the vowel, which contained formant frequencies appropriate for a Japanese [a] (Nishi, Strange, Akahane-Yamada, Kubo, & Trent-Brown, 2008). We found that F0 phase-locking in the response to the consonant (10e80 msec) correlated with performance on consonant items (R 2 ¼ .426, p ¼ .001). F0 phaselocking in the response to the vowel (80e170) also correlated with performance on consonant items (R 2 ¼ .260, p ¼ .009). Moreover, the relationship between consonant perception and F0 phase-locking did not significantly differ between these two portions of the response (T ¼ .97, p ¼ .34).
To further test whether confounding effects of language experience could explain our results, "Age Arrived in UK" and c o r t e x 9 3 ( 2 0 1 7 ) 1 4 6 e1 5 4 "Years in UK" were used to assess the extent of participants' experience with English. "Years in UK" was cube roottransformed to bring its distribution closer to normality (ShapiroeWilk W ¼ .89, p > .01 after transformation). Subjects who were older when they arrived in the UK made more consonant errors, although the correlation was only marginally significant [ To assess whether our neural measures predicted variance in phonological competence that could not be simply explained by experience, we fit two linear models: one with age of arrival in the UK and years residence in the UK predicting consonant performance (the "Experience Only" model), and another which also included the consistency of the neural response (F0 phase locking; the "Experience plus Neural model"). The two predictors in the Experience Only model together accounted for 25% of the variance on consonant performance. The Experience plus Neural model with F0 phase locking as a predictor performed significantly better than the Experience Only model [F(1,21) ¼ 5.43, p ¼ .030], with the F0 phase-locking predictor accounting for an additional 15% of the variance for consonant performance. Including gamma phase locking as an additional predictor only accounted for an additional 1.5% of the variance, and this reduction in error was not significant (p ¼ .50).
Finally, to investigate links between individual differences in low-frequency and high-frequency phase-locking, we compared phase-locking in the gamma band to phase-locking in the FFR at F0 and the harmonics. Gamma phase-locking was correlated with phase-locking at both F0 (R 2 ¼ .31, p ¼ .004) and the harmonics (R 2 ¼ .17, p ¼ .039).

Discussion
Here we examined English speech perception and neural sound encoding in twenty-five native speakers of Japanese who moved to the United Kingdom as adults. We found that English consonant perception was linked to the degree of phase-locking to the fundamental frequency of the frequency-following response (FFR) to sound and to phase- locking within the gamma band. Vowel perception, however, did not relate to neural phase-locking. The relationship between these neural metrics and English speech perception ability remained significant even after time in the UK and age of arrival were controlled for. That FFR phase-locking relates to second language speech perception suggests that difficulties with auditory perception can interfere with the acquisition of non-native speech sound categories. On the other hand, we found that non-native vowel perception was not linked to FFR phase-locking, suggesting that vowel perception may depend less on the precision of auditory processing. These findings support previous behavioral research demonstrating relationships between non-native speech perception and auditory abilities including amplitude envelope discrimination (Kempe et al., 2012), frequency discrimination (Lengeris & Hazan, 2010), and spectral discrimination (Kempe et al., 2015). However, language learning is a complex process, and there are likely many ways in which foreign language learning can be disrupted. Only a portion of children with reading impairment, for example, display problems with auditory perception (Ramus et al., 2003), and the causes of adult language learning difficulty are likely to be similarly heterogenous. FFR phase-locking may be a useful metric to help identify people whose difficulties with non-native language perception stem from auditory impairments.
These findings support and extend previous work demonstrating links between the precision of neural sound encoding, language skill, and language experience. Krizman, Slater, Skoe, Marian, and Kraus (2015), for example, found that in Spanish-English bilinguals degree of bilingual experience was linked to the strength of fundamental frequency (F0) encoding in the FFR. Here we replicate this relationship in native speakers of Japanese learning English as a second language, and extend this finding by showing that this same neural metric can also explain individual differences in nonnative speech perception, even after language experience is accounted for. Hornickel and Kraus (2013) demonstrated that the inter-trial consistency of the FFR is linked to individual differences in language skills in school-age children; here we show that precise neural encoding of sound is linked to successful adult language learning as well. Chandrasekaran, Kraus, and Wong (2012) showed that the robustness of FFR pitch encoding can predict subsequent short-term learning of lexical tones; here we show that FFR phase-locking is linked to long-term language learning of non-tonal speech sounds.
What is the mechanism underlying this relationship between FFR phase-locking and non-native speech perception ability? One possibility is that FFR phase-locking reflects the precision of temporal perception. FFR phase-locking has been linked to the ability to precisely synchronize movements with sound onsets (Tierney & Kraus, 2013Woodruff Carr, Tierney, White-Schwoch, & Kraus, 2016). This suggests that precise tracking of sound timing relies upon consistent auditory neural timing, as synchronization places stringent demands upon the precision of auditory time perception (on the order of a few milliseconds; Repp, 2000). The ability to track sound timing is also vital for speech perception, as the temporal information contained in the speech envelope contains information relevant to speech sound discrimination (Rosen, 1992); in fact, discrimination of speech sounds is possible even if spectral information is greatly reduced (Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995). Moreover, nonnative speech perception may rely more upon temporal information than does native speech perception. For example, Japanese adults have a strong bias towards the use of temporal information such as closure duration and formant transition duration when distinguishing [la] and [ra], whereas native English speakers rely more heavily upon the frequency of the third formant (Iverson et al., 2005).
We replicate the finding of Krizman et al. (2012) that F0 encoding in the FFR is related to degree of bilingual experience but encoding of the harmonics is not. Moreover, we show that phase-locking at the F0 but not the harmonics is also linked to non-native speech perception ability. The specificity of this relationship was predicted based on these previous findings, but the underlying mechanism remains unclear. One possibility is that this result reflects a relationship between non-native speech perception ability and cortical auditory encoding. There is strong evidence that frequency-following responses at 250 Hz and above are generated within the auditory brainstem, as cooling the inferior colliculus in cats abolishes the scalp-recorded FFR (Smith, Marsh, & Brown, 1975) and patients with inferior colliculus lesions do not display an FFR (Sohmer, Pratt, & Kinarti, 1977). However, both of these studies included no stimuli below 250 Hz, and recent work has suggested that the FFR at 100 Hz is generated within multiple sources, including both cortical and subcortical regions (Coffey et al., 2016). Thus, the higher frequencies of the FFR may reflect a greater contribution from more peripheral areas such as the inferior colliculus, as generally the upper limit of phase-locking to sound is lower in more central structures (Joris, Schreiner, & Rees, 2004). Our finding of a relationship between non-native speech perception ability and phase-locking within both the low-frequency FFR and the gamma band, therefore, may indicate that learning a second language in adulthood relies upon precise cortical but not subcortical auditory processing. This hypothesis cannot be properly evaluated by the current study; however, it could be tested by future work examining FFR phase-locking and non-native speech perception using MEG.
Previous work (Heim et al., 2011;Nagarajan et al., 1999) has demonstrated that children with language learning difficulties have less phase-locked gamma band onset responses to sounds presented with a short inter-stimulus interval (ISI). Here we find that degree of gamma phaselocking is linked to non-native speech perception. Given that our stimuli were presented with a short ISI, this could reflect an impaired ability to process rapidly presented sounds on the part of the participants who struggled to learn to perceive English. Future work could examine this hypothesis by examining links between non-native speech perception and gamma phase-locking to stimuli presented at different ISIs. This enhanced gamma phase-locking in participants better able to perceive English may also reflect greater recruitment of speech processing resources in response to synthesized English speech sounds in these participants, as gamma phase-locking has been shown to be greater for speech stimuli as compared to non-speech stimuli (Palva et al., 2002). This would be consistent with fMRI evidence showing that subjects who are better at learning novel speech sounds display more STG activity when passively listening to speech sounds (Archila-Suerte, Bunta, & Hernandez, 2016). Finally, gamma phase-locking has also been hypothesized to be an important component of speech perception in multi-time resolution models (Poeppel, Idsardi, & van Wassenhove, 2008), in which phonetic information is carried within the gamma band and prosodic information is carried within the delta and theta bands. Greater gamma phase-locking in the participants who were better able to perceive English speech may, therefore, indicate more precise neural encoding of the timing of the speech envelope. This interpretation is supported by our finding that gamma phase-locking was correlated with FFR phase-locking.
One limitation of this work is that it is difficult to rule out the possibility that the link between neural sound encoding and non-native speech perceptual ability is driven by experiential factors. Time spent in the United Kingdom, for example, was linked to both F0 phase-locking and English perception, a relationship which is likely contributing to the link between F0 phase-locking and speech perception performance. However, the relationship between neural sound encoding and nonnative speech perception held even after time in the UK and age of arrival were controlled for, suggesting that this relationship partially reflects the dependence of successful nonnative language learning on auditory skills. Moreover, the relationship between non-native speech perception and F0 phase-locking held both for the neural response to the consonant, which did not overlap with any Japanese speech sound category, and the response to the vowel, which contained formant frequencies similar to those of the Japanese [a] (Nishi et al., 2008). Nevertheless, in a retrospective study it is difficult to account for all possible confounding experiential factors. This limitation could be addressed in future work in which participants are tested prior to beginning study of a foreign language for the first time or through the use of very short-term training paradigms (Lim & Holt, 2011).