Asynchronous online peer judgments of intelligibility: simple task, complex factors

Pronunciation learners can benefit from peer feedback in a Computermediated Communication (CMC) environment that allows them to notice segmentals and suprasegmentals. This paper explores the intelligibility judgments of same-L1 peers using P-Check (Version2, https://ver2.jp), a Learning Management System (LMS) plug-in that aggregates peer feedback on local intelligibility (Munro & Derwing, 2015). P-Check randomly delivers written prompts for learners to record. Recordings are randomly delivered to peers who choose from a drop-down menu which utterance was perceived. Aggregated judgments from peers and from the instructor are displayed to learners as feedback on intelligibility. This study used eight segmental contrasts: /b-v/, /s-θ/, /l-ɹ/, /l-ɹ/-clusters, /æ-ʌ/, /ɑ-ʌ/, /ɑ-oʊ/, and /i-ɪ/. Participants (N=38) made 3,451 intelligibility judgments on 1,203 recordings. The effects of rater listening discrimination proficiency and of utterance intelligibility were examined in six contrasts using Generalized Estimating Equations (GEE). Results showed that intelligibility was generally a significant predictor of judgment accuracy, but rater listening discrimination proficiency was not.


Background
In pronunciation learning, the effectiveness of same-L1 peer feedback in English as a foreign language environments has yet to be fully explored. There is some evidence that same-L1 learners can benefit from peer pronunciation feedback, especially in asynchronous CMC environments that allow repeated listening to recorded speech, giving learners time to notice pronunciation features (Correa & Grim, 2014;Gilakjani, Ahmadi, & Ahmadi, 2011). However, feedback from same-L1 learners may be problematic in task-based communication due to learners converging on a shared non-standard pronunciation. Walker (2005) recommends preventing convergence through highly-controlled activities. Thus, the present study directs participants' feedback to selected features by having them make a forced-choice judgment of local intelligibility (Munro & Derwing, 2015).
Pronunciation instruction would benefit from a better understanding of what factors underlie the accuracy of these same-L1 peer judgments of intelligibility. This exploratory study focuses on two aspects: the stimulus and the learner. The research questions are: (1) to what extent does the accuracy of local intelligibility judgments by same-L1 peers vary depending on the targeted phoneme and utterance accuracy, and (2) to what extent does it depend on rater listening discrimination ability?

Participants
The 38 participants (M=17, F=21) in this convenience sample were Japanese university students enrolled in an elective first-year practical English phonetics course who provided their informed consent following Teaching English as a Second Language (TESOL) standards.

Classroom environment
The language targets were eight segmental contrasts that are difficult for this learner population: /b-v/, /s-θ/, /l-ɹ/, /l-ɹ/-clusters, /ae-ʌ/, /ɑ-ʌ/, /ɑ-oʊ/, and /i-ɪ/. Materials consisted of 47 pairs of two-line contrastive conversations with L1 glosses. The first line of each conversation differed in one phoneme, such as: After receiving focused instruction on the targeted phoneme, participants did individual online listening discrimination and pronunciation practice. Due to time constraints, this practice was not completed for the /s-θ/ or /ɑ-oʊ/ contrast.
The learning sequence included a listening discrimination pre-test, shadowing, listening discrimination practice, visual input, pronunciation practice, and choral repetition of the contrastive conversations to familiarize participants with their meaning and pronunciation. Finally, participants engaged in online peer judgments of intelligibility for approximately 15 minutes per contrast.

Software
The peer judgments were conducted using P-Check, a plug-in for Glexa, a proprietary LMS that has been used by more than 100,000 students in over 1,000 university courses throughout Japan. P-Check randomly presented the first line of one of the two conversations onscreen for the learner to record. Recordings were randomly delivered to peers who selected the appropriate second line of the conversation from a drop-down menu. After recordings received four judgments, they were taken out of circulation. A Native-Speaker (NS) rater also used P-Check to judge the intelligibility of all recordings. Peer and NS rater feedback for each recording was displayed to individual participants.

Data collection and analysis
Data were gathered from the P-Check database. Data consisted of participants' intelligibility judgments (n=3,451) which were compared to the NS rater's judgments of the 1,215 recordings produced by participants.
In further analysis, the relative effects of utterance intelligibility and rater listening discrimination proficiency were modeled for the six contrasts listed in Table 3. GEE, which produces a population-average model, was used because it can accommodate repeated categorical outcomes while accounting for a different number of outcomes per participant.
The model included one centered covariate, listening discrimination proficiency, and one factor, intelligibility of the utterance. The events-in-trials outcome variable was accurate/inaccurate judgment of intelligibility. The models used binomial distribution with a logit-link function, exchangeable structures, and robust standard errors (Heck, Thomas, & Tabata, 2012). Models were run separately for each contrast and those with the lowest QICC 2 , a criterion used in model selection, were chosen. An alpha level of .05 was used for all statistical tests.

Initial results and discussion
Research Question 1 asks to what extent the accuracy of local intelligibility judgments vary depending on the targeted phoneme and utterance accuracy. Table 1 suggests that participants were more likely to make accurate judgments when the utterance was intelligible (to the NS rater) than when it was not.  (100) Of all judgments, 21.3% occurred when peers were unable to recognize an intelligible phoneme; these may be related to low listening discrimination ability. Only 12.7% of judgments involved participants judging an unintelligible phoneme to be intelligible; these judgments may involve using knowledge of the L1 phonology. Table 2 indicates substantial variation in mean intelligibility among the contrasts. The least intelligible contrasts were /s-θ/ and /ɑ-oʊ/, which had been taught but not fully practiced, and /l-ɹ/ clusters. The /ae-ʌ/ and /ɑ-ʌ/ contrasts were most intelligible and were judged most accurately. Reseach Question 2 asks to what extent the accuracy of judgments depends on rater listening discrimination ability. This was measured for six contrasts by the listening discrimination pre-test at the beginning of the teaching sequence. Participants heard 20 pairs of words, half minimal pairs (e.g. lake, rake) and half tokens of the same word (e.g. lake1, lake2), and they marked each pair same or different. Table 3 indicates that participants discriminated /ae-ʌ/ the best and /l-ɹ/-clusters the least well. Results of the GEE analysis were as follows. For the /b-v/ and /l-ɹ/-clusters contrast, neither intelligibility nor listening discrimination were significant predictors of judgment accuracy. For the remaining contrasts, parameter estimates showed that unintelligible utterances received accurate intelligibility judgments at a significantly lower rate than intelligible utterances (/l-ɹ/: Wald χ2(1)=6.054, p=.014, β=-.584); /i-ɪ/: Wald χ2(1)=12.388, p<.001, β=-.949; /ae-ʌ/: Wald χ2(1)=11.158, p=.001, β=-.928; /ɑ-ʌ/: Wald χ2(1)=69.707, p<.001, β=-2.979). For /ɑ-ʌ/, listening discrimination was also a predictor of judgment accuracy (Wald χ2(1)=5.888, p=.015, β=.141).

Conclusion
Intelligibility was a significant predictor of judgment accuracy, except for /b-v/ and /l-ɹ/-clusters. Closer examination reveals further variation even within some contrasts. For example, /b/ had 75% judgment accuracy while /v/, a phoneme not in the participants' L1 inventory, had only 56% accuracy. A different pattern was seen for /l/ and /ɹ/, both of which were judged with 61% mean accuracy. However, the /l-ɹ/ contrast showed strong variation at the item level, with accuracy ranging from 25% (long) to 91% (lamp).
Although a detailed analysis is beyond the scope of this paper, it is clear that intelligibility judgments were highly sensitive to target variability. Unexpectedly, listening discrimination ability was found to predict judgment accuracy only for one contrast, possibly indicating that a more robust measure of this covariate is needed.
Disclaimer: Research-publishing.net does not take any responsibility for the content of the pages written by the authors of this book. The authors have recognised that the work described was not published before, or that it was not under consideration for publication elsewhere. While the information in this book is believed to be true and accurate on the date of its going to press, neither the editorial team nor the publisher can accept any legal responsibility for any errors or omissions. The publisher makes no warranty, expressed or implied, with respect to the material contained herein. While Researchpublishing.net is committed to publishing works of integrity, the words are the authors' alone.
Trademark notice: product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.
Copyrighted material: every effort has been made by the editorial team to trace copyright holders and to obtain their permission for the use of copyrighted material in this book. In the event of errors or omissions, please notify the publisher of any corrections that will need to be incorporated in future editions of this book.