對話者之語言能力與評分嚴苛度對印尼語口語評量成績之影響

外語課堂以溝通式教學為目標者，常見的口語評量模式是以二人一組搭檔對話的方式進行口試，並由評分者使用評分表檢定成效。然而學生在選擇口試搭檔時，可能因選擇不同對象而影響口試表現；而不同評分者在使用評分表時，也可能因個人評分嚴苛度有所差異，給予不同口試成績，因此教學者需要考慮是否需要規定口試對話搭檔之選擇標準，以及如何訓練助教團隊使用評分表以增進口試之公平客觀性。本研究以臺灣一所國立大學通識教育中心之印尼語課程為研究場域，使用Rasch模型檢測：（1）評分者不同的嚴苛度在經過訓練之後能否達成口試評分的一致性？（2）學生在口試搭檔的選擇上，選擇與個人語言背景相當（初學者與初學者搭檔）或與個人語言背景不相同者（初學者與印尼華人搭檔）是否會影響其口試成績？本研究結果發現不同評分者即使施予訓練仍無法完全達成評分一致性，因此目前由多位評分者共同擔綱，刪除離群值、取其平均數，或許是權宜之計。然而，根據多層面Rasch分析法檢測評分者嚴苛度，有助及早發現問題。其次，學生選擇與不同語言能力背景搭檔口試並不會影響其口試成績，因此應讓學生自由選擇對話搭檔，輔以鼓勵機制讓印尼華僑多跟初學者搭配，以達到雙贏的效果。

關鍵字

多層面Rasch模式；口語評量；印尼語；評分嚴苛度；對話搭檔

並列摘要

The use of pair work in speaking assessment has frequently been adopted as an authentic manner of testing oral proficiency in second-language communicative language classrooms; however, the findings of studies regarding whether interlocutor proficiency influences the outcomes of oral assessment and whether rater training enables long-term interrater reliability have been inconclusive or contradictory. Studies have indicated that if one of a pair of interlocutors exhibits higher proficiency than the other or if the individuals know each other well, they may collaborate to produce more speech and achieve higher performance in oral assessments (Iwashita, 1996; Norton, 2005; Storch, 2001). However, a higher volume of speech is not always associated with higher overall performance scores (Davis, 2009). Other studies (Galaczi, 2008, 2014) have found that weaker language users might be more reluctant to contribute in oral interactions when paired with more proficient interlocutors. Son (2016) reported that Korean students of English as a foreign language spoke less when paired with more proficient interlocutors, although their overall oral performance did not necessarily decrease. The outcomes of oral assessments may also be influenced by the reliability of the ratings of assessors. Rater severity can be identified by applying the many-facet Rasch model (MFRM; Eckes, 2009, 2015). Although rater training can theoretically increase the confidence and consistency of raters (Davis, 2012, 2016; Huang et al., 2016; McNamara, 1996), differences in rater severity often persist after training (Eckes, 2005, 2009, 2015; Knoch, 2011; Sundqvist et al., 2020; Weigle, 1998) but the results of training are not necessarily long-lasting (Bonk & Ockey, 2003; Chang et al., 2011; Kim, 2011; Lan, 2012; Liao, 2016; Lumley & McNamara, 1995). Because second language assessment generally involves more than one assessor, providing on-the-job rater training is necessary to increase interrater reliability in oral assessments. Therefore, the following must be explored: (1) Whether training raters in the use of assessment rubrics increases interrater reliability, and (2) whether test takers perform differently when paired with interlocutors of different proficiency levels. This study investigated oral assessment in two General Education Indonesian language classes at a national university in Taiwan that was conducted in the fall semesters of 2020 and 2021. The study used Rasch analysis to measure to what extent interlocutor proficiency (Indonesian language learning beginners vs. speakers of Indonesian as a first language) influenced the students' oral performance and to what extent the severity of the Indonesian teaching assistants (TAs) could be identified and controlled for. The 2020 class comprised 44 students (Taiwanese individuals = 26, Chinese Indonesian individuals = 10, individuals of other nationalities = 8; men = 10, women = 34) and 7 Indonesian TAs (TAs from North Sumatra = 4, TAs from Java = 2, TA from Sulawesi = 1; men = 2, women = 5), and the 2021 class comprised 38 students (Taiwanese individuals = 17, Chinese Indonesian individuals = 14, Chinese Malaysian individuals = 4, individuals of other nationalities = 3; men = 18, women = 20) and 8 Indonesian TAs (TAs from North Sumatra = 4, TAs from Java = 4; men = 4, women = 4). The data comprised six oral assessments performed throughout the semester for each class that were scored by the trained TAs according to a rubric containing five categories: Content, accuracy, fluency, pronunciation, and interaction. The participants self-assessed their Indonesian language proficiency at the beginning of the semester. Generally, the Chinese Indonesian and Chinese Malaysian students rated themselves as native speakers of Indonesian and Malay, respectively, whereas the Taiwanese students and those of other nationalities identified themselves as true beginners. The participants selected their partners for the oral exams from among their classmates. The data were analyzed using Facets (Linacre, 2022a) to investigate the oral performance of each student pair, the severity of their assessor, and the difficulty of the criteria in the scoring rubric. The scores were transformed into a logit scale for comparison. Analysis based on the MFRM was used to obtain the following information for interpretation: logit measurements, the information-weighted mean-square fit statistic (infit), the outlier sensitive mean-square fit statistic (outfit), the separation index, reliability of separation index, and Chi-square tests for homogeneity. The results were represented using a variable map for each semester, divided into sections for each of the aforementioned three facets. A higher logit value in the three facets indicated higher student pair performance in oral exams, more severe rating, and more difficult criteria for high scores. The results indicate that even after training, rater consistency was low. In the 2020 class, Chinese Indonesian students had the highest scores, as expected. Performance ranged widely among the Taiwanese students and those of other nationalities. Among the seven TAs, five provided similar ratings and two provided ratings that were either excessively high (logit = -2.42) or excessively low (logit = 1.03) for the midterm oral assessment. After further training was provided before the final exam, two different TAs provided markings that were either excessively high (-0.45 logits) or excessively low (0.97 logits); however, the rater severity among the seven TAs for the final exam was within 1 and -1 logits, the acceptable range. The rater variable interacted with the rating criteria. One TA rated accuracy favorably (t = 2.76) but rated interaction (t = -2.11) severely. Another rated fluency favorably (t = 2.55) but rated pronunciation severely (t = -4.25). In the 2021 class, although the eight TAs were fully trained to use the rubric consistently, variables beyond our control that influenced rating consistency, especially the interaction between the rater and criteria, remained. Therefore, using average scores after outliers are removed may be a viable alternative method of grading until a superior solution is identified. Nonetheless, identifying rater severity variability was helpful as a basis for further rater training. Different Indonesian proficiency levels between assessment partners did not influence individual student scores in the oral assessments. The students from the 2020 and 2021 classes were categorized into four groups, LL, LH, HL, and HH (L = true beginner, H = proficient Indonesian/Malaysian speaker). Their mean scores were analyzed using Kruskal-Wallis tests. We first investigated whether beginners paired with proficient speakers (LH) scored higher than did those paired with other beginners (LL). However, the scores of these groups did not differ significantly. Next, we determined whether proficient speakers paired with beginners (HL) would score lower than did those paired with other proficient speakers (HH). The scores of these groups did not differ significantly. Our results support the findings of Davis (2009) and Son (2016). We did not demonstrate that interlocutor proficiency positively or negatively affected the students' oral performance. However, based on the comprehensive analysis of students' feedback on the oral examination method, the students seemed to prefer to select partners and remain in their partnerships throughout the semester. Because they were allowed to prepare their scripts and practice their oral exams before the exams, the students developed a sense of solidarity and camaraderie with their partners. The amount of speech they used appeared to not be influenced by differences in interlocutor proficiency. The students were also tolerant of mistakes made by their partners and exhibited patience. Thus, allowing students to choose their own partners and encouraging local students to pair with Chinese Indonesian students would increase their intercultural experiences. The research site had two unique features that may not be present in other second language classrooms. One was team instruction conducted by a linguist and 7-8 TAs. The other was the presence of a considerable number of proficient speakers of Indonesian/Malay as students attending class with true beginners. Nonetheless, these unique features provide valuable information in this case study with multiyear data.

並列關鍵字

many-facet Rasch model ； oral assessment ； Indonesian ； rater severity ； interlocutor proficiency

參考文獻

王佳琪（2020）：〈科學想像力圖形測驗之驗證〉。《教育心理學報》，51，341–367。[Wang, C.-C. (2020). Validation of the scientific imagination test-figural. Bulletin of Education Psychology, 51, 341–367.] https://doi.org/10.6251/BEP.202003_51(3).0001

何德華（2019）：〈印尼語 TEAL 創意互動教學測驗與評量〉。《通識教育學刊》，24，79–131。[Rau, D. V. (2019). Large class assessment of Indonesian language proficiency. Taiwan Journal of General Education, 24, 79–131.] https://doi.org/10.6360/TJGE.201912_(24).0003

吳昭容、曾建銘、鄭鈐華、陳柏熹、吳宜玲（2018）：〈領域特定詞彙知識的測量：三至八年級學生數學詞彙能力〉。《教育研究與發展期刊》，14（4），1–40。[Wu, C.-J., Cheng, C.-M., Cheng, C.-H., Chen, P.-H., & Wu, Y.-L. (2018). The measurement of domain-specific vocabulary knowledge: The mathematical vocabulary ability of third to eighth grade students. Journal of Educational Research and Development, 14(4), 1–40.] https://doi.org/10.3966/181665042018121404001

林小慧、林世華、吳心楷（2018）：〈科學能力的建構反應評量之發展與信效度分析：以自然科光學為例〉。《教育科學研究期刊》，63（1），173–205。[Lin, H.-H., Lin, S.-H., & Wu, H.-K. (2018). Developing and validating a constructed-response assessment of scientific abilities: A case of the optics unit. Journal of Research in Education Sciences, 63(1), 173–205.]https://doi.org/10.6209/JORIES.2018.63(1).06

林怡君、張麗麗、陸怡琮（2013）：〈Rasch模式建置國小高年級閱讀理解測驗〉。《教育心理學報》，45，39–61。[Lin, I.-C., Chang, L., & Lu, I.-C. (2013). The development of reading comprehension test for 5th and 6th graders using the Rasch model. Bulletin of Educational Psychology, 45, 39–61.]https://doi.org/10.6251/BEP.20121128

主題瀏覽