Adaptation and assessment of a public speaking rating scale

Abstract Prominent spoken language assessments such as the Oral Proficiency Interview and the Test of Spoken English have been primarily concerned with speaking ability as it relates to conversation. This paper looks at an additional aspect of spoken language ability, namely public speaking. This study used an adapted form of a public speaking rating scale originally designed for English as a foreign language (EFL) contexts. This paper seeks to evaluate the relationship between this EFL-based scale and assessment within a core US university public speaking course. The relevance of EFL assessments to those used in English medium courses provides insight as to whether language learners are being evaluated on similar public speaking constructs to their English speaking peers, and informs instruction aimed at preparing students for English medium courses. A sample of undergraduate students (N = 44), primarily native speakers of English, performed classroom speeches in an introductory public speaking course and were rated using the adapted public speaking rating scale. The rating scores, independent instructor evaluations, and written feedback from two raters were analyzed to determine reliability and validity, and to inform revisions of the scale. The rating scale was found to be reliable and valid for this population. Limitations of the scale and proposed expansions are discussed as the result of qualitative analysis of the data.


Background
The ACTFL Oral Proficiency Interview (OPI) and the TOEFL Test of Spoken English (TSE) are used broadly in academia and professional disciplines to assess foreign language speaking and EFL speaking, respectively. Public speaking is one specific aspect of academic speaking ability that is neglected by such general speaking tests as the OPI and TSE. Although demands for public speaking are often placed on language learners in academic contexts (as in communication, business, and liberal studies classes), there is not yet enough research on this type of task and appropriate measures of assessment for language learners. Because public speaking is required in many academic contexts, an assessment that reflects the public speaking expectations of mainstream classrooms would benefit these students.
Public speaking is a complex task, requiring an individual to analyze, plan, and produce language in a performance context. According to Turner and Upshur (2002), performance tasks which require extended speaking are among the greatest improvements in second language assessment. Rather than maintaining a strict focus on accuracy, performance assessments allow the evaluator to measure the actual use of language in context. Performance assessment refers to both the assessment task (public speech) and the scoring method used (Khattri, Reeve, & Kane, 1998). This type of assessment has gained increasing support in spoken language testing as researchers strive to increase authenticity and transparency in testing.
The development of an EFL public speaking course and associated rating scale was an important first step in creating an assessment that addresses public speaking specifically. Yamashiro (2002) investigated the connection between second language study and public speaking, and tested the reliability and validity of a rating scale to be used in EFL contexts. The study emphasized assessment measures that correspond to the expectations and methods of a particular academic context. The public speaking rating scale was developed by Yamashiro and Johnson (1997) as part of a curriculum project, and was used to rate the classroom speeches of Japanese university students (N = 170) from seven classes attending two different universities. Reliability and validity were assessed using structural equation modeling, analyzing three traits (non-verbal, verbal, and purpose) and three rating methods (teacher, peer, and self) with a 14-point public speaking rating scale. Yamashiro found the scale to be reliable and valid in this context. These findings are an important step in meeting the demands of EFL public speaking instruction and providing a bridge between assessment and language teaching by validating a test for a specific classroom purpose. It is important however, to research ways of improving the accuracy of assessment for specific tasks and evaluating the usefulness of L2 assessments. Yamashiro's study does not provide us with a connection between EFL assessment in public speaking and the expectations of mainstream public speaking courses within the United States. This is an important question for many EFL/ESL students planning to transition to credit-bearing university courses.
In order to address this question, the scale used by Yamashiro (2002) has been adapted for the current investigation to examine the relationship between an assessment designed for second/foreign language speaking classrooms and evaluation used within mainstream public speaking classrooms. The purpose of this study is to (a) explore the constructs underlying the assessment of public speaking and (b) evaluate the extent to which Yamashiro's adapted public speaking scale adequately measures the language abilities emphasized in a university public speaking context. Specifically, this study addressed the following research questions: (1) What is the relationship between scores on the adapted public speaking rating scale and grades assigned by a public speaking instructor?
(2) To what extent does the adapted rating scale address the concerns of a public speaking instructor and experienced raters of public speaking?

Participants
Participants in this study were undergraduate students enrolled in a university in the southwestern United States. There were 44 students from two intact public speaking classes. The two classes were taught by the same teacher, a professor in the Communication Department with over 40 years of teaching experience. The classes met twice a week for 15 weeks during a fall term. There were 14 males and 30 females ranging in age from 18 to 44 with a mean age of 21 and a standard deviation of 4.7 years. Sixteen of the students were communication majors, with the remainder from a variety of other majors (English, Liberal Studies, Math, Parks and Recreation, Social Work, Latin American Studies, Hotel and Restaurant Management, Psychology, Sociology, History, Construction Management, and Education). Two of the participants spoke languages other than English as their first language (Spanish and German), and 11 students had studied other languages in addition to English (Spanish, French, Latin, Japanese, German, and American Sign Language). There were 22 students in each of the two classes. The majority of students reported that they were taking the class to fulfill a requirement (n = 39) and most had little or no speech experience (n = 37). There were two raters, both graduate students in an English program at the same university. Both of the raters had 3 years of experience in teaching and evaluating student public speaking at the university level.

Instruments
The prompt used for this task was taken from the course textbook (Osborn & Osborn, 2002, p. 378) and consisted of an outline of major elements expected in the speaking assignment. It was reviewed during regular class time as part of a 3-week instructional unit. Students also received a copy of the instructor's rubric (see Appendix A), which contained a breakdown of presentation elements required at percentage grade intervals and space for instructor comments. This feedback sheet was developed by the instructor and had been used in previous semesters of teaching public speaking. The rating scale used in this study was adapted from the 14-point EFL public speaking rating scale used by Yamashiro (2002). The original 14-point scale was altered for this study by removing three points addressing non-verbal behavior. The rationale for this decision was that the focus of the current study was on verbal presentation skills only, and that raters would not be able to address nonverbal behavior because they based their evaluations on audio-recorded speeches.
The 11-point public speaking rating scale was previously piloted with 15 pre-recorded speeches to assess inter-rater reliability. Reliability refers to the consistency of measurement (Bachman & Palmer, 1996) and is here determined by the relationship between the scores assigned by two raters on the same speech samples. Inter-rater reliability was calculated using Cronbach's α (Bailey, 1998) on each point of the rating scale. Raters participated in a short training session in order to familiarize themselves with the rating scale and optimize consistent usage. This session included reading over and discussing the rating scale descriptors to ensure that the categories made sense and that the descriptors were relevant and transparent. Audio recordings of three practice speeches were then rated and any differenced were discussed, resulting in small changes to the wording of the rating scale descriptors (see Appendix B for rating scale and final descriptors). Each rater then evaluated the 15 pilot speeches independently. Analysis of the scores shows an overall inter-rater reliability of α = .93 (see Table 1), showing that the public speaking rating scale was used reliably in this public speaking context.

The speaking task
Data was collected from two sections of an introductory public speaking class. The researcher attended each class informally 4 times prior to the study so that the additional presence would not cause unnecessary anxiety. Study procedures and questionnaires were approved by the Institutional Review Board through an expedited review, and students filled out demographic questionnaires and consent forms explaining the general purpose of the study. The students had performed two speeches in their class earlier in the semester and were accustomed to the speeches being videotaped for selfevaluation. For this reason, the use of a recording device was not considered invasive and none of the students appeared to be concerned or inhibited by its presence. The students delivered one 4-6 min persuasive speech on a topic of their choice. Each speech considered in this analysis was audio-recorded during performance assessments as scheduled by the instructor in the class syllabus. Data were collected over 6 class periods.

Instructor and rater evaluations
The researcher audio-recorded the student speeches and collected a copy of the instructor's graded evaluations and written comments for each presentation. The two raters used the 11-point public speaking rating scale to score the recorded speeches. In addition to the rating scale scores for each speech, comments were collected from the two raters to inform issues of validity. Raters were asked to provide written feedback on the speech that they thought would enhance the evaluation, note any problems or limitations encountered while using the scale for this population, and comment on any salient features that helped distinguish ability level among the speech performances.
The instructor and one rater also participated in a workshop replicating scale development methods used in North and Schneider (1998). In this workshop the instructor and rater were asked to listen to two of the pilot speeches and discuss which was better and why. The goal of this session was to identify language used to describe aspects of public speaking, in order to compare these descriptors to the rating scale categories. The discussions were transcribed as notes by the researcher. Rating sheets, feedback, and any additional notes made by the raters and the researcher were gathered throughout the assessment process.

Correlational analysis
Each of the 44 audio-recorded speech samples were rated using the adapted public speaking rating scale. Validity corresponds to the degree that the inferences made from the assessment scores reflect the constructs or skill of interest, in this case public speaking proficiency. In order to assess the validity of the rating scale, the rating for each speech was compared with the instructor's percentage grade evaluation of each performance. The 5-point scale was treated as interval data and a Pearson correlation was used.

Qualitative analysis
A qualitative review of the rater comments, teacher feedback, and research notes was conducted in order to establish an empirical (data-driven) revision of the rating scale for future use. Each of the comments on the instructor evaluation sheets, the written notes made by each rater, and the research notes from the instructor-rater workshop, were compared to the rating scale descriptors to discover aspects of public speaking not considered in the rating scale. In order to systematically evaluate the feedback, comments made by the raters reflecting problems with scale use were separated from those directed at the speech performance, and feedback on non-verbal behavior was removed from the analysis. Next, each instructor and rater comment was independently compared to the 11-point public speaking rating scale descriptors in order to determine whether the comment was captured by one of the descriptors, or not addressed in the 11-point scale. Those comments that were addressed in the rating scale were coded by descriptor heading. A list was generated of all comments unrelated to the scale descriptors. Following scale-making procedures similar to those used in previous studies of rating scale development for specific tasks (Turner, 2000;Turner & Upshur, 2002;Upshur & Turner, 1999), these comments were placed on cards and raters were asked to place the cards in piles under categories identified in the communication literature (see Appendix C), as well as an additional pile for comments that did not fit into one of the categories. This sorting process elicited five prominent categories, not previously included in the scale: credibility, distracting language, effective vocabulary use, text organization, and audience adaptation. Conversational presentation received only one comment and was not included in further analysis. The rater comments that reflected problems with scale use, previously set aside, were then discussed by the two raters who sorted eight out of nine of these comments into one of the five categories. One comment, addressing memorization, did not fit into these categories. Based on these instructor and rater comments, revisions to the adapted public speaking rating scale were proposed.

Research question 1: What is the relationship between scores on the adapted public speaking rating scale and grades assigned by a public speaking instructor?
The initial rating scale score obtained for each speech sample was correlated with the percentage grades assigned by the instructor (r = .77, n = 44, p < .01) in order to address research question 1 (see Table 2 for descriptive statistics). This correlation coefficient indicates an agreement between the instrument and the classroom assessment measure (Linn & Gronlund, 2000). The instrument can be considered valid in this context. The results indicate that the two forms of assessment are measuring similar constructs.

Research question 2: To what extent does the adapted rating scale address the concerns of a public speaking instructor and experienced raters of public speaking?
In order to address research question 2, the researcher compared the written comments provided by the instructor and the feedback from the raters with the rating scale and descriptors, sorting comments that were covered in the rating scale from those that were not (see Table 3).
The feedback from both the instructor and the raters was based on the student samples, and can be considered an empirically based contribution to the revision of the rating scale. There were a number of comments found in the qualitative evaluations of the speech samples that were inadequately represented in the rating scale. These comments were sorted, eliciting five prominent categories (see Table 4), which encompassed 102/104 (98%) of the total speech comments that were not sufficiently reflected in the rating scale, as indicated by the teacher and rater feedback.
Feedback related to credibility (36%) included authority to speak on a chosen topic (i.e. a bilingual student speaking on bilingual education) and the quality and quantity of sources or evidence to support claims. The use of distracting language (hedges, tag questions, vocalized pauses) was not included anywhere on the rating scale and was mentioned by both raters and the instructor (23.5%) as having a distracting or detrimental influence on the overall effectiveness of a presentation. Feedback from the instructor and raters included reference to vocabulary choice, such as vivid language, rather than appropriate language use (18.6%); these types of language use could not be summarized appropriately in the existing vocabulary or language use scales. Textual organization (15.7%) was discussed as going beyond the description of points and examples to support a thesis (under "Content" in the rating scale descriptors) to "using an effective organizational format," "presenting the information in a logical way," and "providing continuity throughout the speech." Raters found it difficult to evaluate audience adaptation (5.9%) because they viewed it as an independent construct and it was defined as part of several other skills in the rating scale. Non-verbal behavior    Bell (1984) and Ladegaard (1995) was intentionally removed from the rating scale in order to focus on verbal behavior, but one rater noted that memorization (a point under "Non-verbal behavior") was closely tied to fluency and that lack of memorization was clear in the audio tapes. This remark was supported by instructor comments on non-verbal behavior removed from the analysis. A partial list of the feedback placed under each category is found in Appendix D.

Discussion
The insights gained through the instructor and rater feedback were crucial to the understanding of an appropriate measure of public speaking. Although the adapted 11-point public speaking rating scale was positively correlated with instructor grades, and could be considered a valid measure of public speaking in this university context, the final analysis revealed some limitations. Instructor and rater feedback indicated specific expansions to the rating scale and descriptors that would more accurately capture public speaking proficiency. These suggestions included: a measure of credibility, delineation between vocabulary choice and accuracy, a measure of distracting language, textual organization, and the addition of audience adaptation as a separate construct. This brings into focus the question of how characteristics of a rating scale should be identified. Several authors have supported data-based criteria for rating scale development (North & Schneider, 1998;Turner, 2000), and the present analysis also supports these findings. Although there was a great deal of overlap between the adapted public speaking rating scale, originally designed for an EFL context, and the instructor's evaluation sheet, the comparison revealed a number of areas missing or in need of clarification in the scale. This is particularly relevant because it illuminates a gap between the foreign language assessment measure and mainstream public speaking classroom expectations. The feedback from the raters provided suggestions for improvement of the public speaking rating scale.
Despite their absence in the adapted EFL public speaking rating scale, many of the proposed expansions are supported in the communication studies literature. Personal and source credibility is discussed with respect to speaking in both public and interpersonal contexts (Burrell & Koper, 1998;McCroskey & Mehrley, 1969;Montgomery, 2001;O'Keefe, 2002;Perloff, 2003), and was perceived by the instructor and raters as a discriminating factor of public speaking ability in this study. Personal credibility is vaguely addressed under topic choice, but should be made more explicit in the future, and the inclusion of quality as a descriptor of source information would improve the evaluation of content. The adapted public speaking rating scale utilizes a positive format for evaluation, rating the presence of desirable attributes highly and their absence poorly. This appears to neglect the importance of distracting or negative language behavior, such as nonfluencies. It is interesting to note that several researchers (Burgoon, Birk, & Pfau, 1990;McCroskey & Mehrley, 1969) have connected nonfluencies in delivery to a perceived lack of credibility or trustworthiness in speaking. The instructor and rater comments indicate that a measure of distracting language would be beneficial to the revision of the scale, as the presence of this type of language was believed to be detrimental to public speaking ability. This could be achieved through a reversal of the usual rating scale points so that a 5 would indicate the absence of nonfluencies or other distracting speech, and a 1 would indicate that the presence of distracting speech made the presentation difficult to understand or follow. Distracting speech was a salient factor for raters when listening to the speech samples, as indicated by the frequency of comments.
Teacher and rater comments also suggest that definitions of vocabulary use need to be redefined to include not only appropriate word choice, but also engaging word choice. The present scale asks raters to assess vocabulary in relation to the appropriate word choice and clarity for the audience level. An important aspect of public speaking, as revealed by the instructor and rater comments, is the use of vivid language, and the use of intense language has been proposed as a crucial element in advanced language use by a number of researchers (Chimombo & Roseberry, 1998;Hamilton & Hunter, 1999;Sandell, 1977). This suggests delineation between appropriate vocabulary and optimally effective language choice. Gass and Seiter (1999) introduce specific lexical choices that fit into categories described as God terms, Devil terms, and charismatic terms. God terms describe words that a group or culture look up to (i.e. family values). This has been used in a number of political campaigns including George W. Bush's election campaign. Devil terms refer to words that the audience almost universally looks down upon. Nazi is an example of a term that usually fits this category. Charismatic terms are words that strike a positive affective chord in the audience. The word freedom has come to be known as one of these terms, at least in the western world. These terms tend to have a positive effect on the audience and increase the feeling of inclusion or common ground. An improvement to the understanding of vocabulary assessment in the rating scale would include an accessible description of both appropriate and effective lexical choice.
Another suggested expansion to the rating scale is the inclusion of textual organization in the description of content. The structure and order of arguments in public speaking was noted in the raters' feedback and is also discussed in the communication literature (O'Keefe, 1997;Struckman-Johnson & Struckman-Johnson, 1996). The organization of arguments within a speech is vital to its overall effectiveness. For example, Allen (1991) compared the persuasiveness of one and two sided arguments using meta-analysis, and found that two sided arguments that mentioned the opposing positions and refuted them were more persuasive than either one sided arguments or two sided arguments that simply stated the opposing view. These studies show that the perceived success of public speech is highly reliant on the types of arguments provided and their organizational structure, as well as the linguistic choices made during speech. The adapted rating scale describes the presence of points or arguments to support a thesis but makes no mention of their order of presentation. Organization was evaluated explicitly by the instructor and commented on by the raters. Although it was not included in the adapted scale, raters questioned whether or not an evaluation of organization was inherently part of their rating of content. A revision of the scale should include a specific statement assessing effective organization in the speech.
Finally, audience adaptation has been considered a critical aspect of public speaking. Topic choice, language use, and delivery are all elements of public speaking that rely heavily on audience-related factors. The ability to adapt or style shift in a particular context is crucial to successful public speaking (Bell, 1984;Ladegaard, 1995). Audience adaptation was recognized and included in the original rating scale, described within projection, pace, diction, topic choice, and vocabulary, but feedback from the raters indicated that the inclusion of audience adaptation across the descriptors was confusing and overly general. Adaptation as an independent measure would clarify and improve the assessment of public speaking.
Taken together, these five areas describe the gaps found between the public speaking rating scale, originally designed for use in an EFL context, and written comments provided by a public speaking instructor and two trained raters. This process of quantitative and qualitative assessment of a public speaking rating scale was found to be valuable to the understanding of elements that contribute to public speaking success. Although the rating scale was shown to be reliable and relatively valid in both the present study, using primarily native speakers, and an EFL population (Yamashiro, 2002), the insights gained through instructor and rater feedback in the present study were constructive and contribute to an increased understanding of appropriate measures of public speaking in academic contexts.

Conclusions
The context specific nature of classroom-based research in the present study does not allow us to make generalized claims regarding foreign language speech pedagogy. The definitive relationship between public speaking instruction in first and second language classrooms remains to be seen. Instead, this research hopes to broaden the understanding of public speaking assessment as it is applied in context, and work toward a more comprehensive evaluation of student public speech. This paper contributes to knowledge of the relationship between a rating scale developed for the foreign language classroom, and evaluations of public speaking in a mainstream university classroom. Investigations of the degree to which foreign language public speaking assessments measure the same constructs as mainstream university evaluations of public speech play an important role in our understanding of how best to prepare international students for credit-bearing university classes, as many courses across the curriculum require some degree of public speaking. An informal survey of instructors from various departments at the host university showed that 88% of respondents (n = 25) describe using oral communication activities, such as presentations in their classes, and the general catalog showed that 27% of undergraduate degree requirements include at least one course which emphasizes speaking. Because the most commonly used assessments of oral language proficiency, such as the OPI and TSE, focus attention on conversation and interview skills, more assessments which emphasize public speaking skills are needed for academic purposes. Public speaking is a fundamental part of academic life in the United States, and international students should be prepared for this type of task. This study takes an important step in understanding the public speaking expectations of students in mainstream university classrooms. Future research should include assessment of revisions to the public speaking rating scale using performance samples. It would be useful to pilot the rating scale in both mainstream and second/foreign language classrooms and study any differences or limitations that arise during implementation. As revisions are made, the scale should be reassessed for reliability and validity in the specified contexts. The development of adequate assessment measures for speaking ability in university contexts continues to be an ongoing process.

Voice Control
Projection 5  Total:_____/55_ Descriptors for 11-Point Public Speaking Rating Scale Voice Control

Point 1 Projection
Speaking not too loudly or too softly. Should be loud enough for all members in the audience to hear clearly. Should be a little louder than normal conversational voice. Projection should be varied at times of emphasis. being made. No variation in pitch for emphasis.
1 Speaker could not be understood without great effort by all members of the audience.

Point 4 Diction
All words/phrases should be clearly spoken so that the audience can easily hear/understand the speaker's points.
5 Speaker speaks clearly so that all members of the audience can easily hear. All key words were clear. Accent and syllable stress were used effectively during the presentation.
4 4 Speaker speaks smoothly enough, but had a few key words that were unclear. Speaker tried to use accent and syllable stress, but was awkward.
3 Speaker generally speaks clearly enough but sometimes may have accent or syllable stress. Overall meaning can be understood, but some key words were unclear.
2 Speaker generally did not speak clearly enough for easy listening. It was possible with effort to hear the points being made. Many key words were not understood.
1 Speaker could not be understood without great effort by all members of the audience.

Point 5 Introduction
Speakers should have an attention-getting device, thesis statement, and a sentence of method.

Point 7 Conclusion
Speakers should restate the thesis statement/summarize body and make a closing statement.
5 Conclusion has the following two parts: restatement of the thesis or a summary of the body and a closing statement. Both parts are effective.
4 Conclusion has the following two parts: restatement of the thesis or a summary of the body and a closing statement. One part is weak.
3 Conclusion may have the two parts: restatement of the thesis or a summary of the body and a closing statement. Both parts are weak.
2 Conclusion may be missing one of the two parts: restatement of the thesis or a summary of the body and a closing statement.
1 Conclusion is not complete or effective.

Point 8 Topic Choice
Speakers should pick topics that will be interesting for the audience. Speakers should also be interested and knowledgeable in the chosen topic. Narrow the topic to fit the assignment (prepare enough examples/support).