A Study on the Standard Setting, Validity and Reliability of the Standardized Patient Performance Rating Scale – Student Version

Background A standardized patient is a healthy person, trained to role-play a patient who gives feedback on the performance of the student. During the clinical skills training of students, the performance of a standardized patient is essential for the effectiveness of clinical education. Methods In this study, we have developed and determined the psychometric properties and standard-setting of the 9-item "Standardized Patient Performance Rating Scale – Student Version," which is designed for assessing the individual performance of standardized patients. For the scale development process, 702 medical students and seven educators participated in the study. Exploratory and conrmatory factor analyses were performed. Cronbach's alpha internal consistency coecient and split-half reliability coecient were calculated. For the standard-setting study, the extended Angoff method was used. a The of out of


Introduction
The changes in healthcare service delivery, developments in the measurement and evaluation process in education sciences, and humanistic and ethical education principles have led to a search for new ways of teaching clinical skills. Consequently, the use of a standardized patient (SP), a method of learning through simulation, has become mainstream. Using SPs in health education was rst implemented in When developing these standard scales, it is crucial to set cut-off scores. The standard setting is the methodology for de ning achievement and pro ciency levels and for identifying cut-off scores corresponding to those levels [13]. If the cut-off scores are not appropriately set, the results of the assessment could be questioned. For that reason, standard-setting is a critical component of the test development process [14].
Consequently, developing a valid and reliable scale that students can use for evaluating SPs' performance in an educational setting and implementing a standard-setting study for this scale will make a signi cant contribution to both the literature and the standardized evaluation of SP performance. This study aimed to develop and determine the psychometric properties of the "Standardized Patient Performance Rating Scale -Student Version" (SPS-S) and implement a standard-setting study.

Study Design
The model for this study was methodological research by which the validity and reliability of the developed SPS-S were examined and implemented as a standard-setting study. The study was approved by Ethics Committee of Ankara University School of Medicine.

Participants
In this study, there were two groups of participants: Sample 1 was used for scale development, and Sample 2 was used for the standard-setting study of the scale.
Sample 1: For the scale development process, 702 second and third-year medical students enrolled in a public university were utilized. Those students had previous SP encounter experiences, and a rater training was performed before the SP encounters. The numbers of participants (sample size) were decided based on the assumptions of statistical methods. Exploratory Factor Analysis (EFA) and Con rmatory Factor Analysis (CFA) were computed to test the construct validity of SPS-S. These are the multivariate statistics, which require big sample sizes. According to Comrey & Lee, a sample size of 200 is fair, and a sample size of 300 is suitable for EFA [15].
Moreover, at least 300 cases are needed with low commonalities, a small number of factors, and just three or four indicators for each factor [16]. As EFA and CFA should be conducted with two different groups selected from the same population, we divided participants into two groups. One group is for EFA (N=307), and the other is for CFA (N=395).
Sample 2: The standard-setting study of the scale was carried out with a test-centered approach, and expert opinions were collected in this context. Experts who had at least ve years of experience in using and training SPs were selected by criterion sampling and maximum variation sampling methods. In this study, the variation factors were being an SP trainer, using SPs in student training, and working in different departments ( Table 1).

Development of Data Collection Tools
As the research is partly a scale development study, the development stages of the scale and information on the "standard-setting form" are explained in this section.

SPS-S Development Steps
The scale development process consisted of seven steps, which included a literature review, conducting interviews, synthesis of the literature review and interviews, developing items, consulting expert validation, preliminary application, and pilot testing [17].
Literature Review: At this stage, the psychological attributes to be measured were de ned, and a literature review was performed to determine which characteristics should be present in the SP. During this stage, the research focusing on the development of the measurement tool for SP performance evaluation was investigated [10,11].
Conducting Interviews: The interviews were conducted with seven tutors, 50 students, and two eld experts who participated in SP practices. They were asked what they considered to be the main attributes of the good and poor performance of the SP. They focused on the different performance characteristics like persuasion in the role, successful portrayal, respecting the scenario, and giving effective feedback. Both written and verbal answers were collected from these interviews.
Synthesis of the Literature Review and Interviews: The data from the literature review and the interviews were evaluated together. The scope and content of the measurement tool intended to be developed were determined. Nine indicators of performance (1-Persuasiveness of acting, 2-Portraying of a patient, 3-Acting according to the scenario, 4-Using communication skills while giving feedback, 5-Competency in giving feedback, 6-Recalling the encounter, 7-E cacy of the feedback, 8-Professional attitude while giving feedback, 9-Observing the student performance) were identi ed for scaling [7].
Developing Items: An item pool was created, consisting of 18 items, and two items were assigned to each indicator (Appendix A). Two items for each SP indicator were assigned to prevent the narrowing of the scope of the scale in a situation where an item was removed as a result of expert opinion or item analysis.
All developed items were positively worded. A ve-point Likert-type scale was determined in consultation with the experts. The response anchors of the items were de ned as "poor (1)", "fair (2)", "good (3)", "very good (4)", and "excellent (5)". Following these steps, a draft version of SPS-S was formed.
Consulting Expert Validation: To obtain an opinion on the 18-item draft scale, seven experts working in the eld, four faculty experienced in using and training SPs, two linguists, and one measurementevaluation specialist were consulted. These experts examined the scale items in the context of content, scope, language, comprehensibility, measurement, and evaluation principles using an evaluation form. On the form, the experts stated their opinions on each item as "applicable," "not applicable," "needs revision," and included their recommendations for these items. Based on the recommendations from the experts, six items were excluded, and one item was revised; thus, a 12-item pilot version of the scale was created (Appendix B).
Preliminary Application: At this stage, the scale was applied to a group of 81 students to determine the approximate duration of implementation and make changes where necessary. As a result of the preliminary application, no item could be misunderstood, none of the items was left unanswered, the instructions were comprehensible, and the evaluation of an SP took three to ve minutes.
Pilot Testing: The 12-item pilot form was applied to a large group (N=702), and the validity and reliability studies were performed. As a result of these analyses, the scale was nalized. Relevant ndings are presented in the "Data Analysis" section.

Standard-Setting Form
In this study, the extended Angoff method was used to determine the cut-off score for SPS-S. In this method, the experts estimated the number of scale points that they believe borderline examinees would obtain from each response item [13]. In this context, experts determined the level of performance of the SPs at the borderline by using the "Standard-Setting Form" for each item in the scale. In the "Standard-Setting Form," two sections specify the level of performance of the SP, which is at the borderline for each item. The experts complete one of these sections in the rst evaluation session and the other in the second evaluation session after carrying out discussions between the two sessions.

Data Collection
Scale Development Data: SPS-S was applied to the participants following the preliminary application. For this purpose, a group of second-and third-year students enrolled during the 2015-2016 and 2016-2017 academic years was utilized. The students were asked to respond to SPS-S immediately after encountering SPs because it was thought that they would evaluate patients more accurately. The students were informed about the study, and their consent was obtained before the interview with SPs. Care was taken for the students to complete the scale alone without any interaction with their peers or others. Twenty-ve SPs participated in the study, 19 females and six males, aged between 32 and 65 years. Each student evaluated the SP he/she interviewed one time during communication skills training only.
Data for Standard Setting: First, the experts were trained by a specialist in the eld of measurement and evaluation. During this training, the aim and methods used in the standard-setting were explained, and information was given about the procedures to be performed on the standard-setting method. In the next step, the experts discussed the characteristics they considered should be present in the SP and agreed on the level of competence that a patient at the borderline should have. Then, the rst evaluation session began, and the experts were asked to give the scores (min: 1, max: 5) fo each item by an SP at the borderline. This process was carried out individually. They were then asked to share their evaluations with the group and justify them. Then, the experts stated their opinions about each other's evaluations, and a discussion was held. This discussion was followed by the second session, in which the experts were asked once again to provide the scores for the SP at the borderline. The rst session, the discussion, and the second session took approximately one hour.

Data Analysis
The Validity of the Scale Structure Validation: Before EFA, Kaiser-Meyer-Olkin (KMO) and Bartlett test variables were analyzed, and the data was tested for its appropriateness for the factor analysis. The principal components method was used in the factor selection process. In the determination of the number of factors, scree plot and parallel analysis methods were used. Since the scale had a single factor structure, no rotation method was used.
In the CFA process, initially, the data were tested for the multivariate normal distribution assumption. As the data did not satisfy this assumption, the analysis was carried out based on the weighted least squares method, and the standardized coe cients, corresponding t values, and some t indices were evaluated.
Item Discrimination: In order to assess the item discrimination, the signi cance of the differences between the scores of the participants in the upper and lower 27% groups for each item was compared using the Mann Whitney U-test. Since the scores given for each item were within the ranking level, parametric tests (t-test for independent groups, etc.) were not used in this comparison.

Reliability of the Scale
In order to assess the reliability of the developed scale, Cronbach's alpha [18], internal consistency coe cient, and split-half reliability coe cient [19] were calculated.
Standard-Setting Study: An adaptation of the Angoff method for items with more than two possible scores is called the extended Angoff method [20]. Candidates at the borderline are those at the su cientinsu cient border and those who are considered barely su cient. Using the extended Angoff method, the experts decided the scores for each item of the SP at the borderline and recorded these estimates. The mean of the estimates given by the experts was calculated for each item. The sum of means gave the cut-off score.
During the data analysis process, SPSS 21.0, Lisrel 8.7, Excel 2016, and Monte Carlo PCA for Parallel Analysis packages were used.

Structure Validation Studies
Exploratory Factor Analysis: EFA was carried out on 307 participants. The KMO value for this study was calculated as 0.92. The Bartlett's test of sphericity was applied to determine if the data met the multivariate normality assumption. The statistical signi cance of the calculated chi-square value can be interpreted as the distribution that meets the multivariate normality assumption with the factors that can be deduced from the correlation matrix [21].
The results of the Bartlett test indicate that a chi-square value of less than 0.05 is signi cant, which shows that a multivariate normality assumption has been met, and the factors can be deduced from the correlation matrix [22].
As a result of EFA, the scale had three factors with an eigenvalue greater than 1. When more than 200 samples are reached, it is recommended to examine the scree plot according to the eigenvalues for determining the number of important factors [23]. There is a signi cant deceleration in the Scree plot after the rst factor, and the rate of deceleration decreases and follows a horizontal course after the second factor. Moreover, the eigenvalue of the rst factor (5.432) is approximately ve times greater than the eigenvalue of the second factor (1.082) before the rotation. The rst factor alone, yielding a high variance (45%), was interpreted as the scale having a single-factor structure. The parallel analysis for the factor number also supports the single-factor structure. The Scree Plot is presented in Fig.1.
After establishing that the scale had a single-factor structure, the analysis was performed once again. The single-factor structure explained 59% of the total variance. Since the scale had a single-factor structure, no rotation was performed. In the analysis process, three items with a factor loading value below 0.40 (item 4, 0.19; item 5, 0.18, and item 7, 0.23), were excluded starting from the item with the smallest factor loading and items were renumbered. The factor loading values for the rest of the scale are presented in Table 2.
According to the result, SPS-S had a single factor structure and consisted of nine items.
Con rmatory Factor Analysis: CFA was performed for the veri cation of the single-factor structure resulting from the EFA of SPS-S developed within the scope of the study. In CFA analysis, the latent variable is the SP performance. This latent variable is abbreviated as "PERFORMAN." Observed variables are items of SPS-S abbreviated as S1 to S9. All variables entered into the model are displayed in Fig. 2 and Fig. 3.
As a result of CFA on the single-factor structure of SPS-S, the t values of the latent variables related to observed variables were greater than the critical value (2.58) and statistically signi cant at the 0.01 level. Fig. 2 and Fig. 3 show the standardized coe cients and t values for the relationships in the model, respectively.
In the analysis, the software suggested that the errors related to the items s1 and s2 should be associated and that this association could result in a decrease of 26.91 in the chi-square value. This theoretically reasonable modi cation was accepted, and errors of items s1 and s2 were associated. After that, the ratio of chi-square value to degree of freedom was calculated as 1.81. This value can be considered as an indicator of a perfect t [16]. Other calculated indices are as follows: NNFI 0.97, CFI 0.98, GFI 0.97, AGFI 0.96, RMSEA 0.04, and SRMR 0.04. When the results are examined, the values of all t indices are indicative of a perfect t [24].
Item Discrimination: For each item included in the scale, the mean ranks were higher in favor of the group in the upper 27%, and these differences were signi cant (p < 0.01).

Reliability Studies
Cronbach's alpha internal consistency coe cient was calculated as 0.91, and the split-half reliability coe cient as 0.87. These ndings show that the internal consistency coe cient of the scale is at the desired level.

Findings Regarding the Cut-Off Score Obtained by the Extended Angoff Method
In order to determine the cut-off score of SPS-S (nine items), seven experts were consulted. The experts were asked to use the extended Angoff method and give a score between 1 and 5, taking into account an SP at the borderline for each item. This scoring was undertaken in two rounds (R), as shown in Table 3.
When the cut-off scores for each item were examined after the second round, the experts allocated the lowest cut-off point to item 4 (2.857) and the highest cut-off point to item 9 (4.14). Also, the experts stated that an SP at the borderline should have an average of 3.44 points from each item. The standard deviation values were examined to determine the variability between the experts' scores, and the variability between expert opinions was less in round 2 (0.53) than in round 1 (0.59).
The mean of the total points of each expert for each item was taken to calculate the cut-off score. Then, the means were summed to obtain the cut-off score.
Cut-off score =3.00+3.78+3.56+3.33+4.00+3.11+3.33 Cut-off score = 24.11 Therefore, according to the ndings from the extended Angoff method, for an SP to be su cient, he/she must obtain at least 24.11 out of 45 from SPS-S.

Discussion
SPs are used for teaching, practicing, and assessing a wide range of communication skills, counseling, and clinical/physical examination skills in medicine. In this context, the authenticity of role-play and quality of feedback provided by SPs is of high importance for the quality of learning during SP contact learning sessions. Also, Objective Structured Clinical Examination (OSCE) that use SPs are subject to many measurement errors based on SPs performances [25]. Therefore, training of SPs and following their performance and development are important. A standard scale can be used by stakeholders to assess and follow the development of SP performance.
In this study, SPS-S was developed, and validity, reliability, and standardization studies were performed. This scale is only for students who can assess the quality of SP performance after the interaction. As a result of EFA, it was determined that the scale had a single-factor structure con rmed by CFA. The internal consistency of the scale was shown to be at the desired level by the reliability analysis.
The scale consists of nine items scored out of 5, meaing that the lowest achievable score is 9 and the highest is 45. The borderline score of an SP was 24.11 out of a total of 45 determined by the extended Angoff method. In this context, for someone to successfully qualify as an SP, he/she must obtain at least 24 points from SPS-S.
Students are important stakeholders during the assessment of an SP performance. As students interact with SPs one-on-one during the encounters, it is essential to learn their perspectives to obtain straightforward and intuitive multi-source information about the SP performance. The assessment of the SP performance by students might have certain advantages compared to the assessment done by other stakeholders. For example, several students can assess SPs, whereas only a few faculty and even fewer number of SP trainers can. Furthermore, students can assess SPs at different times, so it is possible to monitor SP performance progress through time. Moreover, this scale can be used to identify SPs who need further training early on by picking out the ones who scored less than 20, which will bring e ciency to the SP training and development process.
The strength of this study is that the SPS-S is not case-speci c and can be used for various scenarios.
Moreover, it is short (9 items) and can easily be completed by students. This scale was personalized for students; it contains items relating to their interactions with SPs and feedback about their performances.
The limitation of the study is that the criterion validity could not be tested because there is no other reliable and valid scale to assess SP performance by students. However, in similar studies, SPS-S can be used as a criterion scale to test validity. Another limitation could be the presence of a recall bias amongst students during their evaluation of the SP performance.

Conclusion
This study provides a valid and reliable instrument to assess SP performance by students. The use of SPS-S, which has been con rmed for validity, reliability, and standard-setting studies, will guide SP trainers during the training. It will also help SPs assess their weaknesses on an individual performance level. For further studies using SPS-S, we recommend the researchers to re-assess validity and reliability using CFA and internal consistency coe cient. Also, the scale can be modi ed for other stakeholders who will use it to assess SP performance.
In conclusion, this scale can be used for the selection and evaluation of SPs in the eld of health sciences. Availability of data and materials The datasets used and analysed during the current study are available from the corresponding author on reasonable request.
The study was designed and developed by IG, DCD and SE. SE was responsible for data collection. DCD was responsible for data analysis interpretation was undertaken by IG, DCD and SE. IG undertook the preliminary drafting of the paper, which IG, DCD and SE revised together and approved the nal version of the manuscript.