Bi-cultural, bi-national benchmarking and assessment of clinical reasoning in Obstetrics and Gynaecology

Background: The Script Concordance Test (SCT) is being increasingly used in professional development in clinical reasoning (CR) in postgraduate medicine. On-line delivery favours multi-institutional collaboration. Objectives: To establish if: 1) SCT questions developed in the French-speaking University of Montreal were readily adaptable for use in the English-speaking University of Adelaide 2) expert reference panels (ERP) from both institutions could be used interchangeably 3) student cohorts would perform similarly in the same test. Study Design: 82 SCT questions based on 27 clinical cases in Obstetrics and Gynaecology were developed in Montreal and run in a volunteer cohort of year 3 and year 4 medical students (n=154). Local faculty translated all questions, selecting 31 based on 17 clinical cases for use in summative examinations a year 5 student cohort in Adelaide (n=123). Results: Mean (SD) percentage scores using each ERP key were: 74.2 (6.4) versus 73.3 (6.9), p<0.001 for Adelaide students and 72.5 (7.8) versus 70.6 (8.8), p<0.001 for Montreal students. The correlation coefficients were ≥ 0.928 (p<0.001).


Introduction
Clinical reasoning (CR) is a cornerstone of medical practice. Whilst most methods of teaching and assessing CR have their own benefits and drawbacks a common element is that testing CR is resource-intensive. Thus, the potential for efficiency gains by multi-institutional collaboration in the development of assessment tools is attractive, as is cross-institution benchmarking.
Testing the clinical reasoning of medical students is a core component of assessment in medical programs. Tests of CR are now becoming increasingly important in the postgraduate domain, where it is recognized that errors in CR make the single most significant contribution to successful malpractice claims. (Saber Tehrani 2013) The Script Concordance Test (SCT) is a tool for assessment of CR that is increasingly being used in continuing professional development in medicine (Ahmadi et al 2014) including in large, geographically dispersed medical communities. (Hornos et al 2013) Script theory explains how physicians progressively acquire knowledge adapted to their clinical tasks. (Charlin et al 2000a, Charlin et al 2000b The SCT is a written assessment based on clinical scenarios designed to measure clinical data interpretation. 10-20 members of the expert reference panel are recommended for optimal reliability. (Gagnon et al 2005) One significant characteristic of the SCT format is that it allows testing in ill-defined contexts that are often typical of clinical practice. (Lubarsky et al 2013) The SCT has been used in assessment in disciplines including radiology, neurology, radio-oncology, surgery, emergency pediatric medicine, and has been used as an assessment tool for intraoperative decision-making in gynecological surgery. (Brailovsky et al 2001, Brazeau-Lamontagne et al 2004, Lambert et al 2009, Lubarsky et al 2009, Meterissian et al 2006, Park et al 2010 In these reports, tests were statistically reliable and showed construct validity (Lubarsky et al 2011), with statistically linear progression of scores with clinical experience.
These studies in postgraduate medicine have been undertaken with participants of differing levels of clinical expertise. A few studies have assessed reasoning among same level medical students in specific domains. (Collard et al 2009, Duggan 2007, Duggan and Charlin 2012, Monnier et al 2011 We report our experience in the development and application of a "trans-national, bi-lingual" SCT developed for assessment of senior medical students in Obstetrics and Gynaecology. The research questions were: 1) can SCT questions in Obstetrics and Gynaecology developed in the French-speaking University of Montreal, Canada be readily adapted for use in the English-speaking University of Adelaide, Australia? 2) Could the independent expert reference panels from both institutions be used interchangeably? 3) Would student cohorts in both centres perform to an equivalent level in the same test?

Background
The University of Montreal has a four-year postgraduate medical program with Obstetrics and Gynaecology taught in a clinical clerkship of 8 week's duration (4 weeks in Obstetrics and 4 weeks in Gynaecology). Students choose whether to undertake this clerkship in the third or fourth year of their program. In contrast, the University of Adelaide has a six-year undergraduate program with Obstetrics and Gynaecology taught in 9-week clinical Duggan P, Monnier P, Roex A, Bédard M, Charlin B MedEdPublish https://doi.org/10.15694/mep.2016.000025 Page | 3 clerkships in the fifth year.

Structure, production and scoring of the SCT cases and questions
The SCT format is shown in Figure 1. This provides a clinical scenario (case), a hypothesis or plan of action based on the scenario, and some additional information that may or may not have an effect on the hypothesis or plan. Each scenario is followed by a number of questions. For each question, the participant selects the single best Likert response that describes the effect of the additional information that has been given. In contrast to many conventional forms of written testing (e.g. multiple choice questions), there is no single correct answer; several responses to each question may be considered acceptable. Credit is assigned to each response based on the proportion of experts on the reference panel choosing that response. A maximal score of 1 is given for the response chosen by most of the experts (i.e., the modal response). Other responses are given partial credit in proportion to the number of experts choosing them. The information in each question stands alone -i.e. when considering the answer to Q1 there is no oxygen saturation result available and for Q2 there is no chest X-ray result available. Typically, between 3-5 questions are provided per clinical case.
You are called to a hospital ward to evaluate a 74-year-old woman three days following vaginal hysterectomy and anterior repair for prolapse. She is complaining of a sore leg and now feels short of breath whilst sitting in a chair. Obstetrics and Gynaecology wrote the questions (in their native French language) taking into account clerkship educational objectives. Questions for the expert reference panel (ERP) were placed on line in a purpose-written restricted access electronic database. The Montreal ERP comprised 15 volunteer experts in Obstetrics and Gynaecology (specialists and subspecialists) who were actively involved in teaching in the university. In Montreal, 154 of 171 (90%) medical students who had completed clinical clerkships in Obstetrics and Gynaecology in four consecutive rotations agreed to sit under normal examination conditions the 82-question paper-based SCT. This part of the study received approval from the University's Institutional Review Board. Neither institution required ethical approval for the remainder of the study. The complete question set of this examination was forwarded electronically to the University of Adelaide.
In brief, the "fate" of the Montreal-derived questions was as follows: 82 questions (27 cases) were received, the first filter removed 5 questions, the second filter removed 8 questions, and the third filter removed 38 questions. The final set comprised 31 questions from 17 cases (range of 1-3 questions per case -see table 1). 31 questions in topics from Obstetrics and Gynecology was the maximal allowed in the assessment blueprint for the clerkship examination in the University of Adelaide.
The first filter (formatting compatibility) An Adelaide-based, specialist obstetrician and gynecologist (AR) translated the questions in to English. Of the 82 questions 77 were in a suitable format for entry into the Adelaide on-line SCT database. 5 questions related to two cases had Likert anchors of a mixed format (i.e. mixed hypothesis, investigation or management type questions in the one case scenario), which did not fit the configuration of the Adelaide on-line test facility. An Adelaide ERP, made up of 12 Obstetrics and Gynaecology specialists and subspecialists actively involved in teaching, answered these remaining 77 questions on-line.
The second filter (post hoc review of questions for curriculum compatibility, language and transposition errors).
After completion of the Adelaide ERP work a review of the 77 questions was independently undertaken by PD. 8 questions had ambiguous phrases or key errors in translation or transposition of data and were removed. Examples include transposition errors for key laboratory data and omission of a key word such as "only". One question though appropriately translated was considered to be ambiguous -in the case of a woman with severe pre-eclampsia the phrase "managing her conservatively" could have been interpreted by students as meaning observation only or observation plus antihypertensive therapy. This left 69 questions in the question bank and that were suitable for use in the clerkship examinations.
The third filter (selection of 31 questions for use in Adelaide end of year examinations) The benchmarking exercise to be undertaken was to compare the performance of Montreal and Adelaide student cohorts in identical SCT questions. A representative sample comprising 17 clinical cases and 31 questions was chosen covering diagnosis, management and investigation of common clinical conditions (table 1). Question topics were selected with reference to our assessment blueprint for the end of year examination and after applying the Adelaide criteria for selection of SCT questions (Duggan and Charlin 2012) 1) the modal response was consistent with current best evidence 2) alternatives to the modal answer were chosen by more than one member of the ERP 3) the questions reflected the contents of the curriculum 4) the questions were of an appropriate degree of difficulty for the Year 5 cohort 5) the questions complemented our other assessment items.

Statistical analysis
The effect on Adelaide and Montreal student scores and the correlation of scores obtained with the different expert reference panel keys were analysed using the paired samples t-test function in SPSS version 20 for Mac.

Results
There were 123 students (67 female, 56 male) in the fifth year Adelaide cohort and 154 students (98 female, 42 male, 14 no data) in the Montreal cohort. 92 Montreal students reported they were in third year, 37 in fourth year, and 25 did not declare their year.
Mean (SD) percentage scores were 74.2 (6.4) versus 73.3 (6.9), p< 0.001 for the Adelaide students and 72.5 (7.8) versus 70.6 (8.8), p< 0.001 for the Montreal students. Overall, the 95% confidence interval of the difference of the means was 0.4% -2.4% (table 2). The correlation coefficient for the Adelaide student data run with Adelaide and Montreal ERP scoring keys was 0.93 and the scatter plot of those data are shown in figure 2. Very similar results were obtained when the data for the Montreal student cohort were analysed (correlation coefficient 0.94, scatter plot not shown).

Discussion
To our knowledge this is the first report of an international collaboration in assessment in Obstetrics and Gynaecology, made all the more remarkable for its bilingual nature. We had 3 research questions: 1) can SCT questions in Obstetrics and Gynaecology developed in the French-speaking University of Montreal, Canada be readily adapted for use in the English-speaking University of Adelaide, Australia; 2) could the independent expert reference panels from both institutions be used interchangeably, and 3) would student cohorts in both centers perform to an equivalent level in the same test?
The key finding of our study was that the English-speaking University of Adelaide was able to utilise the majority of questions developed in the French-speaking University of Montreal. Whilst significant editorial input was required, this is also necessary for locally written questions, in any institution. Furthermore, the level of attrition of questions provided by Montreal to Adelaide is similar if not better than the outcome of many local question-writing workshops. The development of examination questions of any sort is a resource intensive process. There is significant potential for cost sharing in addition to enhanced opportunities for professional development of faculties In relation to our second research question, although the differences in means for each cohort were statistically significant the absolute difference was small and well in keeping with, if not better than, results when different local expert reference panels are used. The effect of this difference on real student scores is remarkably small, for example, in comparison with differences reported in multi-institutional standard setting studies of other forms of assessments, such as OSCE examinations. (Boursicott et al 2006) The correlation between results achieved by our student cohorts was very high and it is highly unlikely that better correlations would be obtained between two different reference panels from the same institution. These data lend support to the notion that institutions can use in their assessments scoring keys derived from other institutions. The resourcing implications of this finding are potentially significant and especially so in small to medium sized institutions that might have difficulty in local recruitment of an adequate numbers of panelists. Is it appropriate to use a scoring key developed by a different faculty in another country? There are some risks in doing so, particularly if the assessment is high stakes. These risks can be ameliorated but not eliminated by appropriate local editorial control and cross-institutional benchmarking.
Our third research question was answered in the benchmarking exercise. Benchmarking by definition requires sharing of questions. In the exercise we have described, the cohorts were compared to each other rather than an external control group. The questions chosen for this comparison were determined by both faculties to be appropriate to their curricula. To our knowledge such an international benchmarking exercise has not previously been conducted. We believe that the small difference in scores between the two student cohorts, although statistically significant, is consistent with an equivalent and satisfactory level of performance. However, the two cohorts are not directly comparable in part due to differences in structure and length of the two programs. Furthermore, the Montreal students sat their SCT as volunteers (90% participation rate) with no stakes attached, whereas the Adelaide students were sitting a summative test.

Conclusions
Our study clearly demonstrated that SCT questions in Obstetrics and Gynaecology can be effectively shared between French and English speaking institutions located in different hemispheres. ERP data derived from the collaborating institution can be used provided there is appropriate local editorial control. There appeared to be few differences in clinical practice. Potential advantages include the creation of an international database of assessment items, benchmarking and cost sharing. This will save time for teachers and can be a first step for standardisation in assessment, particularly useful in a world where faculty frequently moves from a country to another.

Take Home Messages
Script Concordance Test questions can be effectively shared between French and English speaking institutions located in different hemispheres. International collaboration in question development creates opportunities for benchmarking and cost sharing. Scoring keys derived from independent expert reference panels of collaborating institutions can be used interchangeably.