Validity evidence for two objective structured clinical examination stations to evaluate core skills of the shoulder and knee assessment

We developed two objective structured clinical examinations (OSCEs) to educate and evaluate trainees in the evaluation and management of shoulder and knee pain. Our objective was to examine the evidence for validity of these OSCEs. A multidisciplinary team of content experts developed checklists of exam maneuvers and criteria to guide rater observations. Content was proposed by faculty, supplemented by literature review, and finalized using a Delphi process. One faculty simulated the patient, another rated examinee performance. Two faculty independently rated a portion of cases. Percent agreement was calculated and Cohen’s kappa corrected for chance agreement on binary outcomes. Examinees’ self-assessment was explored by written surveys. Responses were stratified into 3 categories and compared with similarly stratified OSCE scores using Pearson’s coefficient. A multi-disciplinary cohort of 69 examinees participated. Examinees correctly identified rotator cuff and meniscal disease 88% and 89% of the time, respectively. Inter-rater agreement was moderate for the knee (87%; k = 0.61) and near perfect for the shoulder (97%; k = 0.88). No correlation between stratified self-assessment and OSCE scores were found for either shoulder (0.02) or knee (−0.07). Validity evidence supports the continuing use of these OSCEs in educational programs addressing the evaluation and management of shoulder and knee pain. Evidence for validity includes the systematic development of content, rigorous control of the response process, and demonstration of acceptable interrater agreement. Lack of correlation with self-assessment suggests that these OSCEs measure a construct different from learners’ self-confidence.


Background
The prevalence of musculoskeletal (MSK) problems is substantial, and in 2006, data from diagnostic coding showed that MSK conditions were the most common reason for patients to visit primary care clinics in the United States (US). [1][2][3] Nevertheless, clinical training in MSK diseases has been widely regarded as inadequate across multiple levels of medical education in the US and abroad. [4][5][6] Calls for innovations in response to these training needs have come in the context of an increasing awareness of the need for reflective critique and scholarly review of initiatives in medical education. [7,8] The US Bone and Joint Initiative's 2011 Summit on The Value in Musculoskeletal Care included the following recommendation in the summary of the proceedings: "Training programs for all health care providers should improve the knowledge, skills, and attitudes of all professionals in the diagnosis and management of musculoskeletal conditions. At present, many graduates report a deficit of knowledge of musculoskeletal conditions and competence in patient evaluation and treatment, including performance of the musculoskeletal physical examination." [7] In response to this call, and as part of a broad initiative to enhance MSK care, we convened a multidisciplinary group to develop two objective structured clinical examination (OSCE) stations to facilitate training and assessment in the evaluation and management of shoulder pain and knee pain in primary care. We designed these exercises to be capstone elements within MSK educational programs, developed for students, post-graduate trainees, and practicing providers. [9][10][11][12] The purpose of the OSCEs was to assess the ability to 1) perform a systematic, efficient, and thorough physical exam, 2) recognize history and exam findings suggestive of problems commonly seen in primary care (rotator cuff disease, osteoarthritis (OA), adhesive capsulitis, and biceps tendinitis in patients with shoulder pain; OA, meniscal disease, ligamentous injury, iliotibial band pain and patellofemoral syndrome in patients with knee pain), and 3) suggest an initial management plan, including the appropriate use of imaging, corticosteroid injections, and specialty referral. Our objective in this study was to examine the evidence for validity of these two OSCE experiences.
Contemporary understanding of validity has developed recently, exchanging an older framework that had considered content, criterion (including predictive), and construct validity to be distinct concepts, for a unified hypothesis in which validity is viewed as an argument to be made-using theory, data, and logic-rather than the measureable property of an instrument or assessment tool. [13,14] In this contemporary construct, evidence used to argue validity is drawn from multiple sources: 1) content, 2) response process, 3) internal structure, 4) relations to other variables, and 5) consequences. [15,16].

Content
The OSCE stations were created by a group consisting of two orthopedic surgeons (RZT, JPB), two rheumatologists (MJB, GWC), and a primary care provider with orthopedic experience (AMB). Station content-the set of elements constituting a complete examination for the shoulder and knee-was proposed by faculty, supplemented by literature review, and finalized through a Delphi process. Checklist items representing observable exam maneuvers and the criteria for guiding rater observations to assess the quality of performance of each of these items were also developed and finalized through faculty consensus. Simulated cases representing causes of shoulder pain (rotator cuff disease, OA, adhesive capsulitis, and biceps tendinitis) and knee pain (OA, meniscal disease, ligamentous injury, iliotibial band pain and patellofemoral syndrome) commonly encountered in primary care settings were created; expert clinical faculty drafted, reviewed, and revised these cases together with the checklists and rating scales, and additional faculty reviewed and critiqued the revised versions. Exacting specifications detailed all the essential clinical information to be portrayed by the simulated patient (SP).

Response process
OSCE scores were collected in the context of intensive structured educational programs developed for trainees and practicing primary care providers. [11,12] To promote accuracy of responses to assessment prompts, and to ensure strong data collection, one faculty member served as the SP (MJB) and another as the rater (AMB). OSCEs were conducted in clinical exam rooms, and ratings were recorded in real time. Any and all questions regarding the performance of specific exam maneuvers or the quality of the technique were resolved between the two faculty immediately following the exercise.
A scoring rubric was designed to produce five total possible points for the shoulder OSCE. The elements were distributed and organized into five domains: observation, palpation, range of motion, motor function of the rotator cuff, and provocative testing. Each domain was assigned a factor weight by clinical experts on the basis of their assessment of the importance that each domain contributed to clinical decision-making. For example, testing the rotator cuff motor function was assigned a factor of 1.5. This domain was a greater factor weight than that assigned to provocative testing-factor of 1-because if weakness of the rotator cuff is noted during the physical exam, magnetic resonance imaging (MRI) may be considered, whereas a positive Speed's or Yergason's test suggesting biceps tendinitis would not be expected to lead to advanced imaging. Each of the items within each domain was weighted equally. When the rater scored the OSCE, if a skill was not performed the item was scored as "0." If the skill was attempted but the technique was not adequate, it was scored as "1;" if performed correctly, it was scored as "2." The score within each domain was the percentage of possible points within that domain.
A similar five point scoring rubric was developed for the knee OSCE, with elements distributed and organized across five domains: observation, range of motion, palpation, stability testing, and provocative testing. As for the shoulder station, each domain is assigned a factor weight, to reflect differences in how the relative maneuvers might have greater or lesser impact on clinical decisions. Rating and scoring the knee OSCE followed the same procedure used in the shoulder station.

Internal structure
To establish interrater agreement, two faculty members (AMB, MJB) independently rated 10% of the cases. Inter-rater agreement was calculated and Cohen's kappa corrected for chance agreement on binary outcomes.

Relations to other variables
Relationship to self-assessment of ability to evaluate shoulder pain and knee pain was explored with written surveys, using Likert scales anchored at five points ranging from 1 (Strongly Disagree) to 5 (Strongly Agree). Five items related to the shoulder and 5 to the knee. In addition to using these in traditional pre-course and post-course measurements, participants were asked-after the course ended-to retrospectively rate their precourse proficiency-in effect, capturing information that trainees "didn't know they didn't know." [17,18] Responses were averaged for each of the 5 items; averaged responses were then stratified across 3 categories of selfassessed ability-low, medium, and high-and compared with similarly stratified OSCE scores. This project was reviewed by the Institutional Review Board of the University of Utah and was determined to meet the definition of a quality improvement study but not the definition of research with human subjects, and was classified as exempt. Written consent was not requested.

Content evidence
Final versions of the shoulder (21 items) and knee (25 items) checklists are shown in Tables 1 and 2, respectively. Table 1 Legend: Shoulder Examination Checklist. Table 2 Legend: Knee Examination Checklist. Individual items were grouped into 5 domains: "observation", "palpation", "range of motion", "motor function of rotator cuff", and "provocative testing" for the shoulder; "observation", "range of motion", "palpation", "stability testing", and "provocative testing" for the knee. Videos demonstrating the performance of the complete exam were developed (AMB, MJB) to accompany these checklists as teaching tools. [9] Copies of these videos are available as supplementary materials (see Shoulder Exam Small 2014.mov and Knee Exam Small 2014.mov) and online [19,20].

Response process evidence
A multi-disciplinary cohort of 69 trainees participated in the OSCEs in 2014-15 Table 3.
Using the examination approach in the checklists, 88% of the trainees correctly identified rotator cuff pathology and 89% of them correctly diagnosed meniscal disease.

Internal structure evidence
Observed inter-rater agreement was 87% for items on the knee checklist, and 97% for those on the shoulder Table 4.
Kappa coefficients indicated moderate agreement for the knee (0.6) and near perfect agreement for the shoulder (0.9), according to a commonly cited scale [21].

Relations to other variables evidence
Sixty nine pre-course, 67 post-course, and 63 retrospective pre-course surveys were collected (response rates of 100, 91 and 97%, respectively); mean self-assessment ratings are shown in Table 5.
Relationship of stratified self-assessment and OSCE scores is shown in Table 6.

Discussion
We have developed a systematic, efficient, and feasible method of organizing, teaching, and evaluating the physical examination of the shoulder and the knee. This paper presents validity evidence supporting the use of these examination checklists and OSCE stations in the context of an educational program focused on strengthening these clinical skills.
Several recent reports have been published, which describe the development and use of OSCE stations and checklists in the context of MSK and rheumatology; the two most recent of these emphasize the importance of developing consensus among educators regarding the elements of these important teaching and assessment tools. [22][23][24][25][26][27][28] There are many possible techniques used in examining the MSK system, and a recent review by Moen et al. reported that at least 109 specific maneuvers for the shoulder have been described. [29] Although some individual studies have reported sensitivity and  specificity properties for these maneuvers that may seem reasonable, other studies have arrived at different results, and a recent Cochrane review has not found sufficient evidence to recommend any examination element, likely due to "extreme diversity" in techniques compared to the original descriptions. [30] Many studies have not examined combinations of individual elements into a systematic, synthetic approach; some even question the relevance and role of the physical examination altogether, in contrast with the summary recommendation of the US Bone and Joint Initiative. [31,32] No group has yet proposed a detailed checklist of elements for the physical exam of the shoulder for use in a multidisciplinary educational program. Our study has several strengths. First, the content of our instruments was developed using a well-defined process, grounded in an explicit theoretical and conceptual basis-that in order to be effective the physical exam must balance thoroughness with feasibility. Our checklists represent those elements that were identified in the literature and finalized in a systematic item review by a multidisciplinary panel of experts representing orthopedics, rheumatology, and primary care. Second, the strength of our methods to control the response process and preserve a coherent internal structure within these OSCEs is demonstrated by the high rate of accuracy in identifying simulated rotator cuff and meniscal pathology, as well as good interrater agreement of faculty assessors. Finally, we have addressed the relationship of these structured observations of clinical skill to written self-assessments.
Further development of this educational initiative will involve exploring the use of these tools in several additional settings: 1) a national continuing professional education initiative to strengthen the evaluation and    [10,12,24] This exploration will involve work to examine validity evidence informing the interpretation of scores in each of these contexts. We acknowledge several limitations to our study. First, we have not examined the relationship of these OSCEs to other assessments of knowledge, including written examinations. We are currently developing additional methods of evaluation, including multiple choice questions that will evaluate content knowledge. Second, we do not currently have evidence of consequence to inform our validity hypothesis. Sources of consequence evidence might include more appropriate use of highcost imaging, better prioritization of referrals to physical therapy, surgery, or specialty care, and more precise documentation of the physical exam. Finally, our study examines evidence of the performance of these assessments within a single institution. It is believed that these teaching and assessment tools are generalizable, and offer a valuable resource at additional sites.

Conclusions
In summary, we have presented evidence of validity supporting the use of these shoulder and knee OSCEs as a capstone element of a structured educational program designed to strengthen the evaluation and management of common MSK complaints. This initial critical review of these assessment tools prepares the way for dissemination of these OSCEs to other institutions, learning platforms, and contexts, where additional examination of the experiences of implementation will be important to determine generalizability and feasibility.  Pearson's coefficient indicated no correlation for either the shoulder (0.02) or the knee (−0.07)