Validity, reliability and acceptability of Professionalism Mini-Evaluation Exercise (P-MEX) for emergency medicine residency training.

Professionalism is a core competency in the medical profession. In this paper, we aimed to confirm the validity, reliability and acceptability of the Professionalism Mini-Evaluation Exercise (P-MEX) instrument for the emergency medicine (EM) residency program. Twenty-two EM attending physicians completed 383 P-MEX forms (the Persian version) for 90 EM residents. Construct validity was assessed via structural equation modeling (SEM). The reliability coefficient was estimated by the generalizability theory, and acceptability was assessed using two researcher-made questionnaires to evaluate the perspectives of residents and assessors. There was a consensus among the participants regarding the content of P-MEX. According to the results of SEM, the first implementation of the original model was associated with a moderate fit and high item loadings. The model modified with correlated error variances for two pairs of items showed an appropriate fit. The reliability of P-MEX was 0.81 for 14 occasions. The perception survey indicated high acceptability for P-MEX from the viewpoint of the residents and increasing satisfaction with P-MEX among the assessors over time. According to the results of the research, P-MEX is a reliable, valid, and acceptable instrument for assessing professionalism in EM residents.


Introduction
Professionalism is a core characteristic of the medical profession (1). In recent years, increasing attention has been paid to professionalism due to concerns regarding the decline of professional and ethical values (2). One responsibility of medical schools is determining whether such competencies have been achieved (3). Over the last three decades, various instruments have been developed to assess medical professionalism (4). Recognizing the observation of students' performance as the most efficient technique to evaluate professionalism in real clinical practice led to the identification of the Professionalism Mini-Evaluation Exercise (P-MEX) as the core of any assessment strategy (5).
Introduced by Cruess et al., P-MEX measures four areas of professionalism skills: doctor-patient relationship, reflective skills, time management, and interprofessional relationship skills (6). The necessity to reevaluate professionalism assessment scales before application in a new country has been emphasized due to cultural and contextual differences (7,8). For instance, Tsugawa modified the instrument so that it could be applied to Japanese medical students (9,10). Unfortunately, no observational instrument has been validated for the assessment of the professionalism of emergency medicine (EM) residents (11,12). Working as a resident in the EM department is more stressful compared to other departments due to the unique features of this ward, e.g. heavy workload, uncontrolled environment, and an unlimited number of patients with a vast spectrum of diseases and a short-term stay (13).
Studies have shown that professional values are violated by residents who suffer from burnout due to prolonged exposure to stress. Formative assessment of behavior facilitates early identification of unprofessional behavior before it becomes a significant issue. It also assists trainers in opening the dialogue on signs of burnout with residents through feedback for minimizing professionalism lapses and ameliorating burnout. Therefore, it is essential to apply effective assessment strategies in the clinical workplace (14,15). Considering the differences between the EM department and other clinical settings, this study aimed to confirm the reliability, validity, and acceptability of the P-MEX for EM residents in Iran.

Methods
This study was conducted in the EM departments of four teaching hospitals in Iran from July 2017 to January 2018. The research was approved by the Institutional Review Board of the School of Medicine of Tehran University of Medical Sciences. In translating the P-MEX from English to Persian, the guidelines for the translation and adaptation of tests developed by the International Test Commission (ITC) were followed (16). First, two experts conducted a forward translation, which was then evaluated by an expert panel consisting of five professionals. This evaluation led to the formation of a single Persian translation by consensus. Second, the Persian version was back-translated into English by two bilingual native English speakers. Third, Richard Cruess and Sylvia Cruess (the two developers of the P-MEX) discussed the discrepancies in the two backward translations.
Based on their recommendations, a final draft of the Persian P-MEX was prepared. Fourth, cognitive debriefing interviews were conducted with a sample of participants, consisting of six assessors and six EM residents, to assess the comprehension and face validity of the translated P-MEX.
All EM residents (n = 90) and 22 attending physicians voluntarily participated in the study. Non-monetary incentives were used to encourage participation in the research. Participants were first instructed to perform the P-MEX exercise through weekly meetings, in which they received a booklet containing an instruction guide and the P-MEX forms. The P-MEX comprises 20 minutes of observing clinical encounters followed by five minutes of immediate feedback. In the present study, the full 24item P-MEX scale was used, scored based on a four-point Likert scale with the options of exceeded expectations (score 4), met expectations (score 3), below expectations (score 2), and unacceptable (score 1). The fifth category was entitled "not observed" or "not applicable". The original P-MEX form (questionnaire) is presented at the end of the paper as an appendix.

Analysis
The structural equation modeling (SEM) was utilized to investigate the construct validity of the P-MEX. The following indices of SEM were applied in the present study to evaluate the model's goodness-offit: comparative fit index (CFI > 0.90 indicative of a good fit), the root-meansquare error of approximation (RMSEA < 0.08 indicating acceptable fit), and Chisquare (χ2 / d.f. ≤ 3 ratio). SEM statistics were also conducted using the STATA/IC (14.2) (StataCorp, College Station, TX, USA). Moreover, the generalizability theory was used to evaluate the reliability of the scores. To this end, the Generalizability coefficient (G) was estimated for a one-facet crossed design, in which resident (R) was the object of measurement and occasions (o) were facets of measurement using the G-STRING IV version 6.3.8 (Bloch & Norman, 2011). Furthermore, the G study was performed followed by the decision study to identify the number of occasions (P-MEX) per resident required to achieve the highest level of reliability. After the completion of the P-MEX assessment process, residents and faculties were asked to complete a questionnaire on their perception of the P-MEX from various aspects, including the feasibility, content, fairness, and educational impact of the assessment. The questionnaire for residents contained 52 items, whereas the scale for assessors encompassed 37 items, both scored based on a five-point Likert scale ranging from "strongly disagree" (1) to "strongly agree" (5). The face and content validity of the questionnaires were confirmed by a group of experts consisting of two medical education faculty members and four emergency medicine specialists who participated in the study as assessors. In order to determine the stability of the questionnaires over time, test-retest was used by Pearson's correlation coefficient. Consequently, the questionnaires were readministered to 21 residents and 8 assessors two to three weeks later. In the current study, Pearson's correlation coefficient was between 0.726 and 0.943 for the residents' questionnaire (P < 0.01), and between 0.779 and 0.906 for the assessors' questionnaire (P < 0.01), which suggested satisfactory stability. Furthermore, the reliability of the questionnaires was estimated at the Cronbach's alpha of 0.88. Data analysis was performed in SPSS 22.

Results
In total, 383 P-MEX forms were completed by 22 EM faculties for 90 residents during a seven-month period. The mean number of evaluations per resident was 4.26 (range of 1 -11). In addition, the range of the P-MEX completed per rater was 1 -46, with an average of 17.41 (+/-2.68 SD). Moreover, the mean of the evaluation scores of all residents for overall competency was 3.32 (± 0.04 SD) out of 4. According to the results, the mean observation time equaled 128.3 minutes (median of 120 and range of 10 -600) and the mean feedback time equaled 13.06 minutes (median of 10 and range of 1 -35). In the present research, the residents received the lowest scores on items 10 (23.8%), 8 (21.4%), 17 (14.9%), and 13 (12.5%), which pertained to soliciting feedback, warning about the limitations, addressing the gap between knowledge and skills, and maintaining composure in difficult situations, respectively.
In 11% and 8.6% of the assessments, item 23 (using health resources appropriately) and item 22 (maintaining patient confidentiality) were rated as not-applicable by the assessors, respectively. These items were reconsidered for their additional value in the assessment of the EM residents in this research. However, item 22 was more applicable in over-an-hour-long observations. It should be noted that the correlation between intra-item subscales was evaluated using the Pearson product-moment correlation coefficient, and results were indicative of a significant and strong correlation between items 2 and 3 (r = 0.774, P < 0.005). Moreover, item 5 was highly correlated with items 4, 6, and 7 (r = 0.773, 0.864 and, 0.743, respectively).
In addition, SEM was used to confirm the model's goodness-of-fit. As presented in figure 1, factor loadings for all items were significantly above Kline's cut-off point (>0.50) (17). However, item 12 (appropriate boundaries with patients/colleagues), had been cross-loaded on two latent variables (i.e. patient-doctor communication skills and interpersonal skills in the original model), and was barely loaded on factor 1 (loading value 0.096) but mostly on factor 4 with a value of 0.65.   Based on the literature, correlated error terms are often caused by item wording, item placement, double-barreled questions, or the effects of missing variables. In order to address the issue of correlated error terms, researchers recommended removing or merging items with the correlated error terms and proposing new items (17)(18)(19)(20)(21). The error correlation between items 2 and 3 could be justified by referring to the same concept of respect for patients with both items. Item 2 was removed from the scale due to the failure of the evaluators to differentiate between the items after scoring the performance of the residents.
Additionally, the error terms of items 22 and 23 were allowed to correlate. However, item 23 did not apply in our setting since it is not the responsibility of residents to use or allocate health resources. As a result, item 23 was eliminated from the scale. The reliability of P-MEX scores was measured by the generalizability theory using a onefacet (the resident by form) crossed design. The G coefficient was estimated at 0.647 based on the six levels of the forms crossed with residents. The D study results (table 1) revealed that the optimal number of occasions required for reaching acceptable reliability on the P-MEX assessment was 14 occasions with the G coefficient equal to 0.81.

Table 1-D Study Results for S×O Design
Based on table 2, the acceptability of the P-MEX was measured by a post-intervention questionnaire. The results indicated that all of the participants were satisfied with the content of the P-MEX. In this regard, 56.6% of the residents responded "strongly agree", whereas 43.3% selected the option "agree". On the other hand, 55.6% and 44.4% of the faculties chose "strongly agree" and "agree", respectively.

EM= emergency ward; * = Faculties; ** =Residents
As presented in this table, the majority of the participants "agreed" and "strongly agreed" that the P-MEX was easily administrated in EM clinical settings, and confirmed adequacy of the time allocated for completing the questionnaire. Nevertheless, it seems that in some cases, the assessment process was negatively affected by the heavy workload in overcrowded EM settings. In addition, it was found that feedbacks were recorded in only 12% of the P-MEX forms.
Moreover, a small number of residents reported only receiving general verbal comments on their performance. Most of the residents and some of the faculties believed that raters' prior knowledge about resident's performance creates a positive or negative halo, influencing the grading of the latter's professional behavior. They also mentioned the effect of the quality of the relationship with raters on the rating scores. Furthermore, more than half of the faculty members preferred the indirect observation of professional behavior in which residents are unaware that they are being observed; most of the residents, however, selected the options "disagree" and "strongly disagree" regarding this statement.

Discussion
To the best of our knowledge, this was the first psychometric study of the P-MEX in the EM clinical setting. In total, two items were identified as "unfitting and problematic" in the divergent validity analysis. Items 2 and 23 were removed due to the error correlation observed for similar wording and non-applicability. Moreover, item 12 was only loaded on the interpersonal factor. Finally, data were fitted to the proposed model by removing two items.
In the present study, when assessments were performed during the day when patient flow was lowest, time was not a significant issue for either the faculties or the residents. However, they believed that situations with clinical overload or high stress conflicted with the implementation of the P-MEX, since assessment had an adverse effect on the patient care process in life-threatening situations.
Providing feedback was the most significant factor faced by assessors in implementing P-MEX in the present study. While feedback is an essential component of this formative assessment, provision of feedback on the observed clinical performance was inadequate. Since professionalism is subjective in nature, different assessors judge behaviors in different ways and may give different feedbacks, so residents are likely to view a low score and constructive feedback as unfair. In interviews, assessors expressed their interest in providing feedback but had concerns about the emotional and defensive reactions of residents to criticism, which could lead to poor performance in clinical settings.
Moreover, they believed that it could potentially cause tension in the supervisory relationship. The working relationship between faculties and residents over an extended period caused leniency bias in ratings in this face-to-face assessment. To address the dilemma existing between the necessity of providing feedback and preventing tension in the busy and stressful emergency setting, assessors suggested the anonymity of raters whereby residents would be aware of the scores and feedbacks but the identity of raters would remain confidential.
Nevertheless, the residents stated that they were enthusiastic about having an opportunity to learn from feedback and even criticism because it made them understand expectations and identify their own weak points. These findings are consistent with those of Colletti et al. (22) who found that while medical students desire more timely (22), direct observation and feedback on their clinical performance, faculties are unwilling to point out students' weaknesses face-to-face, particularly when it involves negative feedback, resulting in score inflation. Therefore, there is an obvious need for residents to improve their feedback solicitation skills, and for faculty members to develop their observation and feedback skills with an emphasis on creating a feedback-friendly environment and professional support with mutual trust.
Another phenomenon observed in the present study was that the mean time of observation was about one hour and a half.
Although all raters were instructed on the principles of this assessment, a few were still unfamiliar with the P-MEX instrument and its ultimate goal that is formative coaching rather than assessing, and tended to implement it using the classical global rating methods. They applied the instrument for assessing residents in one shift using the multiple mini-observations technique for the completion of each P-MEX, so that their

Conclusion
According to the results of the present study, the P-MEX is a valid instrument on the condition that several modifications are made, including removal and addition of some items. Moreover, the reliability and feasibility of the instrument were confirmed in EM settings. While the P-MEX was highly accepted by residents, faculties were not initially comfortable with the instrument. It became progressively easier as the assessors observed that residents were showing more interest in receiving and soliciting feedback. To accurately assess professionalism among residents, we need to go beyond traditional methods. If consensus is achieved on the fact that the importance of any professionalism assessment lies in professional identity formation, educational goals should be modified so that challenges in the emergency ward, such as heavy work-load and stress, become educational opportunities for residents, resulting in the development of their professional identity rather than burnout and professional insufficiency. It is believed that developing the faculty's perception of this issue with an emphasis on enhancing their knowledge and skills regarding principles of effective feedback and assessment methods in clinical settings can play an important role in the improvement of the feasibility, acceptability, and validity of the P-MEX.