The impact of interviewer characteristics on residency candidate scores in Emergency Medicine: a brief report

Background At the conclusion of residency candidate interview days, faculty interviewers commonly meet as a group to reach conclusions about candidate evaluations based on shared information. These conclusions ultimately translate into rank list position for The Residency Match. The primary objective is to determine if the post-interview discussion influences the final scores assigned by each interviewer, and to investigate whether interviewer characteristics are significantly associated with the likelihood of changing their score. Based on Foucault’s ‘theory of discourse’ and Bourdieu’s ‘social capital theory,’ we hypothesized that interviewer characteristics, and the discourse itself, would contribute to score changes after a post-interview discussion regarding emergency medicine residency candidates. Methods We conducted a cross-sectional observational study of candidate scores for all candidates to a four-year emergency medicine residency program affiliated with Yale University School of Medicine during a single application cycle. The magnitude and direction of score changes, if any, after group discussion were plotted and grouped by interviewer academic rank. We created a logistic regression model to determine the odds that candidate scores changed from pre- and post-discussion ratings related to specific interviewer factors. Results A total of 24 interviewers and 211 candidates created 471 unique interviewer-candidate scoring interactions, with 216 (45.8%) changing post-discussion. All interviewers ranked junior to professor were significantly more likely to change their score compared to professors. Interviewers who were women had significantly lower odds of changing their individual scores following group discussion (p=0.020; OR 0.49, 95% CI 0.26-0.89). Conclusions Interviewers with lower academic rank had higher odds of changing their post-discussion scores of residency candidates compared to professors. Future work is needed to further characterize the influencing factors and could help create more equitable decision processes during the residency candidate ranking process.


Introduction
Given the binding nature of 'The Match' in determining residency candidate-program pairings, all parties are incentivized to ensure optimum compatibility during residency recruitment season.Despite this, a validated scoring system to assess residency candidate interview performances does not exist.Interviewers should therefore consider any factors potentially influencing their scores [1][2][3][4][5][6] .
A literature search was performed using all Ovid MEDLINE(R) database entries from 1946 to November 5, 2020, which was the date the search was performed, and did not identify previous investigations that focused on emergency medicine (EM) candidates, nor the impact of a post-interview group discussion.Search terms included: bias, medical school, decision, debrief, interview.A limited number of studies have investigated interviewer characteristics and their possible impact on residency match scores.A recent study reported that there was no significant effect of interviewer sex, faculty academic rank or title on internal medicine candidate scoring 7 .Another found internal medicine residents involved in interviewing consistently gave candidates more favorable scores than faculty interviewers but including resident interviewers did not lead to a significant impact on initial or final rank list position of candidates 8 .The primary objective is to determine if the post-interview discussion influences the final scores assigned by each interviewer, and to investigate whether interviewer characteristics are significantly associated with the likelihood of changing their score.Our hypothesis was based on Foucault's 'theory of discourse' and Bourdieu's 'social capital theory' 9 .According to the theory of discourse, what a society [in this case, the interviewer group] holds true changes based on the exchange of ideas of those belonging to the society.Social capital theory describes the concept that one's social position [in this case, interviewer characteristics including academic rank] is a form of resource or commodity that can be used in times of discourse or conflict [in this case, post-interview candidate ranking] 10 .Therefore, we hypothesized that interviewer characteristics, and the discourse itself, would contribute to score changes.

Ethical considerations
The Yale University School of Medicine Institutional Review Board deemed this deidentified study "Not human research" and exempt from consent requirements (Protocol ID #2000025029, Determined 2/21/2019).Specifically, no consent for publication was required as data has been anonymized, and the data alterations have not distorted scientific meaning.

Participants
The study was conducted at a four-year Accreditation Council for Graduate Medical Education (ACGME)-accredited EM residency program affiliated with Yale University School of Medicine, which is a quaternary referral center in the United States.Subjects were all faculty and resident interviewers during the 2017-18 application cycle.Twenty-four interviewers and 211 candidates were included for a total of 471 unique Interviewer-Candidate pairings of scores.Residency applicant interviewers included eight senior residents, four chief residents, three clinical instructors, six assistant professors, one associate professor, and two professors of EM.Interviewers were directed to score each candidate on a scale from 0 to 8 and provided examples of historical scores and corresponding likelihood to match.Each candidate had a maximum of one resident interviewer.Interviews conducted by the program director (PD) were not included in the data set at the recommendation of our statistician, as changes post-discussion were exceedingly rare, and PD scores mirrored the final rank list very closely.Interviews from the faculty interviewer collecting the data for the study were also excluded in an attempt to avoid bias.In a very rare instance, a faculty member was not able to be at the debriefing and was thus unable to provide edited scores (and these interactions were excluded from the data set).Ultimately, a total of 454 candidate interviews were included in the analysis, all of which were performed in-person.

Data collection
We conducted a cross-sectional, consecutive observational study to determine any change in score that resulted from the discussion session at the end of interview days.The interview structure at the study site included four one-on-one in-person interviews.After each interview, but prior to group discussion, interviewers independently numerically scored each candidate on a scale from 0 to 10.The day concluded with a closed discussion session attended by all interviewers.During this discussion, each candidate's application and interview performance were reviewed by the entire group, allowing an opportunity for shared perspectives and optional revisions to initial candidate scores.These scores were ultimately used to create the first iteration of the

Amendments from Version 1
The authors would like to thank all reviewers for taking the time to review our work.
In response to thoughtful peer review comments, a more robust description was added to the Participants subsection within the Methods to help more accurately describe inclusion and exclusion criteria for the data set.
The "in-person" interview qualifier was added since interviews have since transitioned to virtual-only for Accreditation Council for Graduate Medical Education (ACGME)-accredited emergency medicine residency applicants.
An expansion of the discussion includes consideration of repeating the study to investigate cumulative influence for score change of non-program-director interviewers compared to program director scoring.Additional attention was paid to behavioural and holistic interview assessments, adding one reference (Hopson LR et al.).

Minor grammatical clarifications were added.
Any further responses from the reviewers can be found at the end of the article rank list, reviewed again at the final end-of-season discussion.Two scores were obtained for each candidate from each individual interviewer: first immediately following the one-on-one interview, and second following review of the candidate in the closed group discussion.A validated scoring system to assess residency candidate interview performances does not exist, but this method differs from the historical scoring of a single score after the closed group session, that is anecdotally edited with some frequency after discussion but before submission.Closed group discussions last approximately 10 minutes per candidate.
Data were collected by one physician author (man) and the dataset was de-identified and coded by the residency program coordinator (non-physician woman), who did not participate in interviews nor the composition of any portion of this manuscript and entered into a Microsoft Excel (RRID:SCR_016137) worksheet.Data from interviews conducted by the physician author who collected the data were not included in the analysis.Interviewers included peers, educators, advisees, and supervisors of the physician data collector.All participants were aware of data collection and its purpose.Gender identification was by self-report.No interviews occurred more than once.No prompts were provided by authors, no audiovisual recording was used in data collection.No notes other than scores were taken, and corrections of scores after final submission were not allowed.Statistical calculation revealed that our dataset was appropriately powered to draw mathematical conclusions.

Statistical analysis
We determined the odds of candidate scores changing before and after discussion as related to specific interviewer factors using logistic regression modeling.The following variables were included in the model: interviewer academic rank, interviewer sex, score prior to the discussion, and candidate final rank group.A p-value of <.05 was chosen as statistically significant and 95% confidence intervals (CI) were reported.We used IBM SPSS Statistics (RRID:SCR_016479) software (v.22.0, IBM Corp) to perform statistical analyses.No funding was obtained during this undertaking.

Results
In total, 216 (45.8%) scores changed from pre-to postdiscussion.Logistic regression results are summarized in Table 1 11 .
All interviewers ranked below professor were significantly more likely to change their score as compared to professors.Candidates in the top two thirds of the ultimate rank list were less likely to have their score changed post-discussion as compared to the bottom third (top third: OR 0.26 (95% CI 0.14, 0.48); middle third: OR 0.34 (95% CI 0.20, 0.59)).Interviewers who were women had significantly lower odds of changing their individual scores following group discussion (OR 0.49 (95% CI 0.26, 0.89)) as compared to men.For graphical representation of the degree and direction of candidate score change after discussion, please see Figure 1.

Discussion
Post-interview discussion resulted in an increased likelihood of score changes by all interviewers, except for full professors.One of the full professors in the study was the residency program director, who has a supervisory role over every interviewer within the department, except for the other full professor.Evaluating these results on the basis of Foucault's 'theory of discourse,' the findings could be explained by the idea that power structures of the post-interview discussion had an influence in the outcomes: the junior faculty and resident interviewers could have a conscious or unconscious desire to 'agree' with their supervisor 9 .Alternatively, they may simply have adjusted their score based on reconsideration of the candidate considering other interviewer perspectives.Another background consideration rooted in the 'social capital theory' is that residents and junior faculty are closer to candidates in career progression, and so may theoretically have an easier time relating to applicants, whereas senior ranking faculty will have more life, clinical, and interviewing experience, though also be less likely to be affected by the impressions of more senior faculty (i.e., the one full professor did not change their rank) 10 .Both interviewer groups use their social context to rate the candidate, thus scores change after the discussion occurs.
In future investigations it would be interesting to include candidate demographics and non-academic-rank groupings of faculty such as years of practice.Further work could also examine whether score changes may be more likely to affect final rank list positions for particularly strong or particularly weak candidates more than those in the middle.
Given the stakes for all parties involved in the residency matching process, it is particularly important that interviewers consider all variables, including the influence of group discussion observed here, that may affect a candidate's position on the rank list.This is especially true of senior faculty who must consider their potential influence on junior interviewers, while junior interviewers should recognize their possible vulnerability to this influence.
A larger conversation by national organizations may be warranted regarding the positive and negative aspects of subjective numerical candidate ranking systems and complexities of associated biases.This study took place at a time when all interviews took place in-person, so further work is needed to investigate whether these findings are applicable to virtual interview procedures.As has been mentioned in recent work, further consideration of holistic and behaviorally based interview questions and scoring systems may allow programs to better design interview assessments to match their priorities 12 .In addition to a small sample size, there were very few professors and associate professors interviewing during the investigation time window.Further exploration of any cumulative effect that non-PD interviewers make on final rank list position seems a worthwhile next step, since the PD scores were nearly identical to final rank list positions.As previously mentioned, in this case, the PD was also a professor with over twenty years of interviewing experience, which is not universally the case among training programs.A replication of this study could also consider using years of interviewing experience as a seniority variable instead of academic ranking.
Interviewer reasoning for score adjustment was not evaluated in this study.Further investigation of score change rationale may clarify the influences on their decisions, though could also be influenced by reporting bias.It would be noteworthy to study any difference in reasoning between gender groups, as there was a significant difference in score change in our study.
The role of candidate socioeconomic status, race, ethnicity, and gender identity, which have been investigated in medical school interviews as well as several other industries, were not addressed in this study but could be an area for future investigation regarding candidate characteristics and any association with interview scores 7,8,13 .

Conclusions
Interviewers with lower academic rank had higher odds of changing their post-discussion scores of residency candidates compared to those at the professor level.Future work is needed to further characterize the influencing factors and could help create more equitable decision processes during the residency candidate ranking process.The authors have evidence supporting the assertion that scores by junior faculty are more likely to change.The question remaining for me is whether the senior evaluators (including the PD who is one of the professors) have post-hoc opportunity to influence the rank list, thereby making them less likely to change the actual score in the first iteration of the rank list upon which the analysis is based.The authors mention a final "end-of-season discussion" which may allow further modification.As such, if the discussion session is seen an information-gathering exercise for the PD and other senior faculty who may have a later opportunity to state their views when compared to the more junior interviewees, the measured difference may be less meaningful.Additional limitations in this study are the small number of faculty involved and the single institution nature of the study limiting generalizability.These types of complexities are difficult to account for in a limited setting with a small number of faculty in general.
I fully agree with authors' suggestion for further studies to evaluate the impact of bias and discussion on rankings for all candidates.This study, despite limitations in generalizability, is helpful as an exploration into this important area of study.Given the fact that these discussions and ranking adjustment processes are likely universal in residency training programs, more studies like this and attempts at understanding the impact of social dynamics on interviewee evaluations is needed.

Are sufficient details of methods and analysis provided to allow replication by others? Partly
If applicable, is the statistical analysis and its interpretation appropriate?Defer to statistician

Have any limitations of the research been acknowledged? Yes
Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Medical Education, Learner Development, Growth Mindset, Coaching I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Version 1
Reviewer Report 30 January 2024 https://doi.org/10.21956/mep.21143.r35732 © 2024 Hopson L. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Laura R Hopson
Department of Emergency Medicine, University of Michigan, Ann Arbor, Michigan, USA This is an interesting study, and I will admit that I have contemplated (but never executed) a similar approach to understanding the role of group dynamics in the assignment of interview scores.I appreciate the anchoring of the study in theory and the attempt to explain behaviors.
There are several areas that I believe would benefit from additional attention from the authors in order to present a well-rounded perspective on their data and its analysis.
First, these interactions are, almost definitely given their dates, in person interactions rather than virtual interactions for both the interview and scoring discussion.This has the potential to be a significant limitation given that their analysis relies on social interactions.I agree that a single "validated scoring system does not exist" for residency interviews as noted in the introduction.However, one such tool was published for EM in PMID: Hopson et al. (2019 1 ) (selfciting).There are also strategies that can increase reliability such as behaviorally based interview questions and behaviorally anchored scoring systems.In addition, holistic review allows programs to design interview assessments to map to program priorities.A discussion of these would add richness to the discussion at a minimum.I would also appreciate seeing the scoring tool shared and a brief discussion of its development with the validity evidence behind it to strengthen this work.
Data collection processes are clearly reviewed and appear appropriate.The statistical analysis generally appears appropriate.However, I am worried there may be significant missing data and would like to see how this data was treated.4 interviews per candidate with 211 candidates should yield almost 850 data points rather than the 471 included.I would like to see either clarification of this or accounting for an almost 50% loss of data.
While it may be beyond the scope of this manuscript, there are some interesting nuances in the data which may merit further investigation -that is the marked variability in response of group members to change which appears to be particularly prominent among the residents and chief residents in Figure 1.In addition, the degree of influence of the discussion is not linearly related to academic rank.Both of these make me wonder if there is another confounder such as years of interview experience which could be involved.
The authors also have the potential to propose interventions to mitigate the influence of senior members during discussion and I would like to see them add that to this report in the discussion.
There is a question as to whether the methods allow for the study question to be adequately addressed.Given the fact that the more senior evaluators (including the PD who is one of the professors) may have more insight into each candidate when compared to the more junior interviewees, there are alternative explanations to the differences in rating changes.There may be other reasons why the more senior faculty do not change ratings.It is also difficult to draw conclusions when comparing small numbers (in this case, an "n" of 2 full professors).
Additionally, the statistical description is not fully clear and requires further explanation.The methods state that a logistic regression is reported to have been used without further explanation.The adjusted OR reported are all within the 95% confidence interval for all of the analyses cited and therefore would presumably not reject the null hypothesis in this instance.Unless I am missing something, it is quite unclear how these are statistically significant outcomes.This is altogether not surprising given the fact that the comparison is potentially underpowered.
This is an important study and would be interesting if expanded to include more individuals such that a statistically significant outcomes could be determined.As it stands, it is unclear what to take away from this manuscript given the above limitations and whether the authors' conclusions can be accepted.

Christie Lech
Weill Cornell Medicine, New York, NY, USA It is important to also comment on how the interviewers were trained regarding interviewing of candidates and how they were trained to score the candidates, as this is also an area for potential bias.In addition, years of experience may come into play.Also can touch more on the significance of this work to future outcomes like matching of highly ranked residents etc.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and does the work have academic merit?Partly

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes

Have any limitations of the research been acknowledged? Partly
Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Medical education I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
an area for potential bias.In addition, years of experience may come into play.Also can touch more on the significance of this work to future outcomes like matching of highly ranked residents etc. Thank you for taking the time to review our work.A new version has been submitted.We have added a sentence regarding scoring training in the first paragraph of the Methods Section.We agree that years of experience would be interesting and important and have attempted to include it in the 3rd to last paragraph of the Discussion Section.
Competing Interests: No competing interests were disclosed.

Figure 1 .
Figure 1.Change in interview candidate scores following group interviewer discussion.

©
2023 Lech C.This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table 1 . OR of interviewers changing candidate scores following group discussion. (Abbreviations: n
= number; OR = odds ratio; CI = confidence interval).