Resident Operative Autonomy and Attending Verbal Feedback Differ by Resident and Attending Gender

Objectives: This study tests the null hypotheses that overall sentiment and gendered words in verbal feedback and resident operative autonomy relative to performance are similar for female and male residents. Background: Female and male surgical residents may experience training differently, affecting the quality of learning and graduated autonomy. Methods: A longitudinal, observational study using a Society for Improving Medical Professional Learning collaborative dataset describing resident and attending evaluations of resident operative performance and autonomy and recordings of verbal feedback from attendings from surgical procedures performed at 54 US general surgery residency training programs from 2016 to 2021. Overall sentiment, adjectives, and gendered words in verbal feedback were quantified by natural language processing. Resident operative autonomy and performance, as evaluated by attendings, were reported on 5-point ordinal scales. Performance-adjusted autonomy was calculated as autonomy minus performance. Results: The final dataset included objective assessments and dictated feedback for 2683 surgical procedures. Sentiment scores were higher for female residents (95 [interquartile range (IQR), 4–100] vs 86 [IQR 2–100]; P < 0.001). Gendered words were present in a greater proportion of dictations for female residents (29% vs 25%; P = 0.04) due to male attendings disproportionately using male-associated words in feedback for female residents (28% vs 23%; P = 0.01). Overall, attendings reported that male residents received greater performance-adjusted autonomy compared with female residents (P < 0.001). Conclusions: Sentiment and gendered words in verbal feedback and performance-adjusted operative autonomy differed for female and male general surgery residents. These findings suggest a need to ensure that trainees are given appropriate and equitable operative autonomy and feedback.


INTRODUCTION
As graduate medical education shifts towards competency-based models, meaningful attending-to-resident feedback and appropriate provision of graduated autonomy are imperative. Female and male residents may experience training differently, affecting the quality of learning and graduated autonomy. [1][2][3][4] Implicit bias may contribute to observed discrepancies in the way female and male residents experience surgical training. 5,6 It has been suggested that either gender bias in surgical resident evaluation or the way in which female and male residents experience surgical training is responsible for the observation that female trainees have significantly lower attainment of Accreditation Council for Graduate Medical Education milestones for several subcompetencies. 7 However, data regarding gender differences in the provision of operative autonomy and equity in performance evaluations are conflicting. [8][9][10][11] It also remains unclear whether the content of verbal feedback from attendings to residents differs by resident and attending gender, as has been described for written performance evaluations. 12 The Society for Improving and Measuring Procedural Learning (SIMPL) is increasingly being used to better understand the progression of intraoperative resident autonomy and performance, offering unique opportunities to systematically assess associations among surgical resident and attending gender, operative performance, and operative autonomy. 13 This study evaluates differences in narrative feedback and performance assessments recorded in the SIMPL app with the null hypotheses that overall sentiment and gendered words in verbal feedback and resident operative autonomy relative to performance are similar for female and male residents.

Study Design and Data Source
This longitudinal, observational study used an existing multicenter dataset maintained by the SIMPL Collaborative, which contains detailed information regarding resident and attending surgeon evaluations of a sample of resident's operative performance and autonomy as well as verbal feedback from attendings to residents. This study includes both objective and subjective evaluations of resident performance and autonomy in the form of validated ordinal scales as well as verbal feedback recorded on mobile devices, as these elements provide context for one another and are presented together to surgical residents on the SIMPL platform to offer a thorough and holistic evaluation. The University of Florida Institutional Review Board approved this study (IRB No. 202101698 From 2016 to 2021, the SIMPL app (described in detail at: https://www.simpl.org/simpl, accessed January 13, 2022) was used to generate records representing individual surgical procedures performed by residents at 54 general surgery residency training programs. Variables representing each case were trainee postgraduate year and gender (available classifications on the SIMPL app were female, male, not known, or not applicable; the dataset was filtered to include female and male genders), attending gender, description of procedure type (eg, "Excision soft tissue mass, neck," "Roux-en-Y gastric bypass (laparoscopic)"), debriefing narratives dictated by attending surgeons, and objective assessments of resident performance and autonomy, assessed separately by attending surgeons and by residents. Resident performance was quantified by the validated Performance Scale, consisting of 5 levels: Unprepared/ Critical Deficiency, Inexperienced with Procedure, Intermediate Performance, Practice-Ready Performance, and Exceptional Performance. 13,14 Resident autonomy was quantified by the Zwisch scale, consisting of 4 levels (Show and Tell, Active Help, Passive Help, or Supervision Only). The Zwisch scale has been shown to be a valid and reliable way to differentiate between levels of faculty guidance provided (and its inverse, resident autonomy granted) during an operation. 13,15,16 The dataset contained information representing 140,420 evaluations submitted by residents, 98,784 evaluations submitted by attendings, and 4313 dictations submitted by attendings. After excluding blank or duplicated dictations (N = 586) and excluding attending evaluations that were missing dictations (N = 61,961) or objective assessments of resident performance or autonomy (ie, the Performance Scale and Zwisch scale, N = 3226), all remaining evaluations and dictations were merged on common case identification numbers. Cases were then excluded if resident gender was "not known" or "not applicable." To minimize heterogeneity in the study population, cases were also excluded if the trainee was a preliminary resident or fellow or had postgraduate year 6 or greater. The final dataset included 2683 cases with complete debriefing narratives and objective assessments, as shown in Supplement Digital Content 2 (http:// links.lww.com/AOSO/A204).

Natural Language Processing of Dictated Feedback
This study builds on previous work demonstrating that Natural Language Processing (NLP) can accurately classify feedback quality among 3 surgical residency training programs. 17 Operative performance assessments by faculty were analyzed with NLP techniques, which allowed for the systematic evaluation of word use to understand overall sentiment and use of gendered language in dictated feedback. Sentiment analysis is a subtask of NLP encompassing the extraction of opinions, evaluations, attitudes, and emotions from written language. 18 This study applies advanced sentiment analysis techniques using deep learning methods to assess the degree of positivity or negativity in dictated feedback, which offers the potential advantage of greater contextual understanding of natural language compared with classical approaches to sentiment analysis that use linguistic analysis or rule-based phrase matching against predefined positive and negative word lists. 19 Pre-trained language models, in which machine learning models are trained on large corpora of generalized human language (eg, the entirety of Wikipedia) and then fine-tuned on a domain and task-specific dataset, have outperformed classical rule-based models for several NLP tasks. [20][21][22] The sentiment analysis prediction pipeline was based on a previously established deep learning model that was fine-tuned on the Stanford Sentiment Treebank v2 23 dataset of movie reviews for predicting positive and negative sentiment from text. 24,25 Sentiment analysis experiments were conducted using PyTorch and the Huggingface library, truncating dictation transcriptions to 128 tokens (ie, sequences of 1 or more text characters that, when grouped, convey meaning, and represent the smallest unit for processing-in the present study, most tokens were words). 26 For each transcription, the deep learning model predicted a sentiment label (positive or negative) and a corresponding sentiment score ranging from 0.5 to 1.0. For dictations predicted to be negative, sentiment scores were subtracted from 1.0 so that all transcription scores would range from 0 (most negative) to 100 (most positive). For example, a dictation with positive sentiment and a sentiment score of 0.7 would have a final score of 0.7; a dictation with negative sentiment and a sentiment score of 0. This study also classifies gendered words, including both agentic (male-associated) and communal (female-associated) words, adapted from a previously established classification system that has been cited in peer-reviewed literature more than 400 times, 27-31 as listed in Supplemental Digital Content 4 (http://links.lww.com/AOSO/A204). Each dictation was analyzed using this word bank to determine whether a dictation contained any agentic, communal, or gendered words as well as the proportion of adjectives that were agentic, communal, or gendered words relative to the number of adjectives in the dictation. Although gendered words may be adjectives or nouns, in dictated feedback almost all gendered words were used as adjectives.

Statistical Analysis
NLP and statistical analyses were performed with Python 3.8.8 software. Discrete variables were compared by Fisher exact test and presented as raw number with percentages. Continuous variables were compared by the Kruskal-Wallis test and presented as median values with interquartile ranges (IQRs). All statistical tests were 2-sided with alpha = 0.05.
Comparisons of cases performed by female versus male residents were made for case characteristics, resident postgraduate year, total number of evaluations per resident for the index case (if the case was a laparoscopic cholecystectomy and the resident had 2 prior evaluations for laparoscopic cholecystectomy then this number would be 3) and for all types of cases (if the resident had 15 total prior evaluations then this number would be 16), and resident and attending assessments of case complexity, resident performance, and resident autonomy. The proportion of gendered words in dictated feedback were compared for female versus male residents with subgroup analyses of dictations by female versus male attendings.
To test the hypothesis that resident operative autonomy relative to performance is similar for female and male residents, it was first necessary to generate a statistical representation of the balance between autonomy and performance for each individual case. This was performed by converting the Performance Scale and Zwisch scale to ordinal scales with maximum score 5 (Performance Scale: exceptional = 5, practice-ready = 4, intermediate = 3, inexperienced = 2, critical deficiency = 1; Zwisch scale: supervision only = 5, passive help = 4, active help = 3, show & tell = 2), and subtracting the performance score from the autonomy score; these scores were compared between female and male residents. Additional analysis included subgroup analyses of dictations by female versus male attendings.
Due to the observation that male residents performed a greater proportion of cases in the "Hardest 1/3" category of case complexity, subgroup analyses excluded cases in the "Hardest 1/3" category, as shown in Supplemental Digital Content 5 and 6 (http://links.lww.com/AOSO/A204).

Resident, Attending, and Case Characteristics
Forty-four percent of cases were performed by female residents (N = 1166; Table 1). Seventy-seven percent of cases were performed by residents at postgraduate year 3 or higher.
Postgraduate year levels were similar for female and male residents. The total number of all previous evaluations per resident was higher for male residents  vs 16 [IQR 7-33]; P = 0.005). The number of previous evaluations for the case being evaluated was one (IQR 1-2) for both female and male residents (P = 0.80). SIMPL evaluations could be initiated by either residents or attendings; most evaluations were initiated by residents, and a greater proportion of evaluations were initiated by male residents (61% vs 57%; P = 0.04).
The top 3 most common procedures were laparoscopic cholecystectomy, open inguinal hernia repair, and laparoscopic appendectomy. For 18 of the top 20 most common procedures, there were similar proportions of cases performed by female and male residents. Male residents performed a greater proportion of hiatal hernia repairs (1.6% vs 0.5%; P = 0.006); female residents performed a greater proportion of breast excisional biopsies (1.5% vs 0.7%; P = 0.03). Greater proportions of cases classified as "Hardest 1/3" were performed by male residents as compared with female residents, according to both resident (15% vs 12%; P = 0.01) and attending (27% vs 24%; P = 0.05) assessments of case complexity, as shown in Table 1. Case characteristics for a subset of "Easiest 1/3" and "Average" complexity cases were otherwise similar to case characteristics in the primary analysis, as summarized in Supplemental Digital Content 5 (http://links.lww.com/AOSO/A204).
Female residents had higher sentiment scores (95 [IQR 4-100] vs 86 [IQR 2-100]; P < 0.001). A greater proportion of cases performed by female residents had an overall positive sentiment (62% vs 57%; P = 0.01). There were no gender differences in the total number of words, adjectives, or communal words used in dictated feedback. Gendered words were present in a greater proportion of dictations for female residents (29% vs 25%; P = 0.04); this difference was attributable to male attendings using agentic (male-associated) words in a greater proportion of  figure. For cases by male attendings, the proportion of gendered words in dictations was greater for female residents (0, IQR 0-5 vs 0, IQR 0-2; P = 0.02). Characteristics and results from subgroup analyses for cases by male attendings only and cases by female attendings only are summarized in Supplemental Digital Content 9-12 (http://links.lww.com/AOSO/A204). For cases by male attendings, objective assessments of resident performance and autonomy mirrored the primary analysis, sentiment scores were higher for female residents (90 [IQR 3-100] vs 79 [IQR 2-99]; P = 0.02) but proportions of cases with overall positive sentiment were similar for female and male residents. For cases by female attendings, operative performance and autonomy were similar for female and male residents, sentiment scores were higher for females (99 [IQR 52-100] vs 96 [IQR 4-100]; P < 0.001), and proportions of cases with overall positive sentiment were higher for female residents (75% vs 63%; P = 0.005).
Associations between attending assessments of resident autonomy and resident performance are illustrated in Figure 1. When considering all cases performed by both female and male attendings, progressive increases in resident performance were associated with progressive increases in resident autonomy. Overall, male residents had greater autonomy relative to performance compared with female residents, as determined by subtracting the ordinal performance score for each case from the ordinal autonomy score for the same case and comparing aggregate results by resident gender (P < 0.001). Male residents had greater autonomy relative to performance compared with female residents when the attending was male (P = 0.004) and when the attending was female, although the difference for female attendings was not statistically significant (P = 0.06). When high-complexity cases were excluded, again, male residents had greater autonomy relative to performance compared with female residents (P = 0.004); this was true whether the attending was female (P = 0.04) or male (P = 0.03; Supplemental Digital Content 6, http://links.lww.com/ AOSO/A204). There were no significant differences in autonomy relative to performance at individual performance levels, as suggested by overlapping 95% confidence intervals.

DISCUSSION
These findings suggest sentiment and gendered words in dictated feedback from attendings to residents and alignment between resident operative performance and autonomy may be different for female and male residents. Overall, attendings reported that male residents were granted greater autonomy relative to their operative performance compared with female residents. Surprisingly, the increased use of gendered words in dictated feedback for female residents was due to greater use of agentic (traditionally male-associated) words. Female residents had higher sentiment scores and overall sentiment positivity compared with male residents, which was more evident in female attending verbal feedback. These findings occurred in the absence of differences in resident postgraduate year or case complexity. There were subtle differences in the types of cases performed by female and male residents, but these cases represented a small proportion of the dataset. Male residents had more total prior evaluations for cases of any type, although the authors are unaware of any evidence that the total number of cases performed by graduating residents differs between female and male residents.
A recent study used NLP to detect linguistic differences and gender bias in letters of recommendation written for general surgery residency applicants and found that letters of recommendation for females contained more gendered wording compared with letters for males, and that female applicants who were described with male-associated words had higher sentiment scores, consistent with the present study. 29 When female attending verbal dictations were evaluated herein, the use of gendered words was similar for female and male residents but sentiment score and overall positive  sentiment were higher for female residents, suggesting that female attendings provided comparatively gender-neutral verbal feedback but were verbally more positive for female residents. The available data do not allow for understanding of why female attendings provided comparatively gender-neutral feedback but with greater positivity for female residents. These findings could reflect a progression towards gender equity in surgery and "lift as you climb" movements such as online campaigns put forth by the Association of Women in Surgery to connect female surgeons with one another. 32 Importantly, while positive feedback and encouragement may serve to build confidence and reinforce good habits, negative feedback is occasionally necessary for growth, and should be valued by residents and attendings alike, regardless of gender. Too much positivity, unanchored to negativity and sober recognition of challenges and opportunities for growth, has been associated with lower business performance. 33,34 Maintaining optimal levels of positivity (not too much, not too little) has the potential to cultivate grit and improve human performance for complex tasks like surgery, suggesting that positivity in dictated feedback is relevant to the quest of optimizing surgical training. 33 Overall, our findings suggest that gender differences exist in the form of verbal feedback from female attendings, and underscore the importance of ensuring that trainees are given appropriate and equitable operative autonomy and feedback. Few prior studies have assessed associations between trainee gender and operative autonomy, and none report autonomy relative to performance. 8,9,[35][36][37] Three studies of thoracic and general surgery training programs similarly found that female trainees were granted less operative autonomy compared with male counterparts. 8,9,37 However, in studies using a different measurement tool to assess entrustment in the operating room, using observation and assessment by a third party, there was no association between resident gender and resident autonomy granted by faculty. 35,36 The latter studies may have been influenced by the Hawthorne effect, which could account for differences in results relative to the former studies (ie, provision of autonomy was equitable while provision of autonomy was being observed and recorded).
In 2018, the American Board of Surgery initiated the Entrustable Professional Activities (EPA) pilot project, "designed to integrate competencies and milestones to provide an evaluation platform that translates often-theoretical concepts into a tool that can be used in a single patient encounter." 38 In a recent study evaluating the effect of general surgery resident gender on EPA entrustment levels, Padilla et al 39 used NLP methods to analyze narrative comments in identifying topics correlated with resident sex. Faculty assessments showed no differences in EPA levels between female and male residents. Additionally, resident self-ratings were lower for female residents compared with male residents, which the authors describe a theoretical framework for perceived differences in autonomy. This important work showed that EPA-based evaluation, when performed by attendings, can be gender neutral, which is a powerful argument for moving surgical training toward competency-based assessment models fueled by frequent workplace-based assessment. 40 Progressive autonomy in the operating room is a useful measure of preparedness for independent practice and will remain an important assessment tool even during the planned implementation of competency-based education. This study was not designed to assess underlying causes of the observed differences in autonomy relative to performance seen between male and female residents. It is, however, important that trainees be given adequate operative autonomy throughout their training that should not differ by gender or other sociodemographic factors. One way to mitigate bias in autonomy may be to specifically articulate which procedure-specific behaviors should be displayed for residents to gain graduated levels of autonomy. Additionally, it is imperative for educators to acknowledge that there is potential room for subjectivity and bias within assessment tools and that to create a work environment in which all trainees can succeed, these biases must be acknowledged and mitigated. For SIMPL, this could be accomplished by expanding the existing guidance for providing verbal feedback, which is currently provided to attendings in training sessions and within the mobile application.

Limitations
The determination of gender was based on a binary response option as designed by the SIMPL application, and although this may be a gross representation of the breakdown of male and female residents in a program, it may not be representative of transgender and gender nonconforming individuals. Further, data collected using the SIMPL app represents a small portion of observed clinical performances, and trainees may disproportionately solicit feedback for cases in which they performed well or are interested in improving upon. Therefore, the cases included in this study may not truly be representative of the full breadth of experience of surgical trainees. A recent study using SIMPL data confirmed that there was a strong correlation between SIMPL procedure frequencies and Accreditation Council for Graduate Medical Education case log procedure frequencies. 41 The present study only includes cases logged by attendings and residents if attending verbal dictations were provided, which may generate reporting bias. The present study included the same top 3 most common procedures (laparoscopic cholecystectomy, laparoscopic appendectomy, and open inguinal hernia repair) with similar procedure frequencies; however, there were substantial differences in the remaining procedures included in the study, likely attributable to the factors described above. Sentiment in dictations was assessed using a publicly available benchmark model; a domain-specific training dataset may lead to more accurate sentiment predictions.

CONCLUSIONS
Gendered words, especially male-associated words, were present in a greater proportion of dictations for female trainees compared with male trainees, primarily due to male attendings using male-associated words in feedback for female residents. Female residents received higher sentiment scores while male residents received greater performance-adjusted autonomy. These findings suggest the need to ensure that trainees are given appropriate and equitable operative autonomy and feedback that should not differ by gender or other sociodemographic factors. To promote equity in resident operative autonomy, we propose that attendings articulate procedure-specific behaviors that should be displayed for residents to gain graduated levels of autonomy. To promote equity in verbal feedback, we propose the implementation of specific guidance for the verbal feedback portion of SIMPL evaluations. For individual attendings, we suggest granting residents autonomy that corresponds to their operative performance and providing equitable, formative feedback that builds better surgeons, regardless of gender.