Pediatric residency milestone performance is not predicted by the United States Medical Licensing Examination Step 2 Clinical Knowledge

Objectives This study aims to show whether correlation exists between pediatric residency applicants’ quantitative scores on the United States Medical Licensing Exam Step 2 Clinical Knowledge examination and their subsequent performance in residency training based on the Accreditation Council for Graduate Medical Education Milestones, which are competency-based assessments that aim to determine residents’ ability to work unsupervised after postgraduate training. No previous literature has correlated Step 2 Clinical Knowledge scores with pediatric residency performance assessed by Milestones. Methods In this retrospective cohort study, the United States Medical Licensing Exam Step 2 Clinical Knowledge Scores and Milestones data were collected from all 188 residents enrolled in a single categorical pediatric residency program from 2012 - 2017. Pearson correlation coefficients were calculated amongst available test and milestone data points to determine correlation between test scores and clinical performance. Results Using Pearson correlation coefficients, no significant correlation was found between quantitative scores on the Step 2 Clinical Knowledge exam and average Milestones ratings (r = -0.1 for post-graduate year 1 residents and r = 0.25 for post-graduate year 3 residents). Conclusions These results demonstrate that Step 2 scores have no correlation to success in residency training as measured by progression along competency-based Milestones. This information should limit the importance residency programs place on quantitative Step 2 scores in their ranking of residency applicants. Future studies should include multiple residency programs across multiple specialties to help make these findings more generalizable.


Introduction
The United States Medical Licensing Exam (USMLE) determines whether physicians-in-training qualify to practice medicine in the United States.The exam "assesses a physician's ability to apply knowledge, concepts, and principles, and to demonstrate fundamental patient-centered skills, that are important in health and disease and that constitute the basis of safe and effective patient care." 1 The USMLE is a three-step examination, with Step 1 and Step 2 Clinical Knowledge (CK) occurring during medical school and Step 3 occurring during the first year of residency training.(From 2004 -2019,  Step 2 had a second component called Clinical Skills, which was graded on a pass-fail basis, but this has since been eliminated.)While the stated purpose of the USMLE is to determine fitness for medical licensure, it has become common practice for residency programs to use USMLE scores, especially Step 1, as an important consideration when selecting students for interview and determining rank order in the National Resident Matching Program (NRMP) Main Residency Match [2][3][4][5] .The NRMP uses a proprietary algorithm using applicants' ranking of residency programs and residency programs' ranking of applicants to provide a "match" for residency training.As the outcome of this match is contractually binding for the duration of residency training (3 years for pediatrics), this represents a very high-stakes decision.
In February 2020, the USMLE announced that it would change reporting for students' performance on Step 1 to pass/fail 6 .This decision was partly in response to concerns raised during the Invitational Conference on USMLE Scoring (InCUS) in 2019, regarding the use of scaled USMLE scores as predictors of, and requirements for, residency success 7 .As part of their process, InCUS considered recent data that brought into question the role of scaled USMLE scores due to biases inherent in this standardized examination.One study showed consistent bias associated with race, age and gender in USMLE Step 1 scores, with racial bias persisting across Step 2 and 3 8 .The authors cited the natural implication that this bias limits opportunity in competitive specialties where Step scores may be used for screening of large numbers of applicants.
Since that time, there has been speculation about the change in relative importance of the USMLE Step 2 CK score 9 .
The USMLE has announced that Step 2 CK will remain a scaled quantitative score.The natural implication of these changes is that residency programs might simply substitute the USMLE Step 2 CK scaled score for the previously scaled score of USMLE Step 1 in their evaluation and ranking of applicants.
Logic and previous studies could substantiate this approach.
Step 2 CK is allegedly more clinically relevant than Step 1 10,11 , which is based more on the basic science focus of the first two years of typical American undergraduate medical education curricula.Step 2 CK is also taken during or after a student's core clinical clerkships, which could reflect knowledge gained through clinical experiences.Some studies have shown positive correlations specifically between Step 2 CK scores and in-training exam scores 12,13 , certifying examinations 14,15 , and survey-based residency performance 16,17 .One study conducted in an Emergency Medicine residency found a statistically significant positive relationship between Step 2 CK scores and competency-based milestones ratings 18 .
Conversely, some studies suggest a weaker correlation between Step 2 CK scores and Board passage than when compared to Step 1 [19][20][21] .Another study found no relevance of Step 2 CK scores when examining clinical outcomes of discrepant radiology reports 22 .
In this study, we sought to determine if USMLE Step 2 CK scores are predictive of success in pediatric residency training.Our method of measuring "success" was the Accreditation Council for Graduate Medical Education (ACGME) Competency-Based Milestones.Milestones are reported by all accredited United States residency programs to the ACGME and are defined as "competency-based developmental outcomes (e.g., knowledge, skills, attitudes, and performance) that can be demonstrated progressively by residents/fellows from the beginning of their education through graduation to the unsupervised practice of their specialties." 23The ACGME categorizes competencies into six Domains of Competence: Patient Care, Medical Knowledge, Practice-Based Learning and Improvement, Interpersonal and Communication Skills, Professionalism, and Systems-Based Practice.The ACGME applied 21 of these competencies to pediatric training in the initial iteration of Milestones, on which this study is based: five in Patient Care, one in Medical Knowledge, four in Practice-Based Learning and Improvement, two in Interpersonal and Communication Skills, six in Professionalism, and three in Systems-Based Practice.Milestone ratings range on a scale from 1 to 5, with the exception of three competencies (Patient Care 4, Systems-Based Practice 1, and Systems-Based Practice 3) which are on a scale from 1 to 4, with increments of 0.5 between available scores.
As such, progression in Milestones is expected to mirror learners' progression toward unsupervised practice 24 .Studies have not only used Milestones as a measurement of individual success in residency 25,26 , but also as an appraisal of medical schools' preparation of students for postgraduate training 27 .We

Amendments from Version 1
In response to reviewers, the authors have clarified two points about competencies that have only 4 milestones (instead of 5) and about the number of residents in the study for whom ABP certifying exam scores were available.In addition, the authors have produced supplemental figures showing correlations between standardized test scores and each competency (instead of each domain of competence, figures of which are included in the manuscript).This can be found at 10

Methods
We reviewed and collected data from the UPMC Medical Education categorical pediatric residency program in a retrospective cohort design.This study was deemed exempt from formal review by the Institutional Review Board of the University of Pittsburgh School of Medicine because of its focus on medical education and the lack of obtaining any information from subjects that was not already available to the authors.Based at UPMC Children's Hospital of Pittsburgh, a free standing, 313 bed tertiary care center in Pittsburgh, Pennsylvania, USA, the residency program recruits up to 30 categorical pediatric residents each year.
During the study period from 2012-2017, information was gathered on each of the 188 residents matriculated to our pediatric residency training program.We collated USMLE Step 1 and Step 2 CK scores through retrospective review of their previous residency applications.The American Board of Pediatrics annually provides ABP In-Training Exam (ITE) scores for residents in post-graduate year one (PGY1) and post-graduate year three (PGY3), as well as ABP General Pediatric Certifying Examination scores after training completion, to residency program directors; three of the authors (BM, AN, SD) accessed these data through their roles as residency program directors or associate program directors.Each of these examinations yield a scaled score between one and 300 1 .To assess generalizability of this study population to other pediatric residents, national data were abstracted from the National Resident Matching Program (NRMP) and ABP during the same time period 28,29 .
The primary endpoints of this study were the Milestone rankings at two time points during residency training: end-of-year PGY1 milestones as a measure of initial success, and end-of-year PGY3 milestones as a measure of end-of-training attainment of skills.These milestones reflect a set of key competencies central to pediatric residency training.Milestones data are generated directly from evaluations by attending physicians, pediatric fellows, supervising residents, medical students, and nurses.The only exception for this process is for two competencies (Systems-Based Practice 2 and Problem-Based Learning and Improvement 3) dealing with engagement in quality improvement; residency program directors assign these milestones ratings independently based on their assessment of specific resident performance in quality improvement endeavors.These data are then combined to create an overall average Milestone rating across all competencies, as well as an average for each domain of competence.Eighteen of the competencies have 5 milestones ratings, while three (Patient Care 4, Systems-based Practice 1 and Systems-Based Practice 3) have 4.All were averaged together and equally among residents, so this would not affect correlation coefficients.
We sought to measure whether performance in exam-based and milestone-based metrics correlated.Pairwise Pearson correlation coefficients were computed for every combination of USMLE Step 1, Step 2 CK, ITE and milestone ratings.For each pairwise comparison, any individual missing data for either feature was excluded for that comparison.All analyses were performed in the Python programming language with the Numpy and Pandas analysis libraries 30 .

Results
We examined correlation between standardized examinations before, during, and after residency training to test the hypothesis that they would be predictive of residency performance as assessed by individual pediatric Milestones data.USMLE Step 1 and Step 2 CK scores were available for 180 (97%) and 179 (96%) of residents respectively.Our program began administering the ITE for PGY-1 residents in 2015; therefore, ITE scores were available for 74 PGY-1 residents (40% of total sample, 100% of those eligible).In the final two years of our study period, two classes (PGY-1 and PGY-2) had not yet taken the PGY-3 ITE; therefore, ITE scores were available for 122 PGY-3 residents (65% of total sample, 100% of those eligible).Similarly, milestone reporting became available in 2014, naturally limiting data for some residents in the study population due to the rolling annual nature of residency matriculation.Milestones data were available for 129 PGY-1 residents (69% of total sample, 100% of those eligible) and 122 of PGY-3 residents (65% of total sample, 100% of those eligible).One hundred one residents had graduated the program and taken the ABP certifying exam at the time of data analysis (54% of the total sample, 100% of those eligible).Our program expanded its complement of residents during the study period, thus a higher number of residents were included each year as the study progressed.When a similar comparison was made with end of PGY3 milestones assessment, as shown in Figure 2, again no correlation was observed between standardized examinations and milestones performance, either overall average or specific domains.Of specific note, overall PGY3 milestone performance did not correlate with the ITE examination at the beginning of PGY3 (r = -0.071)or ABP examination after completion of the PGY3 year (r = 0.071).Demographics of residents included in the data set are displayed in Table 1, with comparison to ABP and NRMP data of pediatric trainees during the same period and practicing pediatrician data from the ABP.The study population was mostly female (76%), consistent with national pediatric gender distribution (71%).Residents in our program differed from average NRMP applicants during the study period in Alpha Omega Alpha medical student honor society membership (34% versus 16%) and attendance at a top 40 medical school in NIH research funding (10% versus 30%) but were similar in USMLE Step 1 and 2 scores, with an overlap in interquartile ranges.

Discussion
This study demonstrates a lack of correlation between Step 2 CK scores and ACGME competency-based milestones performance in our categorical pediatric residency program.Additionally, we found no correlation of any standardized examination with residency performance, and examination scores only correlated with scores on other standardized examinations.These findings are consistent with previous concerns about the limitations of USMLE scores as predictors of success in residency training 31,32 .This suggests that residency programs should reconsider the influence of scaled USMLE Step 2 CK scores on selection of residency applicants in the National Resident Matching Program.
While there is no consensus validated measure of resident performance, competency-based milestones ratings have emerged as potentially the most reliable.Several medical disciplines, including Internal Medicine 33 , Family Medicine 34 , Anesthesia 35 , and Obstetrics/Gynecology 36  We also found that, while resident performance as judged by milestones ratings did not correlate with standardized examination scores, USMLE Step 2 CK scores correlated with ITE scores in the PGY1 year and with eventual ABP General Pediatric Certifying Examination scores (Figure 1 and Figure 2).These findings are consistent with previously published studies in pediatrics 37 and other medical specialties 38,39 .This suggests that previous performance on standardized examinations correlates with performance on subsequent standardized examinations, rather than with competency-based achievement.While previous authors suggest this fact alone substantiates the use of USMLE scaled scores in the selection of residency program applicants 38 , our data may suggest otherwise for pediatric programs, under the established assumption that educational achievement within the ACGME six domains of competence is a better measure of physician competence than whether someone is able to pass a standardized examination.
Previous authors have posited a similar belief.In separate publications, Prober et al. 40 and Bernstein 2 argue that not only is the USMLE being misused, but also that the misuse might be harmful.It is estimated that medical school students historically would devote weeks to studying for the USMLE Step 1 41 .It is accepted that residency programs across specialties use examination scores not solely for the declared purpose of competence for practice but also as a measure of acceptability into residency programs, including "filtering out" applicants below a minimum score in some circumstances.Our study suggests that while preparations for the USMLE Step 1 may have resulted in a higher score, they do not correlate with improved performance in residency training, and in fact may function at every level to limit opportunity based on race, gender, and age.In fact, correlations between Step 1 performance and milestones achievement in our program skewed slightly negative, an interesting finding.
With the shift of USMLE Step 1 from a scaled examination to pass/fail, residency program directors will likely look to other sources to differentiate between applicants.A rational replacement could be USMLE Step 2 CK.However, our data demonstrate that Step 2 CK scores are not a good predictor of performance in residency.Even though the correlation between Step 2 CK and milestones ratings are slightly stronger than between Step 1 and milestones ratings, correlation here is essentially random and uninformative for selection of potential trainees.
These data suggest that pediatric residency program directors should exercise caution in the use of any scaled USMLE score in their evaluation of program applicants.The change to scoring USMLE Step 1 on a pass/fail basis will likely incite significant cultural shifts in undergraduate medical education, including an even higher importance placed on USMLE 2 and grades in core clinical clerkships.Residency programs may need to follow suit, using these grades and other information in the Medical Student Performance Evaluation letter to determine whether to offer an interview to their program and subsequent rank for the National Resident Matching Program.While this process of filtering applicants through holistic review may be more arduous than screening based on a scaled test score, it will likely result in a more accurate determination of an applicant's ability to achieve success in residency.
One important caveat to note is that the ACGME uses Board Certifying Examination pass rates for a pediatric training program's graduates in their accreditation process of residency and fellowship programs 42 .Thus, as our data suggest that scaled USMLE scores correlate with eventual board pass rate, programs are incentivized to match applicants who score well on standardized exams, which our data also demonstrate has no correlation with actual performance as a physician.
Our study has limitations.First, this is a single center, pediatric study over a short time.We only compared USMLE scores and milestones ratings for residents who matched into our training program.We do not have access to data about applicants who matched elsewhere, or residents in other training programs, pediatrics or otherwise.This limitation is somewhat mitigated by the fact that our residents are largely representative of pediatric applicants from allopathic medical schools in the United States, as noted in Table 1.Based on these comparisons to national data, the results of this study may be generalizable, but further study in other pediatric residencies and other specialties is warranted.
Second, there is wide variability in how residency programs establish Milestones ratings for their residents.We have described our method above, which incorporates input from a variety of sources to try to minimize confounding factors and mitigate potential bias, but the collection process has not been standardized among programs.Therefore, programs that generate and/or collate Milestones data differently may see disparate results from ours.

Conclusion
This In this study, the authors provide reasonable rationale for their primary study question, "Is USMLE step 2 scores predictive of success in pediatric residency training?".The main hypothesis is that USMLE Step 2 scores are signifcantly correlated with ACGME milestones.Secondary hypothesese include significant correlations between each of USMLE Step 1, USMLE Step 2, ITE as PGY-1, ITE as PGY-3, and ABP certifying exam.This is a retrospective cohort design of 188 pediatric residents at UPMC between 2012-2017.ACGME milestones were utilized as a measure of resident "success".It is unclear on how the "milestone rating score" was obtained: -Was it an average across each of the 6 domains and then an average of all 21 sub-competencies?-Was there any correction for the 2 milestones with a range of 1-4 vs the others with a range of 1-5? -Were there any milestones scored as "not yet assessable"?We would like to see these questions answered to guide in the reproducibility of the study as well as further understanding of the meaningfulness of the final conclusions.There was a well-defined cohort at a single residency program and subjects were at similar points in their training.Follow-up time, however, was not sufficiently long.There were several secular trends that occurred during the study years which led to signficant missing data: -Milestone reporting was only available for 129/188 end of year PGY-1 residents.The study design seems very appropriate with a good sample size and consistency in obtaining the different variables such as standardized test scores, ITE, and milestone ratings.The details of the methods and analysis were clear and the statistical analysis and its interpretation seems appropriate.The heatmaps (Figures 1 and 2) were a nice addition and a visually appealing way to understand the results.
I appreciate that the major limitations of the study were addressed.Expanding this single center study to more institutions would be interesting, and I would like to see how/if the results from pediatric residencies differ from the results in other subspecialties as well.I also think the subjectivity in determining individual pediatric milestones is also a major limitation here and was addressed.As stated in the article, the authors tried to minimize confounding factors and mitigate potential bias, but the collection process has not been standardized among programs.But other than program directors and associate program directors, scorers generally don't have access to residents' standardized test scores, so there should be little bias from that standpoint.However, milestone ratings from various individuals may range significantly, and that fact was not studied/commented on in the study.It was only stated that "milestones data are generated directly from evaluations by attending physicians, pediatric fellows, supervising residents, medical students, and nurses…These data are then combined to create an overall average Milestone rating across all competencies, as well as an average for each domain of competence."Perhaps a future step or suggestion for other studies is to look at inter-rater reliability of various scorers of the individual milestones.
Overall, I fully support this article with very minimal (if any) revisions.Thank you for this insightful study and for inviting me to be a reviewer.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?Yes

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes

Have any limitations of the research been acknowledged?
Yes Are all the source data underlying the results available to ensure full reproducibility?Yes Are the conclusions drawn adequately supported by the results?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: pediatric hospital medicine, medical education I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 1
Figure 1 depicts the Pearson correlation coefficients between USMLE Step 2 and other standardized examinations (Step 1, ITE-1 and ABP certifying examination) and milestones averages of the PGY1 year.The heat map depicts correlation constants between averages within each domain of competence as well as overall average (1-Ave) and the examinations.No overall correlation was seen for PGY1 performance and USMLE Step 2 (r = -0.1),Step 1 (r = -0.13),or the ABP examination (r = 0.082).

Figure 1 .
Figure 1.ACGME PGY-1 Milestones correlation with standardized examination.The heat map is labeled with the individual Pearson correlation coefficients between PGY1 milestones and standardized examination scores.Domains of competence (1-ICSALL = Interpersonal Communications Skills, 1-MK = Medical Knowledge, 1-PALL = Professionalism, 1-PBLIALL = Practice-Based Learning and Improvement, 1-PCALL = Patient Care, 1-SBPALL = Systems Based Practice) are shown with the overall average of all ratings (1-Ave).Standardized examinations used in correlation include USMLE Step 1 and Step 2, PGY1 In-Training Exam (1-ITE) and the American Board of Pediatrics certifying examination (ABP Score).The scale to the right demonstrates the strength of correlation (blue = negative, red = positive correlation).Note that due to timing of data collection no residents in this cohort had both 1-ITE and ABP scores available, leaving that intersecting block empty.

Figure 2 .
Figure 2. ACGME PGY-3 Milestones correlation with standardized examination.The heat map is labeled with the individual Pearson correlation coefficients between PGY3 milestones and standardized examination scores.Domains of competence (3-ICSALL = Interpersonal Communications Skills, 3-MK = Medical Knowledge, 3-PALL = Professionalism, 3-PBLIALL = Practice-Based Learning and Improvement, 3-PCALL = Patient Care, 3-SBPALL = Systems Based Practice) are shown with the overall average of all ratings (3-Ave).Standardized examinations used in correlation include USMLE Step 1 and Step 2, PGY3 In-Training Exam (3-ITE) and the American Board of Pediatrics certifying examination (ABP Score).The scale to the right demonstrates the strength of correlation (blue = negative, red = positive correlation).

Table 1 . Demographics of study population compared to national data.
have demonstrated reliable, consistent increase in Milestone evaluation with advancing years of training.Using milestones-based evaluations to gather data from multiple evaluators for each resident, our program has observed that overall milestone averages for residents correlate with their competence as evaluated by our Clinical Competency Committee and program leadership.As examples, high scores have correlated with chief resident selection, and low scores have correlated with remediation or termination (data not shown).Park et al. had similar findings that low milestones ratings correlate with trainees who are struggling in their learning 33 .
* = ABP national data, 2012 -2017 ** = 2018 NRMP data ¥ = 2020 NRMP data (first year NRMP reported information on race/ethnicity) # = American Medical School Graduate / International Medical School Graduate ^ = Top 40 US Medical School with highest NIH funding It is not ethically feasible to share our raw data publicly.Though the names of residents included in the study have been redacted, there may be enough personal information within the dataset that could identify residents who have trained in our program.These data could include sensitive information such as scores on board-certifying examinations, and this information should not be part of the public domain.Reviewers wishing to see raw data can contact the corresponding author (Benjamin Miller, benjamin.miller@chp.edu)to request access.
Medical College of Wisconsin, Milwaukee, USA This is a very insightful and well-written retrospective cohort study examining the correlation, or lack there of, between pediatric residency applicants' quantitative scores on the United States Medical Licensing Exam Step 2 Clinical Knowledge examination and their subsequent performance in residency training based on the Accreditation Council for Graduate Medical Education Milestones.Although prior studies have looked at the concerns about the limitations of USMLE scores as predictors of success in residency training, this study seems very novel in specifically examining the correlation between Step 2 CK and Pediatric residency milestones.
-Milestone reporting was only available for 122/188 end of year PGY-3 residents.-ITEexam results were only available for 74/188 residents in PGY-1 year -ITE exam results were only available for 122/188 residents in PGY-3 year.