The American Board of Surgery In-Training Examination (ABSITE®) is a multiple-choice exam administered yearly to surgical residents. The ABS provides the following information on its website:

"The ABSITE is furnished to program directors as a formative evaluation instrument to assess residents’ progress. The results are released only to program directors and should not be shared outside of the department GME division." [1]

However, the use of ABSITE scores is often inconsistent with the stated guidelines. While formative assessment is a low-stakes evaluation of an individual's current learning with an emphasis on actionable feedback, summative assessment is a high-stakes evaluation in which the primary objective is to hold an individual accountable for a body of knowledge [2]. Currently, ABSITE scores are used summatively in many fellowship applications and are routinely shared outside departments. This misalignment between intended and actual use challenges the validity of ABSITE scores.

Validity refers to the body of evidence that supports the use of a test score to accurately interpret a construct, such as “surgical knowledge” or “preparedness for fellowship” [3]. Specifically, there should be evidence that (1) test content maps to the construct in question (content), (2) test questions evoke the intended thought processes (response process), (3) test questions collectively measure the intended construct (internal structure), (4) test scores correlate in expected ways to related measures of the construct (relationship to other variables), and (5) use of test scores leads to intended outcomes (consequences) [3]. This framework was initially proposed by Messick and later operationalized into practical guidelines in The Standards for Educational and Psychological Testing [3, 4]. Using The Standards, we discuss the limitations of the ABSITE in its current form, including (1) invalid score interpretation, (2) inconsistent use of scores, (3) disparate learning opportunities for test-takers, and (4) variability in test administration. We then suggest future reforms.

First, Standard 1.4 states, "If a test score is interpreted for a given use in a way that has not been validated, it is incumbent on the user to justify the new interpretation for that use" [4]. While studies differ on the importance of the ABSITE in fellowship ranking decisions, all of them indicate that ABSITE scores are a considered component [5, 6]. In doing so, fellowships implicitly view them as a measure of candidate quality. Although some studies have correlated low ABSITE scores with failure of the ABS Qualifying Exam (QE), the data are mixed [7, 8]. Furthermore, studies demonstrate poor correlation between ABSITE scores and clinical evaluations, suggesting that the exam is a poor predictor of clinical performance [9]. Thus, while the ABS does provide a blueprint on how questions map to domains of surgical knowledge (content evidence), there is insufficient evidence that ABSITE scores actually correlate with clinical competence (relationship to other variables) [10].

Second, Standard 5.24 states, "When proposed score interpretations involve one or more cut scores, the rationale and procedures used for establishing cut scores should be documented clearly" [4]. Currently, the ways in which fellowship programs use ABSITE scores are heterogeneous and opaque. The number of ABSITE scores that fellowships request from applicants can vary, even across fellowships in the same specialty. Moreover, when they exist, score cutoffs are not transparently published. One study demonstrated that although a minority of fellowship programs enforced ABSITE score cutoffs, those that did had cutoffs ranging from the 10th percentile to the 90th percentile [5, 11]. Because no rigorous methodology for determining ABSITE score cutoffs has been described, fellowships may be inadvertently excluding qualified candidates. As a result, the summative use of the ABSITE lacks consequence validity evidence [12].

Third, Standard 12.8 states, "When test results contribute substantially to decisions about student promotion or graduation, evidence should be provided that students have had an opportunity to learn the content and skills measured by the test" [4]. Thus, without adequate study conditions, ABSITE scores cannot be used summatively. Given institutional differences in protected education time and clinical schedules, residents may not have adequate opportunities to prepare for the ABSITE. In a study analyzing the relationship between burnout and ABSITE scores, 48% of the respondents viewed "no time secondary to clinical duties" as a barrier to studying. Moreover, burnout due to exhaustion was associated with low ABSITE scores in a multivariable regression analysis [13]. Proponents of maintaining the ABSITE in its current form believe that it incentivizes residents to study, with a recent opinion article noting "an important by-product of a rigorous, scored exam is the incentive for residents to read even though they may be too tired or distracted to do so" [14]. While this may be true, it also reflects the reality that residents are frequently studying under suboptimal conditions and are not given a reasonable opportunity to learn the required content.

Fourth, Standard 3.0 states, "All steps in the testing process, including…administration…should be designed…to minimize…variance" [4]. Currently, residents are assigned a time slot within a five day testing window to take the ABSITE. However, some residents may take the test while on-call or following overnight call when fatigued and sleep deprived. While previous studies failed to show a correlation between prior night call and exam scores, subjecting a resident to a five hour test post-call seems unnecessary [15,16,17]. If ABSITE scores continue to be misused in a summative manner, then test takers must be given a fair and equitable chance to perform their best. The prior two examples highlight situations in which scores may inadvertently capture structural differences related to learning conditions or fatigue instead of how well residents can understand and answer the questions being asked. As a result, the summative use of ABSITE scores is again undermined by a lack of consequence and response process evidence [3, 12].

Considering these issues, we can improve the design and implementation of the ABSITE. Ideally, we propose that fellowships stop requesting ABSITE scores. While residents technically report ABSITE scores voluntarily, in practice, this is a forced choice since residents perceive score reporting as the expectation and the norm. Alternatively, current practice can be improved by standardizing score reporting and optimizing the testing environment. Specifically, fellowships should be consistent and transparent about how many scores they request, what score cutoffs they use, and how these cutoffs are determined. In addition, the testing window can be widened to accommodate residents’ clinical schedules. Notably, this was done in 2021 due to the COVID-19 pandemic with minimal issues [18]. Alternatively, multiple testing windows can be offered, as is done with the MCAT and the USMLE. Finally, residency programs should prioritize using the ABSITE in a formative manner. For example, scores can be used to guide educational quality improvement or formulate personalized learning plans [19]. As an added benefit, transitioning to a low-stakes examination reduces the incentive to cheat and thus the need for stringent test security.

Adoption of the proposed reforms invites the question of how fellowships will evaluate applicants without ABSITE scores, as they are currently one of the few quantitative metrics available. As one alternative, the newly developed entrustable professional activities (EPAs) may offer an evidence-based approach to determine resident preparedness [20]. However, since EPAs are also intended to be formative, summative use would need to be thoroughly examined. Ultimately, we recommend that fellowship programs shift toward a more holistic review process for prospective fellows, paralleling initiatives already taking place at the medical school level [21,22,23].

In conclusion, the current use of ABSITE scores neither complies with ABS statements nor follows well-accepted testing guidelines. With exception to content evidence, the ABSITE lacks validity evidence in all other domains. It is essential that we recognize these limitations and use the test in an evidence-based manner. With proper use, we believe that the ABSITE can be a powerful learning tool that can help maximize the educational potential of every future surgeon.