Introduction

In the United States, the federal system of criminal sentencing was created to ensure that federal courts provide similar sentences for similar criminal conduct by individuals with similar criminal histories. At the same time, according to the judicial branch itself, there exists substantial racial inequity in sentencing (Franklin, 2013). As compared to similarly situated white men, Hispanic and Black men receive sentences that are, respectively, 5% and 20% longer on average (United States Sentencing Commission, 2018). While the disparity between Black and white defendants is particularly well-established (Feldmeyer and Ulmer, 2011; Mustard, 2001; Rachlinski and Wistrich, 2017), there is compelling evidence that Native and Hispanic men—especially young men—also receive harsher sentences (Doerner and Demuth, 2010; Franklin, 2013).

How can such inequities exist alongside a system designed specifically to eliminate them? There are at least three reasons. First, the sentence recommended for an individual is determined primarily by their criminal history and the type of crime for which they have been convicted. This system can have a disparate impact on minoritized populations as compared to white ones if minoritized individuals have more substantial criminal histories or are likely to be convicted of more severe crimes. Second, some federal districts may apply sentencing guidelines in a manner that implicitly penalizes racially minoritized defendants as compared to white ones (Spohn, 2000; Yang, 2015). For instance, judges in a particular district may tend to use the upper end of presumptive sentencing ranges for certain offenses, adopting a stance such as “We’re tough on drugs in this district.” This type of structural differential impact could potentially amplify the first type. Finally, actors in the federal system may treat members of racial groups differently even after accounting for legal factors (Spohn, 2000). We are concerned with all three of the aforementioned sources of racial disparity. We will estimate the structural disparate impact and differential treatment by courtroom actors at the national level and, for the first time, within individual federal judicial districts.

How federal sentencing works

The Sentencing Reform Act of 1984 created the United States Sentencing Commission (USSC) and charged it with promoting equity in the justice system. To this end, USSC created its sentencing guidelines and put them into practice three years later (United States Sentencing Commission, 1987). The guidelines account for a defendant’s criminal history and the severity of their alleged crime, as well as numerous possible mitigating and aggravating factors. Based on these factors, each case is placed into a group, visualized as a cell in a sentencing grid, and each cell carries a specific range of time recommended for incarceration. One may view the guidelines as an algorithm that outputs a recommended sentencing range based on the characteristics of the defendant and on the alleged crime.

Initially, federal judges were required to follow these guidelines; that is, they were limited to imposing sentences that were within the recommended ranges. A pivotal change occurred after the Supreme Court case United States v. Booker, in which the Supreme Court held that mandatory guidelines violated the Sixth Amendment. As a result, the language making the guidelines mandatory was stripped as of early 2005, and the guidelines were thenceforth advisory. Legal scholarship has sought to assess the impact of Booker on racial disparities, largely with a focus on Black defendants. Some evidence suggests that Black-white disparities in sentencing increased after Booker (Yang, 2015), while other evidence indicates no such increase (Starr, 2013). Irrespective of Booker, some studies have found evidence suggesting a decrease in race-based sentencing disparity over time (King and Light, 2019; Light, 2022).

Disparate impact and differential treatment

Though scholars debate whether Booker increased race-based sentencing disparities, there is little debate on whether the disparities existed both before and after the decision. Such disparities also show up in state sentencing (Abrams et al., 2012) and state bail decisions (Arnold et al., 2018). In thinking about sentencing differences due to defendant race, we distinguish between the differential impact of structural factors and differential treatment by actors in the courtroom workgroup, including judges, defense attorneys, and prosecutors (Bushway and Piehl, 2001).

Differential treatment by the actors in the courtroom workgroup has been a focus of research in this space, with much of the work in criminology guided by the focal concerns framework (Steffensmeier et al., 1998). This theory proposes that members of the courtroom workgroup have three focal concerns when determining appropriate sentences. The three focal concerns are the defendant’s blameworthiness, the extent to which a longer sentence would protect the community, and the practical constraints associated with a longer sentence. This framework says that members of the courtroom workgroup consciously or subconsciously rely on stereotypes to assess the three concerns, in turn reducing the cognitive demand of sentencing. For instance, some groups may be stereotyped as tough enough to handle prison, as being already absent from one’s dependents, and so forth. White defendants benefit from these stereotypes (Smith et al., 2014). Regarding blameworthiness, white individuals are underrepresented as perpetrators of crime in television news reports (Dixon and Linz, 2000), and this underrepresentation appears to provide white individuals an innocence premium in people’s minds (Dixon, 2006). Regarding danger to the community, white individuals are differentially perceived as safe and disinclined to commit crimes (Smith et al., 2014). Regarding practical constraints, qualitative evidence suggests that judges perceive white defendants and their families as being especially harmed by prison time (Kramer and Steffensmeir, 1993).

This racially disparate treatment is important but may account for a relatively small part of the overall racial disparity in the system (Baumer, 2013; Mitchell, 2005). Instead, the bulk of the racial disparity may be attributable to the structural properties of the criminal justice system (Bushway and Forst, 2013; Bushway and Piehl, 2011; Frase and Roberts, 2020). The guidelines system itself is an example of a structure that could create inequity since the recommended punishments were created based on past practices, which could well have been racially biased. Above and beyond the basic guidelines, other mitigating and aggravating factors can cause disparity. For example, the Anti-Drug Abuse Act of 1986 put in place mandatory minimum sentences for certain drug-related crimes. Twenty years after the passage of the law, Black defendants made up 15% of drug users in the U.S., 74% of individuals sentenced to prison for a drug offense, and over 80% of individuals sentenced for crack-related offenses (Vagins and McCurdy, 2006).

The disparity might also stem from variation among federal districts, each of which may choose to apply the sentencing guidelines in its own way. Documentation from USSC at the time they created the guidelines states that concerns about interdistrict variation were one of the primary motivations for the Sentencing Reform Act of 1984. The sentence given to a defendant results from the decisions of prosecutors, defense attorneys, and judges, and there is no reason to assume that the approaches of these actors are uniform across districts. Intra-city/inter-judge variation in sentencing is wide and has generally grown post-Booker (Scott, 2010; Yang, 2014).

Unfortunately, the literature on intra-city (including inter-judge) and interdistrict differences is sparse. Some work in criminology has used multi-level models to demonstrate district-level variations in practices like the provision of downward departures (Johnson et al., 2008). USSC itself has released a report on interdistrict variation in sentencing (United States Sentencing Commission, 2020a). That said, the government’s publicly released sentencing records do not attach judge names to individual sentences, thereby precluding judge-level analysis. The aforementioned judge-level studies (Scott, 2010; Yang, 2014) were possible only because they used proprietary data. However, the data usage agreement prohibits public access and disclosure of judge names.

Summary of results

In this paper, we estimate the structural disparate impact and differential treatment by courtroom actors at the national level and, for the first time, within individual federal judicial districts. We study over one-half million sentencing records from the USSC database, spanning the years 2006 to 2020. Black and Hispanic defendants receive average sentences that are approximately 19 months longer and 5 months longer, respectively. Personal demographics and sentencing guideline elements account for nearly 17 of the 19 months for Black defendants and all five of the months for Hispanic defendants, demonstrating the disparate impact of the system at the national level. To further investigate the source of these differences, we analyze individual districts. Assuming uniform treatment of sentencing variables across all districts, 22 districts display significantly longer conditional sentences for minoritized defendants. These disparities are evidence of either structural inequity and/or differential treatment within the district. However, even after controlling for each district’s unique demographics and implementation of sentencing factors, 14 districts show significant differences for minoritized defendants as compared to white ones. These unexplained differences are evidence of possible differential treatment by judges, prosecutors, and defense attorneys.

Methodology

Data

We estimate race-based sentencing disparities from public data released by USSC (United States Sentencing Commission, 2021). To avoid potential confounding due to the Booker Supreme Court case in 2005, we utilize data from fiscal years 2006 through 2020, comprising 1,027,976 sentences before restriction. The codebook for these data (United States Sentencing Commission, 2020b) provides detailed explanations of all variables.

We immediately eliminate immigration cases using the variables that specify offense type (OFFGUIDE = 17, OFFTYPE2 = 27, or OFFTYPSB = 27, depending on the sentencing year) and any other cases involving noncitizen defendants (NEWCIT = 1). This is the same approach taken by, for example, Rehavi and Starr (2014), who state that “immigration cases and other cases involving noncitizens were excluded because the stakes in those cases center on deportation, making them not directly comparable to other crimes.” There are 479,347 records that involve noncitizens and/or immigration cases and/or have missing citizenship data. After discarding these, 548,629 sentencing records remain. It is important to keep in mind that this exclusion of data means that the research questions we can address are necessarily limited in scope.

In our regression models, the outcome variable is the total prison sentence in months as captured by the variable SENTTOT in USSC data. There are 82,101 records (15.0%) with missing data for this variable. Among the records with missing data, the vast majority, 75,338 records, have penalties involving probation, monetary fines, and/or home confinement. For these records, we convert the value of SENTTOT to zero months. After this imputation, there are still 6763 records with missing data for SENTTOT. These comprise 1.2% of the 548,629 records, and we eliminate them since we cannot determine the outcome in terms of prison sentence. At this stage, 541,866 records remain.

USSC codes life sentences as 470 months. There are 2295 such cases (0.4%) in the remaining data and we retain these. At the same time, 1570 records (0.3%) indicate sentences longer than 470 months. We discard this small amount of data in order to avoid a potentially nonmonotonic representation of prison sentences. At this stage, 540,296 records remain.

We will estimate the effects of defendant race and judicial district after controlling for various other explanatory variables. Our strategy is to focus primarily on variables directly related to sentencing, but also to include a small number of extra-legal variables, namely age, sex, and educational attainment. One could choose to exclude any and all extra-legal variables. Instead, we opt to include the few extra-legal variables commonly used as controls in the research literature and in the government’s own studies. Starr (2013), Rehavi and Starr (2014), and United States Sentencing Commission (2018) are merely a few examples.

We eliminate any remaining records that are missing data for variables key to our analysis. These are the defendant characteristics race (NEWRACE), age (AGE), sex (MONSEX), and educational attainment (NEWEDUC), as well as the following sentencing variables: offense level after Chapter 2 adjustments (CHAP2), offense level after Chapter 3 adjustments (COADJLEV), final offense level (XFOLSOR), criminal history group (XCRHISSR), minimum of guideline sentencing range (XMINSOR), trumped guideline minimum (GLMIN), whether the defendant pleaded guilty (NEWCNVTN), and presence of a government departure based on the Booker reporting category (BOOKERCD through fiscal year 2017, and SENTRNGE afterward). There are 21,477 cases with missing data, comprising 4.0% of the data at the previous stage of elimination. We drop these from our analysis dataset, leaving 518,819 records.

Finally, we have only 98 records from the District of the Northern Mariana Islands, which is insufficient for estimating effects within that district. We eliminate those records from our study, leaving us with a final dataset of 518,721 records.

Treatment of variables

Except for defendant age (AGE) and prison sentence (SENTTOT), we treat all variables as categorical, regardless of coding that appears numerical in the raw data. This is especially important for offense-level variables (CHAP2, COADJLEV, XFOLSOR) and criminal history group (XCRHISSR). We make this choice because the meanings of these variables are not linear in their codings. For instance, a crime with offense level six is in no way expected to confer double the sentence of a crime with offense level three.

To further prepare our data for analysis, we add several derived variables. To track the application of mandatory minimum sentences, we create a flag (MANDMIN) that is true whenever the trumped guideline minimum (GLMIN) exceeds the guideline minimum (XMINSOR). To track government-sponsored downward departures, we create a flag (GOVDEPART) that is true whenever such a departure is coded in the relevant sentencing variables (BOOKERCD or SENTRNGE, depending on the year). Finally, as described in the “Introduction”, a defendant’s presumptive sentencing range is determined by the grid cell on which their alleged offense falls, which in turn is determined by the final, adjusted offense level (XFOLSOR) and by the criminal history group (XCRHISSR). For convenience, we create a categorical variable (GRID) out of all possible combinations of those two variables.

Statistical modeling

The results of Fig. 1 and Table 1 are based on linear regression models conducted in the statistical computing package R. Here we make the assumptions that the various factors impacting the sentence are linearly additive and interactions between factors are negligible except as specified. The response variable is prison sentence (SENTTOT) and the explanatory variables are defendant race (NEWRACE) along with all the variables described in the captions. Table 2 provides a more detailed specification of the variables involved in each model. For each regression, we measure racial disparity via the difference between each coefficient for minoritized race (Black, Hispanic, ARI) and the coefficient for white race. Motivated by a visual inspection of the residuals, we use heteroskedasticity-robust standard errors to associate p-values, which we tabulate along with the estimates. As we have three measurements of disparity for each of the 12 models, there are 48 estimates and we Bonferroni adjust p-values for the differences.

Fig. 1: Race-based sentencing disparity in the federal court system.
figure 1

Each point represents mean sentencing disparity (in months) as estimated from a regression model that conditions the outcome by adding an additional explanatory variable (given on the x-axis) to the model immediately to the left. The one exception is the “Grid Cell” model, marked with an asterisk (*) on the x-axis. This model combines the previous four explanatory variables into a single one representing the cell of the United States Sentencing Grid onto which each defendant is placed, as the grid cell sets the presumptive sentence. Disparities are shown as the average difference in sentences given to defendants who are Black (blue circles), Hispanic (yellow triangles), and another racial identity (ARI, red squares), each as compared to white defendants. The leftmost model, labeled “Baseline,” provides the average difference in sentences between racial groups irrespective of any and all other factors. The rightmost model demonstrates average disparities for minoritized defendants-especially Black and ARI ones-that remain unexplained. Table 1 reports the values of the estimates represented here along with p-values and confidence intervals. Table 2 provides a more detailed specification of the 12 regressions. The analysis is based on N = 518,721 sentencing records. Model diagnostics, including F-statistics, the associated degrees of freedom, the associated p-value, as well as raw and adjusted r2 values appear in our permanent data repository (Topaz, 2023). Each of these F-statistic p-values is numerically indistinguishable from zero, and the adjusted r2 values range from 0.02 for Model (1) up to 0.79 for Model (12).

Table 1 Race-based sentencing disparity in the federal court system.
Table 2 Specification of regression models.

The results for Figs. 24 and Table 3 are based on a separate set of linear regression models. District Model I is similar to Model 12 of Fig. 1 and Tables 1 and 2, except we replace district and race (as separate terms) with the interaction of district and race (treated categorically), that is:

$$\begin{array}{rc}SENTTOT \sim &AGE+MONSEX+NEWEDUC+SENTYR+NEWCNVTN+GRID\\ &+MANDMIN+GOVDEPART+NEWRACE:CIRCDIST\end{array}$$
Fig. 2: Significant race-based sentencing disparities in federal judicial districts.
figure 2

We calculate the average sentencing disparity in months for defendants who are Black (circles), Hispanic (triangles), and another racial identity (ARI, squares), each as compared to white defendants. There are two sets of results. The first set comes from a regression model (District Model I, black symbols) that includes defendant demographics (age, sex, and educational attainment), sentencing year, presence of a guilty plea, relevant cell on the U.S. sentencing grid, presence of a mandatory minimum, presence of government-sponsored downward departures, and the interaction of judicial district and defendant race. In this model, all terms except for the interaction derive from nationwide data. For example, this model assumes that all districts, on average, apply the U.S. sentencing grid in the same way. For the second set of results (District Model II, orange symbols), we partition the same data by judicial district, and within each district perform a regression that includes the other aforementioned variables, as well as race. This model allows for variations in grid application, defendant demographics, and more. For each model, we present racial disparities unexplained by other factors that are statistically significant (p < 0.05 with Bonferroni adjustment). See Table 3 for estimates of the three racial disparities under each of the two modeling frameworks and see “Discussion” for the interpretation of these disparities vis-a-vis disparate impact and differential treatment of minoritized individuals. The analysis is based on N = 518,721 sentencing records. Model diagnostics, including F-statistics, the associated degrees of freedom, the associated p-values, as well as raw and adjusted r2 values appear in our permanent data repository (Topaz, 2023). Each of these F-statistic p-values is numerically indistinguishable from zero. For District Model I, the adjusted r2 value is 0.88. For District Model II, the adjusted r2 values range from 0.70 to 0.91 with a median of 0.83.

Table 3 Race-based sentencing disparity in the federal court system.

District Model II takes the same regression formula as Model 11 of Fig. 1 and Tables 1 and 2 but applies it to each district individually. By using this framework, we allow district-specific impacts from various factors included in the model. For both district models, we again use robust standard errors. As we have three measurements of disparity for each of the 93 districts under two different modeling frameworks, there are 558 estimates and we Bonferroni adjust the p-values for these. Assuming no further existing confounding effects on the sentence, we could interpret the estimated sentencing disparity (difference in estimated coefficients between two races) as the differential treatment due to race.

Results

National-level race-based sentencing disparities

We study 518,721 federal sentencing records from 2006 through 2020 (see “Methodology”). Figure 1 summarizes the results of linear regression models that account for the contribution of various factors to race-based disparities. Table 1 provides numerical values of these estimates. Within the federal system as a whole, the average sentence given to Black defendants is 18.5 months longer than that given to white ones. For Hispanic defendants, it is 5.3 months longer, and for defendants of another racial identity (ARI), it is 9 months shorter. Several comments are in order. Most importantly, these average differences are not conditioned at all: neither on type of crime, criminal history, or anything else. Additionally, the grouping of all races other than Black, Hispanic, and white into the ARI category limits our understanding. For example, Asian defendants and Native American defendants are both in this group (called “other” in the government’s public data). For this reason, we limit our comments on results for ARI defendants in our nationwide analysis. We will return to a more detailed analysis of ARI defendants in our district-level analyses, in which the geography of each district hints at what groups might be driving sentencing disparities in that category.

Conditioning sentencing on personal demographics, namely age, gender (which the government measures as binary sex), and educational attainment, reduces the disparities to 12.9 months (−5.6) and 1.0 months (−4.3) for Black and Hispanic defendants respectively, demonstrating the role that these characteristics play in the disparate impact naively attributable to race. Conditioning on year of sentence has little effect, suggesting that time is not a primary driver of disparities, at least during the 2006–2020 time period spanned by our study. The presence of a guilty plea plays quite a modest role in explaining disparities for Hispanic defendants. In contrast, accounting for such pleas reduces the disparity for Black defendants from 12.8 to 10.3 months (−2.5) as compared to white ones.

Conditioning on criminal history further reduces the sentence differential for Black defendants to 2.2 months. This is a fairly large reduction, and it captures the extent to which the criminal histories of Black and white defendants differ. That is to say, the sentencing guidelines’ emphasis on criminal history accounts for a substantial proportion of the racial difference. In contrast, accounting for criminal history actually exacerbates the disparity between Hispanic and white defendants.

As one would expect based on how sentencing works (see “Introduction”), conditioning the actual sentence on presumptive sentence (labeled “grid cell” in Fig. 1 and Table 1) substantially reduces disparities for Black defendants, from 10.3 to 5.4 months (−4.9). The disparity for Hispanic defendants is no longer statistically significant. Additionally conditioning on mandatory minimum laws and government-sponsored downward departures reduces the Black-white disparity from 5.4 months to 1.9 months (−3.5, and compared with an unconditional difference of 18.5 months, −16.6) and yields a remaining Hispanic-white disparity, now statistically significant, of −1.1 months. This result means that controlling for all aforementioned factors, Hispanic defendants receive a conditional sentence that is slightly shorter than that given to white defendants. We caution that this result does not mean that Hispanic defendants are exempt from disparities. For instance, when examined at a finer jurisdictional level, disparities for Hispanic defendants are well-established (Ulmer and Parker, 2020).

Finally, to begin accounting for the federal judicial district where sentencing occurs, we include the district as a main effect in our model. That said, the observed effect of placing this control is modest. To achieve a more refined understanding, we will consider the role of the district in much greater detail below.

After conditioning the sentence outcome on age, gender, educational attainment, sentencing year, presence of a guilty plea, presumptive sentence, application of mandatory minimum laws, government-sponsored downward departures, and judicial district, there remains a statistically significant Black-white disparity of 1.9 months. The Hispanic-white disparity is negligible and not statistically significant. The ARI-white disparity of 2.8 months is significant and yet is difficult to interpret, as mentioned previously, due to the government combining many different racial groups into one. In summary, even after controlling for numerous factors that should explain sentencing disparities, there remain unexplained differences for Black and ARI defendants. To investigate the loci of these differences, we will proceed to an examination of each individual judicial district. We would like to re-emphasize that while our model and assumptions lead to no unexplained disparity for Hispanic defendants at the national level, prior results (see “Introduction”) suggest that a disparity does exist, and thus whether certain districts display it is a question that merits investigation, just as it does for Black and ARI defendants.

Race-based sentencing disparities within districts

To study associations between sentence length and defendant race within districts, we use two models. Figure 2 and Table 3 display the results of these models. The first model, labeled District Model I, measures the still-unexplained racial disparity within each district by including the interaction of district and defendant race while accounting for a nationwide-aggregated age, gender, educational attainment, sentencing year, presence of a guilty plea, presumptive sentence, mandatory minimum laws, and government-sponsored downward departures. The second model, labeled District Model II, allows for interdistrict variations in defendant demographics, in the application of presumptive sentences, and more. Arguably, the most important role of Model I is to set a benchmark for comparison to Model II, allowing us to ask how racial disparities change once accounting for structural factors such as in-district demographics and sentencing practices. See “Methodology” as well as the captions of Fig. 2 and Table 3 for further explanation.

In District Model I, 22 districts have statistically significant unexplained disparities for one or more minoritized racial groups (p < 0.05, Bonferroni-adjusted). Seventeen districts have a Black-white disparity, ranging from a high of 13.0 ± 1.6 months (95% confidence interval) for the Eastern District of Virginia to 2.9 ± 1.4 months for the Middle District of Florida. Three districts have a Hispanic-white disparity, ranging from 9.3 ± 4.6 months for the Western District of North Carolina to 3.5 ± 1.8 months for the Middle District of Florida. Finally, four districts have an ARI-white disparity, ranging from 12.5 ± 1.6 months for the District of Arizona to 6.6 ± 2.1 months for the District of Montana.

In District Model II, 14 districts have statistically significant unexplained disparities (p < 0.05, Bonferroni-adjusted), each for exactly one minoritized racial group. Eleven districts have a Black-white disparity, ranging from a high of 7.8 ± 2.8 months for the Southern District of Iowa to 3.8 ± 1.1 months for the Eastern District of Missouri and 3.8 ± 1.5 months for the Western District of Missouri. Three districts have an ARI-white disparity, ranging from 14.2 ± 1.6 months for the District of Arizona to 7.8 ± 2.7 months for the District of New Mexico.

Discussion

As shown in Fig. 1, the system-wide raw disparities for Black and Hispanic defendants decrease substantially once conditioned on defendant-specific and case-specific characteristics. This result is consistent with prior studies (Light, 2022; Mitchell, 2005; Rehavi and Starr, 2014; Spohn, 2013, 2000). For Black defendants, personal demographics (age, sex, educational attainment) and criminal history appear to play the most significant roles. In contrast, a raw disparity in favor of ARI defendants flips its sign after conditioning. However, interpreting this result is difficult as the actual racial identification of individuals in this group is unknown. We do not observe a significant unexplained system-wide disparity for the conditional sentences of Hispanic defendants. Much of the disparity appears attributable to personal demographics, as was the case for Black defendants.

Of course, the federal judicial system is not centralized. Sentences are created as part of a process by a courtroom workgroup within individual districts, and thus a more granular examination is warranted. Indeed, a meta-analysis of sentencing disparities research notes “[results] that pool data from many jurisdictions... may suffer from aggregation bias,” (Mitchell, 2005). We performed two analyses to assess the contribution of the federal judicial district to sentencing disparities. District Model I assumes that the sentencing guidelines are implemented uniformly across all districts, while the framework of District Model II conditions on all factors, including the application of the sentencing guidelines, uniquely for each district. As mentioned previously, the most important role of District Model I is to set a benchmark for comparison to District Model II. The two models inform our understanding of the disparate impact and differential treatment in the following manner.

First, for 24 district-race combinations, District Model I shows a statistically significant disparity (black symbols in Fig. 2). Specifically, 17 districts show a disparity for Black individuals, three for Hispanic individuals, and four for ARI individuals. Within the framework set by this model, these results have at least two interpretations. One interpretation is that there may be bias of courtroom actors, that is, differential treatment, for the identified district-race combinations. We will consider this point in more detail momentarily. The second interpretation is that the system has a structural disparate impact. As a concrete example, the estimated Black-white disparity of 13.0 months in the Eastern District of Virginia could indicate that for socioeconomic reasons, Black defendants are more strongly associated with offenses and criminal histories that have longer presumptive sentences. To address this type of structural disparity requires one to think not only about socioeconomic determinants, but also about who is policed, who is arrested, who is prosecuted, who is convicted, and why particular cells in the sentencing grid carry longer presumptive penalties. In short, the elimination of these disparities would require structural change focused outside of the courtroom.

Second, for 20 district-race combinations, District Model I shows a statistically significant disparity (black symbols in Fig. 2) that is either reduced or lost in District Model II (orange symbols). To restate this result, conditioning on all factors at the individual district-level results in a reduced estimate of racial disparity for these 20 cases. We interpret this reduction in racial disparity as evidence of differential impact on minoritized groups. For example, in the Eastern District of Virginia, the disparity of 13.0 months for Black individuals in District Model I is reduced to 7.5 months in District Model II, suggesting that in-district factors—including the specific ways in which that district implements the sentencing grid—explain 5.5 months of the disparity measured in District Model I. As shown in Fig. 2 and Table 3, the 20 aforementioned cases consist of 16 districts with disparities for Black individuals, three districts for Hispanic individuals, and one district for ARI individuals. Policies encouraging more consistent application of the guidelines could help reduce such disparities. Overall, our results about differential impact extend previous work that, irrespective of race, identifies substantial inter-jurisdictional discrepancies in sentencing outcomes within state systems (Ridgeway et al., 2020; Wright, 2012) and the federal system (Harries and Lura, 1973; United States Sentencing Commission, 2020a; Wu and Spohn, 2010).

Finally, for 14 district-race combinations, District Model II shows a significant disparity (regardless of results from District Model I). Another way to state this result is that, for these 14 cases, even after conditioning on all factors at the district level, an unexplained racial disparity remains. Figure 3 summarizes these 14 cases, provides confidence intervals for the disparities, and organizes them by federal circuit. There are Black-white disparities in the Fourth, Fifth, Seventh, Eighth, and Eleventh Circuits, and ARI-white disparities in the Ninth and Tenth Circuits. Figure 4 provides a map of these disparities. The Black-white disparities are largely concentrated in the plains, the mid-Atlantic seaboard, and scattered across the deep south. The ARI-white disparities are in Montana, New Mexico, and Arizona, which are states with high per capita Native American populations (United States Census Bureau, 2021). However, due to the racial heterogeneity of the ARI category, and due to complex issues of jurisdiction on tribal lands in the United States, our study cannot provide a definitive and nuanced interpretation of the ARI-white disparity.

Fig. 3: Summary of evidence of race-based criminal sentencing bias (differential treatment) in federal judicial districts.
figure 3

In Fig. 2, 14 federal districts display evidence of differential treatment by judges and other members of the courtroom workgroup even in a modeling framework that allows sentences to be conditioned separately for each district. In the figure above, we summarize the disparities and visualize their 95% confidence intervals. Eleven districts show evidence of bias against Black defendants (blue circles). Three districts show evidence of bias against defendants who have another racial identification (ARI, red squares), a category that includes Native Americans. The fourteen districts come from seven of the federal circuits. Circuit numbers appear in the center of each subdivided section above.

Fig. 4: Map of evidence of race-based criminal sentencing bias (differential treatment) in federal judicial districts.
figure 4

This map puts Fig. 3 and Table 3 in geographic context; see those captions for further explanation. The horizontal axis labels in Fig. 3 provide the full district name corresponding to each abbreviation in the map. States marked “none” are those for which our study did not produce evidence of differential treatment. The designation “none” does not mean that differential treatment may not, in reality, exist.

The results summarized in Figs. 3 and 4 are racial disparities that are explained neither by defendant demographics (age, sex, educational attainment), sentencing year, the presence of a guilty plea, presumptive sentence (based on criminal history and severity of the offense), mandatory minimum laws, or government-sponsored downward departures. We interpret these unexplained disparities as evidence of possible differential treatment of defendants, whether due to implicit or explicit bias of judges, prosecutors, and defense attorneys. One study that focused on federal drug sentencing described this type of differential treatment as “state-level racial animus” (Tuttle, 2019).

Conclusions

While racial disparities and interdistrict variability in sentencing have both been previously established in the literature, their interplay has not. We have provided a look at district-level variations in race-based sentencing disparity and identified 14 judicial districts showing significant differences for minoritized defendants as compared to white ones.

It is important to keep in mind several limitations of our work:

  1. 1.

    We have focused on sentence length as opposed to the judge’s decision of whether or not to impose prison time at all. Though not the subject of our study, the imprisonment decision itself is an important source of racial disparity (Ulmer et al., 2016).

  2. 2.

    We have not disaggregated our results by offense type. Offense type is not central to our research question, though it is possible that in some districts, disparities exist for some offense types and not others. This could be an appropriate route of investigation for future work.

  3. 3.

    As we mentioned previously, and in keeping with other published work, we have omitted noncitizens from our analysis. Unfortunately, this likely limits the understanding of disparities experienced by Hispanic individuals.

  4. 4.

    Evidence of differential treatment within a district is not proof and, as with any statistical modeling study, hinges on the modeling assumptions we have articulated (see “Methodology”).

  5. 5.

    Equally, evidence of differential treatment does not reveal which judges, prosecutors, or defense attorneys, if any, might be responsible.

  6. 6.

    On the other hand, there could certainly be differential treatment by judges, prosecutors, and defense attorneys in districts other than those we have identified here. Biased sentences in those districts could remain undetected in our analysis due to other sentences that mask or counterbalance them.

While the actions of members of courtroom workgroups are certainly not the only source of inequity in the federal court system—as our results confirm—their decisions should be subject to close scrutiny and analysis. In the 1980 case Richmond Newspapers v. Virginia (448 U.S. 55), the Supreme Court found that openness of criminal courtrooms is “implicit in the guarantees of the First Amendment.” This means that in theory, the public is allowed to know the sentence given by each judge to each defendant. However, that theoretical right does not translate into public transparency on a large scale because USSC does not include the names of the members of the courtroom workgroup who created the sentence in its public data, thereby precluding analysis at the individual level. Some preliminary steps have been taken to unmask judge identities, including the JUSTFAIR database of approximately 600,000 federal sentences (Ciocanel et al., 2020). Of course, any such large-scale, data science-based effort to identify sentencing judges is subject to error. For this reason, in order to enable the most accurate, and ideally, causal, analyses of race-based differential treatment, the government would need to cease redacting judge identity in its public records and find ways to identify other actors in the courtroom workgroup who help shape the outcome.