Computational analysis of 140 years of US political speeches reveals more positive but increasingly polarized framing of immigration

Significance In the first comprehensive quantitative analysis of the past 140 y of US congressional and presidential speech about immigration, we identify a dramatic rise in proimmigration attitudes beginning in the 1940s, followed by a steady decline among Republicans (relative to Democrats) over the past 50 y. We also reveal divergent usage of positive (e.g., families) and negative (e.g., crime) frames—over time, by party, and between frequently mentioned European and non-European groups. Finally, to capture more suggestive language, we introduce a method for measuring implicit dehumanizing metaphors long associated with immigration (animals, cargo, etc.) and show that such metaphorical language has been significantly more common in speeches by Republicans than Democrats in recent decades.

date. Most speeches in the dataset include a named speaker, state, and party, though about 15% of speeches are missing this 23 information, most of which are procedural (e.g., "Without objection it is so ordered"). 24 For the speeches from the @unitedstates project (https://github.com/unitedstates/congress), we also tokenize them using spaCy, 25 and then apply pre-processing to more closely match the format of the Genzkow data. To do this, we replace all commas with 26 periods, drop all apostrophes, and remove all hyphens connecting words. 27 For the Presidential data, we first split each document into paragraphs (by splitting on newlines), and treat each paragraph 28 as a segment. Some documents (such as transcripts of press briefings) include comments from multiple speakers. To exclude 29 speeches by everyone except The President, we filter out blocks of text that begin with "Q." or "Q:" (question) as well as other 30 named positions (e.g., "The Vice President:") and those that begin with a name preceded by a non-Presidential title (e.g., "Dr.

31
[NAME]:"). For additional details, please refer to online replication code. 32 Although the Congressional Record data is generally of high quality, there are some errors from Optical Character Recognition 33 (OCR), especially in earlier years. Table S1 shows, for three key terms, the most common similar tokens found in the first part 34 of this corpus (sessions 46-70) (those that have an edit distance of one to the target). The counts of obvious OCR errors are 35 relatively small, but we nevertheless include some common variations (e.g., "inmigration", "imnigration") as keywords when 36 doing the initial speech selection for annotation. 37 Table S1. Common OCR errors for three key terms. The table shows the token counts for the most common terms with an edit distance of one from the target terms (top row) in all speeches from the 46th to 70th Congress, revealing that OCR errors are present, but relatively rare.

38
In order to get a sufficiently large number of positive examples for annotation, initial keyword filters were used to identify 39 sentences in the Congressional speeches that were potentially about immigration. These were developed through an iterative 40 process of keyword selection, query expansion, and exploration, and encompassed prefixes relating to immigration (e.g,.

Details of Training and Applying Classifiers for Relevance and Tone
Using the inferred labels from the annotated segments, we then trained a pair of classifiers for both the early and modern time 79 periods, which we chose to focus on due to the relatively small amount of immigration to the U.S. in the intervening years. 80 For both time periods, we trained a binary classifier for relevance, and a ternary classifier for tone. To do so, we built on the 81 transformers library, beginning with the pretrained roberta-base model provided by Hugging Face (https://huggingface.co/). 82 We implemented a weighted classifier that incorporates the inferred label probabilities into the cross entropy loss as weights 83 during training. For the tone classifier for the modern time period, we also augmented the annotations we collected with 84 similar tone annotations from the immigration news articles in the Media Frames Corpus (2), which resulted in a slight increase 85 in held out performance. 86 To identify relevant speeches in the full corpus, we first applied the relevance and tone classifiers to all segments which 87 contained keywords (the pool of candidate segments for annotations). A second, smaller set of speeches was identified by 88 identifying speeches not yet identified as being about immigration, breaking them into segments, classifying all of those segments 89 as being relevant or not, and keeping only those speeches that contained segments classified as relevant and occurred on a day 90 with many speeches already classified as being about immigration (see replication code for details). In this way, the relevant 91 speeches primarily consisted of those that contained keywords, but were not limited to those. 92 The overall tone of each speech is obtained by counting the number of segments in a speech classified as being pro, neutral, 93 or anti (among those classified as being relevant to immigration). In other words, although we count entire speeches which only 94 briefly mention immigration as being relevant to the issue, the tone assigned to each speech is determined only by those parts 95 of the speech which are judged relevant, not the unrelated parts. 96 For the early and later periods, we only use the predicted labels from the corresponding models. To get final labels for 97 segments from the middle period, we used a linear interpolation of the predicted probabilities from the two models, placing all 98 the weight on the early model in 1934 and transitioning to placing all the weight on the modern model by 1957. 99 To confirm that these models are comparable, we inspect the aggregate predictions made by each model individually on all 100 of these years, as shown below. We also plot the predictions using both the Gentzkow et al data and the @unitedstates project 101 data for the years where they overlap (the 104th to the 112th Congress) to confirm that the slight differences between these 102 data sources are not consequential. As shown in Figures S2 and S3 below, there is nearly perfect overlap for the later models 103 (for both relevance and tone) for the comparison between data sources, demonstrating that the minor differences between these 104 sources are not consequential. By contrast, the difference between Model 1 and Model 2 is slightly larger. This makes sense 105 however, given that the content of speeches about immigration changed dramatically over this time, hence the use of two 106 different models. Nevertheless, the earliest aggregate predictions of the later model agree strongly with those of the earlier 107 model, implying that the predictions from these two models are indeed comparable.  The green line shows the same for the predictions of the later model on the @unitedstates project data. As can be seen, there is strong agreement between these three lines, indicating that a) the two models are comparable to each other, and b) the two data sources are comparable to each other.
Overall performance estimates are provided in Tables S4 and S5 S3. Aggregate tone predictions from the two tone models, on both the full Gentzkow et al data and the @unitedstates project data (USCR). As above ( Figure S2), there is near perfect agreement between the two data sources. The agreement between the two models in the middle period is slightly weaker than for relevance, but the aggregate predictions still show the same trends. In particular, the prediction of the later model match the earlier model almost perfectly at the beginning of the transition period. By contrast, the latest predictions of the earlier model are not as positive as those made by the later model, likely because there are aspects of positive immigration language in the 1960s that the earlier model was not exposed to in the data it was trained on.
In order to ensure that our results are not being excessively influenced by underlying biases in the base RoBERTa model, or 119 differential classifier performance, we include a series of validation checks, using alternative model specifications and aggregate 120 corrections (see Validity Checks for Tone below).

Validity Checks for Tone
Linear Models. Classifiers based on pre-trained models, such as RoBERTa, typically perform better, but introduce some amount 155 of unknown bias (from the pre-training data). To verify that our results are not being excessively influenced by roberta-base, 156 we repeat our pipeline using basic logistic regression models, operating on bag-of-words features (which avoids all issues with 157 biases from pretraining data) and reproduce Figure 1 in the main paper based on the predictions from these simpler classifiers.

158
For the sake of simplicity, and to simultaneously address concerns about our use of separate models for the earlier and 159 later parts of our data, we do this replication using only a single model for relevance, and a single model for tone. That is, we 160 combine all annotated data, leaving out a random set of 400 segments for each, and train one logistic regression model on 161 the combined relevance annotations, and another on the combined tone annotations. Because data is limited, we do not do 162 extensive tuning of these models. Rather, we use what are known to be strong default choices (4): we use all unigrams and 163 bigrams that occur at least twice in the training data, binarize all features (present or absent), and regularize the model using 164 L1-regularization, with strength tuned using five-fold cross validation. The resulting models have similar accuracy for relevance 165 (0.89), but slightly lower accuracy for tone (0.63). Nevertheless, the resulting time series for Congressional Speeches using 166 the logistic regression models, shown in Figure S5, is overall extremely similar to the one based on the contextual embedding 167 models ( Figure 1 in main paper).   Figure 1 in the main paper (average tone over time for Congress and Presidents in the top and bottom sub-panels, respectively), using logistic regression models for relevance and tone (based on unigram and bigram features), rather than models based on roberta-base. As can be seen, results are extremely similar to the main results in terms of all important findings.
We similarly use these models to recreate Figure 2 in the main paper, shown in Figure S6, again finding no meaningful 171 difference from the models based on roberta-base, which helps to assuage concerns that biases present in the RoBERTa models 172 could be distorting our results. (Note that for Figure S6, because we only plot sessions with at least 20 relevant speeches, there 173 are some differences in terms of which sessions are plotted, due to slight differences in which speeches are classified as relevant).  Figure 1 in the main paper using a different RoBERTa model, trained only on pro-and anti-immigration speeches, to predict tone as a binary variable. The y-axis shows the percentage of speeches per Congress classified as pro-immigration. from within the previous or following 10 sessions of Congress, and gives us a prior label distribution per session and party, 189 based only on the raw annotated data, i.e., p(y | P, C), where y is the true (human-identified) tone of a speech, P is the party 190 (Republican or Democrat), and C is the session of Congress (43-116).

191
We then use the predictions of our tone models to estimate confusion matrices for tone predictions, similar to what is shown in Table S5, but normalized by row, and computed per-party for the modern model. * This gives us an estimate of the probability of each predicted label, conditional on each true label, with estimates that vary depending on whether the Congressional session is from the early, mid, or modern period of annotations. † We denote these values as p(ŷ | y, P, C), wherê y is the predicted label, and y is the true label (provided by the annotators). Because what we actually want is the probability of a true label, conditional on a predicted label (provided by the models), we invert the confusion matrix using Bayes' rule, i.e., p(y |ŷ, P, C) ∝ p(ŷ | y, P, C) · p(y | P, C), and normalize by summing over all values for each predicted label.

192
Putting this all together, we use the inverted confusion matrix to correct the predicted tone probabilities for each segment 193 (which generally softens them away from high probability assigned to a single label), and recreate Figure 1 from the main 194 paper using the corrected probabilities, rather than the predicted labels (i.e., the sum of the pro-immigration probabilities 195 minus the sum of the anti-immigration probabilities). The resulting Figure is shown in Figure S8. 196 Although there are some minor changes from the original time series, the overall patterns are essentially the same. Regardless,197 we treat this a supporting validity check, rather than our main result, in part due to potential error in estimating error rates 198 * For the early model, we just use a single estimate, since there is so little difference between the parties. † The confusion matrices are not estimated per session of Congress because they require more data than the class priors. and class priors from limited amounts of annotated data.   Figure 1 in the main paper, after applying a Bayesian correction to the predicted probabilities, to account for error rates in models and prior label distributions from annotation (by party and over time).
Leave-one-out Analysis. Finally, to ensure that the results are not being strongly influenced by individual members of Congress,  As a complement to this, Figure S10 shows the pattern in net tone per speaker, for all speakers in sessions of Congress where 204 they have at least 20 immigration speeches. As can be seen, there is considerable variability in tone within each party at all 205 points in time. In addition, we can see the most recent sessions of Congress are unprecedented, in that, except for one or two 206 outliers, the most anti-immigration Democrat is still more pro-immigration than the most pro-immigration Republican. During 207 the Trump administration, we find that all Democrats (among those for whom we have sufficient data), were pro-immigration 208 in their speech, on average, whereas nearly all Republicans were anti-immigration, according to our metric, as was the case for 209 nearly all legislators before the 1950s.
The largest outlier with respect to this finding is the one Republican who appears very pro-immigration in 2017-2018. This    Reform and Control Act, which targeted employers who hired people without authorization to work in the U.S.

221
Finally, figure S12 shows the percent of immigration speeches classified as pro, neutral, and anti over time, both overall 222 (top), and by party (Democrats middle and Republicans bottom). As can be seen, most speeches are neutral over most of this 223 time series, though these become less common over time, consistent with greater polarization. By contrast, there is a decline in 224 negative speeches and a rise in positive speeches in the middle of the century, followed by a resurgence in anti-immigration 225 sentiment.

227
Given the uneven geographical nature of immigration to the U.S., as well as regional realignments in party affiliation during 228 this time period, we provide a validity check on party polarization in which we disambiguate the contributions of party and 229 geography to changes in expressed attitudes towards immigration. ‡ To do so, we fit hierarchical Bayesian models to the 230 predicted tone of immigration segments, building linear models with effects per year for each of party and region, as well as 231 overall tone, with each type of offset drawn from its own hierarchical prior, which we fit using Stan (7).

232
In more detail, we model the tone of each speech as drawn from a normal distribution, with a mean parameter modeled    In terms of geography, meanwhile, the estimated biases by region (North, South, and West), are shown in Figure S14. We 250 see that the North has had a consistent mildly pro-immigration bias over this time period, while the South had a relatively 251 anti-immigration bias over the entire twentieth century, and the West has gradually changed from having an anti-to a 252 pro-immigration bias, with a considerable dip from the late 1970s to the early 2000s, especially during the period of discussion 253 of Prop 187 in California. Overall these regional biases pale in comparison, however, to the growing partisan divide shown 254 in Figure S13 (note the difference in scale between the two figures). Finally, Figure S15 shows the estimated offset for the 255 Senate over time, which fluctuates between slightly more pro-and anti-immigration than the House of Representatives, with no 256 consistent difference between them.

264
The analyses of tone in the main paper primarily emphasize trends over time, and overall differences between the parties. It is 265 also interesting, however, to investigate whether our measure of tone captures variation within parties.

266
To do so, we obtained ideological positions for each speaker in Congress using DW-NOMINATE, a widely-used scaling 267 procedure to compare politicians based on their voting records (8). For each member of Congress from 1880-2020, we collected 268 information for the first and second dimension recovered by DW-NOMINATE using Voteview data from March 2022. ¶ The 269 first dimension captures the liberal-conservative spectrum, while the second has historically captured more differences within 270 political parties over regions, civil rights, and lifestyle (9).

271
We then matched the names from the Voteview database (which are official full names) with the name of the speakers in 272 our immigration speeches dataset. To deal with partial matches (due to partial names and OCR errors), we used the following 273 procedure: first, we normalized all names in both the Voteview and immigration speech records to be all uppercase and 274 converted accents and other diacritics to their ASCII representations. From our list of speakers, we then removed honorifics as 275 well as mentions of their district or state within the speaker field (e.g. converting "Ms. Warren of Massachusetts" to "Warren"). Among those matched speakers with at least 10 speeches relevant to immigration, we calculated their average tone, i.e., 283 their percentage of pro-immigration speeches minus percentage of anti-immigration speeches. We stratified the speakers by 284 party (Democrat and Republican) and time period (pre-1924, 1924-1965, and post-1965), so that we could analyze how the 285 correlation between our tone measure and DW-NOMINATE varied across parties and over time.
286 ¶ https://voteview.com/data ‖ https://pypi.org/project/fuzzywuzzy/ Our results show that the first dimension of DW-NOMINATE, which captures the liberal-conservative spectrum, has a negative relationship with our tone measure. That is, more conservative speakers within each party tend to be more  We also find that the second dimension of DW-NOMINATE has a negative relationship with our tone measure. The 293 interpretation of the second dimension is more ambiguous; it tends to capture cross-cutting issues that are not already captured 294 by the liberal-conservative spectrum. Like the first dimension, we find that the relationship becomes increasingly negative over These results demonstrate that our tone measure is not only correlated with political ideology, but also that it can capture 298 variation across individuals within each party. Furthermore, this correlation becomes stronger over time, supporting the 299 argument that immigration has become an increasingly polarized issue. We found-as in Congressional speeches-that attitudes towards immigrants improved over time among Democrats but 312 not among Republicans. We also found in general that more respondents wanted to keep immigration at its present levels or 313 * * Typically our tone measure, as a difference of percentages, falls between −100 and 100, while DW-NOMINATE falls between −1 and 1 (as shown in Figures S16 and S17). However, here we rescaled our tone measure to the range of −1 to 1 when reporting regression coefficients, so that two measures could be on the same scale. decreased, and fewer wanted levels increased. On average over the 12 surveys, 33% of respondents wanted immigration levels 314 kept at present, 16% wanted them increased, and 42% wanted them decreased (see Figure S18).  We then report the relationship between the share of speeches that are anti-immigration and the share of respondents

324
In other words, within a state, as the local population reports attitudes that are more anti-immigration, political representatives 325 from that state are measured as making more anti-immigration speeches. This correlation does not tell us the direction of 326 the relationship between the local attitudes and political speeches-it could be that politicians are responding to changing 327 attitudes in the electorate, or that local residents are influenced by political elites-but we find the association between these 328 two state-level measures of attitudes toward immigration to be reassuring.

330
We conducted a supplementary analysis to test whether the divergence in attitudes toward immigration that we observed  year (even) or non-election year (odd), with fixed effects for decades. We found for both parties that there was no significant 343 difference in tone between on-and off-election years, either pre-1979 (before C-SPAN) or post-1979 (after C-SPAN).

344
In our second regression model, we conducted a very similar analysis, but instead of simply indicating whether it was an 345 on-or off-election year, we provided as an independent variable the number of months until the next election. For example, 346 November in an even year mapped to 1, October mapped to 2, and so on, and December to 24. In an odd year, November 347 mapped to 13, October to 14, and so on, and December to 12. Again, we found no significant effect for either party: both 348 pre-1979 and post-1979, the number of months until the next election did not have a significant effect on average tone.

349
Thus, we did not find evidence that the polarization we observed could be explained by the advent of new technology like 350 C-SPAN and a desire to appeal to voters. In addition, Congressional Representatives seem to not be altering their tone with 351 respect to immigration in election years, which suggests that anti-immigration attitudes are not being driven primarily by 352 electoral cycles.

354
To demonstrate that Figure 2 in the main paper (which plots tone over time for the three most frequently mentioned 355 nationalities-Italian, Chinese, and Mexican) is representative of broader regional trends, we create an equivalent plot here for 356 the corresponding regions (Europe, Asia, and Latin America) as shown in Figure S20. Specifically, we count mentions of the     and Cuba only became more positive than the average in the 2000s, and Mexico is still at or below the overall average today.

374
The corresponding mention frequencies are shown in Figure S22.  We report results from a regression with 14 countries of origin-by-decade observations. In each case, we calculate the share 381 of speeches about that country-of-origin in that decade that are pro-immigration or anti-immigration. Our dependent variable 382 is % pro -% anti, as in our main results (e.g., Figure 1). As explanatory variables, we include: (a) country-of-origin fixed  Table S6). more in future work. One explanation could be that groups are spoken about very positively when they are refugees and perceived to be in need of help (e.g., Cubans, Vietnamese) and this positive speech diminishes as the groups become perceived 393 to be made up of more "economic" migrants over time. Exploring these hypotheses in greater detail is beyond the scope of this 394 paper, but the patterns that emerge are interesting and worthy of further study.

Immigration Topics
396 Figure S23 shows a set of 40 topics discovered using Latent Dirichlet Allocation (10), plotted in terms of the mean document 397 proportions over time. As can be seen, some are procedural (e.g., "act, section, amendment"), some group together nationalities 398 (e.g., "chinese, treaty, china, government, japanese"), some reflect aspects of the immigration debate that are relatively focus in 399 time (e.g., "education, school, students"), and some represent enduring issues (e.g., "tax, percent, budget"). Most however, are 400 relatively localized in time. Because of this, we choose to make use of semi-automatically constructed frames (tagged lexicons) 401 for measuring the prevalence of immigration frames across the entire time period (see below).

403
In order to deepen our analysis of the language used in relation to immigrants, we constructed a set of fourteen immigration 404 "frames", i.e., thematic groups of words used in association with immigrants, and measured their prevalence across parties, 405 across ethnic groups, and over time.

406
In this section, we discuss how we chose these fourteen frames to focus on, and how we constructed word lists for each 407 frame. Our approach consisted of three steps: (1) we applied computational methods to uncover all words that were used  significantly more in reference to immigrants as opposed to generic people mentions (e.g, "man", "woman", etc.); (2) based on 409 a combination of automatic word clustering methods and hypotheses from the literature on immigration, we identified fourteen 410 relevant frames; (3) all authors on this paper manually annotated the immigrant-associated words to label each one with the 411 frame(s) that they belonged to.

412
Identifying immigrant-associated words. In the first step, our goal was to automatically identify all words used in association with immigrants, i.e., words that were used significantly more frequently to modify immigrant terms than general person terms.

414
First, we constructed two groups of "anchor" terms, one of immigrant terms (e.g., "immigrant", "emigrant") and one of generic 415 terms related to people (e.g., "person", "man", "woman"), which also included the immigrant terms; we provide a full list of 416 these anchor terms in Table S7. To identify modifying words, we applied part-of-speech and dependency parsing to all of the sentences in the Congressional 418 speeches. For a given anchor term, we collected all adjectives, verbs, and nouns that appeared in the speeches with certain 419 dependency relations to the anchor term. For example, we collected all adjectives that were adjectival modifiers of the anchor 420 term, such as "illegal" in "illegal immigrants". In Table S8 To identify terms that were significantly associated with immigrants, we compared the relative frequency of words in Ci 425 versus Cp. Formally, for a given term w (defined by a lemma and a part-of-speech), we computed its relative "background" with 898 unique words (lemma and part-of-speech) in total. In Figure S24, we also visualize a subset of these words-the 438 "strongest" associations-for which fi(w) was at least 5 times as large as fp(w).

439
Identifying immigrant frames. In our second step, our goal was to construct frames from among the immigrant-associated words. 440 We first approached this automatically, using a combination of word embedding and clustering techniques. In order to learn 441 word embeddings that were specific to the context of the Congressional speeches, we trained our own word embeddings on the 442 Congressional speeches using word2vec (11). As input to the word2vec model, we provided all of the immigration speeches  The x-axis shows the score in the earlier time period, and the y-axis the same score in the later period. From the 898 words that we kept (see Table S8 for filtering criteria), we plot the words whose scores are at least 5 for one or both periods. The grey line indicates y = x; falling above the line (e.g., the verb "flee") means that the word was more associated with immigrants in the later period, falling below the line (e.g., the adjective "undesirable") means that the word was more associated with immigrants in the earlier period.
Then, we gathered the word embeddings for each of the immigrant-associated words output in the previous step. To identify potential themes among these words, we ran k-means clustering on their word embeddings, with a range of possible cluster Table S9. Curated word lists for frames. Letters in parentheses indicate part of speech (n = noun, v = verb, a = adjective). Note that hyphens between words were dropped in preprocessing the speeches, hence the appearance of terms like "selfsufficiency (n)".   S25. Comparison of usage of frames by parties (reproducing the analysis in Figure 3 in main paper), when using the expanded frame lexicons combining words from Tables S9 and S10.   Figure 4 in main paper), when using the expanded frame lexicons combining words from Tables S9 and S10.  contextual embedding models to measure the extent to which mentions of immigrants "sound like" each of several metaphorical category.

505
The basic idea of this method is illustrated in Figure S28. Contextual embedding models, such as BERT (12), are trained to 506 predict the identity of randomly masked words based on the surrounding context. Here, we repurpose the model by intentionally 507 masking entire mentions of immigrants (which could be, for example "aliens" or "Mexican nationals", etc.), and computing the 508 probability-according to the model-that each word in its vocabulary would serve as a replacement for the mask. By adding 509 up the probabilities for a set of words which we have previously identified as being representative of particular categories, we 510 get an estimate of how much a particular mention suggests the corresponding metaphor. In Figure S28, for example, the 511 reference to "dumping" something "into this country" suggests that words in the Cargo category would be likely replacements. with an initial set, we use static embeddings to look for semantically similar terms, and then limit the list to those that are in 515 the BERT vocabulary as whole words. The full set of terms we use as targets for each of the metaphorical categories are given 516 in Table S11.

517
The table also includes a set of random control terms. To choose these, we counted the occurrence of all words that occur as 518 nouns in the Congressional Record (after parsing it with spaCy), and restricted the possible set to those words that occur at 519 least 1000 times, and those that exist as whole words in the BERT vocabulary. We then selected a random set of 50, excluding 520 all terms that had previously been used in identifying immigrants, nationalities, or other metaphorical categories. Finally, we 521 noted that the resulting random list included the word "humans". Since this term would account for most of the probability 522 mass for the Random category, we exclude this term, although leaving it in leads to similar results. probabilities for all metaphorical categories decline over time. However, in order to correct for unrelated factors which might 526 explain this decline (e.g., due to older data being less similar to the data that BERT was trained on, or factors related to the 527 Congressional Record itself), we also make use of a set of random control terms to correct for this, as described above.

528
In more detail, to estimate the prevalence of contexts which cue these metaphors over time, we compute the average log 529 probability assigned to all terms in each category at each point in time (across contexts), and divide by the number of terms in 530 the metaphorical category. We then repeat this for the terms in the Random category. Finally, we plot the log of the ratio 531 between these two, which is equivalent to the log of the first minus the log of the Random category, as shown in Figure S29.

532
More formally, for a set of N contexts (mentions of immigrants), we compute the relative log probability for metaphor m as, where ci is the i th context, Wm is the set of words associated with metaphor m, |Wm| is the number of terms in that category,

533
Wr is the set of terms in the Random category, and p(w | ci) is the probability assigned to word w in context ci with the 534 masked mention.

535
After correcting for changes in random terms, we see in Figure S29 that in fact there is no significant increase or decrease in 536 dehumanizing metaphors over time. In addition, we can see that the Animal and Cargo words are the most prominent. By 537 contrast, the Vermin terms are actually less likely than random terms, on average, though we still see that they are significantly 538 more likely in speeches by Republicans than Democrats in the past two decades. In addition, if we do the same comparison between groups (e.g., Republicans vs. Democrats) as we did for the dehumanizing 540 metaphors, but using the Random category, we find that all comparisons are either not significant or show the opposite sign of 541 the differences observed between parties and groups. Thus, we are confident that the observed differences are real, and report 542 them without correction.  Looking at the list of examples given in Table S12, we note that some of the animal examples seem to be picking up on 565 mentions of farms and ranches. In order to ensure that our results are not confounded by differences in how the two parties 566 talk about agriculture in relation to immigration, we investigate these mentions in more detail.

567
To better understand what terms are driving our measurements of mentions of immigrants as suggestive of an Animal 568 metaphor, we take all of the mentions of immigrants from the past two decades, and train a basic bag-of-words logistic regression 569 model to differentiate between those that are found to be relatively highly indicative of this metaphor (the top quartile) and 570 the rest. The resulting model shows that some of the most heavily weighted words are indeed related to agriculture, with 571 the most heavily weighted among the agricultural terms being "agricultural", "dairy", "agriculture", and "farm". We then Average relative log probabilities for each dehumanizing metaphor (as well as overall) for mentions of immigrants over time. Black lines show overall logged average probability prior to the period of polarization; red and blue lines show the averages for Republicans and Democrats, respectively. All lines represent the log of the average probability per word in the metaphorical category relative to the average probability per word for a set of random terms. For most metaphors, probabilities are significantly higher for Republicans (relative to Democrats) in the past two decades, but there is no significant increase or decrease in the overall probability of dehumanizing metaphors over time.
Table S12. The 12 contexts (and corresponding mention terms) most strongly suggestive of the Animal metaphor. Cumulative probability is the sum of probabilities assigned by the model to all words in the Animal category (see Table S11).  Democrats and Republicans. For the sake of completeness, we consider the four terms mentioned above, along with singular 574 and plural forms of "farm", "farmer", "ranch", and "ranchers". The mention frequencies by party are shown in Figure S31.

575
As can be seen, there is indeed a difference between the parties in terms of how frequently they refer to some of these terms 576 in the context of immigration. However, the differences are somewhat symmetric, with Democrats referring more to farms and 577 farmers, and Republicans referring more to ranches and ranchers. Moreover, among the most highly weighted among these 578 terms, two are more frequently used by Democrats ("farm" and "dairy"), and two are more frequently used by Republicans 579 ("agriculture" and "agricultural"). As such, we do not believe that the difference between the parties in terms of how much 580 their mentions of immigrants cue the "animal" metaphor is driven primarily by a difference in their respective references to 581 certain agricultural terms.

582
To be doubly sure, we re-run the metaphorical analysis after masking out all of the agriculture-related words listed in Figure   583 S31 (i.e., replacing those tokens with "[MASK]". Although the numbers change slightly, this does change any of our conclusions.

584
In particular, rounded to the nearest decimal, the resulting probability ratios (for the combined dehumanization metric) remain 585 the same: 1.4 for Chinese:European in the early time period, 1.9 for Mexican:European in the past two decades, and 1.6 for 586 Republicans:Democrats in the past two decades, with no significant difference between the parties in the earlier time period.

587
Finally, to verify that the observed differences in the use of dehumanizing metaphors are not excessively influenced by the 588 fact that BERT was trained on modern data, we repeat the analysis using HistBERT-a version of BERT that has been 589 fine-tuned to historical data, covering the entire 20th Century (13). Although this produces slightly different values, the results Although our use of immigration frames in the main paper is restricted to testing differences between parties and groups, they 595 also allow us to study broader changes over time in how immigration is discussed. To do so, we plot the combined frequency of 596 terms associated with each frame (with the appropriate part of speech tags), both in speeches about immigration, and in all 597 non-procedural speeches ( Figure S32).

598
Although the comparison is imperfect (because non-procedural speeches that are not about immigration represent a variety 599 of different types of speech, not just those about comparable issues), we use the relative frequency, measured using pointwise 600 mutual information (PMI), to measure how salient each frame was to the issue of immigration over time ( Figure S33). To 601 make the PMI scores comparable across sessions of Congress (which differ in the total number of tokens), we normalize these 602 scores by dividing by the PMI score for the term "immigration", which almost never occurs except in speeches that have been 603 classified as being about immigration (hence the use of "Scaled PMI" in these figures). We also use PMI to show the divergence 604 between the parties on framing over time ( Figure S34) and the overall trends for each party ( Figure S35). Figures  Mention frequencies, by party, of agriculture-related words, in segments mentioning immigration from the past two decades. As can be seen, there is a difference between the parties, but this is not likely to explain the differences in our dehumanization metric, given that Republicans mention ranching more, whereas Democrats mention farming more.
with Victims, Family, Legality, Quantity, Crime, Threats, and Contributions (all significant at the p < 0.001 level; see Figure   609 S32). For many of these frames, however, the frequency of the associated terms have also become more common in all 610 non-procedural speeches. As a result, the only frames that show a significant increase in their association with immigration (as 611 measured using PMI) are Crime, Legality, and Victims. Although each of these frames has always had some association with 612 immigration, these associations have grown much stronger over time.

613
For Legality, this was driven partly by the rise in mentions of "illegal immigrants", but also relates to the expansion of 614 the immigration bureaucracy, including issues related to visas and naturalization. The term of "illegal" also contributes to 615 the growing association with Crime, but this association also depends on the popularly expressed notion that immigrants are 616 bringing "drugs", committing "crimes", and connected to "terrorism".

617
By contrast, the only frame for which there has been a significant decline in raw frequencies within immigration speeches is 618 Deficient, which was once extremely common due to references to immigrants said to be "illiterate", "diseased", or "undesirable".

619
However, several frames show a significant relative decline in their association with immigration (measured using PMI), 620 including Deficient, Culture, Threats, Family, Labor, Contributions, and Economics, as the terms associated with these frames 621 have become relatively more frequently used in non-immigration speeches.

622
As expected, nearly all frames show a positive association with immigration during this time, because of how they were 623 constructed (beginning with terms which referred to immigrants more frequently than generic person mentions). The one 624 exception to this is Economics. Although economic terms are common in speeches about immigration, they are yet more 625 frequent in non-immigration speeches, and this effect is stronger now than in the past, hence the significant decline in this 626 association. For the differences between parties, these results largely match those that are shown in Figure 3 in the main paper.

627
As shown in Figure S34, however, we can see that all of these differences have grown stronger over time since the 1980s. Those  S33. Scaled PMI using all speeches, with slopes estimated over the entire time period from 1880-2020. Slopes are scaled such that a slope of 1.0 would be equivalent to moving from 0 to 1 in PMI over the entire time series. Asterisks indicate significant changes after applying a Bonferroni correction (with uncorrected p-values listed). Shaded bands indicate the area from minimum to maximum scaled PMI obtained from re-computing values when excluding out one term at a time from the terms associated with a given frame (showing that no single term has a massive effect on the slope, though significance would change in some cases if we took the maximum p-value). Note that all plots are scaled consistently and are comparable in absolute terms.