1 Introduction

Social media platforms must juggle between two important but conflicting goals: preserving freedom of speech and curbing hate speech and extremism. By swinging the pendulum too much toward curbing undesired ideas, platforms may become monotonous, ultimately losing their appeal to the users. This paper explores the converse case when a social media platform, Reddit (https://www.reddit.com), gave communities too much leeway to promulgate their ideologies, resulting in the formation of highly toxic hate groups. We seek to understand the impact that those problematic groups had on the dissemination of hate through the platform as a whole by exploring the following research question:

How does becoming active in a hate subreddit change a user’s behavior on the platform?

The reason we explore this question is to better understand whether (and how) users adopt extremist ideas from exposure to hate subreddits, where we define “exposure” as posting messages on that subreddit. We answer this question by matching treatment users, i.e., those who become active in a hate subreddit, with control users, who are similar in behavior but never join said hate subreddit. We then model users’ hate speech through interrupted time series design, to obtain a causal effect estimate of how much change in hate speech can be attributed as a result of becoming active in a hate subreddit. In total, we analyzed ten subreddits across 4 different categories: Alt-Right, Racist, Sexist, and Fat-Shaming. In contrast to a previous paper (Schmitz et al. 2022), we use a novel method to systematically label hate subreddits and use this larger pool of subreddits to find statistically significant effects of joining distinct hate subreddits and subreddit categories.

For each subreddit, we use sparse additive generative models of text (SAGE) (Eisenstein et al. 2011) to generate a lexicon of the community-specific hate slurs. The lexicons are then used to measure the hate speech of each treatment user through time, both before and after they join the hate community, as well as measuring their matching control user over the same time period. Causal analysis through interrupted time series leads us to conclude that becoming active in a hate community leads to a measurable spillover in hate speech to other non-hate communities, meaning that a user’s hate is not self-contained within the subreddit that they join. The findings are consistent across subreddits and categories, which demonstrates the qualitative and quantitative robustness of this effect. Our results provide tentative evidence that users adopt extremist ideas (Marwick et al. 2022), and point to the need for social media platforms to improve their moderation in order to reduce hate both within and outside hate communities.

2 Related work

2.1 Online communities

Motivations for joining online communities can be neutral or benign, such as boredom, seeking social support, and information exchange (Ridings and Gefen 2004; Laeeq 2017; Lewis et al. 2018; Yao et al. 2021; Stockdale Laura and Coyne Sarah 2020). Yet, motivations can also be malicious, such as attacking the community and its members (Jhaver et al. 2018), subsequently diminishing overall community activity (Kumar et al. 2018).

The continued engagement of users with the community hinges on several factors, such as whether a newcomer receives reciprocal interaction from existing community members (Arguello et al. 2006; Burke et al. 2009; Burke and Settles 2011; Kraut and Resnick 2012), with engagement being further boosted when replies are positive (Arguello et al. 2006) or personalized (Kraut and Resnick 2012). New editors making their first contributions to Wikipedia are less likely to continue attempting to contribute if their original edits are reverted (Halfaker et al. 2011; Zhang and Zhu 2006). For this reason, Wikipedia encourages existing users to be gentle with newcomers (Contributors Wikipedia 2022). The expression of negative emotion by other group members leads to community abandonment (Yao et al. 2021). A user’s stay in a community is also related to the user’s adoption of the group’s linguistic norms (Danescu-Niculescu-Mizil et al. 2013). At times, certain newcomers are undesired (Kraut and Resnick 2012; Choi et al. 2008), such as those who do not conform to community ideology and culture (Boero and Pascoe 2012; Santos et al. May 2020).

The characteristics of each community on Reddit are sufficiently idiosyncratic that they can be accurately identified on both their style (86.5%) or topic (71.1%) (Tran and Ostendorf 2016). Rudeness and attacks between dissimilar people reduce online engagement across ideological lines (Marchal 2020), yet content about the out-group is shared at higher rates than content regarding the in-group (Rathje et al. 2021).

2.2 Hate speech in social media

It has been shown that news-feed algorithms favor users who share hate and extremist content over those whose posts are neutral or positive (Phadke and Mitra 2021). Further, Facebook has been shown to play a larger role in the active recruitment of new members, whereas Twitter is more prominent in terms of maximizing the broadcasting of the groups’ ideologies (Phadke and Mitra 2020).

Physical characteristics, such as ethnicity, height, weight, and gender, alongside observable behavior, such as sensitivity or insecurity, are the primary attributes used by attackers in their hate speech (Mondal et al. 2017; Silva et al. 2016). When the hate speech is directed, as opposed to generalized, it tends to be angrier (ElSherief et al. 2018), and an increase in time spent on far-right websites such as Gab has been linked to an increase in hate speech (Gallacher and Bright 2021; Zannettou et al. 2018). When it comes to social media, it is estimated that as much as 25% of content contains some form of hate speech, with explicit hate speech being observed in 13.7% and implicit hate speech in 15.5% (Rieger et al. 2021). Hate users display changes in both activity levels and lexicon compared to regular users (Chatzakou et al. 2017).

Extremist subreddits have been shown to prop extremist content through user-level actions such as upvotes and downvotes that favor an echo chamber of surfacing content aligned with the groups’ behavior while dismissing contrarian views (Gaudette et al. 2021; Massanari 2017; Gothard 2020). One important challenge in studying hate speech is the inherent difficulty of appeasing all critics with regard to the definition used, while also construing a definition that is specific enough to be implemented as software in a programmatic matter (Silva et al. 2016). Users who receive replies are less likely to become engaged in hate subreddits than users who do not, while the opposite effect is observed for non-hate subreddits, and this effect is attributable to the toxic, negative, and attacking nature of replies in hate subreddits (Hickey et al. 2023).

2.3 Moderation of hate speech

Both soft approaches such as quarantines, where users receive a warning and must agree to proceed before viewing the contents of toxic communities, as well as hard approaches such as banning problematic subreddits, have been shown to systemically reduce hate speech on Reddit (Chandrasekharan et al. 2017, 2022; Copland 2020). There is, however, evidence that such actions merely forward the hate users from Reddit to other less regulated platforms (Copland 2020). Put together, these indicate that there is likely a user overlap between different hate subreddits and their external counterparts, such that any moderation actions taken must account for user migration (Zannettou et al. 2017). For example, it has been shown that the banning of a subreddit leads its most active users to reduce their overall activity levels the most, but most users did not see a reduction in overall activity levels, rather they had only a reduction in their usage of subreddit specific language (Trujillo et al. 2021). The reduction in moderation on Twitter that followed Elon Musk’s acquisition has led to a significant increase in hate speech on the platform (Hickey et al. 2023). Another challenge to moderation is that suppression of hate content in one platform can inadvertently boost the sharing of that content in another platform (Johnson Neil et al. 2019). Both as a result of moderation initiatives, as well as due to a myriad of other changing factors, it has been observed that certain communities can migrate, sometimes with the intent of avoiding moderation, and sometimes for other reasons (Davies et al. 2021). A factor that helps Reddit maintain users is its long tail of niche content, which can only be lively due to the size of Reddit’s user base. This indicates that whether a hate or fringe community can be exterminated by a Reddit ban is related to the community’s prominence and its ability to attract new members when residing in less popular platforms (Newell et al. 2016).

3 Methods

3.1 Identifying communities of interest

We base our selection of hate subreddits on prior work by Hickey et al. We previously identified 25 hate subreddits that are now banned (Hickey et al. 2023), from which we subsampled ten subreddits across the Racism, Sexism, Alt-Right, and Fat-Shaming categories, as shown in Table 1. The code used to parse and analyze data in our paper is available.Footnote 1

Table 1 Subreddits studied and their category

3.2 Identifying subreddit-specific hate language

Our approach to defining hate words follows Supreme Court Justice Potter Stewart’s principle for defining a threshold for obscenity: “I know it when I see it.” Namely, words that are observed to have hate connotations in their contexts are considered hate words. This lexicon-based approach allows us to implement our analysis in a computationally-compatible manner.

We use sparse additive generative models of text (SAGE) (Eisenstein et al. 2011) to rank all unigrams from the studied subreddit with regard to how distinctive (i.e. community-specific) they are in relation to the typical word usage on Reddit. To define typical word usage, we gather a 10GB corpus containing posts (submissions and comments) randomly sampled across all of Reddit. For a given studied subreddit, we use all posts within it as the community corpus. We remove stopwords from both corpora. Many hate words are slang that are not well handled by stemming or lemmatizing, hence we do not apply any such techniques.

We further filter the ranked word list of each subreddit to the top 100 most distinctive words, and then rate them using a 3-rater system, where authors of this paper independently score the word as 0, 1, or 2, respectively, meaning not a hate word, sometimes a hate word, or Always a hate word. Words with a cumulative score at or above 4 (i.e., at least one rater rated the word as “always a hate word”) are then defined as subreddit-specific hate words. Creating a customized lexicon for each hate subreddit is important because much of the hate expressed in subreddits are based on group-specific slang that is poorly captured by generalized lexicons (van der Does et al. 2022; Gerrard 2018). From these hate word lists, we can then quantify the hate speech for a given user on a given day as the ratio of hate words to total words. The complete hate lexicon for each subreddit is available in our GitHub page.Footnote 2

3.3 Data gathering

In our study ”treatment” is defined as becoming active in a hate subreddit. Users who become active are considered “treated users” while users who never become active are “control users”. We explain how we gather each set of users below.

3.3.1 Treatment data

For each user who posted on each hate subreddit, we collect their entire posting history using the PushShift API (Baumgartner et al. 2020). To prevent our analysis from being influenced by automated accounts (bots), we manually inspect accounts whose name matches the keywords “bot”, “auto”, “transcriber”, “gif”, “link”, “twitter”, removing those that are self-identified as bots or are likely bots based on their behavior. Similarly, for each subreddit we manually inspect the top 20 accounts by activity levels, again removing those found to be automated. Overall, a small number of accounts (\(< 1\%\)) are removed.

To increase signal, we further filter treatment users to include only those who, at the time of becoming active in their respective hate subreddit, are not a part of any of the 25 other hate subreddits compiled by Hickey et al. (2023). We follow that procedure under the rationale that users who are already adopting extremist ideas as a result of interactions on other hate subreddits are unlikely to show a significant change in behavior when becoming active in ’just one more’ hate subreddit. Table 1 displays both the original number of users collected, as well as the number of users being considered after filtering.

3.3.2 Control data

Our goal is to match a treated user to its most similar control user prior to the users becoming active in a hate subreddit. We thus opt for a targeted crawling approach, focusing on obtaining candidate control users that are most likely to be as similar as possible to the treated users. To that end, we obtain our control users from the top thirty subreddits with the highest percentage of user base overlap with each hate subreddit, always removing from that list users who are also active on the treatment subreddit itself, as those are considered treatment users.

3.3.3 Banned subreddits

We obtain a non-exhaustive list of banned subreddits to analyze hate speech by users within under-moderated communities outside the hate subreddit. Reddit does not divulge its bans, and hence all information on bans comes from user-generated content and self-reports. The gathering process was entirely manual and consisted of browsing Reddit itself for posts compiling subreddit bans. Given the nature of the data-gathering process, prominent bans and bans of large subreddits are more likely to be featured in our set. In total, we obtained 3515 subreddits reported to be banned.

3.3.4 Categorical aggregation

One negative side effect of filtering our treatment data to consider only users with no prior activity in other hate subreddits is that it reduces the overall sample size, which can make it hard to detect whether the measured effects are statistically significant or not. For this reason, we pool subreddit data in four categorical groups, as shown in Table 1. Since results are similar both across individual subreddits and grouped categories, we opt to report figures only for the groups in the main section of this paper and include the (much more numerous) plots for individual subreddits in Supplementary Materials.

3.4 Treatment-control matching

For each treatment user, we find a control user that is the most similar to them (Niven et al. 2012; Ali et al. 2018), with similarity being defined in terms of Mahalanobis distance (Stuart 2010), and for computational reasons, we randomly subsample the (initially much larger) pool of potential control users to be ten times the size of the treatment users’ pool, this is done separately for each subreddit.

The Mahalanobis distance is calculated using the following user attributes: account creation date, Reddit karma (sum of all up-votes minus all down-votes), total number of submissions, total number of comments, and the count of posts made in each of the 50 most similar subreddits (those with the highest ratio of treatment members, as per the list generated when defining the subreddits from which to sample control candidates).

We always cap the matching features at the month prior to the treatment user becoming active on the hate subreddit, that same date cutoff is also applied to all control candidates under consideration for matching. This follows standard causal methods that avoid unintentional correlations between matching and outcome (Ham and Miratrix 2022).

The matching algorithm uses the following procedure: (1) select a treatment user, (2) check in which month that user became active on the hate subreddit, (3) consider that user’s features and the control candidates’ features on the month prior to the activation month, (4) find the most similar control candidate via Mahalanobis distance matching, (5) store a triplet of (treatment, control, distance), (6) move to the next treatment user.

There is a possibility that more than one treatment user gets matched to the same control candidate, hence once all users were matched, we check for control candidates who had more than one match and keep only the match with the shortest Mahalanobis distance. The remaining de-matched treatment users were put back on the queue to be re-matched. We also then remove those control users who already have a match from the control dataset such that the treatment user will have to find its next best match. This is repeated until each control user has at most one match.

Once complete, the algorithm provides a one-to-one pairing between treatment and control users that contains the most similar pairs possible. For each pair, we center its data, defining as day 0 the day on which the treatment user first posted in the hate community (i.e., when the treatment began), and that same calendar day is also day 0 for the matched control user. In absolute calendar dates, day 0 differs between pairs as different treatment users became active on the hate community on different dates. This makes our analysis robust to disturbances introduced by any single-day event (Pearce 2016).

3.5 Interrupted time series

Among the Regression Discontinuity Design family of models, the interrupted time series (ITS) approach was conceived for causal-effect modeling of changes to systems over a defined time span, and has been successfully applied to numerous fields (Ham and Miratrix 2022; Lee and Lemieux 2009; Cattaneo et al. 2020).

3.5.1 Bandwidth optimization

The analysis period in which datapoints will be considered is referred to as the bandwidth and is usually symmetric around a date in which the event of interest happened, such that a bandwidth of, e.g., 30 days means an analysis period from -30 to +30 days around the event of interest. In our research, the date for the event of interest (becoming active in a hate subreddit) has been normalized to be day 0, as explained in Sect. 3.4.

To determine the optimal bandwidth, we employ the same quantitative method leveraged in prior research (Imbens and Lemieux 2008; Baicker and Svoronos 2019; Ludwig and Miller 2005), whereby leave-one-out cross-validation is used to select amongst a set of candidate bandwidths. We again follow prior research and use root-mean-squared error (RMSE) (Jacob et al. 2012) as the metric to evaluate and choose amongst the bandwidth candidates.

To ensure stability, the leave-one-out cross-validation analysis is run for 100 rounds. This number was based on choosing one subreddit at random, and verifying the chosen bandwidth when setting the number of rounds to be 5,10,15,...,200, which revealed that roughly past 100 cross-validation rounds results are stable, i.e., we consistently recommend the same bandwidth. Furthermore, since bandwidth optimization should not interfere with the modeling and estimation of interrupted time series parameters, in defining the optimal bandwidth we only consider data points prior to the event of interest, which is to say our datapoints are in the range of \((-n,-1]\) (Turner Simon et al. 2021; Lopez et al. 2018; Ewusie Joycelyne et al. 2017).

We illustrate the process for calculating the RMSE associated with a given bandwidth by using the example case where the candidate bandwidth is ten days, using three cross-validation rounds: (A) fit a linear regression from days \([-11, -2]\); (B) use the fitted model to predict on day \(-1\); (C) on two separate (initially empty) arrays, append the datapoints associated with the true value for day \(-1\) and the predicted value for day \(-1\), respectively; (D) repeat (A–C) using days \([-12, -3]\) to predict for day \(-2\); (E) repeat (A–C) using days \([-13, -4]\) to predict for day \(-3\), we have now reached our desired number of cross-validation rounds for this given bandwidth; (F) now comparing the array of truth values and the array of predicted values, calculate RMSE, and store it as a RMSE associated with a bandwidth of ten days.

The process described above can then be replicated for the entire set of candidate bandwidths, and once it is complete, we then compare the RMSE of all candidates and choose the one with the lowest value, which means that is the number of days that best models our data before the event of interest occurred (on day 0). In our analysis, we considered bandwidths in the range of 30–365 days in five-day increments. Values below 30 were left out due to providing too few samples for later modeling, and values above 365 days were deemed to be too susceptible to longitudinal factors that could confound with the causal effects being studied. The optimal bandwidth found for each subreddit is shown in Table 1.

The datapoints considered during ITS modeling are limited to posts made by members of the studied subreddit outside of the community in question, in order to control for group specific behavior, and because our aim is to measure how hate spreads to the out-group. This only applies to treatment users, as by definition control users do not have datapoints inside the studied subreddit, since having posts inside the studied subreddit is the definition of what qualified a user as an active member, and all such users are part of the treatment group and not part of the control group.

3.5.2 Model design

ITS is based on ordinary least square (OLS) regression (Turner Simon et al. 2021; Lopez et al. 2018). Its two base variables are const and time, which measure the baseline hate speech level and the longitudinal trend, respectively. Two dummy variables are added to generate interaction terms for modeling specific subsections of the data. The first of those variables is exposed, which indicates whether a datapoint belongs to the control group (exposed = 0) or to the treatment group (exposed = 1). The second dummy variable is interrupted, which indicated whether a datapoint belongs to the pre- or post-treatment period. Regardless if treatment or control, all datapoints from day \((-n, -1]\) are pre-treatment and thus have their interrupted variable set to 0, and all datapoints from days [0, n) are post-treatment with interrupted equals 1.

Next, an ITS model considers all interaction terms between time, exposed, and interrupted (Lee and Lemieux 2009). Table 2 summarizes all ITS coefficients and their interpreted meaning. Through all possible interaction terms, a single ITS model can generate four regression best-fit estimates, accounting for both treatment and control users, on both pre- and post-treatment periods.

Table 2 Interpretation of coefficients obtained from the ITS model

3.5.3 Model interpretation

For a given fitted model, it is possible to leverage the obtained parameter coefficients to calculate the relative increase in hate speech when going from pre-treatment to post-treatment (Lopez et al. 2018; Turner Simon et al. 2021). This is done by first obtaining the incremental post-treatment effect size, which for the treatment group is defined by the summation of inter and \(expos \times inter\). Next, we obtain the pre-treatment baseline for the treatment group, calculated as \(const + expos\). We then measure the size of the post-treatment increment in proportion to pre-treatment levels by a simple ratio of incremental post-treatment effect / pre-treatment baseline. This gives us the best estimate of the instantaneous change in hate speech that the treatment causes. Note this measure does not use the time coefficient and is thus not affected by longitudinal trends. Rather it is a comparison of the best estimate of the hate speech right before and right after becoming active in the hate subreddit (Turner Simon et al. 2021).

When we consider the individual variables in each model, \(expos \times inter\) is the most relevant model coefficient, as it captures the instantaneous effect on a user’s hate speech as a result of becoming active in the hate community. inter would be relevant for a similar reason but is always statistically indistinguishable from zero, as it applies to both treatment and control users, the latter having no statistically significant changes in hate speech post-treatment, and thus leading the coefficient to be zero. If \({expos \times inter}>0\), then the user’s hate speech increases as a result of exposure to the hate group, while if \(expos \times inter\le 0\), then the exposure would have reduced hate speech.

3.5.4 Sensitivity analysis

To ensure that our results are robust to variations in bandwidth size, we run a sensitivity analysis whereby we fit one ITS model for each possible bandwidth within the range between 30 and 365 days and measure that model’s coefficients as well as their p-values (Lopez et al. 2018; Chatterjee and Hadi 2009). This test allows us to observe that for a fixed subreddit dataset the ITS coefficients all converge to consistent values independent of bandwidth size. Yet, sensitivity analysis also highlights one difficulty in the intersection of bandwidth optimization and ITS modeling: models built with a small bandwidth contain fewer datapoints, which increases the variance, leading to higher p-values. P-values then go down in tandem with increase in bandwidth size. As previously explained, our category-based pooling of data is intended to counterbalance this effect (Ewusie Joycelyne et al. 2017; Niven et al. 2012). As a reminder, bandwidth optimization and ITS modeling are performed independently.

4 Results

We first explore how hate speech expressed by users changes over time. When we inspect the ITS plots, as shown in Fig. 1, we notice certain trends that replicate across subreddits and categories. First, and unsurprisingly, users who become active on hate subreddits have a higher baseline hate speech even in the pre-treatment period. Figure 1 also reveals that treatment users display an upward trend in their hate speech before becoming active in a hate community, this can be observed by the rising orange line (best fit estimate) for all categories. Despite this trend, ITS’s causal-effect modeling at the break-point (interruption, i.e., becoming active) still allows us to infer that there is a causal link between becoming active in the hate community and seeing an immediate increase in hate speech outside that community. Yet, this pre-treatment trend is worthy of consideration given that it is consistently observed across all studied subreddits, and we explore it in more detail in the Discussion section.

Fig. 1
figure 1

Interrupted time series plots for the studied subreddit categories. We plot the rate of hate speech over time (percent of words that are hate words) outside of the subreddit in question for treatment users that join the hate subreddit and control users that never join. The subreddit categories are, in order, a alt-right, b racist, c sexist, and d fat-shaming. We consistently observe a pattern of increasing hate speech before users become active in the hate community, followed by a spike when the user becomes active, and then a slow decline in hate speech levels

Using the approach outlined in Sect. 3.5, we estimate the effect that becoming active in a hate community has on users’ hate speech outside of the particular subreddit. For each category, we aggregated the time series data of all its subreddits. Figure 2 shows the effect for each hate category, and demonstrates that the effect size is relatively consistent, but is highest for alt-right subreddits. When we disaggregate across categories in Supplementary Fig. S1, we observe similar increases in hate speech, with the largest effect in an alt-right subreddit, r/frenworld. For all subreddits in which the effect size is not statistically significant from zero, we find that the algorithmically-derived bandwidth is especially small (milliondollarextreme = 35 days, CoonTown = 35 days, GreatApes = 80 days, MGTOW = 30 days). For all other subreddits whose algorithmically-derived bandwidth is larger, we observe statistical significance in this plot. More importantly, when we look at Supplementary Fig. S2 (when aggregating across all subreddits in a given category) or Supplementary Fig. S3 (for data disaggregated by subreddit), we can see that \(expos \times inter\) reaches statistical significance (p-values are below 0.05) on all larger bandwidths. Those plots show that individual model coefficients are consistent across wide variations in bandwidth, indicating that results are not sensitive to one particular bandwidth, for example, looking at r/frenworld on Fig. S3, we observe that once the bandwidth is large enough to be robust to noise, there are no variations in the coefficients anymore, with their associated p-values remaining always below 0.05 for the coefficients with a nonzero estimate, whereas the coefficients with a zero estimate continue to be so and to display high p-values.

Fig. 2
figure 2

Relative increase in the rates of hate speech immediately after users become active in a hate subreddit category, as obtained from the interrupted time series model. All categories observe increases in hate speech, but the magnitude of change varies. Error bars represent 95% confidence intervals

Wald’s F-test shows all fitted models are statistically significant (p-value \(<10^{-20}\)). Our models all have p-values > 0.05 for all three coefficients (time, inter, and \(time \times inter\)) that apply to all users (treatment and control). We therefore cannot reject the null hypothesis where the real value for those coefficients is zero. Coefficients that apply to all users being statistically near zero lets us know that control users show no statistically significant hate speech changes over time. This result gives us greater confidence that the studied users are not exposed to significant external effects during the analysis period.

Persistent across all studied subreddits is a reduction in hate speech levels after joining hate subreddits, although hate speech levels still remain higher than pre-treatment levels even at the end of the analysis bandwidth. Because Reddit takes an active stance in moderating its platform, Reddit’s banning of hate users could be behind this trend, especially when one considers that the users most likely to be struck by a ban are those who use the most hate speech and thus most elevate the treatment groups’ hate speech. As these users become banned, the group’s hate speech levels trend down.

We assess this hypothesis by investigating the connection between an account’s hate speech levels and how long that account continued to show overall Reddit activity after having become active in the hate subreddit. We cannot measure actual bans, but rather the last time in which an account made a submission or comment on the platform. Table 3 shows the hate speech levels for accounts that lasted less versus more than one year after becoming active in the studied hate subreddit. For both groups, we only consider posts after they become active in the hate subreddit, and up to one year from that date. For the group where users remained active on Reddit more than one year after they became active in the hate subreddit, we only consider the posts within the first 365 days after becoming active. We see that, for all considered subreddits, the average hate speech for users who disappear within a year is several times greater than for longer-lasting users. This lends evidence to the “active moderation” hypothesis.

Table 3 Rate of hate speech for hate users who remain for under or over one year

In Fig. 3, we show the percent change in the rate of hate words for users across all subreddits, inside the treatment subreddit, outside the treatment subreddit, inside banned subreddits, and inside non-banned subreddits for each hate subreddit category. The change in hate speech is most dramatic for the alt-right subreddits, but we see the change appears longest lasting in the sexist subreddits. Supplementary Fig. S4 shows similar results when disaggregated by subreddit. As in the aggregated data, there are dramatic increases in alt-right subreddits like r/frenworld or r/honkler, but long-lasting changes appear in sexist subreddits, such as r/MGTOW.

Fig. 3
figure 3

Hate speech over time. The percent of all words (comments and submissions) that are hate words for a alt-right subreddits, b racist subreddits, c sexist subreddits, and d fat-shaming subreddits. In all plots, we show hate speech for treatment users in the hate subreddit (blue), treatment users outside the hate subreddit (orange), all posts among treatment users (green), all posts by treatment users in non-banned subreddits (red), and all posts by treatment users in subreddits that will eventually be banned as of January 1, 2023 (brown). Control users are in purple. The red dashed line is when hate users first start posting in hate subreddits. (Color figure online)

It becomes clear that there are limitations to considering a single hate subreddit in a vacuum, as before they become active in the studied subreddit, we can observe that treatment users were already using the same hate lingo at higher levels inside other subreddits that were eventually banned. For example, when looking at “Racist Subreddits” (subfigure b), it is clear that the treatment users were already using much higher levels of racist language inside other subreddits that were eventually banned (solid brown line), when compared to baselines such as “control users” (solid purple line) or even the treatment users themselves inside “not banned” subreddits (solid red line). Since all studied subreddits were eventually banned, they are also part of the “banned subreddits” category, and thus it is only natural that category would see a significant spike when the user starts posting in the studied subreddit. A similar logic applies to the “all” category, which contains all subreddits, including the studied one. This gives us the perspective that, for as much as treatment users are nastier than control users in the whole of the platform, when communicating inside hate communities they let loose their worst.

For our analysis, we built a community-specific lexicon, as delineated under Sect. 3.2. A sample of the most popular words from each lexicon is shown in Supplementary Fig. S5, where we consider their usage by hate users both when inside and outside of their communities. Supplementary Figure S5 depicts how treatment users modulate their vocabulary when talking to the in-group and out-group. In more detail, we can see that words most clearly associated with hate, such as the n-word within r/GreatApes and “obeast” within r/fatpeoplehate, double in frequency when users are in the in-group, whereas when communicating to the out-group users shrink their usage of obscure group slang such as “roastie” (used in r/Incels) and “sheboon” (used within r/CoonTown and r/GreatApes). On the whole, we qualitatively observe that hate users favor less intense hate words when in the out-group.

5 Discussion and conclusions

We modeled the impact of joining a hate subreddit in terms of users’ hate speech in the out-group, unveiling a causal connection between both that is consistent and replicable across communities while also being robust to variation in analysis period, as shown by our sensitivity analysis. Namely, our causal models consistently show that users who become active in a hate community increase their hate speech outside of that community, implying a viral spreading effect on new members. This causal link is shown across ten subreddits and four categories of hate speech.

We began the paper with a tantilizing question: do users adopt extremist beliefs from exposure to hate subreddits? To answer this, we need to determine if users are “...exhibiting intrinsic out-group hostility” and “rejecting egalitarian and democratic values” (Marwick et al. 2022). Our results imply that users become more antagonistic after becoming active in hate groups by using more hate speech in non-hate subreddits. Moreover, the posts users write in hate subreddits tend to demonstrate hatred or even violence towards these out-groups. Both findings lead us to conclude the users exhibit greater out-group hostility. Because we do not know what users think, we cannot directly determine if they are rejecting egalitarian and democratic values more than they did prior to exposure (Marwick et al. 2022), but this is a reasonable if still hypothetical conclusion. Our work therefore provides evidence that users adopt these extremist beliefs simply from exposure to hate subreddits.

Supplementary Figure S5 shows that users exposed to these hate subreddits also tend to use hate words at different frequencies when communicating inside versus outside the hate subreddit. Moreover, in line with prior research (Trujillo et al. 2021), we see an increase in the usage of more cryptic in-group words, such as “foid” (used in the incel communities) or “dindu” (used in racist communities), when communicating with hate group peers, which could be a way of evoking membership through specific language, and thus entice increased responsiveness from other members (Tran and Ostendorf 2016). It is also possible that the relative increase of insider language in the in-group is merely a reflection of the opposite case, where users might in fact be toning down such language in the out-group, for using group-specific language outside might lead to lower engagement, perhaps because the user base at large is not interested in the causes related to the hate words, or merely because they cannot comprehend what is meant by the in-group slangs.

Moreover, when treatment users speak to the out-group, we also observe a reduction in words that qualitatively we perceive as more egregiously offensive (such as using “faggot” instead of the n-word). This could signify that even hate users still feel some of the societal pressures to behave properly, or face harassment. Another explanation is, again, moderation, where it might simply be that users who freely peruse a highly offensive vocabulary are quickly banned from out-group subreddits, and soon cease to be part of our data sample of out-group speech.

There are a few caveats to our research, however. First, despite the matching technique obtaining the best pairing possible, users were not identical in the pre-treatment period, as shown by the differing levels of hate-speech that pre-date treatment. This is caused at least partially from the need not to use the target variable of ITS (hate speech) during matching, to prevent contamination of the results (Ham and Miratrix 2022). As explained in the Methods section, our model is robust to these differences in initial hate speech levels, and can still assign a causal effect to the instantaneous spike in hate speech observed immediately after treatment. Since hate speech exists in a spectrum, yet lexicon based approaches require binary categorizations for matters of practicality, the construction of hate speech lexicons can have a significant impact on the analyses that follow it, given that one might, for example, end up ignoring very large levels of more mild hate speech.

Next, in Fig. 1, we notice an upward trend amongst treatment users before the treatment period. This indicates that before becoming active in a hate subreddit, they already begin to use more hate speech. This might imply that increased hate pushes them into these subreddits, or that users might already have been exposing themselves to the communities material as passive readers, often called “lurkers” on Reddit. As the data that Reddit makes available only tracks user posts, and not what they see, it is difficult to know the degree to which passive reading plays a role. Once again, despite this upwards trend, the causal nature of ITS models at the break-point allows us to attribute the instantaneous spike in hate speech at the moment of becoming active as being precisely due to becoming active. We show the robustness of our findings for individual subreddits in Supplementary Figs. S1 and S6, which display similar data to the figures presented on the main text, but broken down by individual subreddit. Those figures are connected in that the bars shown in Fig. S1 are proportional to the delta between the treatment best fit estimates (orange lines) at day 0 in Fig. S3, for example r/frenworld has the highest relative hate speech increase estimates in Fig. S1, matching it’s having the highest delta (discontinuity) between pre- and post-treatment estimates in S3. We specifically show that the ITS regression models (plotted in Supplementary Fig. S6) have consistent discontinuities around the time users post in hate subreddits. These models find a substantial increase in hate speech, shown in Supplementary Fig. S1.

As noted in the initial analyses, there is a mirror case, where the initial post-treatment spike in hate speech subsides over time, although still remaining above pre-treatment levels. Our analysis points to this potentially being caused by accounts with higher hate speech disappearing from the platform earlier than those with lower hate speech, therefore the long-term effect of joining hate groups may be undercounted due to survivorship bias. Although the root cause for that remains uncertain, the most plausible explanation seems to be that we are observing the effects of Reddit’s moderation efforts, which are banning the most egregious hate users first, thus bringing the group’s average hate speech down, as the plot shows.

In Supplementary Fig. S4, our time series analysis considering various contexts composed by subsets of subreddits, we observe that those in the banned category has persistently higher levels of hate speech. Since subreddits cannot adopt hate words after their ban, it must be that the observed higher hate speech precedes the ban, although we cannot prove whether those higher hate speech levels cause the ban, or if both are merely correlated due to other underlying factors at play.