Comprehensive dataset of user-submitted articles with ideological and extreme bias from Reddit

Our study aims to collect data to understand ideological and extreme bias in text articles shared across various online communities, particularly focusing on the language used in subreddits associated with extremism and targeted violence. Initially, we gathered data from related online communities, specifically the r/Liberal and r/Conservative communities on Reddit, utilizing the Reddit Pushshift API to collect URLs shared within these subreddits. Our aim was to gather news, opinion, and feature articles, resulting in a corpus of 226,010 articles. We also curated a balanced subset of 45,108 articles and annotated 4000 articles to validate their relevance, facilitating understanding of language usage within ideological Reddit communities and insights into ideological bias in media content. Expanding beyond binary ideologies, we introduced a new category termed “Restricted” to encompass articles shared in private or banned subreddits. This third category encompasses articles shared in restricted, privatized, quarantined, or banned subreddits characterized by radicalized and extremist ideologies. This expansion yielded a large dataset of 377,144 articles. Additionally, we included articles from subreddits with unspecified ideologies, creating a holdout set of 922,522 articles. In total, our combined dataset of 1.3 million articles collected from 55 different subreddits will assist in examining radicalized communities and providing discourse analysis in associated subreddits, enhancing understanding of the language used in articles shared within radicalized Reddit communities and offering insights into extreme bias in media content. In summary, we collected 1.52 million articles to understand ideological and extreme bias, providing a comprehensive dataset that aids in understanding language usage within text articles posted in ideological and extreme Reddit communities.

examining radicalized communities and providing discourse analysis in associated subreddits, enhancing understanding of the language used in articles shared within radicalized Reddit communities and offering insights into extreme bias in media content.In summary, we collected 1.52 million articles to understand ideological and extreme bias, providing a comprehensive dataset that aids in understanding language usage within text articles posted in ideological and extreme Reddit communities. ©

Value of the Data
• Comprehensive Ideological Spectrum Analysis: This dataset enables researchers to perform a nuanced analysis of ideological and extreme bias by providing a broad spectrum of articles from various Reddit subreddits.The inclusion of categories such as Liberal, Conservative, Restricted, and Undefined allows for a detailed examination of how different ideological perspectives and extremisms are expressed in online discourse, facilitating a deeper understanding of the ideological landscape on Reddit.• Benchmarking and Model Evaluation: The large and diverse dataset serves as a valuable benchmark for evaluating and improving computational models used in natural language processing and machine learning.By offering labeled data from multiple ideological categories, it allows for the development and testing of models designed to detect and classify ideological bias, advancing research in automatic bias detection and content moderation.• Impact of Restricted Content: The dataset ʼs inclusion of articles from restricted or banned subreddits provides unique insights into the language and rhetoric used in more radicalized or extremist online communities.This aspect of the data is crucial for studying the propagation of extreme ideologies and their potential impacts, offering a window into how extremist content is communicated and received within isolated or covert online spaces.

Data Description
We conceptualize ideological affiliation as a summation of personally held community values [ 10 , 12 ].With the goal of better understanding discourse and how different communities, each with their own common ideological affiliation, connect through news, our data collection centers on the social media website Reddit.Our approach to understanding ideological bias in the news sets us apart from common data sources such as surveys or crowd-sourced datasets, which typically investigate political ideology and media bias.Some prior work uses platforms such as Amazon Mechanical Turk or conventional surveys like Pew Research [ 1 ].Other studies rely on datasets like media bias datasets [ 2 ], the Congressional Tweets Dataset [ 3 ], Twitter fake-news influence datasets [ 4 ], biased Wikipedia articles [ 5 ], and Facebook News outlets [ 6 ].Some surveys do not assess the articles themselves; instead, they gauge the ideological bias of the source.These datasets often include content from diverse users with varying ideologies.Our approach is unique; we do not separate data collection and labeling.
For the first dataset (Data 1), specifically, we are interested in the r/Liberal and r/Conservative communities where members actively share and discuss news articles and blog posts.Towards this goal we have collected news articles posted to each subreddit to better understand ideological expression through the shared news articles, and more specifically, if and how ideological bias may be expressed in the news articles.This method effectively captures the values and perspectives of these specific online communities.A summary of the collected data is provided in Table 1 .In addition, this dataset was limited in its ability to capture ideological extremism due to its binary classes, which, in turn, hindered its ability to demonstrate the effectiveness of its approach.To overcome this limitation, we have expanded (Data 2) the dataset in two keyways.Firstly, we have included text articles from additional Liberal and Conservative ideology identifying subreddits.Secondly, we have introduced a new category referred to as the "Restricted class."This category enables us to explore radicalized and extremist ideologies by incorporating articles from restricted, privatized, quarantined, or banned subreddits.Table 2 provides a summary of this comprehensive data.

Experimental Design, Materials and Methods
To collect the articles contained within the corpus, we gather identification numbers ( IDs ) and web addresses ( URLs ) of submissions from the subreddits, starting from the first post date of each subreddit up to August 10, 2021 using the Pushshift Reddit API [ 7 ].Many of the URLs lead to non -text websites like YouTube and Imgur, which are excluded and outside the scope of the resulting corpus.We utilize the Beautiful Soup API [ 8 ] to extract text content from the remaining URLs and then perform data processing to eliminate empty responses, and duplicates articles from different URLs (duplicate webpage responses are common when sharing the original and short-form URLs using URL shortening tools like Bitly).
Further, we refine the corpus by eliminating non-relevant webpage texts which are unlikely to be news articles or posts such as "404 error" messages or copyright statements.To achieve this, we apply a simple word-count threshold to identify and exclude documents that are unlikely to be articles.This process ensures that our dataset contains relevant and meaningful articles, enhancing its quality and usefulness.
To establish an appropriate word-count threshold, first we divided articles into 20 bins based on their word-count distribution.Next, 100 articles from each category are selected at random for human annotation to label the documents as a verified article or not.The annotation process was conducted by a graduate student using the open-source annotation tool Doccano [ 9 ].Subsequently, established word count threshold to exclude non articles.
For collecting the first dataset (Data 1), we followed the above-described method and collected 22,554 liberal articles and 203,456 conservative articles.The resulting corpus presented a class imbalance between r/Liberal and r/Conservative spanning the years 2008-2021.For classifier development using a balanced dataset, we retained 22,554 r/Liberal articles and sampled an equal number from r/Conservative , maintaining the same daily rates.In this data collection, we labeled 40 0 0 articles into genuine article or not for establishing word count threshold.
Next to build the expanded dataset (Data 2), we collect articles from the Liberal, Conservative, and Restricted category subreddits that align with similar beliefs and interests.Further, we also include a separate holdout dataset collected from various subreddits, representing a mix of overt, vague, and undefined ideologies, as showed in Table 3 .To ensure the inclusion of only authentic text articles, while excluding unrelated webpage content like video descriptions and copyright templates, we introduce a word-count threshold.The threshold was determined by conducting annotations on a subset of 600 articles from the CringeAnarchy subreddit.All CringeAnarchy articles were divided into 12 bins based on word count, and 50 articles were randomly selected from each bin.The annotations were carried out using Doccano [ 9 ], which is an open-source web-based annotation tool.Our objective is to identify a word count threshold at which 90 % of the articles can be classified as "long text."Building on first dataset collection, we set that a word limit of 300 is suitable for categorizing articles as long text.However, certain subreddits failed to meet these criteria, leading to the exclusion of the Socialism_101 and far_right subreddits from our data collection.
Subsequently, the remaining corpus is categorized, as presented in Table 3 , based on whether the respective subreddits express a clear ideology on the Reddit platform.Articles originating from subreddits with explicitly stated ideologies are categorized into three groups: 72,488 articles in the Liberal class, 79,573 articles in the Conservative class, and 225,083 articles in the Restricted class.Conversely, articles from subreddits lacking a clearly defined ideology, including those with implicit or explicit ideologies, are merged to form a holdout dataset comprising 922,522 articles.This holdout dataset will serve as a case study.
To compile text articles, we begin by extracting website links (URLs) from all posts within targeted subreddits.Utilizing the Pushshift Reddit API, we initiate data retrieval from the inception of each subreddit ʼs posts and extend the collection period until August 2021.Links originating from platforms such as YouTube and Imgur are filtered out, focusing solely on textual articles.Subsequently, employing the Beautiful Soup API, we scrape text content from the retained URLs.Throughout this process, measures are implemented to exclude empty or duplicate articles from our dataset.Furthermore, to ensure the inclusion of genuine text articles while eliminating irrelevant web content like video descriptions or copyright notices, a word-count threshold is applied.Datasource location Reddit Data accessibility Part 1 has Data 1 (all) and Data 2 (Raw and Labeled Data -Restricted.json)Part 2 has Data 2 (Raw and Labeled Data -Liberal.json,and Conservative.json)and Data 2 (Raw and Unlabeled Data -first 40 of the 76 .jsonfiles) Part 3 has Data 2 (Raw and Unlabeled Data -remaining 36 of the Ravi, K., Vela, A. E., & Ewetz, R. (2022, December).Classifying the Ideological Orientation of User-Submitted Texts in Social Media.In 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA) (pp.413-418).IEEE.

Table 1
Open-sourced data details for Data 1.

Table 2
Open-sourced data details for Data 2.

Table 3
Open-sourced data details.