Reddit financial image post sentiment dataset

The dataset presented in this paper consists of sentiment information extracted from image and text data of financial subreddit posts. Members of these subreddits post about their trading behavior, express their opinions, and discuss capital market trends. Their posts contain sentiment information on financial topics as well as signaling information on trading decisions. Frequently, members post screenshots of their portfolios from their mobile broker apps. We collected the posts, processed them to extract sentiment scores using various methods, and anonymized them. The dataset consists therefore not of any content from the posts or information about the author, but the processed sentiment information within the post. Further financial tickers mentioned in the posts are tracked, such that the effect of sentiment in the posts can be attributed to financial products and used in the context of financial forecasting. The posts were collected using the Reddit [2] and Pushshift APIs [3] and processed using an Amazon Web Services architecture. A fine-tuned MobileNets artificial neural network [4] was used to classify images into four distinct categories, which had been determined in a preliminary analysis. The categories included classical memes, number posts (e.g. screenshots of mobile broker portfolios), text posts (e.g. screenshots from twitter) and chart posts (e.g. other financial screenshots, such as charts). The reason for the classification of images into the four categories is that the images are so inherently different, that different extraction methods had to be applied for each category. OCR – methods [5] were used to extract text from images. Custom methods were applied to extract sentiment and other information from the resulting text. The data [1] is available on a 20-minute basis and can be used in many areas, such as financial forecasting and analyzing sentiment dynamics in social media posts.


a b s t r a c t
The dataset presented in this paper consists of sentiment information extracted from image and text data of financial subreddit posts. Members of these subreddits post about their trading behavior, express their opinions, and discuss capital market trends. Their posts contain sentiment information on financial topics as well as signaling information on trading decisions. Frequently, members post screenshots of their portfolios from their mobile broker apps. We collected the posts, processed them to extract sentiment scores using various methods, and anonymized them. The dataset consists therefore not of any content from the posts or information about the author, but the processed sentiment information within the post. Further financial tickers mentioned in the posts are tracked, such that the effect of sentiment in the posts can be attributed to financial products and used in the context of financial forecasting. The posts were collected using the Reddit [2] and Pushshift APIs [3] and processed using an Amazon Web Services architecture. A fine-tuned MobileNets artificial neural network [4] was used to classify images into four distinct categories, which had been determined in a preliminary analysis. The categories included classical memes, number posts (e.g. screenshots of mobile broker portfolios), text posts (e.g. screenshots from twitter) and chart posts (e.g. other financial screenshots, such as charts). The reason for the classification of images into the four categories is that the images are so inherently different, that different extraction methods had to be applied for each category. OCR -methods [5] were used to extract text from images. Custom methods were applied to extract sentiment and other information from the resulting text. The data [1] is available on a 20-minute basis and can be used in many areas, such as financial forecasting and analyzing sentiment dynamics in social media posts.
© How the data were acquired Extraction via API calls from financial subreddits and subsequent feature extraction using a custom-built feature engineering pipeline on an architecture using a cloud computing service provider. Data format Mixed (raw and preprocessed) Description of data collection Raw social media posts were collected via API calls from reddit.com. The data was processed using a custom-built feature engineering pipeline on an architecture using a cloud computing service provider. We anonymized the data by hashing the user id for each post. Data source location The data was collected from the following publically available subreddits of reddit

Value of the Data
• The data provides quantitative sentiment extracted from text and images on finance related social media posts on reddit. The data can be aggregated on a 20-minute, hourly or daily basis and be used in time series analyses. • Further, the dynamics and changes in sentiment can be analyzed over time and across posts, which is relevant for the field of sentiment analysis. • The features can be further used in the context of Machine Learning prediction. Since financial tickers are often included for posts, one can analyze the influence of changes in sentiment on stocks. • Generally, the data can be used as additional data to conventional datasets in the context of stock price prediction. • Further, the extensive time series information of posts allows the research of dynamics, that drive the popularity of memes and other social media posts and determine the factors that makes posts go viral.
• As the data set consists of sentiment extracted from financial subreddit posts, it allows for analyses in the context of behavioral finance in respect to the members of such forums. Further, educators can use the variety of features to demonstrate all kinds of models and methods in the fields of Machine Learning and Data Mining.

Objective
The data was generated to allow the investigation of the relationship between sentiment contained in social media posts on Reddit and movements on the financial markets. The datasets convers not only sentiment that is extracted from textual data, but images as well, which has not been done so far in this context.

Data Description
The data [1] consists of sentiment information extracted from social media posts of financial subreddits. We applied custom-written methods to images and text to extract sentiment values and create the features in the provided data set files. We collected the data to extend the research capacities in the field of sentiment analysis in financial forecasting. In particular, we aim to facilitate the area of sentiment extraction from images. With this paper we provide the three csv files features.csv, comments.csv and meta_time_series.csv , containing different parts of the data. The datasets start on different dates, but end on the same date and the shared range covers more than 3 months. The features.csv ranges from October 1 st 2021 to February 25 th 2022, the comments.csv ranges from November 12 th 2021 to February 25 th 2022 and the meta_time_series.csv ranges from November 14 th to February 25 th 2022. The reason each data set has different start dates is because each data set was created by a different feature of the pipeline and they were ready to launch at different times. We chose to begin implementing each feature as early as possible to collect the largest amount of data for each dataset.
The file features.csv contains the static sentiment features that were extracted from the content of the posts. Although the observations in this file have time stamps, the file does not contain time series information itself, since the content of a post usually does not change over time.
This is different for meta_time_series.csv , which stores meta-information of a post, since variables such as the number of comments changes drastically over time. Therefore, meta_time_series.csv contains the time series for the meta-information of each post over the lifetime of a post. To map the static information from the features.csv on the data in meta_time_series.csv, the variable submission_id can be used. Due to privacy reasons, the submission ids were hashed. This way, they can still be used for the unification of the data in the three files, but not be used for identification of the post's author. Lastly, comments.csv contains features from sentiment extraction methods applied to the comments for each post. The data in comments.csv also contains time series information since the number of comments below a post might change over time. The methods were applied on all comments below a post and produce an aggregated value. The resulting value can change over time when new comments are posted for a post.

Experimental Design, Materials and Methods
The approach we followed to create the dataset consists of several steps. First, the raw data was collected from Reddit using API calls on a 20-minute basis. We tracked static variables, such as the ones derived from the content within the post, as well as non-static variables, such as the ones derived from comments, which constantly change. The posts were retrieved using the Table 1 Features.

Variable
Type If there are more green than red pixels, the entry is "positive", if there are more red than green pixels it is "negative", else it is "nan".
In Negative percentage values are aggregated.
( continued on next page ) normal_sentiment_ weighted float Value for overall positive or negative sentiment in the text of a post, ranging between -1 (most negative) and 1 (most positive).
Text in images and titles is evaluated using the sentiment classification model VADER [6] . Additional custom weights are introduced to weigh sentiments based on group-specific keywords 3 which are used by the communities in the considered subreddits. normal sentiment_ score_positive float Value for the degree of positive sentiment, ranging between 0 (no positive) and 1 (most positive).
Text in images and titles is evaluated using the sentiment classification model VADER for positive sentiment only. Additional custom weights are introduced to weigh sentiments based on group-specific keywords which are used by the communities in the considered subreddits. normal_sentiment_ score_negative float Value for the degree of negative sentiment, ranging between 0 (no negative) and 1 (most negative).
Text in images and titles is evaluated using the sentiment classification model VADER for negative sentiment only. Additional custom weights are introduced to weigh sentiments based on group-specific keywords which are used by the communities in the considered subreddits. social_media_type categorical Kind of post -either "twitter", "reddit" or "unknown".
If the post is a repost from twitter, it will be classified as "twitter", if it is originally from reddit it is classified as "reddit", else "unknown". submission_id string Unique identifier for each post (anonymized) Meta -information derived directly from each post via API call. ticker list List of mentioned tickers. Tickers in post were identified using a keyword list and regular expressions searching for a $ sign in front of strings. timestamp string Time when the post was pulled via API call.
Meta -information derived directly from each post via API call.  Meta -information derived directly from each post via API call.   Reddit API [2] as well as the Pushshift API [3] . Subsequently, they were processed using a custom feature extraction pipeline running on Amazon Web Services servers. A Mobile Nets artificial neural network [4] was trained to classify the images contained in posts into four categories, since the images were so inherently different in the structure of their contents that different methods for sentiment extraction needed to be applied. We use several custom functions to create sentiment variables from the image and textual information in the posts as well as the title and meta-information, according to the descriptions in Table 1 -3 above. The output of this pipeline are the final features containing different forms of sentiment. The time series sequences contained in the data are rather short since posts are tracked for as long as they are relevant and never longer than for 24 hours. Some of the variables include outliers. The reason for this is, for example, that the author of a post might exaggerate or even just posts an unrealistically high number (e.g. of realized percentage gains) as a joke. Our methods can not filter for such scenarios. We chose not to exclude outliers from the dataset and provide the raw data and leave it to the individual researcher using this dataset, to decide on how to deal with this issue. The categorical variable flair was excluded from the summary statistics of Table 7 , as there are 254 unique values and a breakdown of the distribution of observations over these categories is not reasonable.

Ethics Statements
The data is based on publicly available social media posts, which have been processed in such a way, such that they do not contain any personal data or copyrighted material. Further the data is fully anonymized. Reddit's data redistribution policies were complied with.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Reddit financial image post sentiment dataset (Original data) (Mendeley Data)