1 Introduction

In recent years, it has become increasingly popular among investors to comment on or to share their opinion about companies’ stock market performances and prospects on social media platforms, such as Twitter and StockTwits. While institutional investors have the means to actively monitor stock markets and public news over the trading day, social media platforms constitute an especially valuable channel for retail investors to obtain stock market relevant information (e.g., Chen et al. 2014). The trading activities of the latter, often portrayed as noise traders in the spirit of Kyle (1985) and Black (1986), may in part be influenced by subjective beliefs about future cash flows and investment risks. These subjective beliefs are referred to as investor sentiment in behavioral models along the lines of De Long et al. (1990), which assume two types of investors, namely rational, sentiment-free arbitrageurs and irrational, sentiment-prone noise traders. Based on their erroneous conviction of having unique information about future stock prices, noise traders buy (sell) stocks when feeling bullish (bearish) about a company. In addition, both types of traders face downward-sloping demand curves for risky assets, which leads to an equilibrium in which these random beliefs of noise traders influence prices. More precisely, De Long et al. (1990) predict that a positive sentiment shock leads to an increase in prices and, conversely, a negative sentiment shock to a decrease in prices.

While prior research has disregarded the role of irrational investors, assuming that arbitrageurs would trade against them and keep prices at their fundamental values (Friedman 1953; Fama 1965), behavioral models following De Long et al. (1990) and Shleifer and Vishny (1997) instead suggest that arbitrageurs are likely to be risk-averse, and their willingness to trade against noise traders is limited. The model introduced by De Long et al. (1990), for instance, postulates that arbitrageurs face not only fundamental risks when taking positions against noise traders but also the risk that the beliefs of irrational investors may not reverse to their mean for a prolonged period of time. This implies that noise traders can drive stock prices away from their fundamental values, at least over short time periods, given that the willingness of risk-averse arbitrageurs to bet against them is limited.Footnote 1

Thus, the classical finance theory in which the cross-section of expected returns is affected only by the cross-section of systemic risk in equilibrium has been augmented by these behavioral aspects. To this end, retail investors have been shown empirically to trade excessively in attention-grabbing stocks (Barber and Odean 2007) and in concert with other retail investors (Kumar and Lee 2006; Barber et al. 2009), having a significant impact on stock prices.

Following this line of thought and the initial findings of Antweiler and Frank (2004) and Das and Chen (2007), a vast literature has evolved around the question of how to augment and improve forecasts of financial variables, such as stock returns, volatility, and trading volume, with measures of investor sentiment derived from online sources (for a recent survey, see Nardo et al. 2016).Footnote 2 For example, Sprenger et al. (2014a) obtain good and bad news from Twitter messages related to the S&P 500 and link these news to market movements. Yang et al. (2015) provide further empirical evidence for the existence of a financial community on Twitter and demonstrate that the weighted sentiment of its most influential contributors has significant predictive power for such market movements. Da et al. (2015) use online search queries of sentiment-specific terms to construct a measure of market-wide investor sentiment. Their results are broadly in line with the theories on investor sentiment mentioned above. Concerning individual-level stocks, Sprenger et al. (2014b) find an association between Twitter sentiment and returns as well as the volume of Twitter messages and trading volume. Moreover, making use of stock picks from the CAPS website, Avery et al. (2015) demonstrate that negative stock picks strongly predict future stock price declines.Footnote 3 Other findings point towards a relation between message board posts and contemporaneous returns of underperforming small-cap stocks (Leung and Ton 2015). Recently, some studies have investigated the predictive performance of online investor sentiment measures at intraday frequencies. While Behrendt and Schmidt (2018) show that the economic significance of Twitter sentiment in intraday volatility forecasting applications is negligible, Renault (2017) provides some empirical evidence for sentiment-driven noise trading throughout the trading day using investor sentiment estimated from StockTwits messages.

In light of these empirical findings, which involve different online sources and methods to estimate investor sentiment, both researchers and practitioners alike are still facing one crucial question – how to measure and quantify investor sentiment adequately? As far as textual analysis in finance is concerned, conventional approaches usually involve dictionaries and machine learning techniques (for recent surveys, see Das et al. 2014; Kearney and Liu 2014). The latter are predominantly used when online investor sentiment is estimated from individual messages published on social media platforms, such as Twitter and StockTwits, since dictionaries developed for short messages that also cover financial topics are scarce. By contrast, methods based on dictionaries, such as the Harvard-IV dictionary or the dictionary of Loughran and McDonald (2011), are more often used in the context of textual analysis of traditional news channels. An exception are the dictionaries of Renault (2017), which are tailored to finance-specific short messages on StockTwits. Although dictionaries are usually publicly available and ready to use, this is not the case for most approaches based on machine learning techniques. Lastly, some commercial data vendors offer investor sentiment measures for researchers and practitioners to use. While these commercial measures may increase the reproducibility of findings, they are inherently opaque since the exact way of calculating the respective measure is usually not publicly disclosed.

This paper contributes to the literature in several ways: (i) we estimate daily online investor sentiment from short messages published on Twitter and StockTwits for 360 stocks over a seven years time period from the beginning of 2011 to the end of 2017 with a wide selection of sentiment estimation techniques used in the finance literature, (ii) the performance of the different approaches is compared by means of financial applications, and (iii) we rank and explain the performance of the dictionaries as well as the machine learning approaches in order to provide a guideline for both researchers and practitioners on the basis of field-specific applications. To be more precise, we estimate investor sentiment with five publicly available dictionaries, two open-source and pre-trained neural networks, and two simple machine learning models trained by us on labelled StockTwits data. The dictionaries considered in this paper are the Harvard-IV dictionary, the dictionary of Loughran and McDonald (2011) (hereafter, LM), both short message- and finance-specific dictionaries of Renault (2017) (hereafter, L1 and L2), and the VADER dictionary (Hutto and Gilbert 2014), which is a general dictionary optimized for short messages. The machine learning models used to estimate investor sentiment are the naive Bayes classifier, maximum entropy, the convolutional neural network Deep-MLSA of Deriu et al. (2017), and the long short-term memory neural network DeepMoji of Felbo et al. (2017). Note that the focus of this paper is on publicly available sentiment estimation techniques. For a further comparison of trainable machine learning approaches, we refer to Renault (2019). While some of the prior research has focused on analyses at lower frequencies and over longer time horizons, we follow more recent literature by considering a daily frequency. Moreover, we also make use of the method proposed by Boehmer et al. (2020) for the identification of retail investor trades in the NYSE Trade and Quote (TAQ) database. This allows us to more closely adhere to the above-mentioned theoretical models and to relate the effects of online investor sentiment to order imbalances based on trades conducted by these investors.

Our comparison of the above-mentioned sentiment measures is based on two financial applications that are helpful to study the effect of online investor sentiment on the cross-section of stocks, which is of central importance in both classical and behavioral finance theory (see Baker and Wurgler 2006, 2007, for a discussion): Firstly, we investigate the effect of each sentiment measure on the cross-section of retail investors’ order imbalances within a model framework in the spirit of Fama and MacBeth (1973). This allows us to estimate the direct impact of the sentiment measures on trades initiated by retail investors. Secondly, since asset pricing applications are often of primary interest for researchers and practitioners, we use the sentiment measures in a model-free portfolio sorting exercise and forecast abnormal portfolio returns. Overall, while the performance of the considered sentiment measures varies considerably, we find that the LM dictionary of Loughran and McDonald (2011) and the L2 dictionary of Renault (2017) perform well in terms of their effect on retail investors’ order imbalances and their ability to forecast abnormal portfolio returns. Thus, finance-specific dictionaries perform on par with or even better than state-of-the-art machine learning approaches.

The remainder of the paper is structured as follows: Section 2 describes the different online investor sentiment measures, their calculation, and some instructive descriptive statistics of the data set. The effect of the respective online investor sentiment measures on the cross-section of retail investors’ order imbalances is investigated within a Fama-MacBeth (1973) regression framework in Section 3 and, subsequently, in a model-free portfolio sorting application to forecast abnormal portfolio returns in Section 4. Lastly, Section 5 offers some concluding remarks.

2 Online investor sentiment data

2.1 The raw text data

We consider two sources of online text data that are widely used in the finance literature, namely Twitter (e.g., Sprenger et al. 2014a, b; Yang et al. 2015; Bartov et al. 2018; Audrino et al. 2020; Lehrer et al. 2019; Nofer and Hinz 2015; Rao and Srivastava 2014; Ballinari and Behrendt 2020) and StockTwits (e.g., Audrino et al. 2020; Cookson and Niessner 2020; Giannini et al. 2019; Renault 2017; Guégan and Renault 2020; Mahmoudi et al. 2018; Ballinari and Behrendt 2020). Twitter is a social media network with roughly 126 million active daily users where people can share thoughts, ideas, and opinions in the form of short messages consisting of 140 characters.Footnote 4 Similarly, StockTwits also allows users to share 120-character messages with the online community, the difference being that it is specifically tailored towards investors and traders. Focusing on the time period between 2011 and 2017, we analyze 360 companies that are constantly part of the S&P 500 during that time.Footnote 5

Our motivation for focusing on S&P 500 stocks is twofold. Firstly, to accurately compute daily sentiment measures, we want to consider companies mentioned in large amounts of social media messages. For the same reason, Cookson and Niessner (2020) focus on the 100 stocks with the highest posting volume on StockTwits. Secondly, considering the S&P 500 universe makes our analysis conservative in the sense that we rule out the possibility that our results are driven by micro-capitalized stocks.

Messages shared on StockTwits are directly obtained through the StockTwits API, whereas Twitter messages are collected by following the procedure outlined in Hernandez-Suarez et al. (2018). We collect all shared messages from Twitter and StockTwits either mentioning a company’s name or its cashtag (the company’s ticker symbol preceded by the dollar sign, e.g., “$AAPL” for Apple Inc.).Footnote 6 For both data sources, we account for changes in a company’s name or ticker.Footnote 7 In total, we collect 30,520,617 and 9,890,132 relevant short messages from Twitter and StockTwits, respectively.

2.2 Different sentiment estimation techniques

After collecting text data from social media platforms, one faces the challenge of transforming the unstructured text data into a quantitative measure for the latent investor sentiment. The two main approaches used for sentiment analysis in finance are dictionary-based and machine learning-based techniques (Das et al. 2014). Table 1 summarises the approaches that are considered in this study. For each dictionary and machine learning model, the table reports a selection of previous studies that make use of the respective sentiment estimation technique. The primary focus of this paper is publicly available dictionaries and pre-trained machine learning models that researches can directly use. Nevertheless, given their great popularity in the finance literature, we have also included the naive Bayes and maximum entropy classifiers into our analysis, which are trained on our StockTwits data.

Table 1 Overview of investor sentiment estimation techniques

2.2.1 Dictionary based approaches

In the finance literature, dictionary-based approaches are the most widely adopted methodologies to gauge the mood and sentiment enclosed in text data. These approaches are based on a list of words associated with a particular sentiment (e.g., positive or negative). One then counts the number of times that words with a particular connotation occur in the analyzed text. In the case of a social media post, for instance, we count the number of positive and negative words used in the message as defined by a specific dictionary. We then categorize the message as being optimistic (or, in the context of finance, bullish) if more words with a positive than a negative connotation are identified. The use of dictionaries for sentiment analysis has several advantages: Firstly, the computational cost of counting positive and negative words is usually low. Secondly, the implementation of a dictionary based approach is relatively simple and transparent. Thirdly, being a computationally feasible and transparent approach, results based on dictionaries are relatively straightforward to reproduce. Lastly, most dictionaries are publicly available. Dictionaries commonly used for sentiment analysis range from very broad and general to field-specific lists of words (often called lexicons).

Initially, the most frequently used dictionary for sentiment analysis in finance has been Harvard-IV, a general-purpose dictionary developed by Harvard University and used in the General Inquirer software.Footnote 8 We refer to Loughran and McDonald (2016) for a more extensive review of studies using this general-purpose dictionary. The Harvard-IV dictionary consists of 2005 negative and 1637 positive words. After applying standard pre-processing methods to the textual data (e.g., tokenizing, transforming words into lower case, and removing stop words), we use the dictionary to capture the tone of social media messages by counting the number of positive and negative words. The sentiment score of a given Twitter or StockTwits short message is then defined as the difference between the share of positive and negative words. As a result, we obtain a sentiment score ranging from −1 (negative) to +1 (positive).

However, the use of general-purpose dictionaries, such as the Harvard-IV dictionary, for sentiment analysis in finance might produce misleading results (Loughran and McDonald 2016; Renault 2017). Almost three-fourths of the words classified as having a negative tone by the Harvard-IV dictionary do not necessarily have a negative connotation in finance-related sentences (for example, “tax”, “cost”, “capital”, and “liability”). Motivated by this issue, Loughran and McDonald (2011) have developed a dictionary consisting of six different word lists (negative, positive, uncertain, litigious, strong modal, and weak modal). The dictionary, often abbreviated as LM, is constructed using a large sample of Form 10-K filings of US companies during the period from 1994 to 2008. After creating a dictionary of words occurring in at least 5% of the filings, Loughran and McDonald (2011) classify each word based on its most likely connotation in a finance context. In our analysis, we consider only the positive and negative word lists consisting of 354 and 2,355 words, respectively. The dictionary and Python implementations for textual sentiment analysis are available at the software repository for accounting and finance of the University of Notre Dame.Footnote 9 Again, the sentiment of a given social media message is calculated as the difference between the share of positive and negative words occurring in the pre-processed text data.

Since the LM dictionary is constructed using words occurring in 10-K filings, the semantic connotations of typical expressions used on social media platforms are not necessarily captured. Emoticons (e.g., a smiling face), abbreviations (e.g., “LOL” stands for “laughing out loud”), or slang (e.g., “nah” or “meh”), which most likely have some sentiment connotation, are not covered by the Harvard-IV and the LM dictionaries. VADER (Valence Aware Dictionary and sEntiment Reasoner), the dictionary and rule-based approach introduced by Hutto and Gilbert (2014), is specially constructed to capture the sentiment of short and informal text messages, such as those published on social media platforms. In a first step, a dictionary is created by combining word lists from existing general-purpose dictionaries and common expressions occurring in social media messages (e.g., emoticons and abbreviations). The semantic connotation of each of the roughly 7500 words and expressions is obtained by averaging the opinion of ten independent human raters. Contrary to the previous two dictionaries, the VADER word list does not only classify a word as being positive or negative, but also defines the intensity of a word’s sentiment. In a second step, Hutto and Gilbert (2014) define a rule-based model that increases or decreases the sentiment intensity of a text based on five grammatical and syntactical heuristics (e.g., punctuation, upper case letters). For our analysis, sentiment scores based on this dictionary and rule-based approach are obtained by processing the social media data with the publicly available implementation of VADER.Footnote 10

While the methodology introduced by Hutto and Gilbert (2014) accounts for the short and informal structure of textual data obtained from social media platforms, it is still based on a general-purpose dictionary and thus might not classify words with a finance-specific meaning correctly (e.g., “liability” has a negative connotation in the VADER dictionary). In a recent study, Renault (2017) proposes two dictionaries specifically designed to capture the sentiment in finance-related social media short messages. The dictionaries are constructed based on messages shared on StockTwits and by exploiting a feature of this social media platform introduced in 2012 that allows users to “tag” their short messages as being either bullish (i.e., positive) or bearish (i.e., negative). The first dictionary, hereafter referred to as Renault L1, is constructed by selecting all uni-grams (1 word) and bi-grams (2 subsequent words) appearing at least 75 times in a sample of 750,000 StockTwits messages.Footnote 11 The semantic connotation of each word is then defined as the difference between the share of appearances in bullish and bearish messages. The dictionary is refined by only considering the 20% most positive and the 20% most negative words (in total 8000 items). Due to anomalies identified in the data-driven dictionary (e.g., the word “commodity” has a negative connotation as a result of the decline in commodity prices during the sample period), Renault (2017) proposes a second dictionary, hereafter referred to as Renault L2, constructed by manually classifying the uni-grams and bi-grams as positive, neutral, or negative. The Renault L2 dictionary consists of 543 positive and 768 negative terms. In our analysis, we compute the sentiment of messages from Twitter and StockTwits using both the Renault L1 and L2 dictionaries.Footnote 12 The textual data are pre-processed by following the approach outlined in Renault (2017), and the sentiment connotation of a given message is defined as the difference between the share of positive and negative terms.

To sum up, we compare five publicly available dictionaries being either of general purpose or specific to a particular field (social media platforms, finance-related text). Table 2 illustrates the commonalities and differences among the five considered dictionaries. More precisely, the table reports the number of shared terms between pairs of dictionaries (the diagonal elements show the total number of words in each dictionary). In parentheses below the number of common terms, we report the share of words to which two dictionaries assign the same sentiment connotation. Except for the comparison of the Renault L1 dictionary with the Harvard-IV and the VADER dictionaries, the share of words with the same sentiment direction between two dictionaries is relatively high, ranging from 93.2 to 99.9%. Table 2 highlights two main differences regarding the five considered dictionaries: Firstly, the number of common terms between the field-specific and general-purpose dictionaries is low. For example, less than one-fifth of the words in the LM dictionary are also part of the Harvard-IV dictionary. Secondly, among the field-specific dictionaries, the number of shared terms between the LM dictionary and the word lists specially constructed for finance-related short messages is also low.

Table 2 Comparison of shared terms among dictionaries

2.2.2 Machine learning techniques

With increasing computational power, the popularity of machine learning algorithms in the context of sentiment classification has increased as well. The underlying idea of these techniques is to train a model to predict the sentiment of a text given a set of features (predictors). The main steps for implementing such a methodology are (i) the definition of the predictors (often referred to as feature engineering), (ii) estimating the relevant parameters (training), and (iii) evaluating the model’s accuracy (testing). Compared to dictionary-based approaches, the use of machine learning techniques has some advantages: Firstly, these models can better capture the complex structure of text data, whereas the dictionaries discussed above rely on the assumption that words (or, at most, bi-grams) in a sentence are independent (i.e., their ordering does not matter). Secondly, instead of selecting words and determining their connotation, machine learning techniques are more flexible in choosing relevant features. However, there are also some drawbacks: Firstly, the classification accuracy of the model highly depends on the quantity and quality of the training data. This implies that a large amount of pre-classified text data is necessary to train and test a model properly (Renault 2017). Furthermore, the predictions made by these models are generally nontransparent and challenging to comprehend. For a comparison of different machine learning classifiers for social media messages about finance, we refer to Renault (2019).

Two of the most popular machine learning approaches used for sentiment classification are the naive Bayes classifier and maximum entropy models (e.g., Cookson and Niessner 2020; Giannini et al. 2019).Footnote 13 Recently, researchers have also started to rely more frequently on neural networks. Mahmoudi et al. (2018), among others, train convolutional and recurrent neural networks to classify StockTwits messages. Unfortunately, the number of pre-trained sentiment classification models that are publicly available is quite small, especially considering field-specific models. To the best of our knowledge, there exist no publicly available sentiment classification algorithms trained specially for finance-related text data.

One of the only publicly available and trained machine learning approaches for sentiment classification is the (deep) convolutional neural network proposed by Deriu et al. (2017), hereafter referred to as Deep-MLSA. Several facts motivate our choice of this model: Firstly, as already mentioned, the authors have made a pre-trained Python implementation of their model publicly available.Footnote 14 Secondly, the model has been specially trained for classifying social media short messages. Furthermore, having won the message polarity classification task “Sentiment Analysis in Twitter” at the 2016 SemEval competition, this technique can be considered as one of the best performing sentiment classification approaches for social media short messages currently available. For a detailed description of the model and training procedure, we refer to Deriu et al. (2017).

For the sake of completeness, we also consider three other machine learning approaches used in the finance literature. More precisely, we consider the long short-term memory neural network introduced by Felbo et al. (2017), a naive Bayes classifier, and a maximum entropy model. As mentioned above, the naive Bayes and maximum entropy models have been trained on labelled StockTwits data since there exist no pre-trained models that are publicly available.

The (deep) neural network developed by Felbo et al. (2017), hereafter referred to as DeepMoji, is trained to predict the emoticons associated with a tweet. The authors train a long short-term memory network to predict the probability that one of 64 considered emoticons occurs in a given social media post. For a detailed description of the model and the pre-training, we refer to Felbo et al. (2017). There exists a Python implementation of the pre-trained DeepMoji model that is publicly available.Footnote 15 However, to use this model for a binary classification task (e.g., bullish and bearish short messages), it is necessary to further train and fine-tune the neural network. We do this by following the approach of Renault (2017) and Mahmoudi et al. (2018), i.e., using the self-reported labels associated with StockTwits messages. To be more precise, the training set contains all 241,591 labeled messages published about the 360 companies considered in this study between Jun 1, 2013 and Aug 31, 2014.Footnote 16 Due to the larger proportion of messages tagged as being “bullish”, we obtain a balanced training set by undersampling the positive messages. We retain 30% of the training data for validating the model. The so constructed training and validation data sets are then used to train and fine-tune the DeepMoji model of Felbo et al. (2017).Footnote 17

The naive Bayes and maximum entropy approaches are trained with the same data set of labeled StockTwits messages described in the previous paragraph. We apply standard cleaning procedures to the textual data, i.e., we turn all words into lowercase, remove stop-words and punctuation, shorten repeated characters (e.g., “allllll” becomes “all”), apply the wordnet lemmatizer to each token, and replace URLs, user-names, company names, cashtags, and numbers with corresponding tags (e.g., “tag_username” or “tag_url”).Footnote 18 We adopt a bag-of-words representation for the cleaned text data. Following the results documented in Renault (2019), we consider both uni- and bi-grams. The bag-of-words representation of the tweets is stored in a term frequency-inverse document frequency (TF-IDF) document-term matrix.Footnote 19 We then train the naive Bayes and maximum entropy classifiers using the so defined matrix of predictors (see Hastie et al. 2009, for a general description of the models).

2.3 Aggregation to a daily investor sentiment measure

After classifying social media short messages as having either a positive or negative sentiment connotation, one usually needs to aggregate the unevenly spaced sentiment scores to obtain an evenly spaced time series at a lower frequency. In this paper, we focus on the construction of daily sentiment measures. We define a day to start at 16:00 Eastern Time of the previous trading day and end at 16:00 Eastern Time of the current day. In the finance literature, different aggregation schemes have been suggested. Renault (2017, 2019) and Cookson and Niessner (2020), among others, aggregate the sentiment scores of StockTwits short messages to a lower frequency with a simple empirical average. In our case, for company i on day t, denoting the empirical average by Ai, t, this amounts to:

$$\begin{aligned} A_{i,t} = \frac{1}{N_{i,t}} \sum _{t_n} S_{i,t_n}, \end{aligned}$$
(1)

where \(N_{i,t}\) refers to the total number of short messages published on a social media platform about company i on day t, and \(S_{i,t_n}\) is the sentiment score at intraday time \(t_n\), with \(n = 1, 2, \ldots , N\), assigned to a short message ranging from −1 (negative sentiment) to +1 (positive sentiment). By contrast, Antweiler and Frank (2004) propose a so-called bullishness measure, denoted here by \(B_{i,t}\) and defined as:

$$\begin{aligned} B_{i,t} = \log \left( \frac{1 + N_{i,t}^{pos}}{1 + N_{i,t}^{neg}} \right) , \end{aligned}$$
(2)

where \(\log (\cdot )\) stands for the natural logarithm, \(N_{i,t}^{pos}\) is the number of messages classified as being positive, and \(N_{i,t}^{neg}\) the number of messages classified as being negative.

We conduct our analysis for both aggregation schemes but only report those results obtained with the bullishness measure since the results show some discrepancies between aggregation schemes and seem more plausible for the bullishness measure. The findings obtained with the average aggregation scheme are available from the authors upon request. The reason behind the discrepancies in the results most likely stems from the fact that the measure proposed by Antweiler and Frank (2004) also takes into account the volume of messages posted over a given day. To be more precise, the bullish sentiment can be approximated by \(B_{i,t} \approx \log (1+N_{i,t}^{pos} + N_{i,t}^{neg}) \left( N_{i,t}^{pos} - N_{i,t}^{neg} \right) /\left( N_{i,t}^{pos} + N_{i,t}^{neg} \right)\). Consider, for example, that over a given day only one message about Apple Inc. is posted and classified as being positive, while on another day 1,000 messages mentioning Apple Inc. are published and all are classified as being positive. The average sentiment is the same for both days, i.e., \(A_{i,t} = 1\). However, the bullish sentiment for the first day is \(B_{i,t} = \log (2) \approx 0.69\), and for the second day \(B_{i,t} = \log (1001) \approx 6.91\). As such, the aggregation approach proposed by Antweiler and Frank (2004) considers not only the sentiment but also the intensity of investors’ attention, which has been shown to have a significant impact on future stock returns (see, among others, Barber and Odean 2007; Da et al. 2011). Note that even after aggregating the estimated sentiment of social media short messages to the daily frequency, it is still possible that no messages about a company are shared on Twitter or StockTwits on a given day. For those days, we make the simplifying assumption that investors’ sentiment remains unchanged until the next message is published, i.e., we replace the missing daily bullish sentiment with the most recent observation.Footnote 20

Table 3 Correlations of daily bullish sentiment across sentiment measures

Table 3 reports correlations between the estimated bullish sentiment for all considered sentiment measures and data sources, pooled over companies and days. More precisely, Panel A and B report the correlations between daily bullish sentiment scores obtained by the five dictionaries and the four machine learning models as estimated from short messages posted on Twitter and StockTwits, respectively. Panel C reports for each sentiment measure the correlation between the bullish sentiment obtained from Twitter short messages and the bullish sentiment obtained from StockTwits short messages. Noteworthy is the fact that bullish sentiment obtained by the LM dictionary is most highly correlated with that obtained by the VADER rule-based approach. Moreover, we find that when using Twitter data, the bullish sentiment obtained by the Deep-MLSA model has a very low correlation with the daily sentiment measures obtained by dictionary based approaches. Moreover, the correlations between the two dictionaries proposed by Renault (2017), the naive Bayes, maximum entropy, and DeepMoji neural network are relatively high. This may not surprise since all five sentiment estimation approaches are estimated and trained on StockTwits data. The results presented in Panel C of Table 3 show that the correlation between sentiment measures obtained from Twitter and StockTwits messages is between 0.2 and 0.3. The LM, L1, L2, and VADER dictionaries appear to produce the most “consistent” bullish sentiment signal across the two social media platforms.

Table 4 reports summary statistics for daily bullish online investor sentiment derived from Twitter (Panel A) and StockTwits (Panel B), again pooled over companies and days. For each of the six sentiment estimation approaches, the table reports the mean daily bullish sentiment and its 1, 10, 25, 50, 75, 90, and 99%-quantiles. The table uncovers two essential features of the sentiment scores: Firstly, we note that the average daily bullish sentiment is positive and the median non-negative, regardless of the data source or sentiment estimation technique being considered. The proportion of days with negative daily bullish sentiment is rather low. Especially the VADER dictionary and the Deep-MLSA model classify very few short messages as having a negative investor sentiment. Secondly, a considerable number of days have a neutral investor sentiment, i.e., the bullish sentiment is zero. In particular, when sentiment is estimated with the LM dictionary and the Deep-MLSA model. This effect is more pronounced when investor sentiment is estimated from short messages published on StockTwits. For instance, when applying the LM dictionary and the Deep-MLSA model to StockTwits messages, 59.9 and 73.2% of the days in our sample have a neutral sentiment, respectively.

Table 4 Summary statistics for daily bullish sentiment

2.4 Filtering tweets and companies

Following the data collection approach described previously (see Section 2.1), all messages that mention a company’s name and/or cashtag are considered for the construction of the daily bullishness score. Since a single message potentially mentions more than just one company, it becomes difficult to attribute the negative or positive connotation of a social media post to a specific company’s stock. Thus, in addition to our baseline data collection approach, we also consider a more conservative selection of short messages and following, among others, Cookson and Niessner (2020), we consider only posts on Twitter and StockTwits that mention a unique cashtag. Table 5 reports the correlations across sentiment scores. In comparison with Table 3, the correlation coefficients for many of the nine sentiment estimation techniques have increased.

Table 5 Correlations of daily bullish sentiment across sentiment measures (unique cashtags)

3 Investor sentiment and retail investors’ order imbalance

Theoretical models of investor sentiment in the context of financial markets assume that there exist two types of investors, namely irrational, sentiment-prone noise traders and rational, sentiment-free arbitrageurs (see, among others, De Long et al. 1990). The former have random beliefs, i.e., not necessarily related to fundamental values, about future cash-flows and dividends. Based on their erroneous conviction of having unique information about future stock prices, noise traders buy (sell) stocks when feeling bullish (bearish) about a company. We therefore expect to observe a positive relation between a given measure of investor sentiment and the future short-term order imbalance of retail investors, i.e., the difference between the volume of buy and sell transactions initiated by retail investors. In other words, when the sentiment of messages published on social media platforms is positive, we expect retail investors to initiate more buy transactions than sell transactions.

We follow the approach suggested by Boehmer et al. (2020) and identify all high-frequency transactions obtained from the TAQ database with exchange code “D” and a price just below (above) a rounded penny as retail initiated buy (sell) transactions. Let \(VB_{i,t}\) and \(VS_{i,t}\) denote the buy and sell trading volume of retail investors for stock i on day t, respectively. Retail investors’ order imbalance for stock i on day t is then defined as:

$$\begin{aligned} OI_{i,t} = \frac{VB_{i,t} - VS_{i,t}}{VB_{i,t} + VS_{i,t}}. \end{aligned}$$
(3)

Following, among others, Loughran and McDonald (2011) and Da et al. (2011), we consider a Fama-MacBeth (1973) cross-sectional regression framework. Therefore, for each trading day, we regress retail investors’ daily order imbalances on the previous day’s bullish sentiment.Footnote 21 Note that the order imbalance of retail investors defined in Equation (3) is bounded between −1 and +1. To avoid imposing parameter restrictions, we consider the Fisher-transformed order imbalance, which is denoted as \(\widetilde{OI}_{i,t} = 0.5\log (1+OI_{i,t}) - 0.5\log (1-OI_{i,t})\). Moreover, we include several control variables in the cross-sectional regression framework. In the spirit of Fama and French (1993) and Carhart (1997), we control for lagged (log) market capitalization, market-to-book ratio, and returns. In addition, we also control for lagged retail investors’ order imbalance, the abnormal news volume defined as the natural logarithm of the ratio between the news volume and its average over the previous 21 days, and the lagged daily realized volatility.Footnote 22 Relevant data are obtained from the Center for Research in Security Prices (CRSP), Compustat, RavenPack News Analytics, and the TAQ database. For each trading day, we run the following cross-sectional regression:

$$\begin{aligned} \widetilde{OI}_{i,t+h} = \alpha _t + \beta _t B_{i,t} + \theta _t' X_{i,t} + \varepsilon _{i,t+1}, \qquad \text {for } h = 1, \dots , 4, \end{aligned}$$
(4)

where \(\widetilde{OI}_{i,t}\) is the Fisher-transformed retail investor order imbalance of company i on trading day t, \(B_{i,t}\) is the daily bullish sentiment measure, and \(X_{i,t}\) is the vector of the above-mentioned control variables. All covariates are standardized such that their coefficients can be interpreted as the effect of a one standard deviation change in the respective variable. The daily regression coefficients are then averaged over time, and Newey-West (1987) standard errors are used to construct t-statistics. Following Da et al. (2011), we include only the first lag of bullish sentiment in the cross-sectional regression. However, in the empirical analysis below, we also vary the forecasting horizon. By doing so, we implicitly account for sentiment effects at longer lags.

The regression results are reported in Table 6. Panel A and B report the average daily cross-sectional regression coefficients obtained for the nine different sentiment measures using Twitter and StockTwits data, respectively.Footnote 23 The table reports the average of the estimated regression coefficients for four different forecasting horizons. As mentioned previously, the results presented in theoretical and empirical studies suggest that an increase in retail investors’ sentiment, i.e., noise traders feeling more optimistic about a company, has a positive effect on their order imbalance, at least in the short-term (see, among others, De Long et al. 1990; Tetlock 2007; Chen et al. 2014).

Table 6 Fama-MacBeth (1973) regression coefficients for daily bullish sentiment

To compare the different sentiment measures, we first investigate whether the bullish sentiment has a (significant) positive effect on the 1-day ahead retail investor’s order imbalance and, subsequently, compare the magnitude of this relation. From the first column in Table 6, we observe that, except for the bullish sentiment measure obtained from StockTwits messages and estimated with Deep-MLSA, all regression coefficients are indeed positive. More interesting are the differing magnitudes of the coefficients. For both Twitter and StockTwits data, we obtain the smallest regression coefficients for the sentiment measure based on Deep-MLSA. For Twitter data, the largest impact is observed for the naive Bayes classifier, the L2 dictionary, the LM dictionary, and the maximum entropy classifier. Concerning the StockTwits data, we observe the largest impact for the LM dictionary, followed by the L2 dictionary.

To asses whether these discrepancies are statistically significant, we report t-statistics for the pairwise differences between sentiment coefficients obtained from the nine estimation methods in Table 7. Panel A and B report t-statistics for the difference between the coefficients obtained with the estimation method reported in the rows with that reported in the columns for messages shared on Twitter and StockTwits, respectively. More precisely, for each data source and each pair of sentiment estimation approach, we construct a time series of differences in the cross-sectional estimates of the sentiment coefficients and compute t-statistics to test whether the average difference equals zero (using Newey-West (1987) standard errors). Differences which are statistically significant at the 5% level are highlighted by boldfaced numbers.

Table 7 Differences in Fama-MacBeth (1973) regression coefficients for daily bullish sentiment

Particularly notable are the results for the L2 and LM dictionaries. For messages published on Twitter, we observe that when estimating sentiment with the L2 dictionary the impact of bullish sentiment on \(\widetilde{OI}_{i,t+1}\) is statistically larger compared to the Harvard-IV dictionary, the L1 dictionary, VADER, and the two neural networks Deep-MLSA and DeepMoji. Similarly, the impact of Twitter bullish sentiment estimated with the LM dictionary on \(\widetilde{OI}_{i,t+1}\) is statistically larger than that of bullish sentiment estimated with the Harvard-IV dictionary, VADER, or Deep-MLSA. For StockTwits messages, bullish sentiment estimated with the L2 dictionary has a significantly larger effect on \(\widetilde{OI}_{i,t+1}\) compared to the maximum entropy classifier and Deep-MLSA. The effect of StockTwits bullish sentiment estimated with the LM dictionary on \(\widetilde{OI}_{i,t+1}\) is significantly larger than for all other sentiment measures.

In columns 2 through 4 of Table 6, we report the average cross-sectional regression coefficients of daily bullish sentiment for longer horizons. For the sentiment estimation techniques that perform well at the 1-day horizon, the relation remains positive and statistically significant also at longer horizons. In general, however, we observe that the positive relation between bullish sentiment and future order imbalances decreases in magnitude. This result suggests that the two social media platforms considered in this paper are particularly well suited to capture the short-term sentiment of retail investors.

The regression results obtained when estimating retail investors’ sentiment using only social media messages that mention a unique cashtag are reported in Table 8. The corresponding t-statistics for the pairwise differences between sentiment coefficients obtained from the nine estimation methods are reported in Table 9. Concerning measures estimated with Twitter data, the effect of bullish sentiment on future retail investors’ order imbalances is smaller compared to the regression results reported in Table 6. Sentiment measures estimated with StockTwits data become instead more informative for future \(\widetilde{OI}_{i,t+1}\) when filtering the data. The reason for these changes might be attributed to the fact that the use of cashtags to identify a company when sharing a message on social media is more common on StockTwits than on Twitter. When removing all messages that do not mention a unique cashtag, the number of messages in our sample is reduced by 62% for Twitter and 36% for StockTwits. As such, our filtering approach might remove messages shared on Twitter that contain valuable information about investors’ sentiment, even though they are not mentioning a company’s cashtag. Nevertheless, the results reported in Tables 8 and 9 confirm our previous finding. The L2 and LM dictionaries are overall associated with the largest impact on future order imbalances of retail investors.

Table 8 Fama-MacBeth (1973) regression coefficients for daily bullish sentiment (unique cashtags)
Table 9 Differences in Fama-MacBeth (1973) regression coefficients for daily bullish sentiment (unique cashtags)

The findings reported in Tables 6, 7, 8, 9 show that dictionaries tailored specifically towards financial topics, such as the L2 and LM dictionaries, are able to capture investor sentiment quite well—and in some cases even better than machine learning approaches. The results presented thus far focus on the predictive power of online investor sentiment for retail investors’ order imbalances. However, academics and practitioners alike are usually more interested in asset pricing implications. We address the effect of the nine sentiment estimation approaches on stock returns in the next section.

4 Model-free forecasts of annualized abnormal portfolio returns

Initially, prior research has disregarded the role of irrational investors, assuming that arbitrageurs would trade against them and keep prices at their fundamental values (Friedman 1953; Fama 1965). More recent theoretical models and empirical findings suggest instead that arbitrageurs are likely to be risk-averse, and their willingness to trade against noise traders is limited (De Long et al. 1990; Shleifer and Vishny 1997). The model introduced by De Long et al. (1990), for instance, postulates that arbitrageurs face not only fundamental risks when taking positions against noise traders but also the risk that the beliefs of irrational investors may not reverse to their mean for a prolonged period of time. This implies that noise traders can drive stock prices away from their fundamental values, at least over short time periods, given that the willingness of risk-averse arbitrageurs to bet against them is limited. Following these theoretical postulations and corresponding empirical findings (e.g., Tetlock 2007; Baker and Wurgler 2006, 2007; Barber et al. 2009), we expect to observe a positive relation between a given measure of investor sentiment and future short-term returns.

Thus, we now investigate the ability of the different investor sentiment measures estimated from short messages published on Twitter and StockTwits to forecast annualized abnormal portfolio returns in a model-free setup (for a similar exercise focusing on online search intensity and a weekly trading pattern, see Joseph et al. 2011). To this end, denote by q the 10%-quantile and by \((1-q)\) the 90%-quantile of the empirical distribution of the respective investor sentiment measure across stocks on a given trading day.Footnote 24 On each trading day, we form two equal-weighted portfolios of stocks based on the bullish sentiment of the previous trading day for each of the considered sentiment measures. The first portfolio (Short) contains the stocks for which the estimated online investor sentiment on the previous (trading) day is \(\le q\). Conversely, the second portfolio (Long) contains the stocks for which the estimated online investor sentiment on the previous trading day is \(\ge (1-q)\). A long-short raw portfolio return (Long – Short) is obtained as the difference between these two raw portfolio returns. The stocks are held in the portfolio for 1 trading day and are then re-sorted on the following trading day. Thus, we implement a daily sorting exercise of zero-cost portfolios.

Based on the assumption that a positive sentiment shock leads to an increase in returns and, conversely, a negative sentiment shock to a decrease in returns, the long-short portfolio return should yield a positive return across the considered sentiment measures. Our choice of q is guided by the aim to include only stocks exhibiting a rather extreme positive or negative sentiment on the previous trading day. Given the classification issues of some sentiment measures, as elaborated upon in Section 2, the two portfolios often contain more than 36 stocks per trading day. In terms of robustness, results remain unchanged qualitatively if we consider, for example, the first and last quintile for q and \((1-q)\), respectively. Since we want to investigate the forecasting performance of different investor sentiment measures in such a hypothetical and model-free portfolio trading application, transaction costs are ignored.

Abnormal, or risk-adjusted, portfolio returns are then obtained as follows: For each portfolio, we run a regression of daily excess returns on the three factors of Fama and French (1993) and the momentum factor of Carhart (1997), which have been found to explain cross-sectional differences in stock returns empirically. Thus, in each case, the regression is given by:

$$\begin{aligned} R_{p,t} - R_{f,t} = \alpha + \beta _m (R_{m,t} - R_{f,t}) + \beta _s \text {SMB}_{t} + \beta _h \text {HML}_{t} + \beta _m \text {MOM}_{t} + \varepsilon _t, \end{aligned}$$
(5)

where \(R_{p,t}\) is the portfolio return on trading day t, \(R_{f,t}\) is the risk-free rate, \((R_{m,t} - R_{f,t})\) denotes the excess return on the market, \(\text {SMB}_{t}\) is the difference of returns between portfolios of “small” and “big” stocks, \(\text {HML}_{t}\) refers to the return difference between portfolios consisting of “high” and “low” stocks as categorized by the book-to-market ratio, and \(\text {MOM}_{t}\) denotes the momentum factor of Carhart (1997). Both data on the three factors of Fama and French (1993) and the momentum factor of Carhart (1997) are obtained from French’s website.Footnote 25 Accordingly, the daily abnormal return is given by \(\alpha\). As mentioned above, we report the implied annualized return for both raw and abnormal returns, the latter calculated as \((1 + \alpha )^{252}-1\), which denotes the total return from holding the portfolio for one year. Statistical inference is based on Newey-West (1987) standard errors and statistical significance at the 5% level is indicated by boldfaced numbers. Results for bullish Twitter and StockTwits sentiment are shown in Table 10.

Table 10 Annualized portfolio returns based on daily bullish sentiment

There are a few interesting findings: Firstly, looking at Panel A and the investor sentiment measures obtained from short messages published on Twitter, only the raw portfolio returns of the Short and Long portfolios are statistically significant at the 5% level. While not statistically significant, the long-short portfolio returns based on portfolios sorted according to investor sentiment estimated with Harvard-IV, naive Bayes, and maximum entropy are even negative. Secondly, looking at Panel B and the investor sentiment measures obtained from short messages published on StockTwits, we find statistically significant raw and risk-adjusted returns only for portfolios sorted based on the L2 dictionary and Deep-MLSA neural network. Interestingly, the long-short portfolio returns based on L2 are slightly larger than for Deep-MLSA. Again, the long-short portfolio returns based on portfolios sorted according to investor sentiment estimated with Harvard-IV are negative, albeit not statistically significant.

To asses whether the differences in raw and risk-adjusted annualized portfolio returns are statistically significant, we report t-statistics for the pairwise differences between long-short raw and risk-adjusted returns in Tables 11 and 12, respectively. The t-statistics reported are calculated for the difference between long-short returns obtained with the estimation method reported in the rows with that reported in the columns. More precisely, for each data source and each pair of sentiment estimation technique, we construct a time series of differences in the returns and compute t-statistics to test whether the average difference equals zero (using Newey-West (1987) standard errors). Panel A and B report the t-statistics for differences in returns using Twitter and StockTwits data, respectively. Differences which are statistically significant at the 5% level are highlighted by boldfaced numbers.

We especially consider the risk-adjusted returns from a portfolio sorting based on the empirical distribution of investor sentiment estimated from StockTwits data since these are statistically significant in some relevant cases. Most notably, while for risk-adjusted returns, the pairwise differences are statistically significant for Harvard-IV and L2 as well as for Harvard-IV and Deep-MLSA, the pairwise difference between L2 and Deep-MLSA is not statistically significant. Thus, the performance of the L2 dictionary and the Deep-MLSA neural network seems to be very similar in terms of their ability to predict annualized abnormal portfolio returns. This is still a striking finding since it shows that a dictionary, which is tailored well towards a specific kind of content, can at least compete with state-of-the-art machine learning based approaches. Thus, for practical applications in general, it might be worthwhile to consider building a dedicated dictionary for a specific type of textual data, instead of building a highly complex model that has to be trained on large amounts of labeled data before being of use.

Table 11 Differences in raw annualized returns of long-short portfolios
Table 12 Differences in risk-adjusted annualized returns of long-short portfolios

As a robustness check, we consider again the subsample of short messages published on Twitter and StockTwits that are identified by a unique cashtag. Results for bullish Twitter and StockTwits sentiment are shown in Table 13 and t-statistics for the pairwise differences between long-short raw and risk-adjusted returns in Tables 14 and 15, respectively. Although in this case, more of the long-short raw and risk-adjusted returns are statistically significant, the above findings do not change qualitatively. Interestingly, the portfolio returns based on a sorting according to investor sentiment estimated with DeepMoji are statistically significant now and very close to those based on Deep-MLSA. Overall, when predicting abnormal returns, considering short messages that can be identified with a unique cashtag seems to reduce noise and to improve performance quite a bit. Therefore, if return prediction is the goal, filtering short messages for unique cashtags should be considered.

Table 13 Annualized portfolios returns based on daily bullish sentiment (unique cashtags)
Table 14 Differences in raw annualized returns of long-short portfolios (unique cashtags)
Table 15 Differences in risk-adjusted annualized returns of long-short portfolios (unique cashtags)

5 Conclusion

We have taken a pragmatic approach to answering the question of how to best gauge investor behavior by means of different online investor sentiment measures. Given the increasing number of publicly available dictionaries and implemented machine learning techniques that researchers and practitioners can use, our comparison of sentiment measures is restricted mostly to such publicly available approaches. The empirical analysis is based mainly on two financial applications that reveal the effects of the online investor measures on the cross-section of stocks—both in terms of retail investors’ order imbalances and forecasts of portfolio returns.

The performance of the considered sentiment measures varies considerably. We find the LM and L2 dictionaries to perform best throughout both applications. This finding is especially striking since the dictionary of Loughran and McDonald (2011) is not optimized for short messages published on social media platforms. These results demonstrate that publicly available dictionaries do not just constitute a methodology that ensures reproducibility of results but also that finance-specific dictionaries are at least on par or even superior to publicly available neural network techniques, such as the Deep-MLSA and DeepMoji, in financial applications. Thus, for future research, we strongly advocate the development of new and the refinement of existing dictionaries that not only cover specifics of financial terms but are also optimized for short messages published on social media platforms. The dictionaries of Renault (2017) may be taken as good examples. On a different note, publicly available machine learning techniques are still scarce. Our understanding of sentiment-driven investor behavior would benefit from researchers making their approaches available to others. Lastly, as our analyses demonstrate, empirical results involving online investor sentiment should always be scrutinized and compared with other approaches to the estimation of investor sentiment from online sources to avoid misleading conclusions.