Stock trend prediction using sentiment analysis

These days, the vast amount of data generated on the Internet is a new treasure trove for investors. They can utilize text mining and sentiment analysis techniques to reflect investors’ confidence in specific stocks in order to make the most accurate decision. Most previous research just sums up the text sentiment score on each natural day and uses such aggregated score to predict various stock trends. However, the natural day aggregated score may not be useful in predicting different stock trends. Therefore, in this research, we designed two different time divisions: 0:00t∼0:00t+1 and 9:30t∼9:30t+1 to study how tweets and news from the different periods can predict the next-day stock trend. 260,000 tweets and 6,000 news from Service stocks (Amazon, Netflix) and Technology stocks (Apple, Microsoft) were selected to conduct the research. The experimental result shows that opening hours division (9:30t∼9:30t+1) outperformed natural hours division (0:00t∼0:00t+1).


INTRODUCTION
For decades, stock trend prediction has been a popular topic due to its importance in the economy and risk management (Ou & Penman, 1989;Dimson, 1979;Checkley, Higón & Alles, 2017;Ranaldi, Gerardi & Fallucchi, 2022). However, its inborn complexity and uncertainty decide its difficulty (Agrawal, Chourasia & Mittra, 2013;Holthausen & Larcker, 1992). For example, politics, wars, and many factors would sharply affect stock prices (Agrawal, Chourasia & Mittra, 2013;Alzazah & Cheng, 2020). Thus, achieving the best result with the minimum required data is the goal (Agrawal, Chourasia & Mittra, 2013). In the past, most research employed many financial or technical factors, which aimed to reflect investors' interest and predict stock from a financial perspective (Weng, Ahmed & Megahed, 2017;Bujari, Furini & Laina, 2017), while a few years ago, with the rapid expansion of social media, people could easily post and spread their emotions through micro-blogging (e.g., Twitter and Reddit) (Checkley, Higón & Alles, 2017;Lassen & Brown, 2011). Therefore, researchers attempted to utilize micro-blogging data to directly attain investors' moods regarding the stock market, as financial decisions are significantly driven by emotion and mood (Bujari, Furini & Laina, 2017;Nofsinger, 2005). This provides the theoretical foundation for connecting social media sentiments and stock fluctuations.
Twitter, one of the micro-blogging platforms, is a significant data source due to its popularity, transparency, and timeliness (Checkley, Higón & Alles, 2017). Twitter provides a cashtag symbol ($) search to obtain relevant stock Twitter messages (tweets) (Mudinas, Zhang & Levene, 2019). For example, $AAPL is the stock topic only for Apple Inc. Therefore, by harvesting tweets with $AAPL, we can build one dataset containing all AAPL stock discussions.
In addition, news is also considered significant in stock prediction and has been widely used for many years (Kabbani & Usta, 2022). An expert's authoritative opinions, the description of one company's business and many other factors hidden in the news can motivate investors to buy or sell stocks (García-Méndez, De Arriba Pérez & González-Castaño, 2022;Hao & Chen-Burger, 2022;Li & Pan, 2022;Sharma & Bhalla, 2022;Srivastava, Tiwari & Gupta, 2022). Among various financial media, Bloomberg, Forbes, and Reuters are some valuable financial news sources (García-Méndez, De Arriba Pérez & González-Castaño, 2022;Li & Pan, 2022;Srivastava, Tiwari & Gupta, 2022). In our research, we will use news attained from eight popular and reputable websites or medias to conduct the stock trend prediction.
Natural language processing (NLP) is one computational technique to analyze and understand human language (Cambria & White, 2014;Hirschberg & Manning, 2015). Sentiment analysis, a branch of the NLP, identifies and extracts people's subjective attitudes, opinions, and emotions (Khedr & Yaseen, 2017;Hussein, 2018;Chalothom & Ellman, 2015). Sentiment analysis has been widely used in checking online comments, and the main goal is to examine sentiment scores (Hussein, 2018;Chalothom & Ellman, 2015). This research will adopt sentiment analysis technology as the primary text analysis method.
In this article, we harvest 2021 tweets and news mentioned Service stock AMZN, NFLX, and Technology Stock AAPL, MSFT. We combine VADER sentiment analysis along with other extracted tweets features to generate a novel weighted sentiment index T weighted for each tweet. We utilize FinBERT to analyze news titles and generate N weighted to represent news opinion value. In addition, we also design two time divisions: 0:00 t ∼0:00 t +1 and 9:30 t ∼9:30 t +1 to explore how tweets and news from different periods can predict the future stock trend. Goal I: Open t +1 -Open t and Goal II: Open t +1 -Close t (t is given transaction date) are created to evaluate their performance separately. Finally, six classifier algorithms (KNN, Tree, SVM, Random Forest, Naïve Bayes, Logistic Regression) are applied to check the result with 10-fold cross-validation.

Related work
In the work of Bujari, Furini & Laina (2017), the authors only used tweets features and achieved 82% accuracy with an ad-hoc model. However, no unified model fits all cases, and the prediction accuracy varies greatly for different stocks with the same model. The limitation of this research is the considerably short period, only 70 days (including non-transaction days) of data were collected. In Checkley, Higón & Alles (2017), the dataset goes from the 17th February 2012 to 17th October 2014. The authors proposed a model to predict market volatility, volume, and returns (direction) by using data granular to two-minute intervals. The evidence showed a causal link between all three target and bull-bearish sentiment metrics. Among the five stocks investigated, market volatility and volume are more predictable than market returns. A similar approach using more granular data also appeared in Kinyua et al. (2021), and they analyzed the impact of President Trump's tweets on SPX and INDU in a 30-minute event window (15 min before tweet posted and 15 min after tweet posted). To focus on the immediate influence of Trump's tweets, only tweets in transaction hours were kept in this research. The model used random forest, decision tree, and logistic regression algorithm to predict how market response to Trump's tweets. Machine learning algorithms showed that the inclusion of Trump's tweets significantly decreased prediction RMSE. Both Trump's strong negative and positive sentiments resulted in an uptrend of SPX and INDU indices. While, for all other sentiment categories, Trump's tweets caused a downtrend.
In addition, Maqsood et al. (2020) built a stock trend prediction model based on tweets related to local or global events. The significant events from 2012 to 2016 in USA, Hong Kong, Turkey, and Pakistan markets were investigated. For example, in USA market, the author examined how tweets mentioned the US election 2012 (local event) and Brexit 2016 (global event) affected Apple and Google stock fluctuations. The percentage of positive and negative tweets was calculated per day for each event. The result revealed that not all major events severely impact the stock market. However, more important events like the US election can strongly affect algorithm performance.
Many researchers have tried to make predictions by using data other than tweets. Kabbani & Usta (2022) used financial articles from 2016-01-01 to 2020-04-01 to predict the stock trend inside a given day. Considering the rapid change of sentiments, only today and tomorrow trend is predicted. After correlation analysis, highly correlated features with the article's sentiment scores were selected in the final data set. The model used linear regression, random forest and the Gradient Boosting Machine Algorithm, and hit 63.58% accuracy on average. Weng, Ahmed & Megahed (2017) introduced a model to predict the stock movement one day ahead. Different from sentiment analysis, the author used Wikipedia traffic, Google news counts, some market data, and various technical indicators to make predictions. The model just focused on Apple stock fluctuations from May 1st, 2012, to June 1st, 2015. Combining data from multiple sources, their expert system hit 85.8% accuracy, reflecting that the increase in data categories can boost the prediction result.
Mudinas, Zhang & Levene (2019) divided news and tweets sentiments into eight emotions (e.g., anger, fear). Only a few sentiment emotions showed some correlation with the future stock movements. In most cases, the prediction based on technical factors achieves a better result without emotions analysis. The same conclusion also got in Khedr & Yaseen (2017). They used N-gram and TF-IDF method to generate the weight for each token and classify collected financial news into positive and negative sentiment attributes. The sentiment attributes method achieved 59.18% accuracy in the K-NN classifier, and sentiment and historical stock data method achieved 89.80% accuracy. However, the author did not show the accuracy achieved with sole historical data. It seems that technical factors are the determining factors in stock prediction while extract sentiment attributes have harmful effects on the result, similar to the conclusion of Mudinas, Zhang & Levene (2019).
Except for traditional Machine learning methods, new technologies were also applied in stock predictions. Nguyen, Shirai & Velcin (2015) designed a novel aspect-based sentiment method to calculate the sentimental values for each topic in one sentence. Instead of extracting hidden topics together with sentiments, this new model represents each message as a set of topics with corresponding sentimental values. To examine the efficiency of the novel aspect-based sentiment method, the authors applied SVM on 18 stocks data from July 2012 to July 2013. The aspect-based sentiment feature method achieved 71.05% high accuracy for Amazon stock. Qiu, Song & Chen (2022) created a new sentiment analysis model based on Baidu AI Cloud, which provides an algorithm automatically mines sentiment knowledge through unsupervised learning. Also, a novel weighted sentiment index considers the holiday effect. Sentiments in non-transaction were somehow allocated to the next transaction day. This novel index increased seven of the eight algorithms' performance compared to the sentiment index without the holiday effect. Cryptocurrency price forecasting is also a hot topic outside of stock prediction. Parekh et al. (2022) presented a hybrid model, DL-GuesS in cryptocurrency price prediction considering historical price and Twitter sentiments. DL-GuesS model adopted LSTM and GRU deep learning to predict cryptocurrency prices considering different window sizes. The proposed DL-GuesS model achieved better Bitcoin-Cash performance than the model only using Bitcoin-Cash price.

MATERIALS & METHODS
The proposed model presented in Fig. 1 is designed to predict stock fluctuations and help stock investors make proper investment decisions. For news and tweets data, they are first sent to the sentiment analysis test section to select the most appropriate technology for applying sentiment analysis respectively. Then both generated weighted scores will be summed up in given intervals and apply holiday effect processing. Finally, summed news and tweets scores are combined according to intervals as input features. While for stock data, we turn stock fluctuations into binary changes and regard such change as the expected result in machine learning.

Data collection
Tweets: Compared with Twitter official API, the Twint library has many advantages and has been used in research (Dutta et al., 2021;Bruno Taborda et al., 2021;Yohapriyaa & Uma, 2022). Twint has no restrictions in the official Twitter API and provides many options to filter data (e.g., language selection). We use the cashtag symbol to harvest twitter messages of Service stock AMZN, NFLX and Technology Stock AAPL, MSFT in 2021. News: To avoid the bias of specific financial media, we scrape news from eight websites or medias including CNBC, Forbes, The Street, Reuters, The Motley Fool, Business Insider and Wall Street Journal, Bloomberg (García-Méndez, De Arriba Pérez & González-Castaño, 2022;Li & Pan, 2022;Srivastava, Tiwari & Gupta, 2022). Finally, six thousand pieces of news are retained for this study. Stock Data: Yahoo Finance is an influential financial news and data website widely adopted in stock research (Ahmar & Del Val, 2020;Budiharto, 2021;Topcu & Gulal, 2020). It provides historical stock price fluctuations (Bujari, Furini & Laina, 2017  Lowest price, and Volume. We download data for four selected stocks from January 4, 2021 to December 31, 2021.

Data preprocessing
Because the tweets dataset is built by using text scraped from Twitter, raw tweets with many undesired stuff need to be cleaned. Emoticons, symbols, URLs, and some meaningless texts (stopwords) are removed (Bruno Taborda et al., 2021;Budiharto, 2021). In addition, tweets containing more than three cashtags are discarded as meaningless messages, the same as Bujari's research (Bujari, Furini & Laina, 2017). For news preprocessing, we only keep news that mentioned chosen stock name exactly in titles so that we can avoid the disturb of much unrelated information. Moreover, we calculate the value Goal I: Open t +1 -Open t and Goal II: Open t +1 -Close t (t is given transaction date) for each stock, then transform all values into a binary variable, where positive values → 1 (uptrend) and negative values → 0 (downtrend) (Weng, Ahmed & Megahed, 2017;Bujari, Furini & Laina, 2017). The equation is shown below: Goal II = 1,Close t ≤ Open t +1 0,Close t > Open t +1 . (2)

Sentiment analysis Test
Testing datasets to find the most proper sentiment analysis technology Financial PhraseBank: Financial PhraseBank is one primary dataset for financial area sentiment analysis (Ding et al., 2022;Ye, Lin & Ren, 2021), which was created by Malo et al. (2014). Financial PhraseBank contains 4,845 news sentences found on the LexisNexis database and marked by 16 people with finance backgrounds. Annotators were required to label the sentence as positive, negative, or neutral market sentiment (Malo et al., 2014). All 4845 sentences were kept with higher than 50% agreement. In our study, we will use this dataset to test the performance of the lexicon or model on news text. Tweets_labelled_09042020_16072020: It consists of 5,000 tweets randomly selected out from 943,672 raw tweets (Bruno Taborda et al., 2021). The raw tweets were harvested by using Twitter hashtags and cashtags like #SP500, $MSFT, $AAPL. Inside this file, 1,300 were manually labeled with positive, negative, or neutral market sentiment and reviewed by a second independent annotator. In our study, we will use this dataset to test the performance of the lexicon or model on tweets.

Sentiment analysis technologies
VADER sentiment analysis In our model, VADER library is the first technology used to analyze text sentiment (Hutto, 2016). The VADER library is a lexicon and rule-based framework that can detect the three dimensions of sentiment in the text (Hutto, 2016;Pano & Kashef, 2020). The generated result consists of positive, neutral, negative, and compound scores ranging from −1 to 1 for each tweet.
Here, we just apply VADER sentiment analysis and consider the compound score to represent the general sentiment of each tweet. Then, to expand the influence of those popular tweets, we create a new feature Tweets Weighted: T weighted T weighted = compound score * (retweets + 1) where T weighted is the weighted sentiment compound score for each tweet.

Loughran-McDonald word dictionary
The Loughran-McDonald dictionary was widely adopted in financial area research (Wang, Yuan & Wang, 2020), since it was the first dictionary to explore the potential benefits by using specific financial domain words (Karalevicius, Degrande & De Weerdt, 2018 Figure 2 shows the testing results of VADER, Loughran-McDonald dictionary and FinBERT performance on dataset Tweets_labelled_09042020_16072020. VADER gets the best classification accuracy 68%; Loughran-McDonald dictionary and FinBERT gain 56%, 53% classification accuracy respectively. The results reflect that VADER has better classification performance on messy Twitter messages, while the FinBERT trained on well-structured news text gives the worst result in tweet classification. In addition, we can find that VADER far outperforms other two methods in positive tweets classification and VADER also gives twice as many positive labels as the others. VADER tends to label more tweets in the testing dataset as positive than the Loughran-McDonald dictionary and FinBERT do. Figure 3 contains the classification accuracy of VADER, Loughran-McDonald dictionary and FinBERT on Financial PhraseBank, the result for FinBERT is gathered from Araci (2019), while other two confusion matrix are calculated by using our own results. In news classification, the situation is reversed, FinBERT gets 86% accuracy, but VADER is only 54%. FinBERT is highly adaptive to standard financial text written by professionals compared with tweets proposed by laymen. However, VADER is more suitable for general messages and cannot identify domain-specific terms well. According to the test results, we will apply VADER analysis on tweets dataset, and FinBERT model on news data we gathered for prediction since they present better sentiment classification results.

Data manipulation
Because our idea is to explore how tweets and news from different periods can predict the future stock trend, we also design two time divisions: Natural hours division 0:00 t ∼0:00 t +1 and Opening hours division 9:30 t ∼9:30 t +1 (t is given natural date) and get Tweets in Natural hours: TN t , and Tweets in Opening hours: TO t TN t = 0:00 t +1 i=0:00 t T weighted (5) News in Natural hours: NN t , and News in Opening hours: NO t are similar to above formular above, except that T weighted is changed to N weighted .
The new variables are the summation of sentiment values in each time division. For example, in 9:30 t ∼9:30 t +1 time division on January 5th, T weighted from January 5th, 9:30 to January 6th, 9:30 are summed up to generate TO t for January 5th.
In addition, calendar anomalies (holiday effect, day-of-the-week effect,) are common in financial markets (Berument & Kiymaz, 2001;Jacobs & Levy, 1988;Kiymaz & Berument, 2003). The holiday effect, which leads to abnormal stock price fluctuation on Monday, has been extensively studied in the stock area (Bujari, Furini & Laina, 2017;Berument & Kiymaz, 2001;Shiller, 2003). The news released on weekends or holidays is one reason that changes investors' behavior (Qiu, Song & Chen, 2022). Inspired by Qiu, Song & Chen (2022), we generate an equation to consider holiday effect as shown below (n is days of vacation): TO modified−t = e −n TO t −n + ... + e −1 TO t −1 + TO t (7) Modified score NO modified−t and NN modified−t are also similar to get, just to change Tweets value to corresponding News value.
Take TN modified−t as an example, if n = 2 days, the TN t −2 , TN t −1 and TN t can stand for summed T weighted on t = Sunday, t-1 = Saturday, t-2 = Friday. As sentiments have a stronger impact more recently, sentiment in the past decreasing exponentially (Barberis et al., 2015). Therefore, we concentrate weekend sentiments on Sunday and use modified Sunday weighted score to predict Monday's trend.
Finally, we will combine modified news and tweets score according to function Combine Natural Hours CN and Combine Opening Hours CO: In our experiment, result is the best when α = 0.25. Thus, we apply 0.25 in the function to get CO and CN .

Evaluation methodology
To evaluate the performance of different machine learning algorithms, we employ Classification Accuracy (CA), Precision, Recall, and F1-score. Equations of these four are shown in the following: TP stands for true positive classification, and FP is also positive classification but false. TN stands for true negative classification, and FN is the positive result but wrongly classified as negative.

RESULTS
Tables 1∼4 presents the six algorithms' performance for AAPL, AMZN, MSFT, NFLX in Goal I: Open t +1 -Open t and Tables 5∼8 are the performance for those companies in Goal II: Open t +1 -Close t . As the result shows, Naïve Bayes is the best classification algorithm in our model, six out of eight best classification accuracy comes from Naïve Bayes.
In Goal I: Open t +1 -Open t prediction, all best Accuracy, F1-score, Precision, and Recall, except AAPL's F1-Score, are generated by using CN . In contrast, all best results in Goal II: Open t +1 -Close t prediction are achieved with CO.
Therefore, since Naive Bayes algorithm has proven to be the most effective, it demonstrates strong performance in TD1 for Goal I (Open t +1 -Open t ) and even better performance in TD2 for Goal II (Open t +1 -Close t ). To determine the overall performance, we calculate the average of Naive Bayes accuracy from both Goal I in TD1 and Goal II in TD2.

CONCLUSION
In this article, we develop a new weighted sentiment index T weighted by using Twitter retweets counts and VADER sentiment score, and we utilize FinBERT to generate N weighted to represent news opinion value. We propose a new time division, opening hours division, to study how tweets and news released in different time periods can predict next-day stock movement. Based on that, the summation of scores in different time divisions is calculated. In addition, the holiday effect are also considered in this article to construct modified score, which proved to be a more reliable and realistic indicator (Qiu, Song & Chen, 2022). Finally, we apply KNN, Tree, SVM, random forest, Naïve Bayes, logistic regression algorithms on aligned series to evaluate the performance of combined score CN and CO in predicting Goal I: Open t +1 -Open t and Goal II: Open t +1 -Close t . The experimental result shows that Naïve Bayes is the best classification algorithm among six, six out of eight best results are produced by Naïve Bayes. The possible reason is that the dataset might be highly skewed, with one class being much more prevalent than the others. NB is known to perform well in such scenarios. Furthermore, CN is better in predicting Goal I: Open t +1 -Open t . Except for AAPL's F1-score, all best Accuracy, F1-score, Precision and Recall are generated by using CN . In reverse, CO presents all best results in Goal II: Open t +1 -Close t prediction. It proves that non-natural time division

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.

Author Contributions
• Qianyi Xiao conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.
• Baha Ihnaini conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.