STTM: an efficient approach to estimating news impact on stock movement direction

Open text data, such as financial news, are thought to be able to affect or to describe stock market behavior, however, there are no widely accepted algorithms for extracting the relationship between stock quotes time series and fast-growing textual representation of economic information. The field remains challenging and understudied. In particular, topic modeling as a powerful tool for interpretable dimensionality reduction has been hardly ever used for such tasks. We present a topic modeling framework for assessing the relationship between financial news stream and stock prices in order to maximize trader’s gain. To do so, we use a dataset of economic news sections of three Russian national media sources (Kommersant, Vedomosti, and RIA Novosti) containing 197,678 economic articles. They are used to predict 39 time series of the most liquid Russian stocks collected over eight years, from 2013 to 2021. Our approach shows the ability to detect significant return-predictive signals and outperforms 26 existing models in terms of Sharpe ratio and annual return of simple long strategy. In particular, it shows a significant Granger causal relationship for more than 70% of portfolio stocks. Furthermore, the approach produces highly interpretable results, requires no domain-specific dictionaries, and, unlike most existing industrial solutions, can be calibrated for individual time series. This makes it directly usable for trading strategies and analytical tasks. Finally, since topic modeling shows its efficiency for most European languages, our approach is expected to be transferrable to European stock markets as well.


INTRODUCTION
Effective market hypothesis (EMH) (Fama, 1970;Fama, 1991) argues that all publicly available information is immediately and fully reflected in stock market prices. Consequently, neither historical data nor the forecasts based on them are seen as usable for the development of efficient investment strategies. However, many approaches for stock market movement prediction that were developed since EMH had been proposed (Fabozzi & Fabozzi, 2020) have demonstrated certain levels of efficiency. At the same time, as the task remains challenging due to the high volatility of stock quotes, new approaches are still needed. Overall, two main groups of approaches-technical and fundamental-are usually singled out by researchers, both nowadays employing machine learning methods (Dixon, Halperin & Bilokon, 2020). In technical analysis, the analyst uses past trends in the share prices to predict their performance in future, without inferring the causes of the observed trends. Fundamental analysis is based on the assumption that the market price of an asset tends to its intrinsic value, but always deviates from it with the asset thus being either overvalued or undervalued. By inferring the intrinsic value to which the market is expected to correct, this approach aims to predict stock price behavior. For this, various external data are often used, including information disclosed by companies, such as revenues, earnings or profit margin, and independent analytics.
One of the promising types of external information is unstructured textual data, notably financial news. Coupled with automated machine learning techniques, it allows investors to solve predictive and descriptive tasks, saving time and labor costs for finding important information in a large amount texts. Such data is found able to generate interpretable and significant information signals that help investors to minimize investment risks.
Shallow feature based methods of text processing play a special role in predicting the direction of different types of financial movement, such as stock or commodity prices, with unstructured text data, such as news or user-generated content. Most often, these methods do not require markup (unlike approaches based on sentiment analysis) and do not need updating their parsing algorithms (unlike event extraction methods). The general procedure of building such algorithms begins with preprocessing of the source texts, then passes to constructing vector representations, or embeddings of these texts (e.g., TF-IDF, BoW, Doc2Vec, DL-based embedding) and finally incorporates these embeddings in machine learning (ML) techniques to predict stock trends. The main disadvantage of such approaches is low interpretability of vector text representations as predictors. Meanwhile, topics generated by probabilistic topic models are easily interpreted by humans based on the lists of most probable words, but are mostly missing from the relevant literature reviews (Usmani & Shamsi, 2021;Jurczenko, 2020;Shah Dev & Zulkernine, 2019). Other dimensionality reduction methods that do find their way into financial movement prediction domain are mostly based on hard clustering approaches, e.g., K-Means (Babu, Geethanjali & Satyanarayana, 2011). This is suboptimal for classification of texts that usually belong to more than one topical cluster. Additionally, such clusters are difficult to interpret as they are delivered unlabelled. As topic modeling co-clusters both texts and words by topics, top words can be used as natural cluster labels, while simple clustering yields nothing except lists of items grouped into the unlabelled clusters. Although K-Means-based approaches can be ideologically adapted to fuzzy logic and to the logic of simultaneous co-clustering of items and their features, we are unaware of such applications in the sphere of stock market prediction.
In this article, we propose a new method for predicting stock price movement direction based on topic modeling. Our algorithm is highly interpretable, requires no fixed markup or pre-existing sentiment dictionaries, and at the same time remains an end-to-end solution within the paradigm of machine learning techniques for stock prediction using numerical and textual data. Our approach achieves high predictive power in the weekly price trend prediction task, where stocks of the largest Russian companies are considered as time series (spanning eight years between 2013 and 2021), and economic news of the three largest Russian-language news agencies are used as textual data. We use the Granger causality test to evaluate statistical significance of the obtained predictions. In addition, we consider a simple trading strategy and evaluate the success of a portfolio calibrated on the obtained predictions through Sharpe ratio and annual return. In doing so, we consider portfolios derived from predictions of various ML-models (Random Forest, Logistic Regression, Gradient Boosting Machine, Support Vector Machine, 3-layer Neural Network) and using different embeddings (average Word2Vec, Navec, Doc2Vec, FastText) of the news title, of its entire text, and of its first paragraph. We also considered the quality of the strategies of the mentioned ML models built on endogenous data (5 lags of the time series). We compare on our approach to SESTM model (Ke, Kelly & Xiu, 2020) that has shown promising results for the US stock market and English-language news and that, according to its authors, outperforms RavenPack algorithms (the industry-leading commercial vendor of financial news sentiment) in terms of Sharpe ratio scores. We show that our approach yields the best results more often than other included in the comparison. As topic modeling performs universally well across all European languages, our approach is expected to be applicable to all European stock markets, respectively.
The rest of the article is structured as follows. The 'Related Work' section reviews approaches based on interpretive sentiment analysis, methods based on combinations of embeddings and ML models, and topic models that are conceptually close to our framework. The 'Methodology' section introduces the proposed method. The 'Datasets and preprocessing' section describes the data used in the current study. The 'Metrics' section describes the return and risk metrics for portfolios obtained using various approaches discussed in this article. The 'Experiments' section contains a description of the procedure for forecasting and constructing various schemes for stock trend modeling. The 'Numerical results' section contains the results of our experiments. The 'Discussion' section interprets the obtained results. The 'Conclusion' summarizes our findings and discusses the possibilities for further framework improvements. Appendix A is devoted to a qualitative analysis of the results of topic modeling. This part of the article, first of all, compares the results of different topic models with each other. Second, it shows which topics are most frequently covered in the main federal media and in the trading terminal news. Finally, the temporal saturation of the market with new information is shown. Appendix B contains supplementary materials of this article, such as illustration of data and models, cumulative divergence of topic profiles, coherence scores and tables with the Granger causality test values.

RELATED WORK
Much research exploring the relationship between textual information and financial time series relies on sentiment dictionaries, such as the Harvard-IV-4 dictionary and Loughran-McDonald Financial Dictionary (Loughran & Mcdonald., 2016). For instance, Li et al. (2014) use both of the mentioned dictionaries to create a sentiment-based model for stock market prediction tasks. Kim, Jeong & Ghani (2014) assign a sentiment score to a textual data stream using a dictionary and rules, after which the authors identify significance of correlations between this news stream and stock market fluctuations. Li et al. (2018) extract sentiment information using Loughran-McDonald, Harvard IV-4, and SenticNet 3.0 in their research. Picasso et al. (2019) use McDonald dictionary and AffectiveSpace 2 (Cambria et al., 2015) to evaluate sentiment information from financial news for twenty most capitalized companies listed in the NASDAQ 100 index. However, dictionary approach is hard to customize to specific data and prediction tasks. Existing dictionaries still require being extended to the financial domain. Moreover, sentiment dictionaries are still underdeveloped for less resourceful languages, including the Russian language and are the subject of recent research (Panchenko, 2014;Koltsova et al., 2020).
Other related works exploit various machine learning approaches (Rundo et al., 2019;Thakkar & Chaudhari, 2021) combined with different encoding procedures used to assign vector representations to documents; these procedures include TF-IDF features (Bing et al., 2017), word-embeddings (Mahmoudi, Docherty & Moscato, 2018 and deep learning methods (Matsubara, Akita & Uehara, 2018;Xu et al., 2020), among others. These vector representations, sometimes combined with other financial numerical features (Gu, Kelly & Xiu, 2020;Li, Wu & Wang, 2020) are used as an input for classification or regression models, depending on the time series problem statement (Henrique, Sobreiro & Kimura, 2019). For example, Khedr, S.E. Salama & Yaseen Hegazy (2017) propose the model that predicts the rise and fall of shares of companies traded at NASDAQ based on economic news. They combine stemming, n-gram, TF-IDF, and numerical features with Naïve Bayes and KNN algorithms. Manela & Moreira (2017) use n-gram features with the SVR model to estimate the relationship between the front-page text of The Wall Street Journal and the VIX volatility index. Weng et al. (2018) extract public information from Google and Wikipedia with Random Forest model (while simultaniously testing NN, SVR and boosted regression tree) to predict the 1-day ahead price of 19 additional stocks from different industries. Such approaches are difficult to interpret by a potential investor: it is often hard to understand why the vector representation model learned a certain word embedding and what effect it had on the final result, as well as to explain why the ML model chose a specific combination of non-transparent features as significant.
Latent Dirichlet Allocation (LDA) is one of the most popular topic modeling techniques using a Bayesian approach for generating topics (Blei, Ng & Jordan, 2003). Topics derived from topic modeling can be good predictors of financial time series. For instance, Chester Curme & Preis (2017) show that the forecasts of trading volume can be improved by accounting for news topical diversity which they measure as the Shannon entropy of a topic distribution yielded by a topic modeling algorithm run over daily corpora of Financial Times news. Also, there are natural extensions to the LDA model with the temporal structure of texts: DTM (Blei & Lafferty, 2006) and DIM (Gerrish & Blei, 2010). These models allow tracing temporal evolution of topics and their lexical composition and reveal the most influential documents. Other papers integrate text and time-series data into a single probabilistic model expanding DTM or LDA (Park, Lee & Moon, 2015;Kanungsukkasem & Leelanupab, 2019). In these papers, researchers carry out a qualitative analysis of topics associated with time series and evaluate the predictive power of the respective model. Kim et al. (2013) develop Topic Modeling with Time Series Feedback model (ITMTF) that infers topics iteratively while optimizing their correlation with time-series data in terms of its strength and direction. The latter means that topics are gradually re-defined so that to include only the words that affect the predicted time series in the same way (either negative or positive). This approach yields more causal topics than the baseline LDA in terms of Granger causality and more pure topics in terms of coherence of the effect's direction. However, ITMTF model uses the time series data from a very limited pool of only three US companies, Apple and two airlines, and only for six months. Another important approach ideologically close to ours is SESTM-Sentiment Extraction via Screening and Topic Modeling (Ke, Kelly & Xiu, 2020). Its authors use a supervised topic model with two topics-one being assigned the words that have a positive impact on asset returns, and the other with the words having a negative impact-to calculate word-level predictive scores (termed sentiment scores) that are later transformed into text-level predictive scores. These latter scores are used to optimize investment portfolio construction whose quality is assessed in terms of Sharpe ratio and annual return metrics.

METHODOLOGY
We propose the Stock Tonal Topic Modeling approach (STTM) by introducing an index that reflects the association between topics occurring in news stream and the stock prices movement. This index, hereafter termed STTM index, is positive if the overall association of all the topics in the news stream of a given time period with the stock movement is positive (i.e., it predicts stock growth), and negative in the opposite case. Further, we use the STTM index to optimize investment portfolio construction and show its efficiency.
The proposed procedure of computing STTM index has several stages. First, we perform topic modeling of the news flow and calculate the salience of each topic at each time point of a pre-selected period, thus receiving a distribution of saliences for each topic over time. This distribution is further referred to as topic stream. Similarly, we obtain the word stream for each word as a distribution of word frequencies in our news flow over time. Second, we compute the tone of each word as the value of an association measure (in our case-Pearson correlation) between the word stream and the target time series (in our case-stock prices). Words found to be positively associated with the target time series are considered to have positive tone, and visa versa. Third, topic-level tone is computed based on the tones of high-probability words from each topic, according to a procedure described further below. Finally, topic-level tones are aggregated over all topics into a distribution termed tonal topic stream, which in turn is aggregated over time into a single value-STTM index. This index, thus, reflects the strength and the direction of the aggregated impact of all topics on the stock price movement.
More formally, let us denote textual data stream as collection D = (d 1 ,t d 1 ),...,(d m ,t d m ), where d i -document, t d i -date and time of document release. Let us also denote financial time series as p t = (p 1 ,t 1 ),...,(p N ,t N ), where p i -value, t i -corresponding time stamp. Our key problem is predicting P(r t ≥ 0), where r t = p t −p t −1 p t −1 , with STTM index based on textual data flow between t and t −1 as feature. In our notation, we normalize the raw STTM index to the range [0;1], so that STTM → 1 and STTM → 0 mean that the textual information pulls the time series p t up and down, respectively. We give a detailed description of the entire procedure whose graphical representation can be found in Fig. 1.

Data preprocessing
First of all, we preprocess input text data. Each text is subject to tokenization and lemmatization, removal of stop-words, and punctuation symbols. Next, we calculate idf-parameter-inverse document frequency, part of TF-IDF feature, for each unique word w in D: where |D|-number of documents in collection; |{d i ∈ D|w ∈ d i }|-number of documents from D collection, where the word w occurs. After that, we remove words in the upper and lower quantiles of the α-level from the text data. 1 Thus, we do not consider the most rare or the most frequent words. Then, we transform the resulting text data into a bag-of-word representation.

Topic modeling
We feed the preprocessed text data as an input to the topic model for generating probabilistic topics: T 1 ,...,T n . The topic model can be LDA, DTM, DIM, ITMTF or any other technique. 2 It can be pre-trained in advance or online-trained on the textual data stream D. As a result of topic modeling procedure, each document d is represented by n -dimensional vector of topics' probabilities: θ (d) = (θ d,1 ,...,θ d,n ). We numerically estimate the salience of each topic T j in each time unit t i of the textual data stream D as follows: (2.1) 3 We use standard level of significance 5%. 4 As default value for C j we use 0.3 Topic stream TS j of each topic T j is thus defined as a set of all salience scores of this topic in a given time period: Consequently, we associate each topic T j with the time series of its stream TS j .

Topic tonality based on word-level tone
By analogy with topic stream, to define word stream, for each word w we first calculate its frequency as the sum of its occurrences over all documents d that have appeared in our stream in a given time point t i : Thus word stream WS w is defined as a set of word frequencies in a given time period: Consequently, we associate each word w with the time series of its stream WS w . In general, c(w,d) can be any additive function of the number of words w in the document d. The tone of the word w is determined as a function of target time series and word stream: Function f w can be any regression evaluation metric or any time series proximity metric. We use the Pearson correlation coefficient: if the significance less than γ 3 and in other cases. Since each topic T j is a probability distribution in each time unit t i over a dictionary V : T j,i = (w 1 ,φ j,i w 1 ),...,(w |V | ,φ j,i w |V | ), we define the topic tone as a function of the word's probabilities in the topic φ j,i W and the corresponding word's tone W for each time point t i : The overall tonality of topic T j is defined as a set of its tones in a given time period: We implement function f T j,i as follows: 1. The number of the most probable words in the topic T j at time point t i (i.e., the words for which the tone will be calculated) is selected so that the sum of their probabilities does not exceed the specified threshold C j : 4 w sorted by probability 2. The variables pProb and nProb are calculated. The calculations involve only words selected in the previous step: If the significance level r p t ,WS w of all the selected words is higher than γ , then we define: pProb = 0,nProb = 0.

Tonal topic stream
At the next step, we calculate the tonal topic stream (TTS) as a product of topic tonalities and topic streams for each time point: Thus TTS is a matrix, where rows correspond to the topics and columns correspond to the time points. Each element of this matrix, TTS j,i , is the value of tonal topic stream for the topic T j at time point t i .

STTM index
Finally, STTM index as the overall tonality of all topics over a given period of time is defined as a time aggregate from the TTS matrix: The aggregation function can be a simple or weighted mean, median, or sum. We use simple sum by default. The tonality of each specific news item d is, analogously, the aggregate over the products of topic probabilities of news item θ (d) and its topic tonalities ψ (d) at time point t d . As noted above, for comparability purposes we normalize the STTM index to the range [0,1] based on sigmoid regressor. Figure 2 shows a general scheme for predicting directions of financial market movements using the STTM approach.

DATASETS AND PREPROCESSING
In this section, we describe in detail our datasets. We collect two types of data: financial time series data and textual data stream. We consider stocks included in the MOEX Russia Index as the time-series data and Russian-language news from the largest and most influential economic media as a textual data stream. In addition, we describe the required raw data preprocessing.

MOEX Russia Index
MOEX (Moscow Exchange) Russia Index (see Fig. 3) is a capitalization-weighted composite index serving as the primary ruble-denominated benchmark of the Russian stock market. It is calculated as the sum of the prices of 39 most liquid Russian stocks weighted by expert assessments of their impact on the Russian economy. These stocks are pre-selected by experts from among the largest and the most dynamically developing Russian issuers with economic activities in the leading sectors of the Russian economy. MOEX Russia Index is used as one of the baseline investment portfolios in this research.
where p t is the closing price of the shares time series, t -weekly timestamp. 5 We use the natural language toolkit (NLTK) in Python for tokenization tasks: https://www.nltk.org 6 We use a Python wrapper for an morphological analyzer for Russian language produced by Yandex Mystem for lemmatization task: https://pypi.org/ project/pymystem3/ 7 List of stop words is available from item 70 on https://www.nltk.org/nltk_data/

Economic news
Our text dataset of daily news includes three Russian national media sources: Kommersant, RIA Novosti, and Vedomosti. Kommersant (The Businessman, available on the website: https://www.kommersant.ru) is a nationally distributed daily newspaper devoted to politics and business. RIA Novosti (Russian Information Agency, available on the website: https://ria.ru) is one of the principal state-owned news agencies publishing news and opinions of social, political, economic, scientific, and financial subjects. Vedomosti (The News, available on the website: https://www.vedomosti.ru) is a national daily newspaper specializing in business. In each media outlet we consider the texts from the economy section only. Consequently, it contains a significant number of editorials, analytical reviews, and expert opinions, affecting the estimated textual data flow. Table 1 illustrates the amount of collected data and the date intervals corresponding to it. Figure

News preprocessing pipeline
We use a common natural language preprocessing pipeline. We begin with the tokenization 5 by breaking each text into sentence components and then into word components. Next, we normalize all tokens in each article to lower case letters, remove punctuation, nonalphabetic, and non-Cyrillic symbols, and perform lemmaization with Yandex MySteman instrument developed specially for the Russian language and based on extensive morphological analysis. 6 A lemmatizer was preferred to stemmers because it avoids aggressive suffix-stripping-an approach that would not be applicable for highly inflected languages, such as those of the Slavic family, since therein suffixes are heavily used for word formation and can entirely change word meaning. In terms of recognizing lemmas from word forms, Yandex MyStem has the error rate of about 2-3% only and consistently outperforms other existing models on different Russian-language corpora (Kotelnikov, Razova & Fishcheva, 2018). Finally, we remove stop words, such as prepositions, participles, interjections, numbers 7 and the tokens in the upper and lower quantiles of the α-level idf-parameter. We use the standard 95%-quantile as a default value for α-level. Finally, we convert each news item into a vector of word counts.

METRICS
We consider two approaches to assessing the results obtained. First, we explore the relationship between the stock market movement and news flow impact of the proposed tonal topic modeling procedure through correlation analysis using Granger's causality test (Deveikyte et al., 2020). Second, we introduce a simple trading strategy of calibrating investment portfolios based on the predictions of the model and evaluate the performance of this strategy with the Sharpe ratio and the annual return of the portfolio (Ke, Kelly & Xiu, 2020). Economic metrics for evaluating the success of a calibrated portfolio appear to be more suitable for the stock trend prediction problem than standard classification metrics, such as accuracy, Receiving Operating Characteristics (ROC), and area under the curve (AUC). This is due to the following reasons. Standard classification quality metrics for one time series may not give high results. However, economic metrics work on many time series, so in this case we can get a good financial result.

Granger causality test
The Granger causality test 8 is a statistical hypothesis test for determining whether one time series is useful in forecasting another. Granger causality requires time series stationarity. 10 We use Sharpe-ratio calculation from empyrical python-package: https: //github.com/quantopian/empyrical 11 We use annual return calculation from empyrical python-package: https: //github.com/quantopian/empyrical Let y t and x t be two time series. To see if x t 'Granger causes' y t with maximum q time lag, the following regression is performed: Then, F-tests are used to evaluate the significance of the lagged x terms. The coefficients of lagged x terms estimate the impact of x on y. The null hypothesis that x does not Grangercause y is accepted if and only if no lagged values of x are retained in the regression. Since in our case the time series has shown non-stationary behavior, as determined by the Dicky-Fuller unit root test, 9 a first-order and, where necessary, a second-order differentiation was performed to achieve stationarity.

Sharpe ratio and annual return
The Sharpe ratio 10 measures the performance of an investment, such as a share or a portfolio of shares compared to a risk-free asset, after adjusting for its risk. It is defined as the difference between the returns of the investment and the risk-free return, divided by the standard deviation of the investment returns. It represents the additional amount of return that an investor receives per unit of increase in risk. More formally, where R p -return of portfolio, R f -risk-free rate, σ p -standard deviation of the portfolio's excess return. As a risk-free asset for the Russian market, we use government bond yields (available on the website: https://www.cbr.ru/eng/hd_base/zcyc_params/). A Sharpe ratio of less than one is usually considered unacceptable or bad. It means that the risk of portfolio encounters is being offset well enough by its return. Annual return or Compounded Annual Growth Rate (CAGR) 11 is the average annual rate calculated from the returns observed in the first and the last years of a given time span, assuming that all the dividends are reinvested in the end of each year. More formally, where EV -ending value, SV -start value, n -number of years.

EXPERIMENTS
In this section, we describe our experimental procedures. The 'STTM' subsection describes how to build and evaluate the quality of models based on the proposed framework. The 'SESTM' subsection describes how to build a topic model that outperforms RavenPack, the industry-leading commercial vendor of financial news sentiment. The 'Shallow feature based methods of text processing' subsection describes a scheme for building models based on embedding news. The 'Endogen models' subsection reveals a way to build simple endogenous models on time series lags. The 'Evaluation procedure' subsection describes the scheme for splitting the initial data into training and test samples to evaluate quality metrics.

STTM
As mentioned before, our Stock Tonal Topic Modeling (STTM) approach can have any basic topic model as its core. In this article, we implement two models: LDA 12 and DTM 13 with python-wrapper from gensim python-package: https://radimrehurek.com/gensim/. Qualitative analysis of these models is presented in the section 'Qualitative Analysis of Russian Economic News Topic Modeling' in Appendix A. DTM extends LDA by allowing word probabilities in a topic to change over time. We have chosen to update them in the increments of one month. Further, we select the number of topics so as to avoid solutions with either a large number of too granular topics or a small number of too broad topics.
To do so, we optimize topic coherence with Roder's C v metric (Röder, Both & Hinneburg, 2015). Although the dynamics of coherence change with the increase of the number of topics is somewhat different for different news sources (see Figs. B4, B5 and B6), the overall optimum appears at n = 20, where C v curve reaches its maximum before flattening out for all national media sources. This optimum is the same for both LDA and DTM. We run a topic model on the training dataset and apply it to the test dataset. The datasets are constructed according to the procedure described in 'Evaluation procedure' section below. Topic tonality based on word-level tone is calculated from the solution obtained on the training set. Topic stream is calculated from the the test set, and topic tonality based on word-level tone is applied to it. STTM hyper-parameters are selected based on optimization of ROC-AUC metrics the grid-search on the training set for each time series independently. Examples of the STTM index, the stock time series, news, topics, words and also their tonalities and tones, as well as an example of a tonal topic stream are presented in Figs. B9, B10, B11 and B12. It can be seen that results of proposal procedure are highly interpretable. Since topic modeling possesses a certain level of instability leading to fluctuations in the word probabilities, we repeat all calculations for trading strategy performance at least ten times. After that we estimate the mean and variance of each of the considered economic metrics.

SESTM
We implemented the Sentiment Extraction via the Screening and Topic Modeling (SESTM) procedure (Ke, Kelly & Xiu, 2020) as a baseline topic model. We apply it for each time series from the MOEX Russia Index for three national news sources. SESTM approach infers only two topics-containing words that have either negative or positive effect on the target time series, the composition of these topics being optimized iteratively based on an association metric. The process consists of three steps: 1. isolating a list of sentiment terms via predictive screening 2. assigning sentiment weights to these words via topic modeling 3. aggregating terms into an article-level sentiment score via penalized likelihood. The main assumption of the model is that the news is generated from the following mixture multinomial distribution: where s i is the total count of sentiment-charged words in article d i , p i is the sentiment score, O + and O − are a positive and negative topics, respectively, which is probability distributions 15 All models realization from sklearn python-package: https://scikit-learn.org/ stable/

Shallow feature based methods of text processing
To compare our approach to the models with shallow features based methods of text preprocessing, we apply a specific pipeline shown on Fig. 5. For each economic news from the considered news agencies (Kommersant, Vedomosti, RIA Novosti), various textual components are extracted (full text, first paragraphs, titles). After that, each component is preprocessed as described in section 'Datasets and preprocessing'. Further, various techniques for obtaining embeddings are applied to each news item: Word2Vec, Navec, Doc2Vec, FastText, 14 Navec realization from natasha pythonpackage: https://github.com/natasha/navec, FastText realization as python-package from: https://fasttext.cc/ (all embeddings models trained on the first two years of texts for each news outlet separately). The resulting embeddings of news (where all the news for the same week are treated as one document) are fed as the input features to the following machine learning models: Random Forest (RF), Logistic Regression (LR), Gradient Boosting Machine (GBM), Support Vector Machine (SVM), and 3-layers Neural Network (NN). 15 The target variable for all models is the sign of returns for each ticker included in MOEX Russia Index, which equals 1 for the growing times series and 0 otherwise.

Endogenous models
In addition, we compare the proposed framework with simple endogenous models: Random forest (RF), logistic regression (LR), gradient boosting machine (GBM), support vector machine (SVM), and three-layers neural network (NN), where weekly return lags are used as features, and the target variable is the same as in shallow-feature-based models. Figure 6 shows the pipeline we used for construction of such endogenous models.

Evaluation procedure
As for time series data, test and train datasets have to be defined as subsets covering uninterrupted periods of time, we use an iterative expanding cross-validation scheme (visualised in Fig. 7) in which we expand the time window of the training set by one year at each iteration, starting from two years and ending with six years, while test set window is kept stable at the length of one year across all iterations. Topic model obtained on the training set is retrained at each iteration and then applied to the test set. We estimate our models on all economic news between stock market start time each Monday and its end time each Friday. Considered target variable is movement direction sign(r t ) between the closing price on Friday and the opening price on the previous Monday for each share in the MOEX Russian Index. For the proposed STTM approach, we compute a Granger causality test between the weekly value of STTM index and the weekly stock price of each ticker included in MOEX Russia Index for each test interval separately. For each ticker and for different topic models (LDA and DTM), including STTM, we evaluate the weekly trading strategy performance. We use a straightforward long strategy. For this trading strategy each Friday at the time of the closure of the stock market we select top 20 percent of stocks with the highest value of the model prediction for the current week. These stocks are the most likely to demonstrate price gains in the upcoming week. Next, we buy these 20% of stocks at the Friday prices. Such procedure is repeated each Friday thus providing portfolio recomposition. We then evaluate this strategy with the annual returns and Sharpe ratio metric introduced before. In doing so, we do not account for either broker commission or transactions costs (see Limitations section) because our goal is to evaluate the predictive power of our method as compared with other methods, rather than to calculate the amount of the final return it allows to gain.

Granger causality tests
In this section, we provide numerical results for the Granger causality test between STTM index and Friday's prices for each of 39 tickers included in MOEX Russia Index; this is done for two topic models employed (LDA and DTM) and for three different news sources. Calculation details can be found in Tables C1-C6. Figure 8 shows the proportion of the assets in our sample for which the STTM index has significant Granger predictive power in each of the studied years.

Weekly trading strategy performance
In total, we compare the performance of 28 different portfolios: five endogenous model based portfolios, 20 portfolios using shallow feature based methods of text processing, two portfolios based on the proposed STTM approach, and one based on SESTM. Each of these mentioned portfolios is constructed for three different news sources independently. Given that we have 39 tickers in our analysis, in total we obtain 2,886 different models validated on six train/test splits each. We also compare all these models to two baselines: MOEX Russia Index, as a type of a broad stock market index, introduced above, and Equal Weight Index (EWI) based on MOEX's tickers, as a type of buy and hold strategy. While the former exemplifies capitalization-weighted index, the latter gives equal weight to all stocks, including small-cap stocks that are generally considered to be higher risk and to have higher potential return investments compared to large-caps. In theory, giving greater weight to the smaller names of the MOEX Russia Index in an equal-weight portfolio should increase the return potential of the portfolio, so EWI may be expected to perform better than MOEX Russia Index. Table 2 contains information about performance of the baselines. Table 3 demonstrates the results for topic modeling based approaches, Table 4 contains the results for shallow feature based methods of text processing and Table 5 presents the results for the endogenous models. Figure 9 shows how weekly returns accumulate depending on the model used for forming an investment strategy and compares them to MOEX Russia Index (IMOEX) and Equal Weight Index (EWI) as baseline strategies. Each strategy uses one-week-ahead approach and sorts portfolios by the score obtained from the chosen model. Specifically  presents strategies using the data from only one information source: Kommersant, RIA Novosti and Vedomosti, respectively. Figure 9D contains the best models that have a Sharpe Ratio greater than one. We discuss it in detail further below.

DISCUSSION
In this section we interpret the obtained results. Figure 8 shows that, as the time passes and the volume of the training data increases, the proportion of tickers for which our STTM approach turns to be Granger-causal tends to increase as well. An exception is a sharp fall of the predictive power for the majority of models in 2018. This fall is most probably explained with a number of international macroeconomic events in the second half of 2018 (including USA-China trade wars, US Federal Funds Rate hike, and the collapse of a large number of global financial indices). These events were poorly covered in the Russian media which focused on the internal agendas, such as the resonant raise of the retirement age. From Fig. 8 it can be seen that, according to the Granger causality test, the proposed STTM index can be significant for as much as 70% of stock quotes listed in the MOEX Russia Index if the data is sufficient to calibrate the model.

Weekly trading strategy performance
The D facet of Fig. 9 illustrates one-week-ahead performance of most economically successful portfolios (Sharpe ratio more than one), as well as the MOEX Russia Index. The top models in terms of the success of the investment portfolio built on them are as follows: STTM (LDA) on Kommersant with mean Sharpe ratio 1.3706 (annual return 36.12%), GBM + Navec based on RIA Novosti first paragraph with Sharpe ratio 1.1917 (annual return 32.84%), STTM (LDA) on Vedomosti with mean Sharpe ratio 1.1059 (annual return 28.40%), STTM (DTM) on Kommersant with mean Sharpe ratio 1.0798 (annual return 28.45%), STTM (LDA) on RIA Novosti with mean Sharpe ratio 1.0140 (annual return 27.55%), and LR + Doc2Vec based on Kommersant's titles 1.0056 (annual return 27.46%). Portfolios built on the models mentioned above are also the most profitable. So out of 69 different approaches to stock trend prediction, only six turned out to be economically viable. Among them, four are built on our novel proposed approach STTM and only two are derived using shallow feature based text processing methods. Three out of six are based on Kommersant news agency data, two are based on RIA Novosti data, one is based on Vedomosti. Now let us take a closer look at the best models for each individual news agency. A, B, C facets of Fig. 9 display performance for news sources Kommersant, RIA Novosti and Vedomosti, respectively. Each facet includes all topic modeling based approaches and five most successful models out of the remaining approaches (shallow feature based methods of text processing and endogenous models). For the Kommersant news agency, the STTM (LDA), model takes the first place (Sharpe ratio 1.3706 and annual return 36.12%), followed by STTM (DTM) (Sharpe ratio 1.0798 and annual return 28.45%) both in terms of Sharpe ratio and annual return. Next are five models with shallow feature based methods of text processing. For RIA Novosti STTM (LDA) and STTM (DTM) take the second (Sharpe ratio 1.0140 and annual return 27.55%) and the fourth (Sharpe ratio 0.9691 and annual return 26.75%) places respectively, the rest of the best models are shallow feature based methods of text processing. For Vedomosti STTM (LDA) and STTM (DTM) take the first   (Sharpe ratio 1.1059 and annual return 28.40%) and third (Sharpe ratio 0.7941 and annual return 20.95%) places, respectively, the remaining best models again are shallow feature based methods of text processing. Our experiments show that for all news sources the proposed STTM approach is among the best models, while maintaining the interpretability of results (see Figs. B9,B10,B11 and B12 in Appendix B). It is worth noting that endogenous models do not make it to the top of the best models, which in turn indicates that more useful economic information can be obtained from external data sources as compared to the information contained in the time series. SESTM never gets in any list of the best strategies, possibly, due to a smaller size of our dataset as compared to the dataset used by SESTM developers. However, we note that SESTM still outperforms the general economic baseline MOEX Russia Index both in terms of Sharpe ratio and annual return.

CONCLUSION
In this article, we have proposed a new approach-STTM-for evaluating the impact of news stream on the stock market trend, which is novel in several aspects. First, it does not use domain-specific dictionaries or any other manual markup. Next, unlike many commercial solutions, such as Reuters and Bloomberg, which produce general impact coefficients for the entire market, our algorithm can be fine-tuned for any individual issuer. At the same time, our analytical pipeline remains transparent and interpretable for an investor or a risk manager. It clusters news streams via topic modeling, finds the most influential terms among the most probable words of each topic with a tone assessment procedure, and offers assessment of the overall tone of each topic through trade-off between positive and negative terms and their probabilities, as well as tone aggregation across the entire news stream. Topic tone reflects the strength and the direction of its potential impact on stock prices. Our procedure can be combined with various topic modeling techniques and time series proximity measures. It can also be generalized to other domains and used to assess the impact of text data on a various time series, both in predictive or explanatory tasks.
To illustrate the usefulness of the proposed method, we have carried out a large number of experiments on the prediction of the Russian stock market with the texts from the economic sections of the most significant Russian-language news editions. We investigated Granger causality between the output of the proposed STTM approach and each of the 39 tickers included in the MOEX Russia Index for six years and for two different topic modeling algorithms (LDA and DTM). The model shows significant causality across multiple tickers and can Granger-cause more than 70% of those if the training data is large enough. We compared 28 different models by assessing their performance in terms of efficiency of a simple long-term trading strategy. For that, we created portfolios based on the predictions from each of these models and from each of our three news sources independently: 20 portfolios used shallow feature based methods of text processing, one was based on SESTM, five on endogen models, and two our approach (STTM). This corresponds to the construction of 2,886 different model variations, as each of the portfolio creation method was applied to each of the 39 tickers and on validated on six train/test splits. The quality of the resulting portfolios was evaluated by two metrics: Sharpe ratio and annual return.
Of all the multitude of model variations, only six turned out to be economically viable with Sharpe ratio more than one. Of them as many as four were based on STTM, and the remaining two were shallow feature based text processing methods that were initially represented by a much large number of model variations than STTM. Each of the STTM-based models ranked top of the list for various news publications, consistently outperforming the MOEX Russia Index baseline, the endogenous models, and the SESTMbased topic model. Thus, our work shows that the proposed framework is promising in explaining and predicting financial time series based on the textual data flow. The universal applicability of topic modeling to all European languages, as well as to some other languages, allows to assume that this framework has good prospects of being usable far beyond the Russian stock market.
The novelty of STTM, as compared to other approaches that make use of topic modeling-SESTM (Ke, Kelly & Xiu, 2020) and ITMTF (Kim et al., 2013)-is two-fold. First, STTM allows to directly optimize the efficiency of investment portfolios-a task that ITMTF does not address-and does it better than SESTM. Second, both SESTM and ITMTF work to homogenize the generated topics by the direction of their effect on the target variableeither negative and positive. For this purpose, SESTM reduces the number of topics to two only which renders them uninterpretable (and, as we have shown, less predictive than our approach). ITMTF's approach is more nuanced: while optimizing both topics' predictive power and their purity in terms of the effect's direction, it yields really interpretable topics. However, it does not evaluate the overall effect of the entire news stream of a given time period on the share prices which, ultimately, is the main practical goal of using news in such models. Additionally, it is not obvious that the overall predictive power of purified topics is higher than that of naturally occurring topics. Thus, adaptation of ITMTF purification logic to the goal of direct trading strategy optimization and comparison of the resulting pipeline to STTM is a an interesting task for future research.
Our approach has several practical implications. First, its ability to create impact indices of a news stream or a stream of textual data from social media for an individual issuer should be of higher practical value for traders than overall market indices. Issuer-specific indices can be used directly in trading strategies or as a factor in more complex models. Second, transparency and interpretability of our approach should make it attractive to investment applications for the mass user that are appearing on the market in large numbers. Our approach can make decision advices rendered by such apps more understandable for lay investors and thus increase customer trust and loyalty to such apps. Finally, professional risk analysts can benefit from the in-depth analysis of the rich information provided by our approach. They can numerically analyze the behavior of their companies in the past for better risk management in the future.

LIMITATIONS
Like all approaches involving topic modeling, our approach is sensitive to duplicate news. Although a large amount of duplicates may indicate topic's importance, duplicate-based topics tend to be artificially separated from similar, but not identical texts. The effect of this phenomenon on model performance needs to be studied experimentally. Likewise, coverage of economic events may be heavily skewed by editorial choices that, like in 2018, may hinder model's predictive power. This effect might be mitigated by broader samples of media outlets. Finally, as it was mentioned, in this article we ignore brokers' commissions and transaction costs when evaluating the performance of our strategy. Although here our goal is to find return predictive signals, for models aiming at exact calculation of returns' amounts these additional costs should be accounted for.

Trading terminal news
We consider real-time trading terminal news produced by the Interfax agency (financial and economic news product, trial available on the website: https://interfax.com/products/newsproducts/). We collected 739,680 news items from Jan 1, 2017 to Jan 1, 2020. Graphs of articles, total number by calendar week, and the empirical distribution of the number of the symbols are in supplementary materials (Fig. B3). Note that the number of articles per week correlates with similar charts for the general economic news. It illustrates the shared imagination of economic processes in both general economic and trading news. We can also note the lack of analytical review among trading news. Within the day, trading news is distributed in one modality. Figure A1 plots the average number of articles in each hour interval of the day.

Qualitative analysis of russian economic news topic modeling
We investigated the similarity of national news agencies and estimated the amount of new information contained in the trading terminal news in terms of topic modeling. We train topic modeling algorithms on each of the national media sources. After that, we apply pre-fitted models to real-time news from the trading terminal. The choice of this order is due, on the one hand, to the technical features of topic modeling algorithms: in longer texts, it is easier to highlight topics, and on the other hand to the natural features of the editorial policy: in national media, the news is published regarding the appropriate context, while trading news is published as is, and contains a lot of irrelevant noise. We use two baseline topic models: LDA and DTM. As noted in subsection 'STTM' of the section 'Experiments', we set one month as the time interval for the change in the word's probabilities in the DTM's topic and the number of topics n = 20 that gave the highest CV score, before the CV-score graph flattening out for all national media sources. The resulting topics can be titled as follows: macroeconomic indicators, Central Bank statements, pension legislation, tax law, economic reforms, monetary policy, financing of national projects, public procurement, trade duties, investment climate and economic development, export figures, rules in entrepreneurship and trade, energy tariffs, insurance, digital technologies, international trade agreements, debt burden, labor and employment, mining and energy, and international relations. Figure A2 shows the topic similarities for different models and data sources, aligned using the Hungarian algorithm. 16 Since the distribution of words in the DTM model topics varies from month to month, we use the time-averaged distribution of words for each considered topic. The topic modeling results show high cosine similarity between the same models based on data from different sources, and between different models based on the same sources. The exception is the national agency Vedomosti, the LDA model of which is poorly consistent with the DTM model and the LDA models, based on the data from Kommersant and RIA Novosti. Further, we apply the obtained models to real-time news from the trading terminal. We can estimate the amount of new information in intraday data through the diversity feature (Chester Curme & Preis, 2017), which characterizes the topic model's degree of confidence: where ρ t ,n is the relative weight of topic n in the news on time interval t . Our hypothesis is simple: each news item should belong to a small number of noticeable topics. Respectively, the diversity feature of the topic model should be minor. When there are no suitable topics for the news in the topic model, it strives to distribute the probabilities evenly. Thus the diversity indicator is overestimated. Figure A3 shows the LDA model's diversity distributions in its training sample (Kommersant) and applied to real-time news. We can observe a slight shift in the distribution of diversity, but the model shows significant confidence in general. From these considerations, we can estimate the amount of new information contained in trading terminal news and, at the same time, is lost in the general economic news. Also, topic streams (general economic news vs. trading news) show a significant correlation (see Fig. B8), which once again confirms the unity of the described economic processes. On the other hand, there is a cumulative divergence of topic profiles (see Fig. B7): national media tend to write on the following topics: macroeconomic indicators, Central Bank statements, pension legislation, whereas real-time news write more often about: labor and employment, mining and energy, international relations. In addition, we estimated the distribution of the received topics within the considered day. Figure A4 shows the cumulative division of the intraday news into topics: every hour, we calculate the total topics probabilities of new incoming data, add it to the amounts already received for the previous hours, and normalize. You can see the saturation of topics from a certain hour in the figure. To determine the elbow point on the timeline, we use the KL divergence function between the cumulative topic distribution for a specific time and the final topic distribution at the end of the day: where ρ t ,i is the cumulative weight of topic i in the intraday trading news on time interval t and q i is final weight of topic i at the end of the day. After that, we determine when the graph of the above-described KL divergence function (middle graph of Fig. A4) reaches a plateau and find the elbow point. We use Kneedle algorithm 17 for this purpose. In the given example (middle graph in Fig. A4), topic saturation occurs from 8 a.m. Since that time, the picture of the profile of the topics almost does not change within the day, and new information basically clarifies the previous. We performed the above procedure for all dates in the dataset of intraday trading news and obtained the following picture of the distribution of topics profiles saturation points. This is demonstrated in Fig. A5. In the figure, we have one pronounced data modality with a center at 9 a.m. It is consistent with the opening time of the Moscow Exchange. Thus the main discussion of the economic situation in the Russian trading news takes place before the start of trading.  Cumulative divergence of topic profiles  Notes. *Significant at the p-value < 0.01. ** Significant at the p-value < 0.05. *** Significant at the p-value < 0.10.
NaN, lack of data on the issuer in the year under consideration. Notes. *Significant at the p-value < 0.01. ** Significant at the p-value < 0.05. *** Significant at the p-value < 0.10.
NaN, lack of data on the issuer in the year under consideration.
NaN, lack of data on the issuer in the year under consideration.